linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-16  0:28 Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
                   ` (14 more replies)
  0 siblings, 15 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

This does not address interrupt stacks.

Andy Lutomirski (12):
  x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
  mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  mm: Move memcg stack accounting to account_kernel_stack
  fork: Add generic vmalloced stack support
  x86/die: Don't try to recover from an OOPS on a non-default stack
  x86/dumpstack: When OOPSing, rewind the stack before do_exit
  x86/dumpstack: When dumping stack bytes due to OOPS, start with
    regs->sp
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
    OOPS
  x86/mm/64: Enable vmapped stacks
  x86/mm: Improve stack-overflow #PF handling

Ingo Molnar (1):
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

 arch/Kconfig                     | 12 ++++++++++
 arch/x86/Kconfig                 |  1 +
 arch/x86/entry/entry_32.S        | 11 +++++++++
 arch/x86/entry/entry_64.S        | 11 +++++++++
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++-
 arch/x86/include/asm/traps.h     |  6 +++++
 arch/x86/kernel/dumpstack.c      | 17 +++++++++++++-
 arch/x86/kernel/dumpstack_32.c   |  4 +++-
 arch/x86/kernel/dumpstack_64.c   | 16 ++++++++++---
 arch/x86/kernel/traps.c          | 32 +++++++++++++++++++++++++
 arch/x86/mm/fault.c              | 39 ++++++++++++++++++++++++++++++
 arch/x86/mm/init_64.c            | 27 ---------------------
 arch/x86/mm/pageattr.c           |  7 +++++-
 arch/x86/mm/tlb.c                | 15 ++++++++++++
 fs/proc/meminfo.c                |  2 +-
 kernel/fork.c                    | 51 ++++++++++++++++++++++++++++++----------
 mm/page_alloc.c                  |  3 +--
 17 files changed, 233 insertions(+), 49 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Ingo Molnar, Andrew Morton,
	Andy Lutomirski, Denys Vlasenko, H . Peter Anvin, Oleg Nesterov,
	Peter Zijlstra, Rik van Riel, Thomas Gleixner, Waiman Long,
	linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately Andy Lutomirski
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This avoids pointless races in which another CPU or task might see a
partially populated global pgd entry.  These races should normally
be harmless, but, if another CPU propagates the entry via
vmalloc_fault and then populate_pgd fails (due to memory allocation
failure, for example), this prevents a use-after-free of the pgd
entry.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bbf4105..6a8026918bf6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 
 	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
 	if (ret < 0) {
-		unmap_pgd_range(cpa->pgd, addr,
+		if (pud)
+			free_page((unsigned long)pud);
+		unmap_pud_range(pgd_entry, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks Andy Lutomirski
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

It's currently only used in the EFI code, which is safe AFAICT.
Warn if anyone tries to use it on the normal kernel pgd.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a8026918bf6..e9b9c5cedbb8 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1996,6 +1996,9 @@ out:
 void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
 			       unsigned numpages)
 {
+	/* Unmapping kernel entries from init_mm's pgd is not allowed. */
+	WARN_ON(root == init_mm.pgd);
+
 	unmap_pgd_range(root, address, address + (numpages << PAGE_SHIFT));
 }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16 11:10   ` Vladimir Davydov
  2016-06-16 15:33   ` Josh Poimboeuf
  2016-06-16  0:28 ` [PATCH 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

It turns out that the code for tracking kernel stack allocations in
units of pages is slightly simpler, so just switch to counting
pages.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/proc/meminfo.c | 2 +-
 kernel/fork.c     | 3 ++-
 mm/page_alloc.c   | 3 +--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..8338c0569a8d 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		K(global_page_state(NR_KERNEL_STACK)),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355aa97f..95bebde59d79 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(ti));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK,
+			    THREAD_SIZE / PAGE_SIZE * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..2b0203b3a976 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			K(zone_page_state(zone, NR_KERNEL_STACK)),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 05/13] mm: Move memcg stack accounting to account_kernel_stack
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size.  Move it into
account_kernel_stack.

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/fork.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 95bebde59d79..59e52f2120a3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	struct page *page = virt_to_page(ti);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 
 	mod_zone_page_state(zone, NR_KERNEL_STACK,
 			    THREAD_SIZE / PAGE_SIZE * account);
+
+	/* All stack pages belong to the same memcg. */
+	memcg_kmem_update_page_stat(
+		virt_to_page(ti), MEMCG_KERNEL_STACK,
+		account * (THREAD_SIZE / PAGE_SIZE));
 }
 
 void free_task(struct task_struct *tsk)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 06/13] fork: Add generic vmalloced stack support
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16 17:25   ` Kees Cook
  2016-06-16  0:28 ` [PATCH 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig  | 12 ++++++++++++
 kernel/fork.c | 45 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d794384a0404..1acd262036b0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -658,4 +658,16 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+       def_bool n
+
+config VMAP_STACK
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
 source "kernel/gcov/Kconfig"
diff --git a/kernel/fork.c b/kernel/fork.c
index 59e52f2120a3..37234fa0ba9b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,30 @@ void __weak arch_release_thread_info(struct thread_info *ti)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	return __vmalloc_node_range(
+		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
+		GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL,
+		0, node, __builtin_return_address(0));
+#else
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
+#ifdef CONFIG_VMAP_STACK
+	vfree(ti);
+#else
 	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+#endif
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -215,15 +226,33 @@ static struct kmem_cache *mm_cachep;
 
 static void account_kernel_stack(struct thread_info *ti, int account)
 {
-	struct zone *zone = page_zone(virt_to_page(ti));
+	struct zone *zone;
+
+	if (IS_ENABLED(CONFIG_VMAP_STACK) && !virt_addr_valid(ti)) {
+		int i;
+		struct vm_struct *vm = find_vm_area(ti);
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK,
-			    THREAD_SIZE / PAGE_SIZE * account);
+		WARN_ON_ONCE(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
-	/* All stack pages belong to the same memcg. */
-	memcg_kmem_update_page_stat(
-		virt_to_page(ti), MEMCG_KERNEL_STACK,
-		account * (THREAD_SIZE / PAGE_SIZE));
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    1, account);
+		}
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			vm->pages[0], MEMCG_KERNEL_STACK,
+			account * (THREAD_SIZE / PAGE_SIZE));
+	} else {
+		zone = page_zone(virt_to_page(ti));
+		mod_zone_page_state(zone, NR_KERNEL_STACK,
+				    THREAD_SIZE / PAGE_SIZE * account);
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			virt_to_page(ti), MEMCG_KERNEL_STACK,
+			account * (THREAD_SIZE / PAGE_SIZE));
+	}
 }
 
 void free_task(struct task_struct *tsk)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (5 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

It's not going to work, because the scheduler will explode if we try
to schedule when running on an IST stack or similar.

This will matter when we let kernel stack overflows (which are #DF)
call die().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..36effb39c9c9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -247,6 +247,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
+	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
+	     & ~(THREAD_SIZE - 1)) != 0)
+		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	do_exit(signr);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (6 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16 17:50   ` Josh Poimboeuf
  2016-06-16  0:28 ` [PATCH 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

If we call do_exit with a clean stack, we greatly reduce the risk of
recursive oopses due to stack overflow in do_exit, and we allow
do_exit to work even if we OOPS from an IST stack.  The latter gives
us a much better chance of surviving long enough after we detect a
stack overflow to write out our logs.

I intentionally separated this from the preceding patch that
disables do_exit-on-OOPS on IST stacks.  This way, if we need to
revert this patch, we still end up in an acceptable state wrt stack
overflow handling.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S   | 11 +++++++++++
 arch/x86/entry/entry_64.S   | 11 +++++++++++
 arch/x86/kernel/dumpstack.c | 13 +++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3a0d27..1499db695a88 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
 	jmp	error_code
 END(async_page_fault)
 #endif
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	leal	-TOP_OF_KERNEL_STACK_PADDING-PT_OLDSS(%esi), %esp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..394cad73e890 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-SS(%rax), %rsp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 36effb39c9c9..d4d085e27d04 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -228,6 +228,8 @@ unsigned long oops_begin(void)
 EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
+extern void __noreturn rewind_stack_do_exit(int signr);
+
 void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 {
 	if (regs && kexec_should_crash(current))
@@ -247,12 +249,15 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
-	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
-	     & ~(THREAD_SIZE - 1)) != 0)
-		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
-	do_exit(signr);
+
+	/*
+	 * We're not going to return, but we might be on an IST stack or
+	 * have very little stack space left.  Rewind the stack and kill
+	 * the task.
+	 */
+	rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (7 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16 11:56   ` Borislav Petkov
  2016-07-08 12:07   ` [tip:x86/debug] x86/dumpstack: Honor supplied @regs arg tip-bot for Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

The comment suggests that show_stack(NULL, NULL) should backtrace
the current context, but the code doesn't match the comment.  If
regs are given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 464ffd69b92e..91069ebe3c87 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -98,7 +98,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..603356a5597a 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -266,7 +266,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (8 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16 18:16   ` Josh Poimboeuf
  2016-06-16  0:28 ` [PATCH 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

If we overflow the stack, print_context_stack will abort.  Detect
this case and rewind back into the valid part of the stack so that
we can trace it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d4d085e27d04..400a2e17c1d1 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
+	/*
+	 * If we overflowed the stack into a guard page, jump back to the
+	 * bottom of the usable stack.
+	 */
+	if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
+		stack = (unsigned long *)tinfo + 1;
+
 	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (9 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  0:28 ` [PATCH 12/13] x86/mm/64: Enable vmapped stacks Andy Lutomirski
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

If we overflow the stack into a guard page, we'll recursively fault
when trying to dump the contents of the guard page.  Use
probe_kernel_address so we can recover if this happens.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_64.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 603356a5597a..5e298638c790 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -276,6 +276,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
+		unsigned long word;
+
 		if (stack >= irq_stack && stack <= irq_stack_end) {
 			if (stack == irq_stack_end) {
 				stack = (unsigned long *) (irq_stack_end[-1]);
@@ -285,12 +287,18 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (kstack_end(stack))
 			break;
 		}
+
+		if (probe_kernel_address(stack, word))
+			break;
+
 		if ((i % STACKSLOTS_PER_LINE) == 0) {
 			if (i != 0)
 				pr_cont("\n");
-			printk("%s %016lx", log_lvl, *stack++);
+			printk("%s %016lx", log_lvl, word);
 		} else
-			pr_cont(" %016lx", *stack++);
+			pr_cont(" %016lx", word);
+
+		stack++;
 		touch_nmi_watchdog();
 	}
 	preempt_enable();
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 12/13] x86/mm/64: Enable vmapped stacks
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (10 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  4:17   ` Mika Penttilä
  2016-06-16  0:28 ` [PATCH 13/13] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and an oops without triggering a double-fault, and
no_context doesn't know about stack overflows.  The next patch will
improve that case.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 +++++++++++++++
 4 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0a7b885964ba..b624b24d1dc1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..9cb7ea781176 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  CR2 will contain
+	 * the linear address of the second fault, which will be in the
+	 * guard page below the bottom of the stack.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)",
+			regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..fbf036ae72ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 13/13] x86/mm: Improve stack-overflow #PF handling
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (11 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 12/13] x86/mm/64: Enable vmapped stacks Andy Lutomirski
@ 2016-06-16  0:28 ` Andy Lutomirski
  2016-06-16  6:05 ` [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Heiko Carstens
  2016-06-16 17:24 ` Kees Cook
  14 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  0:28 UTC (permalink / raw)
  To: linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h |  6 ++++++
 arch/x86/kernel/traps.c      |  6 +++---
 arch/x86/mm/fault.c          | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619740a..01fd0a7f48cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -117,6 +117,12 @@ extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
+#ifdef CONFIG_VMAP_STACK
+void __noreturn handle_stack_overflow(const char *message,
+				      struct pt_regs *regs,
+				      unsigned long fault_address);
+#endif
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9cb7ea781176..b389c0539eb9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -293,9 +293,9 @@ DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
 #ifdef CONFIG_VMAP_STACK
-static void __noreturn handle_stack_overflow(const char *message,
-					     struct pt_regs *regs,
-					     unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(const char *message,
+						struct pt_regs *regs,
+						unsigned long fault_address)
 {
 	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
 		 (void *)fault_address, current->stack,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..c68b81f5659f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -753,6 +753,45 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * Stack overflow?  During boot, we can fault near the initial
+	 * stack in the direct map, but that's not an overflow -- check
+	 * that we're in vmalloc space to avoid this.
+	 *
+	 * Check this after trying fixup_exception, since there are handful
+	 * of kernel code paths that wander off the top of the stack but
+	 * handle any faults that occur.  Once those are fixed, we can
+	 * move this above fixup_exception.
+	 */
+	if (is_vmalloc_addr((void *)address) &&
+	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+		register void *__sp asm("rsp");
+		unsigned long stack =
+			this_cpu_read(orig_ist.ist[DOUBLEFAULT_STACK]) -
+			sizeof(void *);
+		/*
+		 * We're likely to be running with very little stack space
+		 * left.  It's plausible that we'd hit this condition but
+		 * double-fault even before we get this far, in which case
+		 * we're fine: the double-fault handler will deal with it.
+		 *
+		 * We don't want to make it all the way into the oops code
+		 * and then double-fault, though, because we're likely to
+		 * break the console driver and lose most of the stack dump.
+		 */
+		asm volatile ("movq %[stack], %%rsp\n\t"
+			      "call handle_stack_overflow\n\t"
+			      "1: jmp 1b"
+			      : "+r" (__sp)
+			      : "D" ("kernel stack overflow (page fault)"),
+				"S" (regs), "d" (address),
+				[stack] "rm" (stack));
+		unreachable();
+	}
+#endif
+
 	/*
 	 * 32-bit:
 	 *
-- 
2.7.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/13] x86/mm/64: Enable vmapped stacks
  2016-06-16  0:28 ` [PATCH 12/13] x86/mm/64: Enable vmapped stacks Andy Lutomirski
@ 2016-06-16  4:17   ` Mika Penttilä
  2016-06-16  5:33     ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Mika Penttilä @ 2016-06-16  4:17 UTC (permalink / raw)
  To: Andy Lutomirski, linux-kernel, x86, Borislav Petkov
  Cc: Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf

Hi,

On 06/16/2016 03:28 AM, Andy Lutomirski wrote:
> This allows x86_64 kernels to enable vmapped stacks.  There are a
> couple of interesting bits.
> 
> First, x86 lazily faults in top-level paging entries for the vmalloc
> area.  This won't work if we get a page fault while trying to access
> the stack: the CPU will promote it to a double-fault and we'll die.
> To avoid this problem, probe the new stack when switching stacks and
> forcibly populate the pgd entry for the stack when switching mms.
> 
> Second, once we have guard pages around the stack, we'll want to
> detect and handle stack overflow.
> 
> I didn't enable it on x86_32.  We'd need to rework the double-fault
> code a bit and I'm concerned about running out of vmalloc virtual
> addresses under some workloads.
> 
> This patch, by itself, will behave somewhat erratically when the
> stack overflows while RSP is still more than a few tens of bytes
> above the bottom of the stack.  Specifically, we'll get #PF and make
> it to no_context and an oops without triggering a double-fault, and
> no_context doesn't know about stack overflows.  The next patch will
> improve that case.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/Kconfig                 |  1 +
>  arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
>  arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
>  arch/x86/mm/tlb.c                | 15 +++++++++++++++
>  4 files changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0a7b885964ba..b624b24d1dc1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -92,6 +92,7 @@ config X86
>  	select HAVE_ARCH_TRACEHOOK
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>  	select HAVE_EBPF_JIT			if X86_64
> +	select HAVE_ARCH_VMAP_STACK		if X86_64
>  	select HAVE_CC_STACKPROTECTOR
>  	select HAVE_CMPXCHG_DOUBLE
>  	select HAVE_CMPXCHG_LOCAL
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index 8f321a1b03a1..14e4b20f0aaf 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -8,6 +8,28 @@ struct tss_struct;
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>  		      struct tss_struct *tss);
>  
> +/* This runs runs on the previous thread's stack. */
> +static inline void prepare_switch_to(struct task_struct *prev,
> +				     struct task_struct *next)
> +{
> +#ifdef CONFIG_VMAP_STACK
> +	/*
> +	 * If we switch to a stack that has a top-level paging entry
> +	 * that is not present in the current mm, the resulting #PF will
> +	 * will be promoted to a double-fault and we'll panic.  Probe
> +	 * the new stack now so that vmalloc_fault can fix up the page
> +	 * tables if needed.  This can only happen if we use a stack
> +	 * in vmap space.
> +	 *
> +	 * We assume that the stack is aligned so that it never spans
> +	 * more than one top-level paging entry.
> +	 *
> +	 * To minimize cache pollution, just follow the stack pointer.
> +	 */
> +	READ_ONCE(*(unsigned char *)next->thread.sp);
> +#endif
> +}
> +
>  #ifdef CONFIG_X86_32
>  
>  #ifdef CONFIG_CC_STACKPROTECTOR
> @@ -39,6 +61,8 @@ do {									\
>  	 */								\
>  	unsigned long ebx, ecx, edx, esi, edi;				\
>  									\
> +	prepare_switch_to(prev, next);					\
> +									\
>  	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
>  		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
>  		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
> @@ -103,7 +127,9 @@ do {									\
>   * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
>   * has no effect.
>   */
> -#define switch_to(prev, next, last) \
> +#define switch_to(prev, next, last)					  \
> +	prepare_switch_to(prev, next);					  \
> +									  \
>  	asm volatile(SAVE_CONTEXT					  \
>  	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
>  	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 00f03d82e69a..9cb7ea781176 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
>  DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
>  DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
>  
> +#ifdef CONFIG_VMAP_STACK
> +static void __noreturn handle_stack_overflow(const char *message,
> +					     struct pt_regs *regs,
> +					     unsigned long fault_address)
> +{
> +	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
> +		 (void *)fault_address, current->stack,
> +		 (char *)current->stack + THREAD_SIZE - 1);
> +	die(message, regs, 0);
> +
> +	/* Be absolutely certain we don't return. */
> +	panic(message);
> +}
> +#endif
> +
>  #ifdef CONFIG_X86_64
>  /* Runs on IST stack */
>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>  {
>  	static const char str[] = "double fault";
>  	struct task_struct *tsk = current;
> +#ifdef CONFIG_VMAP_STACK
> +	unsigned long cr2;
> +#endif
>  
>  #ifdef CONFIG_X86_ESPFIX64
>  	extern unsigned char native_irq_return_iret[];
> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>  	tsk->thread.error_code = error_code;
>  	tsk->thread.trap_nr = X86_TRAP_DF;
>  
> +#ifdef CONFIG_VMAP_STACK
> +	/*
> +	 * If we overflow the stack into a guard page, the CPU will fail
> +	 * to deliver #PF and will send #DF instead.  CR2 will contain
> +	 * the linear address of the second fault, which will be in the
> +	 * guard page below the bottom of the stack.
> +	 */
> +	cr2 = read_cr2();
> +	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
> +		handle_stack_overflow(
> +			"kernel stack overflow (double-fault)",
> +			regs, cr2);
> +#endif
> +
>  #ifdef CONFIG_DOUBLEFAULT
>  	df_debug(regs, error_code);
>  #endif
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 5643fd0b1a7d..fbf036ae72ac 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  	unsigned cpu = smp_processor_id();
>  
>  	if (likely(prev != next)) {
> +		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> +			/*
> +			 * If our current stack is in vmalloc space and isn't
> +			 * mapped in the new pgd, we'll double-fault.  Forcibly
> +			 * map it.
> +			 */
> +			unsigned int stack_pgd_index =
> +				pgd_index(current_stack_pointer());


stack pointer is still the previous task's, current_stack_pointer() returns that, not
next task's which was intention I guess. Things may happen to work if on same pgd, but at least the
boot cpu init_task_struct is special.


> +			pgd_t *pgd = next->pgd + stack_pgd_index;
> +
> +			if (unlikely(pgd_none(*pgd)))
> +				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
> +		}
> +
>  #ifdef CONFIG_SMP
>  		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>  		this_cpu_write(cpu_tlbstate.active_mm, next);
>  #endif
> +
>  		cpumask_set_cpu(cpu, mm_cpumask(next));
>  
>  		/*
> 

--Mika

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/13] x86/mm/64: Enable vmapped stacks
  2016-06-16  4:17   ` Mika Penttilä
@ 2016-06-16  5:33     ` Andy Lutomirski
  2016-06-16 13:11       ` [kernel-hardening] " Rik van Riel
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16  5:33 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Nadav Amit, Kees Cook, Josh Poimboeuf, Borislav Petkov,
	kernel-hardening, Brian Gerst, linux-kernel, X86 ML,
	Linus Torvalds

On Jun 15, 2016 9:32 PM, "Mika Penttilä" <mika.penttila@nextfour.com> wrote:
>
> Hi,
>

> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 5643fd0b1a7d..fbf036ae72ac 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> >       unsigned cpu = smp_processor_id();
> >
> >       if (likely(prev != next)) {
> > +             if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> > +                     /*
> > +                      * If our current stack is in vmalloc space and isn't
> > +                      * mapped in the new pgd, we'll double-fault.  Forcibly
> > +                      * map it.
> > +                      */
> > +                     unsigned int stack_pgd_index =
> > +                             pgd_index(current_stack_pointer());
>
>
> stack pointer is still the previous task's, current_stack_pointer() returns that, not
> next task's which was intention I guess. Things may happen to work if on same pgd, but at least the
> boot cpu init_task_struct is special.

This is intentional.  When switching processes, we first switch the mm
and then switch the task.  We need to make sure that the prev stack is
mapped in the new mm or we'll double-fault and die after switching the
mm which still trying to execute on the old stack.

The change to switch_to makes sure that the new stack is mapped.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (12 preceding siblings ...)
  2016-06-16  0:28 ` [PATCH 13/13] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
@ 2016-06-16  6:05 ` Heiko Carstens
  2016-06-16 17:50   ` Andy Lutomirski
  2016-06-17  3:58   ` Andy Lutomirski
  2016-06-16 17:24 ` Kees Cook
  14 siblings, 2 replies; 60+ messages in thread
From: Heiko Carstens @ 2016-06-16  6:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf

On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote:
> Since the dawn of time, a kernel stack overflow has been a real PITA
> to debug, has caused nondeterministic crashes some time after the
> actual overflow, and has generally been easy to exploit for root.
> 
> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
> that enable it (just x86 for now) get virtually mapped stacks with
> guard pages.  This causes reliable faults when the stack overflows.
> 
> If the arch implements it well, we get a nice OOPS on stack overflow
> (as opposed to panicing directly or otherwise exploding badly).  On
> x86, the OOPS is nice, has a usable call trace, and the overflowing
> task is killed cleanly.

Do you have numbers which reflect the performance impact of this change?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  2016-06-16  0:28 ` [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks Andy Lutomirski
@ 2016-06-16 11:10   ` Vladimir Davydov
  2016-06-16 17:21     ` Andy Lutomirski
  2016-06-16 15:33   ` Josh Poimboeuf
  1 sibling, 1 reply; 60+ messages in thread
From: Vladimir Davydov @ 2016-06-16 11:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Johannes Weiner, Michal Hocko, linux-mm

On Wed, Jun 15, 2016 at 05:28:26PM -0700, Andy Lutomirski wrote:
...
> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  {
>  	struct zone *zone = page_zone(virt_to_page(ti));
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
> +	mod_zone_page_state(zone, NR_KERNEL_STACK,
> +			    THREAD_SIZE / PAGE_SIZE * account);

It won't work if THREAD_SIZE < PAGE_SIZE. Is there an arch with such a
thread size, anyway? If no, we should probably drop thread_info_cache.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
  2016-06-16  0:28 ` [PATCH 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
@ 2016-06-16 11:56   ` Borislav Petkov
  2016-07-08 12:07   ` [tip:x86/debug] x86/dumpstack: Honor supplied @regs arg tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 60+ messages in thread
From: Borislav Petkov @ 2016-06-16 11:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf

On Wed, Jun 15, 2016 at 05:28:31PM -0700, Andy Lutomirski wrote:
> The comment suggests that show_stack(NULL, NULL) should backtrace
> the current context, but the code doesn't match the comment.  If
> regs are given, start the "Stack:" hexdump at regs->sp.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Very nice:

[    0.020008] unchecked MSR access error: RDMSR from 0x1b0 at rIP: 0xffffffff8102849f (init_intel_energy_perf.part.3+0xf/0xd0)
[    0.024003]  0000000000000000 ffffffff81c03f10 ffffffff81028a2f 0000000000000000
[    0.028799]  ffffffff81cad5e0 ffffffff81d5e920 ffffffff81c03f38 ffffffff810271f5
[    0.034887]  ffffffffffffffff ffff88007efd18c0 ffffffff81d5e920 ffffffff81c03f48
[    0.038883] Call Trace:
[    0.040009]  [<ffffffff81028a2f>] init_intel+0xdf/0x2b0
[    0.042055]  [<ffffffff810271f5>] identify_cpu+0x2f5/0x4f0
[    0.044005]  [<ffffffff81cdc198>] identify_boot_cpu+0x10/0x7a
[    0.045577]  [<ffffffff81cdc236>] check_bugs+0x9/0x2d
[    0.047033]  [<ffffffff81cd1ece>] start_kernel+0x3bd/0x3d9
[    0.048005]  [<ffffffff81cd1294>] x86_64_start_reservations+0x2f/0x31
[    0.049668]  [<ffffffff81cd13fe>] x86_64_start_kernel+0x168/0x176

That's exactly what I wanted to see! Thanks for the hints.

Here's what I did:

---
diff --git a/arch/x86/include/asm/kdebug.h b/arch/x86/include/asm/kdebug.h
index e5f5dc9787d5..1ef9d581b5d9 100644
--- a/arch/x86/include/asm/kdebug.h
+++ b/arch/x86/include/asm/kdebug.h
@@ -26,6 +26,7 @@ extern void die(const char *, struct pt_regs *,long);
 extern int __must_check __die(const char *, struct pt_regs *, long);
 extern void show_trace(struct task_struct *t, struct pt_regs *regs,
 		       unsigned long *sp, unsigned long bp);
+extern void show_stack_regs(struct pt_regs *regs);
 extern void __show_regs(struct pt_regs *regs, int all);
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 400a2e17c1d1..0c3436535b4a 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -206,6 +206,11 @@ void show_stack(struct task_struct *task, unsigned long *sp)
 	show_stack_log_lvl(task, NULL, sp, bp, "");
 }
 
+void show_stack_regs(struct pt_regs *regs)
+{
+	show_stack_log_lvl(current, regs, (unsigned long *)regs->sp, regs->bp, "");
+}
+
 static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 static int die_owner = -1;
 static unsigned int die_nest_count;
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 4bb53b89f3c5..fafc771568c7 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -1,6 +1,7 @@
 #include <linux/module.h>
 #include <asm/uaccess.h>
 #include <asm/traps.h>
+#include <asm/kdebug.h>
 
 typedef bool (*ex_handler_t)(const struct exception_table_entry *,
 			    struct pt_regs *, int);
@@ -46,8 +47,9 @@ EXPORT_SYMBOL(ex_handler_ext);
 bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
 			     struct pt_regs *regs, int trapnr)
 {
-	WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x\n",
-		  (unsigned int)regs->cx);
+	if (pr_warn_once("unchecked MSR access error: RDMSR from 0x%x at rIP: 0x%lx (%pF)\n",
+			 (unsigned int)regs->cx, regs->ip, (void *)regs->ip))
+		show_stack_regs(regs);
 
 	/* Pretend that the read succeeded and returned 0. */
 	regs->ip = ex_fixup_addr(fixup);
@@ -60,9 +62,10 @@ EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
 bool ex_handler_wrmsr_unsafe(const struct exception_table_entry *fixup,
 			     struct pt_regs *regs, int trapnr)
 {
-	WARN_ONCE(1, "unchecked MSR access error: WRMSR to 0x%x (tried to write 0x%08x%08x)\n",
-		  (unsigned int)regs->cx,
-		  (unsigned int)regs->dx, (unsigned int)regs->ax);
+	if (pr_warn_once("unchecked MSR access error: WRMSR to 0x%x (tried to write 0x%08x%08x) at rIP: 0x%lx (%pF)\n",
+			 (unsigned int)regs->cx, (unsigned int)regs->dx,
+			 (unsigned int)regs->ax,  regs->ip, (void *)regs->ip))
+		show_stack_regs(regs);
 
 	/* Pretend that the write succeeded. */
 	regs->ip = ex_fixup_addr(fixup);
diff --git a/include/linux/printk.h b/include/linux/printk.h
index f4da695fd615..0a961657e2a0 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -309,20 +309,25 @@ extern asmlinkage void dump_stack(void) __cold;
 #define printk_once(fmt, ...)					\
 ({								\
 	static bool __print_once __read_mostly;			\
+	int __ret_print_once = !__print_once;			\
 								\
 	if (!__print_once) {					\
 		__print_once = true;				\
 		printk(fmt, ##__VA_ARGS__);			\
 	}							\
+	__ret_print_once;					\
 })
+
 #define printk_deferred_once(fmt, ...)				\
 ({								\
 	static bool __print_once __read_mostly;			\
+	int __ret_print_once = !__print_once;			\
 								\
 	if (!__print_once) {					\
 		__print_once = true;				\
 		printk_deferred(fmt, ##__VA_ARGS__);		\
 	}							\
+	__ret_print_once;					\
 })
 #else
 #define printk_once(fmt, ...)					\

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 12/13] x86/mm/64: Enable vmapped stacks
  2016-06-16  5:33     ` Andy Lutomirski
@ 2016-06-16 13:11       ` Rik van Riel
  0 siblings, 0 replies; 60+ messages in thread
From: Rik van Riel @ 2016-06-16 13:11 UTC (permalink / raw)
  To: kernel-hardening, Mika Penttilä
  Cc: Nadav Amit, Kees Cook, Josh Poimboeuf, Borislav Petkov,
	Brian Gerst, linux-kernel, X86 ML, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

On Wed, 2016-06-15 at 22:33 -0700, Andy Lutomirski wrote:
> 
> > > +++ b/arch/x86/mm/tlb.c
> > > @@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct
> > > *prev, struct mm_struct *next,
> > >       unsigned cpu = smp_processor_id();
> > > 
> > >       if (likely(prev != next)) {
> > > +             if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> > > +                     /*
> > > +                      * If our current stack is in vmalloc space
> > > and isn't
> > > +                      * mapped in the new pgd, we'll double-
> > > fault.  Forcibly
> > > +                      * map it.
> > > +                      */
> > > +                     unsigned int stack_pgd_index =
> > > +                             pgd_index(current_stack_pointer());
> > 
> > stack pointer is still the previous task's, current_stack_pointer()
> > returns that, not
> > next task's which was intention I guess. Things may happen to work
> > if on same pgd, but at least the
> > boot cpu init_task_struct is special.
> This is intentional.  When switching processes, we first switch the
> mm
> and then switch the task.  We need to make sure that the prev stack
> is
> mapped in the new mm or we'll double-fault and die after switching
> the
> mm which still trying to execute on the old stack.
> 
> The change to switch_to makes sure that the new stack is mapped.
> 

On a HARDENED_USERCOPY tangential note: by not allowing
copy_to/from_user access to vmalloc memory by default,
with exception of the stack, a task will only be able
to copy_to/from_user from its own stack, not another task's
stack, at least using the kernel virtual address the
kernel uses to access that stack.

This can be accomplished by simply not adding any vmalloc
checking code to the current HARDENED_USERCOPY patch set :)

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  2016-06-16  0:28 ` [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks Andy Lutomirski
  2016-06-16 11:10   ` Vladimir Davydov
@ 2016-06-16 15:33   ` Josh Poimboeuf
  2016-06-16 17:39     ` Andy Lutomirski
  1 sibling, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2016-06-16 15:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

On Wed, Jun 15, 2016 at 05:28:26PM -0700, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> It turns out that the code for tracking kernel stack allocations in
> units of pages is slightly simpler, so just switch to counting
> pages.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  fs/proc/meminfo.c | 2 +-
>  kernel/fork.c     | 3 ++-
>  mm/page_alloc.c   | 3 +--
>  3 files changed, 4 insertions(+), 4 deletions(-)

You missed another usage of NR_KERNEL_STACK in drivers/base/node.c.


-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  2016-06-16 11:10   ` Vladimir Davydov
@ 2016-06-16 17:21     ` Andy Lutomirski
  2016-06-16 19:20       ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 17:21 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Johannes Weiner, Michal Hocko,
	linux-mm

On Thu, Jun 16, 2016 at 4:10 AM, Vladimir Davydov
<vdavydov@virtuozzo.com> wrote:
> On Wed, Jun 15, 2016 at 05:28:26PM -0700, Andy Lutomirski wrote:
> ...
>> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>>  {
>>       struct zone *zone = page_zone(virt_to_page(ti));
>>
>> -     mod_zone_page_state(zone, NR_KERNEL_STACK, account);
>> +     mod_zone_page_state(zone, NR_KERNEL_STACK,
>> +                         THREAD_SIZE / PAGE_SIZE * account);
>
> It won't work if THREAD_SIZE < PAGE_SIZE. Is there an arch with such a
> thread size, anyway? If no, we should probably drop thread_info_cache.

On a quick grep, I can't find any.  I'll add a BUILD_BUG_ON for now.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (13 preceding siblings ...)
  2016-06-16  6:05 ` [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Heiko Carstens
@ 2016-06-16 17:24 ` Kees Cook
  14 siblings, 0 replies; 60+ messages in thread
From: Kees Cook @ 2016-06-16 17:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Borislav Petkov, Nadav Amit, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf

On Wed, Jun 15, 2016 at 5:28 PM, Andy Lutomirski <luto@kernel.org> wrote:
> Since the dawn of time, a kernel stack overflow has been a real PITA
> to debug, has caused nondeterministic crashes some time after the
> actual overflow, and has generally been easy to exploit for root.
>
> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
> that enable it (just x86 for now) get virtually mapped stacks with
> guard pages.  This causes reliable faults when the stack overflows.
>
> If the arch implements it well, we get a nice OOPS on stack overflow
> (as opposed to panicing directly or otherwise exploding badly).  On
> x86, the OOPS is nice, has a usable call trace, and the overflowing
> task is killed cleanly.
>
> This does not address interrupt stacks.
>
> Andy Lutomirski (12):
>   x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
>   x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
>   mm: Track NR_KERNEL_STACK in pages instead of number of stacks
>   mm: Move memcg stack accounting to account_kernel_stack
>   fork: Add generic vmalloced stack support
>   x86/die: Don't try to recover from an OOPS on a non-default stack
>   x86/dumpstack: When OOPSing, rewind the stack before do_exit
>   x86/dumpstack: When dumping stack bytes due to OOPS, start with
>     regs->sp
>   x86/dumpstack: Try harder to get a call trace on stack overflow
>   x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
>     OOPS
>   x86/mm/64: Enable vmapped stacks
>   x86/mm: Improve stack-overflow #PF handling
>
> Ingo Molnar (1):
>   x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
>
>  arch/Kconfig                     | 12 ++++++++++
>  arch/x86/Kconfig                 |  1 +
>  arch/x86/entry/entry_32.S        | 11 +++++++++
>  arch/x86/entry/entry_64.S        | 11 +++++++++
>  arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++-
>  arch/x86/include/asm/traps.h     |  6 +++++
>  arch/x86/kernel/dumpstack.c      | 17 +++++++++++++-
>  arch/x86/kernel/dumpstack_32.c   |  4 +++-
>  arch/x86/kernel/dumpstack_64.c   | 16 ++++++++++---
>  arch/x86/kernel/traps.c          | 32 +++++++++++++++++++++++++
>  arch/x86/mm/fault.c              | 39 ++++++++++++++++++++++++++++++
>  arch/x86/mm/init_64.c            | 27 ---------------------
>  arch/x86/mm/pageattr.c           |  7 +++++-
>  arch/x86/mm/tlb.c                | 15 ++++++++++++
>  fs/proc/meminfo.c                |  2 +-
>  kernel/fork.c                    | 51 ++++++++++++++++++++++++++++++----------
>  mm/page_alloc.c                  |  3 +--
>  17 files changed, 233 insertions(+), 49 deletions(-)

This is great, thanks! This helps the up-coming usercopy
self-protection code, and makes stack overflows a much less
interesting target for attackers.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/13] fork: Add generic vmalloced stack support
  2016-06-16  0:28 ` [PATCH 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-06-16 17:25   ` Kees Cook
  2016-06-16 17:37     ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Kees Cook @ 2016-06-16 17:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Borislav Petkov, Nadav Amit, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf

On Wed, Jun 15, 2016 at 5:28 PM, Andy Lutomirski <luto@kernel.org> wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/Kconfig  | 12 ++++++++++++
>  kernel/fork.c | 45 +++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 49 insertions(+), 8 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d794384a0404..1acd262036b0 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -658,4 +658,16 @@ config ARCH_NO_COHERENT_DMA_MMAP
>  config CPU_NO_EFFICIENT_FFS
>         def_bool n
>
> +config HAVE_ARCH_VMAP_STACK
> +       def_bool n

In the style of HAVE_ARCH_SECCOMP_FILTER, can you detail all the
things an architecture needs to do to correctly support a vmap stack?
This will help with arch porting.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/13] fork: Add generic vmalloced stack support
  2016-06-16 17:25   ` Kees Cook
@ 2016-06-16 17:37     ` Andy Lutomirski
  0 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 17:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, linux-kernel, x86, Borislav Petkov, Nadav Amit,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf

On Thu, Jun 16, 2016 at 10:25 AM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jun 15, 2016 at 5:28 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> vmalloc_node.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/Kconfig  | 12 ++++++++++++
>>  kernel/fork.c | 45 +++++++++++++++++++++++++++++++++++++--------
>>  2 files changed, 49 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index d794384a0404..1acd262036b0 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -658,4 +658,16 @@ config ARCH_NO_COHERENT_DMA_MMAP
>>  config CPU_NO_EFFICIENT_FFS
>>         def_bool n
>>
>> +config HAVE_ARCH_VMAP_STACK
>> +       def_bool n
>
> In the style of HAVE_ARCH_SECCOMP_FILTER, can you detail all the
> things an architecture needs to do to correctly support a vmap stack?
> This will help with arch porting.

Done.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  2016-06-16 15:33   ` Josh Poimboeuf
@ 2016-06-16 17:39     ` Andy Lutomirski
  2016-06-16 19:39       ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 17:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	linux-mm

On Thu, Jun 16, 2016 at 8:33 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Wed, Jun 15, 2016 at 05:28:26PM -0700, Andy Lutomirski wrote:
>> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
>> zone.  This only makes sense if each kernel stack exists entirely in
>> one zone, and allowing vmapped stacks could break this assumption.
>>
>> It turns out that the code for tracking kernel stack allocations in
>> units of pages is slightly simpler, so just switch to counting
>> pages.
>>
>> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  fs/proc/meminfo.c | 2 +-
>>  kernel/fork.c     | 3 ++-
>>  mm/page_alloc.c   | 3 +--
>>  3 files changed, 4 insertions(+), 4 deletions(-)
>
> You missed another usage of NR_KERNEL_STACK in drivers/base/node.c.

Thanks.

The real reason I cc'd you was so you could look at
rewind_stack_do_exit and the sneaky trick I did in no_context in the
last patch, though.  :)  Both survive objtool, but I figured I'd check
with objtool's author as well.  If there was a taint bit I could set
saying "kernel is hosed -- don't try to apply live patches any more",
I'd have extra confidence.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-16  6:05 ` [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Heiko Carstens
@ 2016-06-16 17:50   ` Andy Lutomirski
  2016-06-16 18:14     ` Andy Lutomirski
  2016-06-17  3:58   ` Andy Lutomirski
  1 sibling, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 17:50 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf

On Wed, Jun 15, 2016 at 11:05 PM, Heiko Carstens
<heiko.carstens@de.ibm.com> wrote:
> On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote:
>> Since the dawn of time, a kernel stack overflow has been a real PITA
>> to debug, has caused nondeterministic crashes some time after the
>> actual overflow, and has generally been easy to exploit for root.
>>
>> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
>> that enable it (just x86 for now) get virtually mapped stacks with
>> guard pages.  This causes reliable faults when the stack overflows.
>>
>> If the arch implements it well, we get a nice OOPS on stack overflow
>> (as opposed to panicing directly or otherwise exploding badly).  On
>> x86, the OOPS is nice, has a usable call trace, and the overflowing
>> task is killed cleanly.
>
> Do you have numbers which reflect the performance impact of this change?
>

Hmm.  My attempt to benchmark it caused some of the vmalloc core code
to hang.  I'll dig around.

FWIW, I expect some overhead on clone/fork (if it's high, then that
would be a good reason to improve vmalloc) and a small
workload-dependent overhead due to increased TLB pressure.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-16  0:28 ` [PATCH 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
@ 2016-06-16 17:50   ` Josh Poimboeuf
  2016-06-16 17:57     ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2016-06-16 17:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds

On Wed, Jun 15, 2016 at 05:28:30PM -0700, Andy Lutomirski wrote:
> If we call do_exit with a clean stack, we greatly reduce the risk of
> recursive oopses due to stack overflow in do_exit, and we allow
> do_exit to work even if we OOPS from an IST stack.  The latter gives
> us a much better chance of surviving long enough after we detect a
> stack overflow to write out our logs.
> 
> I intentionally separated this from the preceding patch that
> disables do_exit-on-OOPS on IST stacks.  This way, if we need to
> revert this patch, we still end up in an acceptable state wrt stack
> overflow handling.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_32.S   | 11 +++++++++++
>  arch/x86/entry/entry_64.S   | 11 +++++++++++
>  arch/x86/kernel/dumpstack.c | 13 +++++++++----
>  3 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 983e5d3a0d27..1499db695a88 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
>  	jmp	error_code
>  END(async_page_fault)
>  #endif
> +
> +ENTRY(rewind_stack_do_exit)
> +	/* Prevent any naive code from trying to unwind to our caller. */
> +	xorl	%ebp, %ebp
> +
> +	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
> +	leal	-TOP_OF_KERNEL_STACK_PADDING-PT_OLDSS(%esi), %esp
> +
> +	call	do_exit
> +1:	jmp 1b
> +END(rewind_stack_do_exit)
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 9ee0da1807ed..394cad73e890 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
>  	mov	$-ENOSYS, %eax
>  	sysret
>  END(ignore_sysret)
> +
> +ENTRY(rewind_stack_do_exit)
> +	/* Prevent any naive code from trying to unwind to our caller. */
> +	xorl	%ebp, %ebp
> +
> +	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
> +	leaq	-TOP_OF_KERNEL_STACK_PADDING-SS(%rax), %rsp

I think this should be:

	leaq	-TOP_OF_KERNEL_STACK_PADDING-SIZEOF_PTREGS, %rsp

That way when it calls do_exit(), the stack frame will be placed at the
conventional spot where a smart unwinder would expect to find it.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-16 17:50   ` Josh Poimboeuf
@ 2016-06-16 17:57     ` Andy Lutomirski
  0 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 17:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds

On Thu, Jun 16, 2016 at 10:50 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Wed, Jun 15, 2016 at 05:28:30PM -0700, Andy Lutomirski wrote:
>> If we call do_exit with a clean stack, we greatly reduce the risk of
>> recursive oopses due to stack overflow in do_exit, and we allow
>> do_exit to work even if we OOPS from an IST stack.  The latter gives
>> us a much better chance of surviving long enough after we detect a
>> stack overflow to write out our logs.
>>
>> I intentionally separated this from the preceding patch that
>> disables do_exit-on-OOPS on IST stacks.  This way, if we need to
>> revert this patch, we still end up in an acceptable state wrt stack
>> overflow handling.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/entry_32.S   | 11 +++++++++++
>>  arch/x86/entry/entry_64.S   | 11 +++++++++++
>>  arch/x86/kernel/dumpstack.c | 13 +++++++++----
>>  3 files changed, 31 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>> index 983e5d3a0d27..1499db695a88 100644
>> --- a/arch/x86/entry/entry_32.S
>> +++ b/arch/x86/entry/entry_32.S
>> @@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
>>       jmp     error_code
>>  END(async_page_fault)
>>  #endif
>> +
>> +ENTRY(rewind_stack_do_exit)
>> +     /* Prevent any naive code from trying to unwind to our caller. */
>> +     xorl    %ebp, %ebp
>> +
>> +     movl    PER_CPU_VAR(cpu_current_top_of_stack), %esi
>> +     leal    -TOP_OF_KERNEL_STACK_PADDING-PT_OLDSS(%esi), %esp
>> +
>> +     call    do_exit
>> +1:   jmp 1b
>> +END(rewind_stack_do_exit)
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 9ee0da1807ed..394cad73e890 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
>>       mov     $-ENOSYS, %eax
>>       sysret
>>  END(ignore_sysret)
>> +
>> +ENTRY(rewind_stack_do_exit)
>> +     /* Prevent any naive code from trying to unwind to our caller. */
>> +     xorl    %ebp, %ebp
>> +
>> +     movq    PER_CPU_VAR(cpu_current_top_of_stack), %rax
>> +     leaq    -TOP_OF_KERNEL_STACK_PADDING-SS(%rax), %rsp
>
> I think this should be:
>
>         leaq    -TOP_OF_KERNEL_STACK_PADDING-SIZEOF_PTREGS, %rsp
>
> That way when it calls do_exit(), the stack frame will be placed at the
> conventional spot where a smart unwinder would expect to find it.

Whoops!

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-16 17:50   ` Andy Lutomirski
@ 2016-06-16 18:14     ` Andy Lutomirski
  2016-06-16 21:27       ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 18:14 UTC (permalink / raw)
  To: Heiko Carstens, Paul McKenney
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf

Adding Paul, because RCU blew up.

On Thu, Jun 16, 2016 at 10:50 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jun 15, 2016 at 11:05 PM, Heiko Carstens
> <heiko.carstens@de.ibm.com> wrote:
>> On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote:
>>> Since the dawn of time, a kernel stack overflow has been a real PITA
>>> to debug, has caused nondeterministic crashes some time after the
>>> actual overflow, and has generally been easy to exploit for root.
>>>
>>> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
>>> that enable it (just x86 for now) get virtually mapped stacks with
>>> guard pages.  This causes reliable faults when the stack overflows.
>>>
>>> If the arch implements it well, we get a nice OOPS on stack overflow
>>> (as opposed to panicing directly or otherwise exploding badly).  On
>>> x86, the OOPS is nice, has a usable call trace, and the overflowing
>>> task is killed cleanly.
>>
>> Do you have numbers which reflect the performance impact of this change?
>>
>
> Hmm.  My attempt to benchmark it caused some of the vmalloc core code
> to hang.  I'll dig around.

[  488.482010] Call Trace:
[  488.482389]  <IRQ>  [<ffffffff810da5f6>] sched_show_task+0xb6/0x110
[  488.483341]  [<ffffffff81108c7a>] rcu_check_callbacks+0x83a/0x840
[  488.484226]  [<ffffffff810dd48a>] ? account_system_time+0x7a/0x110
[  488.485157]  [<ffffffff8111c0f0>] ? tick_sched_do_timer+0x30/0x30
[  488.486133]  [<ffffffff8110d314>] update_process_times+0x34/0x60
[  488.487050]  [<ffffffff8111bb51>] tick_sched_handle.isra.13+0x31/0x40
[  488.488018]  [<ffffffff8111c128>] tick_sched_timer+0x38/0x70
[  488.488853]  [<ffffffff8110db2a>] __hrtimer_run_queues+0xda/0x250
[  488.489739]  [<ffffffff8110e263>] hrtimer_interrupt+0xa3/0x190
[  488.490630]  [<ffffffff810952f3>] local_apic_timer_interrupt+0x33/0x50
[  488.491660]  [<ffffffff81095d58>] smp_apic_timer_interrupt+0x38/0x50
[  488.492644]  [<ffffffff8194d022>] apic_timer_interrupt+0x82/0x90
[  488.493502]  [<ffffffff810ee1c0>] ? queued_spin_lock_slowpath+0x20/0x190
[  488.494550]  [<ffffffff8194c21b>] _raw_spin_lock+0x1b/0x20
[  488.495321]  [<ffffffff811b7a54>] find_vmap_area+0x14/0x60
[  488.496197]  [<ffffffff811b8f69>] find_vm_area+0x9/0x20
[  488.496922]  [<ffffffff810afb19>] account_kernel_stack+0x89/0x100
[  488.497885]  [<ffffffff810aff76>] free_task+0x16/0x50
[  488.498599]  [<ffffffff810b0042>] __put_task_struct+0x92/0x120
[  488.499525]  [<ffffffff810b4a66>] delayed_put_task_struct+0x76/0x80
[  488.500348]  [<ffffffff81107969>] rcu_process_callbacks+0x1f9/0x5e0
[  488.501208]  [<ffffffff810b7ca1>] __do_softirq+0xf1/0x280
[  488.501932]  [<ffffffff810b7f7e>] irq_exit+0x9e/0xa0
[  488.502955]  [<ffffffff81095d5d>] smp_apic_timer_interrupt+0x3d/0x50
[  488.503943]  [<ffffffff8194d022>] apic_timer_interrupt+0x82/0x90
[  488.504886]  <EOI>  [<ffffffff8194c20b>] ? _raw_spin_lock+0xb/0x20
[  488.505877]  [<ffffffff811b7c33>] ? __get_vm_area_node+0xc3/0x160
[  488.506812]  [<ffffffff810e013e>] ? task_move_group_fair+0x7e/0x90
[  488.507730]  [<ffffffff811b9360>] __vmalloc_node_range+0x70/0x280
[  488.508689]  [<ffffffff810b1ce5>] ? _do_fork+0xc5/0x370
[  488.509512]  [<ffffffff811ceb9b>] ? kmem_cache_alloc_node+0x7b/0x170
[  488.510502]  [<ffffffff81327418>] ? current_has_perm+0x38/0x40
[  488.511430]  [<ffffffff810b0501>] copy_process.part.46+0x141/0x1760
[  488.512449]  [<ffffffff810b1ce5>] ? _do_fork+0xc5/0x370
[  488.513285]  [<ffffffff8111faf3>] ? do_futex+0x293/0xad0
[  488.514093]  [<ffffffff810df14a>] ? check_preempt_wakeup+0x10a/0x240
[  488.515108]  [<ffffffff810d92c2>] ? wake_up_new_task+0xf2/0x180
[  488.516043]  [<ffffffff810b1ce5>] _do_fork+0xc5/0x370
[  488.516786]  [<ffffffff8112039d>] ? SyS_futex+0x6d/0x150
[  488.517615]  [<ffffffff810b2014>] SyS_clone+0x14/0x20
[  488.518385]  [<ffffffff81002b72>] do_syscall_64+0x52/0xb0
[  488.519239]  [<ffffffff8194c4e1>] entry_SYSCALL64_slow_path+0x25/0x25

The bug seems straightforward: vmap_area_lock is held, the RCU softirq
fires, and vmap_area_lock recurses and deadlocks.  Lockdep agrees with
my assessment and catches the bug immediately on boot.

What's the right fix?  Change all spin_lock calls on vmap_area_lock to
spin_lock_bh?  Somehow ask RCU not to call delayed_put_task_struct
from bh context?  I would naively have expected RCU to only call its
callbacks from thread context, but I was clearly wrong.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-16  0:28 ` [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
@ 2016-06-16 18:16   ` Josh Poimboeuf
  2016-06-16 18:22     ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2016-06-16 18:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, x86, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds

On Wed, Jun 15, 2016 at 05:28:32PM -0700, Andy Lutomirski wrote:
> If we overflow the stack, print_context_stack will abort.  Detect
> this case and rewind back into the valid part of the stack so that
> we can trace it.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/kernel/dumpstack.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index d4d085e27d04..400a2e17c1d1 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
>  {
>  	struct stack_frame *frame = (struct stack_frame *)bp;
>  
> +	/*
> +	 * If we overflowed the stack into a guard page, jump back to the
> +	 * bottom of the usable stack.
> +	 */
> +	if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
> +		stack = (unsigned long *)tinfo + 1;

That will start walking the stack in the middle of the thread_info
struct.

I think you meant:

		stack = (unsigned long *)(tinfo + 1)

However, thread_info will have been overwritten anyway.  So maybe it
should just be:

		stack = tinfo;

(Though that still wouldn't quite work because the valid_stack_ptr()
check would fail...)

> +
>  	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
>  		unsigned long addr;

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-16 18:16   ` Josh Poimboeuf
@ 2016-06-16 18:22     ` Andy Lutomirski
  2016-06-16 18:33       ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 18:22 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds

On Thu, Jun 16, 2016 at 11:16 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Wed, Jun 15, 2016 at 05:28:32PM -0700, Andy Lutomirski wrote:
>> If we overflow the stack, print_context_stack will abort.  Detect
>> this case and rewind back into the valid part of the stack so that
>> we can trace it.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/kernel/dumpstack.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
>> index d4d085e27d04..400a2e17c1d1 100644
>> --- a/arch/x86/kernel/dumpstack.c
>> +++ b/arch/x86/kernel/dumpstack.c
>> @@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
>>  {
>>       struct stack_frame *frame = (struct stack_frame *)bp;
>>
>> +     /*
>> +      * If we overflowed the stack into a guard page, jump back to the
>> +      * bottom of the usable stack.
>> +      */
>> +     if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
>> +             stack = (unsigned long *)tinfo + 1;
>
> That will start walking the stack in the middle of the thread_info
> struct.
>
> I think you meant:
>
>                 stack = (unsigned long *)(tinfo + 1)
>
> However, thread_info will have been overwritten anyway.  So maybe it
> should just be:
>
>                 stack = tinfo;
>
> (Though that still wouldn't quite work because the valid_stack_ptr()
> check would fail...)

I did mean what I wrote, because I wanted to start at the bottom of
the validly allocated area.  IOW I wanted to do the minimum possible
backward jump to make the code display something.

Eventually I want to make thread_info empty, in which case the
distinction won't matter so much.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-16 18:22     ` Andy Lutomirski
@ 2016-06-16 18:33       ` Josh Poimboeuf
  2016-06-16 18:37         ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2016-06-16 18:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds

On Thu, Jun 16, 2016 at 11:22:14AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 16, 2016 at 11:16 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Wed, Jun 15, 2016 at 05:28:32PM -0700, Andy Lutomirski wrote:
> >> If we overflow the stack, print_context_stack will abort.  Detect
> >> this case and rewind back into the valid part of the stack so that
> >> we can trace it.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>  arch/x86/kernel/dumpstack.c | 7 +++++++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> >> index d4d085e27d04..400a2e17c1d1 100644
> >> --- a/arch/x86/kernel/dumpstack.c
> >> +++ b/arch/x86/kernel/dumpstack.c
> >> @@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
> >>  {
> >>       struct stack_frame *frame = (struct stack_frame *)bp;
> >>
> >> +     /*
> >> +      * If we overflowed the stack into a guard page, jump back to the
> >> +      * bottom of the usable stack.
> >> +      */
> >> +     if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
> >> +             stack = (unsigned long *)tinfo + 1;
> >
> > That will start walking the stack in the middle of the thread_info
> > struct.
> >
> > I think you meant:
> >
> >                 stack = (unsigned long *)(tinfo + 1)
> >
> > However, thread_info will have been overwritten anyway.  So maybe it
> > should just be:
> >
> >                 stack = tinfo;
> >
> > (Though that still wouldn't quite work because the valid_stack_ptr()
> > check would fail...)
> 
> I did mean what I wrote, because I wanted to start at the bottom of
> the validly allocated area.  IOW I wanted to do the minimum possible
> backward jump to make the code display something.

But why the "+ 1"?  Is that a hack to make it pass the valid_stack_ptr()
check?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-16 18:33       ` Josh Poimboeuf
@ 2016-06-16 18:37         ` Andy Lutomirski
  2016-06-16 18:54           ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 18:37 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds

On Thu, Jun 16, 2016 at 11:33 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jun 16, 2016 at 11:22:14AM -0700, Andy Lutomirski wrote:
>> On Thu, Jun 16, 2016 at 11:16 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Wed, Jun 15, 2016 at 05:28:32PM -0700, Andy Lutomirski wrote:
>> >> If we overflow the stack, print_context_stack will abort.  Detect
>> >> this case and rewind back into the valid part of the stack so that
>> >> we can trace it.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> >> ---
>> >>  arch/x86/kernel/dumpstack.c | 7 +++++++
>> >>  1 file changed, 7 insertions(+)
>> >>
>> >> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
>> >> index d4d085e27d04..400a2e17c1d1 100644
>> >> --- a/arch/x86/kernel/dumpstack.c
>> >> +++ b/arch/x86/kernel/dumpstack.c
>> >> @@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
>> >>  {
>> >>       struct stack_frame *frame = (struct stack_frame *)bp;
>> >>
>> >> +     /*
>> >> +      * If we overflowed the stack into a guard page, jump back to the
>> >> +      * bottom of the usable stack.
>> >> +      */
>> >> +     if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
>> >> +             stack = (unsigned long *)tinfo + 1;
>> >
>> > That will start walking the stack in the middle of the thread_info
>> > struct.
>> >
>> > I think you meant:
>> >
>> >                 stack = (unsigned long *)(tinfo + 1)
>> >
>> > However, thread_info will have been overwritten anyway.  So maybe it
>> > should just be:
>> >
>> >                 stack = tinfo;
>> >
>> > (Though that still wouldn't quite work because the valid_stack_ptr()
>> > check would fail...)
>>
>> I did mean what I wrote, because I wanted to start at the bottom of
>> the validly allocated area.  IOW I wanted to do the minimum possible
>> backward jump to make the code display something.
>
> But why the "+ 1"?  Is that a hack to make it pass the valid_stack_ptr()
> check?

Yes.

But hmm.  Maybe the right fix is to drop the + 1 and to change the
last line of valid_stck_ptr from:

    return p > t && p < t + THREAD_SIZE - size;

to:

    return p >= t && p < t + THREAD_SIZE - size;

The current definition of valid_stack_ptr is certainly nonsensical.
It should either be p >= t or p >= t + 1.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-16 18:37         ` Andy Lutomirski
@ 2016-06-16 18:54           ` Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2016-06-16 18:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds

On Thu, Jun 16, 2016 at 11:37:07AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 16, 2016 at 11:33 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Thu, Jun 16, 2016 at 11:22:14AM -0700, Andy Lutomirski wrote:
> >> On Thu, Jun 16, 2016 at 11:16 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Wed, Jun 15, 2016 at 05:28:32PM -0700, Andy Lutomirski wrote:
> >> >> If we overflow the stack, print_context_stack will abort.  Detect
> >> >> this case and rewind back into the valid part of the stack so that
> >> >> we can trace it.
> >> >>
> >> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> >> ---
> >> >>  arch/x86/kernel/dumpstack.c | 7 +++++++
> >> >>  1 file changed, 7 insertions(+)
> >> >>
> >> >> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> >> >> index d4d085e27d04..400a2e17c1d1 100644
> >> >> --- a/arch/x86/kernel/dumpstack.c
> >> >> +++ b/arch/x86/kernel/dumpstack.c
> >> >> @@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
> >> >>  {
> >> >>       struct stack_frame *frame = (struct stack_frame *)bp;
> >> >>
> >> >> +     /*
> >> >> +      * If we overflowed the stack into a guard page, jump back to the
> >> >> +      * bottom of the usable stack.
> >> >> +      */
> >> >> +     if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
> >> >> +             stack = (unsigned long *)tinfo + 1;
> >> >
> >> > That will start walking the stack in the middle of the thread_info
> >> > struct.
> >> >
> >> > I think you meant:
> >> >
> >> >                 stack = (unsigned long *)(tinfo + 1)
> >> >
> >> > However, thread_info will have been overwritten anyway.  So maybe it
> >> > should just be:
> >> >
> >> >                 stack = tinfo;
> >> >
> >> > (Though that still wouldn't quite work because the valid_stack_ptr()
> >> > check would fail...)
> >>
> >> I did mean what I wrote, because I wanted to start at the bottom of
> >> the validly allocated area.  IOW I wanted to do the minimum possible
> >> backward jump to make the code display something.
> >
> > But why the "+ 1"?  Is that a hack to make it pass the valid_stack_ptr()
> > check?
> 
> Yes.
> 
> But hmm.  Maybe the right fix is to drop the + 1 and to change the
> last line of valid_stck_ptr from:
> 
>     return p > t && p < t + THREAD_SIZE - size;
> 
> to:
> 
>     return p >= t && p < t + THREAD_SIZE - size;

Yeah, I think that would be much better.  Then it won't skip the first
value on the page.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  2016-06-16 17:21     ` Andy Lutomirski
@ 2016-06-16 19:20       ` Andy Lutomirski
  0 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 19:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Johannes Weiner, Michal Hocko,
	linux-mm

On Thu, Jun 16, 2016 at 10:21 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jun 16, 2016 at 4:10 AM, Vladimir Davydov
> <vdavydov@virtuozzo.com> wrote:
>> On Wed, Jun 15, 2016 at 05:28:26PM -0700, Andy Lutomirski wrote:
>> ...
>>> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>>>  {
>>>       struct zone *zone = page_zone(virt_to_page(ti));
>>>
>>> -     mod_zone_page_state(zone, NR_KERNEL_STACK, account);
>>> +     mod_zone_page_state(zone, NR_KERNEL_STACK,
>>> +                         THREAD_SIZE / PAGE_SIZE * account);
>>
>> It won't work if THREAD_SIZE < PAGE_SIZE. Is there an arch with such a
>> thread size, anyway? If no, we should probably drop thread_info_cache.
>
> On a quick grep, I can't find any.  I'll add a BUILD_BUG_ON for now.

Correction.  frv has 16KiB pages and 8KiB stacks.  I'll rework it to
count NR_KERNEL_STACK in KiB.  Sigh.

--Andy

>
> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks
  2016-06-16 17:39     ` Andy Lutomirski
@ 2016-06-16 19:39       ` Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2016-06-16 19:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	linux-mm

On Thu, Jun 16, 2016 at 10:39:43AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 16, 2016 at 8:33 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Wed, Jun 15, 2016 at 05:28:26PM -0700, Andy Lutomirski wrote:
> >> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> >> zone.  This only makes sense if each kernel stack exists entirely in
> >> one zone, and allowing vmapped stacks could break this assumption.
> >>
> >> It turns out that the code for tracking kernel stack allocations in
> >> units of pages is slightly simpler, so just switch to counting
> >> pages.
> >>
> >> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> Cc: Michal Hocko <mhocko@kernel.org>
> >> Cc: linux-mm@kvack.org
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>  fs/proc/meminfo.c | 2 +-
> >>  kernel/fork.c     | 3 ++-
> >>  mm/page_alloc.c   | 3 +--
> >>  3 files changed, 4 insertions(+), 4 deletions(-)
> >
> > You missed another usage of NR_KERNEL_STACK in drivers/base/node.c.
> 
> Thanks.
> 
> The real reason I cc'd you was so you could look at
> rewind_stack_do_exit and the sneaky trick I did in no_context in the
> last patch, though.  :)  Both survive objtool, but I figured I'd check
> with objtool's author as well.  If there was a taint bit I could set
> saying "kernel is hosed -- don't try to apply live patches any more",
> I'd have extra confidence.

I think it all looks fine from an objtool and a live patching
standpoint.  Other than my previous comment about setting the stack
pointer correctly before calling do_exit(), I didn't see anything else
which would mess up the stack of a sleeping task, which is all I really
care about.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-16 18:14     ` Andy Lutomirski
@ 2016-06-16 21:27       ` Andy Lutomirski
  0 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-16 21:27 UTC (permalink / raw)
  To: Heiko Carstens, Paul McKenney
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf

On Thu, Jun 16, 2016 at 11:14 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> Adding Paul, because RCU blew up.
>
> On Thu, Jun 16, 2016 at 10:50 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Jun 15, 2016 at 11:05 PM, Heiko Carstens
>> <heiko.carstens@de.ibm.com> wrote:
>>> On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote:
>>>> Since the dawn of time, a kernel stack overflow has been a real PITA
>>>> to debug, has caused nondeterministic crashes some time after the
>>>> actual overflow, and has generally been easy to exploit for root.
>>>>
>>>> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
>>>> that enable it (just x86 for now) get virtually mapped stacks with
>>>> guard pages.  This causes reliable faults when the stack overflows.
>>>>
>>>> If the arch implements it well, we get a nice OOPS on stack overflow
>>>> (as opposed to panicing directly or otherwise exploding badly).  On
>>>> x86, the OOPS is nice, has a usable call trace, and the overflowing
>>>> task is killed cleanly.
>>>
>>> Do you have numbers which reflect the performance impact of this change?
>>>
>>
>> Hmm.  My attempt to benchmark it caused some of the vmalloc core code
>> to hang.  I'll dig around.
>
> [  488.482010] Call Trace:
> [  488.482389]  <IRQ>  [<ffffffff810da5f6>] sched_show_task+0xb6/0x110
> [  488.483341]  [<ffffffff81108c7a>] rcu_check_callbacks+0x83a/0x840
> [  488.484226]  [<ffffffff810dd48a>] ? account_system_time+0x7a/0x110
> [  488.485157]  [<ffffffff8111c0f0>] ? tick_sched_do_timer+0x30/0x30
> [  488.486133]  [<ffffffff8110d314>] update_process_times+0x34/0x60
> [  488.487050]  [<ffffffff8111bb51>] tick_sched_handle.isra.13+0x31/0x40
> [  488.488018]  [<ffffffff8111c128>] tick_sched_timer+0x38/0x70
> [  488.488853]  [<ffffffff8110db2a>] __hrtimer_run_queues+0xda/0x250
> [  488.489739]  [<ffffffff8110e263>] hrtimer_interrupt+0xa3/0x190
> [  488.490630]  [<ffffffff810952f3>] local_apic_timer_interrupt+0x33/0x50
> [  488.491660]  [<ffffffff81095d58>] smp_apic_timer_interrupt+0x38/0x50
> [  488.492644]  [<ffffffff8194d022>] apic_timer_interrupt+0x82/0x90
> [  488.493502]  [<ffffffff810ee1c0>] ? queued_spin_lock_slowpath+0x20/0x190
> [  488.494550]  [<ffffffff8194c21b>] _raw_spin_lock+0x1b/0x20
> [  488.495321]  [<ffffffff811b7a54>] find_vmap_area+0x14/0x60
> [  488.496197]  [<ffffffff811b8f69>] find_vm_area+0x9/0x20
> [  488.496922]  [<ffffffff810afb19>] account_kernel_stack+0x89/0x100
> [  488.497885]  [<ffffffff810aff76>] free_task+0x16/0x50
> [  488.498599]  [<ffffffff810b0042>] __put_task_struct+0x92/0x120
> [  488.499525]  [<ffffffff810b4a66>] delayed_put_task_struct+0x76/0x80
> [  488.500348]  [<ffffffff81107969>] rcu_process_callbacks+0x1f9/0x5e0
> [  488.501208]  [<ffffffff810b7ca1>] __do_softirq+0xf1/0x280
> [  488.501932]  [<ffffffff810b7f7e>] irq_exit+0x9e/0xa0
> [  488.502955]  [<ffffffff81095d5d>] smp_apic_timer_interrupt+0x3d/0x50
> [  488.503943]  [<ffffffff8194d022>] apic_timer_interrupt+0x82/0x90
> [  488.504886]  <EOI>  [<ffffffff8194c20b>] ? _raw_spin_lock+0xb/0x20
> [  488.505877]  [<ffffffff811b7c33>] ? __get_vm_area_node+0xc3/0x160
> [  488.506812]  [<ffffffff810e013e>] ? task_move_group_fair+0x7e/0x90
> [  488.507730]  [<ffffffff811b9360>] __vmalloc_node_range+0x70/0x280
> [  488.508689]  [<ffffffff810b1ce5>] ? _do_fork+0xc5/0x370
> [  488.509512]  [<ffffffff811ceb9b>] ? kmem_cache_alloc_node+0x7b/0x170
> [  488.510502]  [<ffffffff81327418>] ? current_has_perm+0x38/0x40
> [  488.511430]  [<ffffffff810b0501>] copy_process.part.46+0x141/0x1760
> [  488.512449]  [<ffffffff810b1ce5>] ? _do_fork+0xc5/0x370
> [  488.513285]  [<ffffffff8111faf3>] ? do_futex+0x293/0xad0
> [  488.514093]  [<ffffffff810df14a>] ? check_preempt_wakeup+0x10a/0x240
> [  488.515108]  [<ffffffff810d92c2>] ? wake_up_new_task+0xf2/0x180
> [  488.516043]  [<ffffffff810b1ce5>] _do_fork+0xc5/0x370
> [  488.516786]  [<ffffffff8112039d>] ? SyS_futex+0x6d/0x150
> [  488.517615]  [<ffffffff810b2014>] SyS_clone+0x14/0x20
> [  488.518385]  [<ffffffff81002b72>] do_syscall_64+0x52/0xb0
> [  488.519239]  [<ffffffff8194c4e1>] entry_SYSCALL64_slow_path+0x25/0x25
>
> The bug seems straightforward: vmap_area_lock is held, the RCU softirq
> fires, and vmap_area_lock recurses and deadlocks.  Lockdep agrees with
> my assessment and catches the bug immediately on boot.
>
> What's the right fix?  Change all spin_lock calls on vmap_area_lock to
> spin_lock_bh?  Somehow ask RCU not to call delayed_put_task_struct
> from bh context?  I would naively have expected RCU to only call its
> callbacks from thread context, but I was clearly wrong.

I fixed (worked around?) it by caching the vm_struct * so I can skip
calling find_vm_area.  vfree works in IRQ context.  IMO this is still
a wee bit ugly.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-16  6:05 ` [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Heiko Carstens
  2016-06-16 17:50   ` Andy Lutomirski
@ 2016-06-17  3:58   ` Andy Lutomirski
  2016-06-17  7:27     ` Heiko Carstens
  1 sibling, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-17  3:58 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf

On Wed, Jun 15, 2016 at 11:05 PM, Heiko Carstens
<heiko.carstens@de.ibm.com> wrote:
> On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote:
>> Since the dawn of time, a kernel stack overflow has been a real PITA
>> to debug, has caused nondeterministic crashes some time after the
>> actual overflow, and has generally been easy to exploit for root.
>>
>> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
>> that enable it (just x86 for now) get virtually mapped stacks with
>> guard pages.  This causes reliable faults when the stack overflows.
>>
>> If the arch implements it well, we get a nice OOPS on stack overflow
>> (as opposed to panicing directly or otherwise exploding badly).  On
>> x86, the OOPS is nice, has a usable call trace, and the overflowing
>> task is killed cleanly.
>
> Do you have numbers which reflect the performance impact of this change?
>

It seems to add ~1.5µs per thread creation/join pair, which is around
15% overhead.  I *think* the major cost is that vmalloc calls
alloc_kmem_pages_node once per page rather than using a higher-order
block if available.

Anyway, if anyone wants this to become faster, I think the way to do
it would be to ask some friendly mm folks to see if they can speed up
vmalloc.  I don't really want to dig in to the guts of the page
allocator.  My instinct would be to add a new interface
(GFP_SMALLER_OK?) to ask the page allocator for a high-order
allocation such that, if a high-order block is not immediately
available (on the freelist) then it should fall back to a smaller
allocation rather than working hard to get a high-order allocation.
Then vmalloc could use this, and vfree could free pages in blocks
corresponding to whatever orders it got the pages in, thus avoiding
the need to merge all the pages back together.

There's another speedup available: actually reuse allocations.  We
could keep a very small freelist of vmap_areas with their associated
pages so we could reuse them.  (We can't efficiently reuse a vmap_area
without its backing pages because we need to flush the TLB in the
middle if we do that.)

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-17  3:58   ` Andy Lutomirski
@ 2016-06-17  7:27     ` Heiko Carstens
  2016-06-17 17:38       ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2016-06-17  7:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, linux-kernel, X86 ML, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf

On Thu, Jun 16, 2016 at 08:58:07PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 15, 2016 at 11:05 PM, Heiko Carstens
> <heiko.carstens@de.ibm.com> wrote:
> > On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote:
> >> Since the dawn of time, a kernel stack overflow has been a real PITA
> >> to debug, has caused nondeterministic crashes some time after the
> >> actual overflow, and has generally been easy to exploit for root.
> >>
> >> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
> >> that enable it (just x86 for now) get virtually mapped stacks with
> >> guard pages.  This causes reliable faults when the stack overflows.
> >>
> >> If the arch implements it well, we get a nice OOPS on stack overflow
> >> (as opposed to panicing directly or otherwise exploding badly).  On
> >> x86, the OOPS is nice, has a usable call trace, and the overflowing
> >> task is killed cleanly.
> >
> > Do you have numbers which reflect the performance impact of this change?
> >
> 
> It seems to add ~1.5µs per thread creation/join pair, which is around
> 15% overhead.  I *think* the major cost is that vmalloc calls
> alloc_kmem_pages_node once per page rather than using a higher-order
> block if available.
> 
> Anyway, if anyone wants this to become faster, I think the way to do
> it would be to ask some friendly mm folks to see if they can speed up
> vmalloc.  I don't really want to dig in to the guts of the page
> allocator.  My instinct would be to add a new interface
> (GFP_SMALLER_OK?) to ask the page allocator for a high-order
> allocation such that, if a high-order block is not immediately
> available (on the freelist) then it should fall back to a smaller
> allocation rather than working hard to get a high-order allocation.
> Then vmalloc could use this, and vfree could free pages in blocks
> corresponding to whatever orders it got the pages in, thus avoiding
> the need to merge all the pages back together.
> 
> There's another speedup available: actually reuse allocations.  We
> could keep a very small freelist of vmap_areas with their associated
> pages so we could reuse them.  (We can't efficiently reuse a vmap_area
> without its backing pages because we need to flush the TLB in the
> middle if we do that.)

That's rather expensive. Just for the records: on s390 we use gcc's
architecture specific compile options (kernel: CONFIG_STACK_GUARD)

 -mstack-guard=stack-guard
 -mstack-size=stack-size

These generate two additional instructions at the beginning of each
function prologue and verify that the stack size left won't be below a
specified number of bytes. If so it would execute an illegal instruction.

A disassembly looks like this (r15 is the stackpointer):

0000000000000670 <setup_arch>:
     670:       eb 6f f0 48 00 24       stmg    %r6,%r15,72(%r15)
     676:       c0 d0 00 00 00 00       larl    %r13,676 <setup_arch+0x6>
     67c:       a7 f1 3f 80             tmll    %r15,16256  <--- test if enough space left
     680:       b9 04 00 ef             lgr     %r14,%r15
     684:       a7 84 00 01             je      686 <setup_arch+0x16> <--- branch to illegal op
     688:       e3 f0 ff 90 ff 71       lay     %r15,-112(%r15)

The branch jumps actually into the branch instruction itself since the 0001
part of the "je" instruction is an illegal instruction.

This catches at least wild stack overflows because of two many functions
being called.

Of course it doesn't catch wild accesses outside the stack because e.g. the
index into an array on the stack is wrong.

The runtime overhead is within noise ratio, therefore we have this always
enabled.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-17  7:27     ` Heiko Carstens
@ 2016-06-17 17:38       ` Andy Lutomirski
  2016-06-20  5:58         ` Heiko Carstens
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-17 17:38 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Nadav Amit, Kees Cook, Josh Poimboeuf, Borislav Petkov, X86 ML,
	kernel-hardening, Brian Gerst, linux-kernel, Linus Torvalds

On Jun 17, 2016 12:27 AM, "Heiko Carstens" <heiko.carstens@de.ibm.com> wrote:
>
> On Thu, Jun 16, 2016 at 08:58:07PM -0700, Andy Lutomirski wrote:
> > On Wed, Jun 15, 2016 at 11:05 PM, Heiko Carstens
> > <heiko.carstens@de.ibm.com> wrote:
> > > On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote:
> > >> Since the dawn of time, a kernel stack overflow has been a real PITA
> > >> to debug, has caused nondeterministic crashes some time after the
> > >> actual overflow, and has generally been easy to exploit for root.
> > >>
> > >> With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
> > >> that enable it (just x86 for now) get virtually mapped stacks with
> > >> guard pages.  This causes reliable faults when the stack overflows.
> > >>
> > >> If the arch implements it well, we get a nice OOPS on stack overflow
> > >> (as opposed to panicing directly or otherwise exploding badly).  On
> > >> x86, the OOPS is nice, has a usable call trace, and the overflowing
> > >> task is killed cleanly.
> > >
> > > Do you have numbers which reflect the performance impact of this change?
> > >
> >
> > It seems to add ~1.5盜 per thread creation/join pair, which is around
> > 15% overhead.  I *think* the major cost is that vmalloc calls
> > alloc_kmem_pages_node once per page rather than using a higher-order
> > block if available.
> >
> > Anyway, if anyone wants this to become faster, I think the way to do
> > it would be to ask some friendly mm folks to see if they can speed up
> > vmalloc.  I don't really want to dig in to the guts of the page
> > allocator.  My instinct would be to add a new interface
> > (GFP_SMALLER_OK?) to ask the page allocator for a high-order
> > allocation such that, if a high-order block is not immediately
> > available (on the freelist) then it should fall back to a smaller
> > allocation rather than working hard to get a high-order allocation.
> > Then vmalloc could use this, and vfree could free pages in blocks
> > corresponding to whatever orders it got the pages in, thus avoiding
> > the need to merge all the pages back together.
> >
> > There's another speedup available: actually reuse allocations.  We
> > could keep a very small freelist of vmap_areas with their associated
> > pages so we could reuse them.  (We can't efficiently reuse a vmap_area
> > without its backing pages because we need to flush the TLB in the
> > middle if we do that.)
>
> That's rather expensive. Just for the records: on s390 we use gcc's
> architecture specific compile options (kernel: CONFIG_STACK_GUARD)
>
>  -mstack-guard=stack-guard
>  -mstack-size=stack-size
>
> These generate two additional instructions at the beginning of each
> function prologue and verify that the stack size left won't be below a
> specified number of bytes. If so it would execute an illegal instruction.
>
> A disassembly looks like this (r15 is the stackpointer):
>
> 0000000000000670 <setup_arch>:
>      670:       eb 6f f0 48 00 24       stmg    %r6,%r15,72(%r15)
>      676:       c0 d0 00 00 00 00       larl    %r13,676 <setup_arch+0x6>
>      67c:       a7 f1 3f 80             tmll    %r15,16256  <--- test if enough space left
>      680:       b9 04 00 ef             lgr     %r14,%r15
>      684:       a7 84 00 01             je      686 <setup_arch+0x16> <--- branch to illegal op
>      688:       e3 f0 ff 90 ff 71       lay     %r15,-112(%r15)
>
> The branch jumps actually into the branch instruction itself since the 0001
> part of the "je" instruction is an illegal instruction.
>
> This catches at least wild stack overflows because of two many functions
> being called.
>
> Of course it doesn't catch wild accesses outside the stack because e.g. the
> index into an array on the stack is wrong.
>
> The runtime overhead is within noise ratio, therefore we have this always
> enabled.
>

Neat!  What exactly does tmll do?  I assume this works by checking the
low bits of the stack pointer.

x86_64 would have to do:

movl %esp, %r11d
shll %r11d, $18
cmpl %r11d, <threshold>
jg error

Or similar.  I think the cmpl could be eliminated if the threshold
were a power of two by simply testing the low bits of the stack
pointer.

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-17 17:38       ` Andy Lutomirski
@ 2016-06-20  5:58         ` Heiko Carstens
  2016-06-20  6:01           ` Andy Lutomirski
  0 siblings, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2016-06-20  5:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Kees Cook, Josh Poimboeuf, Borislav Petkov, X86 ML,
	kernel-hardening, Brian Gerst, linux-kernel, Linus Torvalds

On Fri, Jun 17, 2016 at 10:38:24AM -0700, Andy Lutomirski wrote:
> > A disassembly looks like this (r15 is the stackpointer):
> >
> > 0000000000000670 <setup_arch>:
> >      670:       eb 6f f0 48 00 24       stmg    %r6,%r15,72(%r15)
> >      676:       c0 d0 00 00 00 00       larl    %r13,676 <setup_arch+0x6>
> >      67c:       a7 f1 3f 80             tmll    %r15,16256  <--- test if enough space left
> >      680:       b9 04 00 ef             lgr     %r14,%r15
> >      684:       a7 84 00 01             je      686 <setup_arch+0x16> <--- branch to illegal op
> >      688:       e3 f0 ff 90 ff 71       lay     %r15,-112(%r15)
> >
> > The branch jumps actually into the branch instruction itself since the 0001
> > part of the "je" instruction is an illegal instruction.
> >
> > This catches at least wild stack overflows because of two many functions
> > being called.
> >
> > Of course it doesn't catch wild accesses outside the stack because e.g. the
> > index into an array on the stack is wrong.
> >
> > The runtime overhead is within noise ratio, therefore we have this always
> > enabled.
> >
> 
> Neat!  What exactly does tmll do?  I assume this works by checking the
> low bits of the stack pointer.
> 
> x86_64 would have to do:
> 
> movl %esp, %r11d
> shll %r11d, $18
> cmpl %r11d, <threshold>
> jg error
> 
> Or similar.  I think the cmpl could be eliminated if the threshold
> were a power of two by simply testing the low bits of the stack
> pointer.

The tmll instruction tests if any of the higher bits within the 16k
stackframe address are set. In this specific case that would be bits 7-15
(mask 0x3f80). If no bit would be set we know that only up to 128 bytes
would be left on the stack, and thus trigger an exception.

This check does of course only work if a 16k stack is also 16k aligned,
which is always the case.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-20  5:58         ` Heiko Carstens
@ 2016-06-20  6:01           ` Andy Lutomirski
  2016-06-20  7:07             ` Heiko Carstens
  0 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-06-20  6:01 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Nadav Amit, Kees Cook, Josh Poimboeuf, Borislav Petkov, X86 ML,
	kernel-hardening, Brian Gerst, linux-kernel, Linus Torvalds

On Sun, Jun 19, 2016 at 10:58 PM, Heiko Carstens
<heiko.carstens@de.ibm.com> wrote:
> On Fri, Jun 17, 2016 at 10:38:24AM -0700, Andy Lutomirski wrote:
>> > A disassembly looks like this (r15 is the stackpointer):
>> >
>> > 0000000000000670 <setup_arch>:
>> >      670:       eb 6f f0 48 00 24       stmg    %r6,%r15,72(%r15)
>> >      676:       c0 d0 00 00 00 00       larl    %r13,676 <setup_arch+0x6>
>> >      67c:       a7 f1 3f 80             tmll    %r15,16256  <--- test if enough space left
>> >      680:       b9 04 00 ef             lgr     %r14,%r15
>> >      684:       a7 84 00 01             je      686 <setup_arch+0x16> <--- branch to illegal op
>> >      688:       e3 f0 ff 90 ff 71       lay     %r15,-112(%r15)
>> >
>> > The branch jumps actually into the branch instruction itself since the 0001
>> > part of the "je" instruction is an illegal instruction.
>> >
>> > This catches at least wild stack overflows because of two many functions
>> > being called.
>> >
>> > Of course it doesn't catch wild accesses outside the stack because e.g. the
>> > index into an array on the stack is wrong.
>> >
>> > The runtime overhead is within noise ratio, therefore we have this always
>> > enabled.
>> >
>>
>> Neat!  What exactly does tmll do?  I assume this works by checking the
>> low bits of the stack pointer.
>>
>> x86_64 would have to do:
>>
>> movl %esp, %r11d
>> shll %r11d, $18
>> cmpl %r11d, <threshold>
>> jg error
>>
>> Or similar.  I think the cmpl could be eliminated if the threshold
>> were a power of two by simply testing the low bits of the stack
>> pointer.
>
> The tmll instruction tests if any of the higher bits within the 16k
> stackframe address are set. In this specific case that would be bits 7-15
> (mask 0x3f80). If no bit would be set we know that only up to 128 bytes
> would be left on the stack, and thus trigger an exception.
>
> This check does of course only work if a 16k stack is also 16k aligned,
> which is always the case.
>

Oh, interesting.  How do you handle the case of a single function that
uses more than 128 bytes of stack?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-20  6:01           ` Andy Lutomirski
@ 2016-06-20  7:07             ` Heiko Carstens
  0 siblings, 0 replies; 60+ messages in thread
From: Heiko Carstens @ 2016-06-20  7:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Kees Cook, Josh Poimboeuf, Borislav Petkov, X86 ML,
	kernel-hardening, Brian Gerst, linux-kernel, Linus Torvalds

On Sun, Jun 19, 2016 at 11:01:48PM -0700, Andy Lutomirski wrote:
> > The tmll instruction tests if any of the higher bits within the 16k
> > stackframe address are set. In this specific case that would be bits 7-15
> > (mask 0x3f80). If no bit would be set we know that only up to 128 bytes
> > would be left on the stack, and thus trigger an exception.
> >
> > This check does of course only work if a 16k stack is also 16k aligned,
> > which is always the case.
> >
> 
> Oh, interesting.  How do you handle the case of a single function that
> uses more than 128 bytes of stack?

The compiler uses the next larger value of the stackframe size that is a
power of 2 for checking. So another example with a stackframe size of 472
bytes would be the below one with a mask of 0x3e00:

0000000000392db8 <htree_inlinedir_to_tree>:
  392db8:       eb 6f f0 48 00 24       stmg    %r6,%r15,72(%r15)
  392dbe:       a7 f1 3e 00             tmll    %r15,15872
  392dc2:       b9 04 00 ef             lgr     %r14,%r15
  392dc6:       a7 84 00 01             je      392dc8 <htree_inlinedir_to_tree+0x10>
  392dca:       e3 f0 fe 28 ff 71       lay     %r15,-472(%r15)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message
@ 2016-07-04 22:31 Borislav Petkov
  2016-07-04 22:31 ` [PATCH -v2 1/3] x86/dumpstack: Honor supplied @regs arg Borislav Petkov
                   ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Borislav Petkov @ 2016-07-04 22:31 UTC (permalink / raw)
  To: LKML; +Cc: Andrew Morton, Andy Lutomirski, X86 ML

From: Borislav Petkov <bp@suse.de>

Hi all,

here's v2 with review feedback integrated.

Changelog:

v1:

here's a simple patchset improving the error message we're dumping when
an unchecked MSR is being done. Patch 3 has an example which explains
the improvement in detail.


Andy Lutomirski (1):
  x86/dumpstack: Honor supplied @regs arg

Borislav Petkov (2):
  printk: Make the printk*once() variants return a value
  x86/dumpstack: Add show_stack_regs() and use it

 arch/x86/include/asm/kdebug.h  |  1 +
 arch/x86/kernel/dumpstack.c    |  5 +++++
 arch/x86/kernel/dumpstack_32.c |  4 +++-
 arch/x86/kernel/dumpstack_64.c |  4 +++-
 arch/x86/mm/extable.c          | 13 ++++++++-----
 include/linux/printk.h         | 17 ++++++++++++-----
 6 files changed, 32 insertions(+), 12 deletions(-)

-- 
2.7.3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v2 1/3] x86/dumpstack: Honor supplied @regs arg
  2016-07-04 22:31 [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Borislav Petkov
@ 2016-07-04 22:31 ` Borislav Petkov
  2016-07-04 22:31 ` [PATCH -v2 2/3] printk: Make the printk*once() variants return a value Borislav Petkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Borislav Petkov @ 2016-07-04 22:31 UTC (permalink / raw)
  To: LKML; +Cc: Andrew Morton, Andy Lutomirski, X86 ML

From: Andy Lutomirski <luto@kernel.org>

The comment suggests that show_stack(NULL, NULL) should backtrace the
current context, but the code doesn't match the comment. If regs are
given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/efcd79bf4106d61f1cd258c2caa87f3a0618eeac.1466036668.git.luto@kernel.org
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index fef917e79b9d..948d77da3881 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -96,7 +96,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index d558a8a49016..a81e1ef73bf2 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -264,7 +264,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.7.3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v2 2/3] printk: Make the printk*once() variants return a value
  2016-07-04 22:31 [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Borislav Petkov
  2016-07-04 22:31 ` [PATCH -v2 1/3] x86/dumpstack: Honor supplied @regs arg Borislav Petkov
@ 2016-07-04 22:31 ` Borislav Petkov
  2016-07-08 12:08   ` [tip:x86/debug] " tip-bot for Borislav Petkov
  2016-07-04 22:31 ` [PATCH -v2 3/3] x86/dumpstack: Add show_stack_regs() and use it Borislav Petkov
  2016-07-06 12:58 ` [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Andy Lutomirski
  3 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2016-07-04 22:31 UTC (permalink / raw)
  To: LKML; +Cc: Andrew Morton, Andy Lutomirski, X86 ML

From: Borislav Petkov <bp@suse.de>

Have printk*once() return a bool which denotes whether the string was
printed or not so that calling code can react accordingly.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/printk.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index f4da695fd615..f136b22c7772 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -108,11 +108,14 @@ struct va_format {
  * Dummy printk for disabled debugging statements to use whilst maintaining
  * gcc's format checking.
  */
-#define no_printk(fmt, ...)			\
-do {						\
-	if (0)					\
-		printk(fmt, ##__VA_ARGS__);	\
-} while (0)
+#define no_printk(fmt, ...)				\
+({							\
+	do {						\
+		if (0)					\
+			printk(fmt, ##__VA_ARGS__);	\
+	} while (0);					\
+	0;						\
+})
 
 #ifdef CONFIG_EARLY_PRINTK
 extern asmlinkage __printf(1, 2)
@@ -309,20 +312,24 @@ extern asmlinkage void dump_stack(void) __cold;
 #define printk_once(fmt, ...)					\
 ({								\
 	static bool __print_once __read_mostly;			\
+	bool __ret_print_once = !__print_once;			\
 								\
 	if (!__print_once) {					\
 		__print_once = true;				\
 		printk(fmt, ##__VA_ARGS__);			\
 	}							\
+	unlikely(__ret_print_once);				\
 })
 #define printk_deferred_once(fmt, ...)				\
 ({								\
 	static bool __print_once __read_mostly;			\
+	bool __ret_print_once = !__print_once;			\
 								\
 	if (!__print_once) {					\
 		__print_once = true;				\
 		printk_deferred(fmt, ##__VA_ARGS__);		\
 	}							\
+	unlikely(__ret_print_once);				\
 })
 #else
 #define printk_once(fmt, ...)					\
-- 
2.7.3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v2 3/3] x86/dumpstack: Add show_stack_regs() and use it
  2016-07-04 22:31 [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Borislav Petkov
  2016-07-04 22:31 ` [PATCH -v2 1/3] x86/dumpstack: Honor supplied @regs arg Borislav Petkov
  2016-07-04 22:31 ` [PATCH -v2 2/3] printk: Make the printk*once() variants return a value Borislav Petkov
@ 2016-07-04 22:31 ` Borislav Petkov
  2016-07-08 12:08   ` [tip:x86/debug] " tip-bot for Borislav Petkov
  2016-07-06 12:58 ` [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Andy Lutomirski
  3 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2016-07-04 22:31 UTC (permalink / raw)
  To: LKML; +Cc: Andrew Morton, Andy Lutomirski, X86 ML

From: Borislav Petkov <bp@suse.de>

Add a helper to dump supplied pt_regs and use it in the MSR exception
handling code to have precise stack traces pointing to the actual
function causing the MSR access exception and not the stack frame of the
exception handler itself.

The new output looks like this:

 unchecked MSR access error: RDMSR from 0xdeadbeef at rIP: 0xffffffff8102ddb6 (early_init_intel+0x16/0x3a0)
  00000000756e6547 ffffffff81c03f68 ffffffff81dd0940 ffffffff81c03f10
  ffffffff81d42e65 0000000001000000 ffffffff81c03f58 ffffffff81d3e5a3
  0000800000000000 ffffffff81800080 ffffffffffffffff 0000000000000000
 Call Trace:
  [<ffffffff81d42e65>] early_cpu_init+0xe7/0x136
  [<ffffffff81d3e5a3>] setup_arch+0xa5/0x9df
  [<ffffffff81d38bb9>] start_kernel+0x9f/0x43a
  [<ffffffff81d38294>] x86_64_start_reservations+0x2f/0x31
  [<ffffffff81d383fe>] x86_64_start_kernel+0x168/0x176

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/kdebug.h |  1 +
 arch/x86/kernel/dumpstack.c   |  5 +++++
 arch/x86/mm/extable.c         | 13 ++++++++-----
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kdebug.h b/arch/x86/include/asm/kdebug.h
index e5f5dc9787d5..1ef9d581b5d9 100644
--- a/arch/x86/include/asm/kdebug.h
+++ b/arch/x86/include/asm/kdebug.h
@@ -26,6 +26,7 @@ extern void die(const char *, struct pt_regs *,long);
 extern int __must_check __die(const char *, struct pt_regs *, long);
 extern void show_trace(struct task_struct *t, struct pt_regs *regs,
 		       unsigned long *sp, unsigned long bp);
+extern void show_stack_regs(struct pt_regs *regs);
 extern void __show_regs(struct pt_regs *regs, int all);
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ef8017ca5ba9..d66e5ac823b2 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -197,6 +197,11 @@ void show_stack(struct task_struct *task, unsigned long *sp)
 	show_stack_log_lvl(task, NULL, sp, bp, "");
 }
 
+void show_stack_regs(struct pt_regs *regs)
+{
+	show_stack_log_lvl(current, regs, (unsigned long *)regs->sp, regs->bp, "");
+}
+
 static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 static int die_owner = -1;
 static unsigned int die_nest_count;
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 4bb53b89f3c5..fafc771568c7 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -1,6 +1,7 @@
 #include <linux/module.h>
 #include <asm/uaccess.h>
 #include <asm/traps.h>
+#include <asm/kdebug.h>
 
 typedef bool (*ex_handler_t)(const struct exception_table_entry *,
 			    struct pt_regs *, int);
@@ -46,8 +47,9 @@ EXPORT_SYMBOL(ex_handler_ext);
 bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
 			     struct pt_regs *regs, int trapnr)
 {
-	WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x\n",
-		  (unsigned int)regs->cx);
+	if (pr_warn_once("unchecked MSR access error: RDMSR from 0x%x at rIP: 0x%lx (%pF)\n",
+			 (unsigned int)regs->cx, regs->ip, (void *)regs->ip))
+		show_stack_regs(regs);
 
 	/* Pretend that the read succeeded and returned 0. */
 	regs->ip = ex_fixup_addr(fixup);
@@ -60,9 +62,10 @@ EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
 bool ex_handler_wrmsr_unsafe(const struct exception_table_entry *fixup,
 			     struct pt_regs *regs, int trapnr)
 {
-	WARN_ONCE(1, "unchecked MSR access error: WRMSR to 0x%x (tried to write 0x%08x%08x)\n",
-		  (unsigned int)regs->cx,
-		  (unsigned int)regs->dx, (unsigned int)regs->ax);
+	if (pr_warn_once("unchecked MSR access error: WRMSR to 0x%x (tried to write 0x%08x%08x) at rIP: 0x%lx (%pF)\n",
+			 (unsigned int)regs->cx, (unsigned int)regs->dx,
+			 (unsigned int)regs->ax,  regs->ip, (void *)regs->ip))
+		show_stack_regs(regs);
 
 	/* Pretend that the write succeeded. */
 	regs->ip = ex_fixup_addr(fixup);
-- 
2.7.3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message
  2016-07-04 22:31 [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Borislav Petkov
                   ` (2 preceding siblings ...)
  2016-07-04 22:31 ` [PATCH -v2 3/3] x86/dumpstack: Add show_stack_regs() and use it Borislav Petkov
@ 2016-07-06 12:58 ` Andy Lutomirski
  2016-07-06 13:11   ` Borislav Petkov
  3 siblings, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2016-07-06 12:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: LKML, Andrew Morton, X86 ML

On Mon, Jul 4, 2016 at 3:31 PM, Borislav Petkov <bp@alien8.de> wrote:
> From: Borislav Petkov <bp@suse.de>
>
> Hi all,
>
> here's v2 with review feedback integrated.

Just a silly overall thought: WARN_ON will be noticed by the 0day bot
and similar tools.  This series changes the message text a bit -- is
it still noticed?

--Andy

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message
  2016-07-06 12:58 ` [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Andy Lutomirski
@ 2016-07-06 13:11   ` Borislav Petkov
  0 siblings, 0 replies; 60+ messages in thread
From: Borislav Petkov @ 2016-07-06 13:11 UTC (permalink / raw)
  To: Andy Lutomirski, kbuild-all; +Cc: LKML, Andrew Morton, X86 ML

On Wed, Jul 06, 2016 at 05:58:35AM -0700, Andy Lutomirski wrote:
> On Mon, Jul 4, 2016 at 3:31 PM, Borislav Petkov <bp@alien8.de> wrote:
> > From: Borislav Petkov <bp@suse.de>
> >
> > Hi all,
> >
> > here's v2 with review feedback integrated.
> 
> Just a silly overall thought: WARN_ON will be noticed by the 0day bot
> and similar tools.  This series changes the message text a bit -- is
> it still noticed?

No idea, let's CC them.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:x86/debug] x86/dumpstack: Honor supplied @regs arg
  2016-06-16  0:28 ` [PATCH 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
  2016-06-16 11:56   ` Borislav Petkov
@ 2016-07-08 12:07   ` tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-07-08 12:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: keescook, brgerst, tglx, mingo, bp, torvalds, luto, dvlasenk,
	fweisbec, hpa, bp, jpoimboe, luto, linux-kernel, peterz, akpm

Commit-ID:  ef16dd0c2a523d2e3975bb1bea9f5727e3e7146f
Gitweb:     http://git.kernel.org/tip/ef16dd0c2a523d2e3975bb1bea9f5727e3e7146f
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 5 Jul 2016 00:31:25 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 8 Jul 2016 11:33:18 +0200

x86/dumpstack: Honor supplied @regs arg

The comment suggests that show_stack(NULL, NULL) should backtrace the
current context, but the code doesn't match the comment. If regs are
given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1467671487-10344-2-git-send-email-bp@alien8.de
Link: http://lkml.kernel.org/r/efcd79bf4106d61f1cd258c2caa87f3a0618eeac.1466036668.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index fef917e..948d77d 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -96,7 +96,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index d558a8a..a81e1ef 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -264,7 +264,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-04 22:31 ` [PATCH -v2 2/3] printk: Make the printk*once() variants return a value Borislav Petkov
@ 2016-07-08 12:08   ` tip-bot for Borislav Petkov
  2016-07-09  2:40     ` Joe Perches
  0 siblings, 1 reply; 60+ messages in thread
From: tip-bot for Borislav Petkov @ 2016-07-08 12:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, linux-kernel, hpa, luto, mingo, peterz, brgerst, tglx,
	luto, torvalds, jpoimboe, fweisbec, dvlasenk, bp, bp

Commit-ID:  069f0cd00df0abfb9252e0dbdc355e40e6ab75fc
Gitweb:     http://git.kernel.org/tip/069f0cd00df0abfb9252e0dbdc355e40e6ab75fc
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Tue, 5 Jul 2016 00:31:26 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 8 Jul 2016 11:33:18 +0200

printk: Make the printk*once() variants return a value

Have printk*once() return a bool which denotes whether the string was
printed or not so that calling code can react accordingly.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1467671487-10344-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/printk.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index f4da695..f136b22 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -108,11 +108,14 @@ struct va_format {
  * Dummy printk for disabled debugging statements to use whilst maintaining
  * gcc's format checking.
  */
-#define no_printk(fmt, ...)			\
-do {						\
-	if (0)					\
-		printk(fmt, ##__VA_ARGS__);	\
-} while (0)
+#define no_printk(fmt, ...)				\
+({							\
+	do {						\
+		if (0)					\
+			printk(fmt, ##__VA_ARGS__);	\
+	} while (0);					\
+	0;						\
+})
 
 #ifdef CONFIG_EARLY_PRINTK
 extern asmlinkage __printf(1, 2)
@@ -309,20 +312,24 @@ extern asmlinkage void dump_stack(void) __cold;
 #define printk_once(fmt, ...)					\
 ({								\
 	static bool __print_once __read_mostly;			\
+	bool __ret_print_once = !__print_once;			\
 								\
 	if (!__print_once) {					\
 		__print_once = true;				\
 		printk(fmt, ##__VA_ARGS__);			\
 	}							\
+	unlikely(__ret_print_once);				\
 })
 #define printk_deferred_once(fmt, ...)				\
 ({								\
 	static bool __print_once __read_mostly;			\
+	bool __ret_print_once = !__print_once;			\
 								\
 	if (!__print_once) {					\
 		__print_once = true;				\
 		printk_deferred(fmt, ##__VA_ARGS__);		\
 	}							\
+	unlikely(__ret_print_once);				\
 })
 #else
 #define printk_once(fmt, ...)					\

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:x86/debug] x86/dumpstack: Add show_stack_regs() and use it
  2016-07-04 22:31 ` [PATCH -v2 3/3] x86/dumpstack: Add show_stack_regs() and use it Borislav Petkov
@ 2016-07-08 12:08   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Borislav Petkov @ 2016-07-08 12:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, torvalds, bp, hpa, luto, bp, mingo, dvlasenk, tglx,
	linux-kernel, jpoimboe, peterz, brgerst, akpm, fweisbec

Commit-ID:  81c2949f7fdcf8ff681326669afde24962232670
Gitweb:     http://git.kernel.org/tip/81c2949f7fdcf8ff681326669afde24962232670
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Tue, 5 Jul 2016 00:31:27 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 8 Jul 2016 11:33:19 +0200

x86/dumpstack: Add show_stack_regs() and use it

Add a helper to dump supplied pt_regs and use it in the MSR exception
handling code to have precise stack traces pointing to the actual
function causing the MSR access exception and not the stack frame of the
exception handler itself.

The new output looks like this:

 unchecked MSR access error: RDMSR from 0xdeadbeef at rIP: 0xffffffff8102ddb6 (early_init_intel+0x16/0x3a0)
  00000000756e6547 ffffffff81c03f68 ffffffff81dd0940 ffffffff81c03f10
  ffffffff81d42e65 0000000001000000 ffffffff81c03f58 ffffffff81d3e5a3
  0000800000000000 ffffffff81800080 ffffffffffffffff 0000000000000000
 Call Trace:
  [<ffffffff81d42e65>] early_cpu_init+0xe7/0x136
  [<ffffffff81d3e5a3>] setup_arch+0xa5/0x9df
  [<ffffffff81d38bb9>] start_kernel+0x9f/0x43a
  [<ffffffff81d38294>] x86_64_start_reservations+0x2f/0x31
  [<ffffffff81d383fe>] x86_64_start_kernel+0x168/0x176

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1467671487-10344-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/kdebug.h |  1 +
 arch/x86/kernel/dumpstack.c   |  5 +++++
 arch/x86/mm/extable.c         | 13 ++++++++-----
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kdebug.h b/arch/x86/include/asm/kdebug.h
index e5f5dc9..1ef9d58 100644
--- a/arch/x86/include/asm/kdebug.h
+++ b/arch/x86/include/asm/kdebug.h
@@ -26,6 +26,7 @@ extern void die(const char *, struct pt_regs *,long);
 extern int __must_check __die(const char *, struct pt_regs *, long);
 extern void show_trace(struct task_struct *t, struct pt_regs *regs,
 		       unsigned long *sp, unsigned long bp);
+extern void show_stack_regs(struct pt_regs *regs);
 extern void __show_regs(struct pt_regs *regs, int all);
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ef8017c..d66e5ac 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -197,6 +197,11 @@ void show_stack(struct task_struct *task, unsigned long *sp)
 	show_stack_log_lvl(task, NULL, sp, bp, "");
 }
 
+void show_stack_regs(struct pt_regs *regs)
+{
+	show_stack_log_lvl(current, regs, (unsigned long *)regs->sp, regs->bp, "");
+}
+
 static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 static int die_owner = -1;
 static unsigned int die_nest_count;
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 4bb53b8..fafc771 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -1,6 +1,7 @@
 #include <linux/module.h>
 #include <asm/uaccess.h>
 #include <asm/traps.h>
+#include <asm/kdebug.h>
 
 typedef bool (*ex_handler_t)(const struct exception_table_entry *,
 			    struct pt_regs *, int);
@@ -46,8 +47,9 @@ EXPORT_SYMBOL(ex_handler_ext);
 bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
 			     struct pt_regs *regs, int trapnr)
 {
-	WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x\n",
-		  (unsigned int)regs->cx);
+	if (pr_warn_once("unchecked MSR access error: RDMSR from 0x%x at rIP: 0x%lx (%pF)\n",
+			 (unsigned int)regs->cx, regs->ip, (void *)regs->ip))
+		show_stack_regs(regs);
 
 	/* Pretend that the read succeeded and returned 0. */
 	regs->ip = ex_fixup_addr(fixup);
@@ -60,9 +62,10 @@ EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
 bool ex_handler_wrmsr_unsafe(const struct exception_table_entry *fixup,
 			     struct pt_regs *regs, int trapnr)
 {
-	WARN_ONCE(1, "unchecked MSR access error: WRMSR to 0x%x (tried to write 0x%08x%08x)\n",
-		  (unsigned int)regs->cx,
-		  (unsigned int)regs->dx, (unsigned int)regs->ax);
+	if (pr_warn_once("unchecked MSR access error: WRMSR to 0x%x (tried to write 0x%08x%08x) at rIP: 0x%lx (%pF)\n",
+			 (unsigned int)regs->cx, (unsigned int)regs->dx,
+			 (unsigned int)regs->ax,  regs->ip, (void *)regs->ip))
+		show_stack_regs(regs);
 
 	/* Pretend that the write succeeded. */
 	regs->ip = ex_fixup_addr(fixup);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-08 12:08   ` [tip:x86/debug] " tip-bot for Borislav Petkov
@ 2016-07-09  2:40     ` Joe Perches
  2016-07-09  7:50       ` Borislav Petkov
  0 siblings, 1 reply; 60+ messages in thread
From: Joe Perches @ 2016-07-09  2:40 UTC (permalink / raw)
  To: akpm, linux-kernel, mingo, luto, hpa, peterz, brgerst, torvalds,
	tglx, luto, dvlasenk, fweisbec, jpoimboe, bp, bp,
	linux-tip-commits

On Fri, 2016-07-08 at 05:08 -0700, tip-bot for Borislav Petkov wrote:
> printk: Make the printk*once() variants return a value
[]
> diff --git a/include/linux/printk.h b/include/linux/printk.h
[]
> @@ -108,11 +108,14 @@ struct va_format {
>   * Dummy printk for disabled debugging statements to use whilst maintaining
>   * gcc's format checking.
>   */
> -#define no_printk(fmt, ...)			\
> -do {						\
> -	if (0)					\
> -		printk(fmt, ##__VA_ARGS__);	\
> -} while (0)
> +#define no_printk(fmt, ...)				\
> +({							\
> +	do {						\
> +		if (0)					\
> +			printk(fmt, ##__VA_ARGS__);	\
> +	} while (0);					\
> +	0;						\
> +})

This change isn't described in the commit message and there
doesn't seem to be a need to change this.

And as statement expressions these no longer need
"do { <foo> } while (0)" this could have been be simplified to

#define no_printk(fmt, ...)
({
	if (0)
		printk(fmt, ##__VA_ARGS__);
})

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-09  2:40     ` Joe Perches
@ 2016-07-09  7:50       ` Borislav Petkov
  2016-07-09 17:56         ` Joe Perches
  0 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2016-07-09  7:50 UTC (permalink / raw)
  To: Joe Perches
  Cc: akpm, linux-kernel, mingo, luto, hpa, peterz, brgerst, torvalds,
	tglx, luto, dvlasenk, fweisbec, jpoimboe, bp, linux-tip-commits

On Fri, Jul 08, 2016 at 07:40:48PM -0700, Joe Perches wrote:
> This change isn't described in the commit message and there
> doesn't seem to be a need to change this.

How do *you* know? Did *you* actually sit down and build a kernel with
your proposed change before sending a reply?

I'm pretty sure you didn't.

Well, there is a very good reason why I made that change but I'm not
going to tell you.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-09  7:50       ` Borislav Petkov
@ 2016-07-09 17:56         ` Joe Perches
  2016-07-10  6:49           ` Borislav Petkov
  0 siblings, 1 reply; 60+ messages in thread
From: Joe Perches @ 2016-07-09 17:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: akpm, linux-kernel, mingo, luto, hpa, peterz, brgerst, torvalds,
	tglx, luto, dvlasenk, fweisbec, jpoimboe, bp, linux-tip-commits

On Sat, 2016-07-09 at 09:50 +0200, Borislav Petkov wrote:
> On Fri, Jul 08, 2016 at 07:40:48PM -0700, Joe Perches wrote:
> > This change isn't described in the commit message and there
> > doesn't seem to be a need to change this.
> How do *you* know? Did *you* actually sit down and build a kernel with
> your proposed change before sending a reply?
> I'm pretty sure you didn't.

defconfigs both with and without CONFIG_PRINTK build
properly with the proposed change to this specific patch.

> Well, there is a very good reason why I made that change but I'm not
> going to tell you.

Borislav, your delightful personality always impresses.
Never change.

If there is a specific reason you know why this 0; value
must be added to a do {} while (0) to statement expression
macro conversion, it'd be good to write that in the
commit message.  It'd also be good to remove the useless
"do {} while (0);" surrounding a single statement.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-09 17:56         ` Joe Perches
@ 2016-07-10  6:49           ` Borislav Petkov
  2016-07-10  8:23             ` Joe Perches
  0 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2016-07-10  6:49 UTC (permalink / raw)
  To: Joe Perches
  Cc: akpm, linux-kernel, mingo, luto, hpa, peterz, brgerst, torvalds,
	tglx, luto, dvlasenk, fweisbec, jpoimboe, bp, linux-tip-commits

On Sat, Jul 09, 2016 at 10:56:55AM -0700, Joe Perches wrote:
> defconfigs both with and without CONFIG_PRINTK build
> properly with the proposed change to this specific patch.

Did you try latest tip/master?

> Borislav, your delightful personality always impresses.
> Never change.

What goes around comes around.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-10  6:49           ` Borislav Petkov
@ 2016-07-10  8:23             ` Joe Perches
  2016-07-10 12:06               ` Borislav Petkov
  0 siblings, 1 reply; 60+ messages in thread
From: Joe Perches @ 2016-07-10  8:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: akpm, linux-kernel, mingo, luto, hpa, peterz, brgerst, torvalds,
	tglx, luto, dvlasenk, fweisbec, jpoimboe, bp, linux-tip-commits

On Sun, 2016-07-10 at 08:49 +0200, Borislav Petkov wrote:
> On Sat, Jul 09, 2016 at 10:56:55AM -0700, Joe Perches wrote:
> > 
> > defconfigs both with and without CONFIG_PRINTK build
> > properly with the proposed change to this specific patch.
> Did you try latest tip/master?

Assuming tip is included in linux-next as of july 8, yes.

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a379b037c5f3bcd0475ab2743b1f8a57fa970e22
[]
+tip		2c387f55aabaf88b6ee6857192a8f778bee05523

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-10  8:23             ` Joe Perches
@ 2016-07-10 12:06               ` Borislav Petkov
  2016-07-10 12:33                 ` Joe Perches
  0 siblings, 1 reply; 60+ messages in thread
From: Borislav Petkov @ 2016-07-10 12:06 UTC (permalink / raw)
  To: Joe Perches
  Cc: akpm, linux-kernel, mingo, luto, hpa, peterz, brgerst, torvalds,
	tglx, luto, dvlasenk, fweisbec, jpoimboe, bp, linux-tip-commits

On Sun, Jul 10, 2016 at 01:23:51AM -0700, Joe Perches wrote:
> Assuming tip is included in linux-next as of july 8, yes.

Try one which has http://git.kernel.org/tip/81c2949f7fdcf8ff681326669afde24962232670

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [tip:x86/debug] printk: Make the printk*once() variants return a value
  2016-07-10 12:06               ` Borislav Petkov
@ 2016-07-10 12:33                 ` Joe Perches
  0 siblings, 0 replies; 60+ messages in thread
From: Joe Perches @ 2016-07-10 12:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: akpm, linux-kernel, mingo, luto, hpa, peterz, brgerst, torvalds,
	tglx, luto, dvlasenk, fweisbec, jpoimboe, bp, linux-tip-commits

On Sun, 2016-07-10 at 14:06 +0200, Borislav Petkov wrote:
> On Sun, Jul 10, 2016 at 01:23:51AM -0700, Joe Perches wrote:
> > 
> > Assuming tip is included in linux-next as of july 8, yes.
> Try one which has http://git.kernel.org/tip/81c2949f7fdcf8ff681326669a
> fde24962232670

That commit isn't in -next and this change to add a return
value to no_printk in _this_ patch is, as I wrote, not
required here.

It's fine that no_printk return a 0 value, but it should
be introduced in a separate patch that introduces that
requirement or at least mentioned in the changelog of
this patch _why_ it's it to be changed.  Otherwise, the
change just appears spurious.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-07-10 12:33 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16  0:28 [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
2016-06-16  0:28 ` [PATCH 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
2016-06-16  0:28 ` [PATCH 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
2016-06-16  0:28 ` [PATCH 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately Andy Lutomirski
2016-06-16  0:28 ` [PATCH 04/13] mm: Track NR_KERNEL_STACK in pages instead of number of stacks Andy Lutomirski
2016-06-16 11:10   ` Vladimir Davydov
2016-06-16 17:21     ` Andy Lutomirski
2016-06-16 19:20       ` Andy Lutomirski
2016-06-16 15:33   ` Josh Poimboeuf
2016-06-16 17:39     ` Andy Lutomirski
2016-06-16 19:39       ` Josh Poimboeuf
2016-06-16  0:28 ` [PATCH 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
2016-06-16  0:28 ` [PATCH 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
2016-06-16 17:25   ` Kees Cook
2016-06-16 17:37     ` Andy Lutomirski
2016-06-16  0:28 ` [PATCH 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
2016-06-16  0:28 ` [PATCH 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
2016-06-16 17:50   ` Josh Poimboeuf
2016-06-16 17:57     ` Andy Lutomirski
2016-06-16  0:28 ` [PATCH 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
2016-06-16 11:56   ` Borislav Petkov
2016-07-08 12:07   ` [tip:x86/debug] x86/dumpstack: Honor supplied @regs arg tip-bot for Andy Lutomirski
2016-06-16  0:28 ` [PATCH 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
2016-06-16 18:16   ` Josh Poimboeuf
2016-06-16 18:22     ` Andy Lutomirski
2016-06-16 18:33       ` Josh Poimboeuf
2016-06-16 18:37         ` Andy Lutomirski
2016-06-16 18:54           ` Josh Poimboeuf
2016-06-16  0:28 ` [PATCH 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
2016-06-16  0:28 ` [PATCH 12/13] x86/mm/64: Enable vmapped stacks Andy Lutomirski
2016-06-16  4:17   ` Mika Penttilä
2016-06-16  5:33     ` Andy Lutomirski
2016-06-16 13:11       ` [kernel-hardening] " Rik van Riel
2016-06-16  0:28 ` [PATCH 13/13] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
2016-06-16  6:05 ` [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) Heiko Carstens
2016-06-16 17:50   ` Andy Lutomirski
2016-06-16 18:14     ` Andy Lutomirski
2016-06-16 21:27       ` Andy Lutomirski
2016-06-17  3:58   ` Andy Lutomirski
2016-06-17  7:27     ` Heiko Carstens
2016-06-17 17:38       ` Andy Lutomirski
2016-06-20  5:58         ` Heiko Carstens
2016-06-20  6:01           ` Andy Lutomirski
2016-06-20  7:07             ` Heiko Carstens
2016-06-16 17:24 ` Kees Cook
2016-07-04 22:31 [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Borislav Petkov
2016-07-04 22:31 ` [PATCH -v2 1/3] x86/dumpstack: Honor supplied @regs arg Borislav Petkov
2016-07-04 22:31 ` [PATCH -v2 2/3] printk: Make the printk*once() variants return a value Borislav Petkov
2016-07-08 12:08   ` [tip:x86/debug] " tip-bot for Borislav Petkov
2016-07-09  2:40     ` Joe Perches
2016-07-09  7:50       ` Borislav Petkov
2016-07-09 17:56         ` Joe Perches
2016-07-10  6:49           ` Borislav Petkov
2016-07-10  8:23             ` Joe Perches
2016-07-10 12:06               ` Borislav Petkov
2016-07-10 12:33                 ` Joe Perches
2016-07-04 22:31 ` [PATCH -v2 3/3] x86/dumpstack: Add show_stack_regs() and use it Borislav Petkov
2016-07-08 12:08   ` [tip:x86/debug] " tip-bot for Borislav Petkov
2016-07-06 12:58 ` [PATCH -v2 0/3] x86/MSR: Improve unhandled MSR access error message Andy Lutomirski
2016-07-06 13:11   ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).