linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-17 20:00 Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
                   ` (12 more replies)
  0 siblings, 13 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

On my laptop, this adds about 1.5µs of overhead to task creation,
which seems to be mainly caused by vmalloc inefficiently allocating
individual pages even when a higher-order page is available on the
freelist.

This does not address interrupt stacks.  It also does not address
the possibility of privilege escalation by a controlled stack
overflow that overwrites thread_info without hitting the guard page.
I'll send patches to address the latter issue once this series
lands.

It's worth noting that s390 has an arch-specific gcc feature that
detects stack overflows by adjusting function prologues.  Arches
with features like that may wish to avoid using vmapped stacks to
minimize the performance hit.

Ingo, once this gets a bit more review, would it make sense to
throw it into a seaparate branch in -tip?  I wouldn't mind seeing
some -next testing to give people a chance to shake out problems.
I'm particularly interested in whether there are any drivers that
expect virt_to_phys to work on stack addresses.  (I know that
virtio-net used to, but I fixed that a while back.)

Changes from v1:
 - Fix rewind_stack_and_do_exit (Josh)
 - Fix deadlock under load
 - Clean up generic stack vmalloc code
 - Many other minor fixes

Andy Lutomirski (12):
  x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
  mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  mm: Move memcg stack accounting to account_kernel_stack
  fork: Add generic vmalloced stack support
  x86/die: Don't try to recover from an OOPS on a non-default stack
  x86/dumpstack: When OOPSing, rewind the stack before do_exit
  x86/dumpstack: When dumping stack bytes due to OOPS, start with
    regs->sp
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
    OOPS
  x86/mm/64: Enable vmapped stacks
  x86/mm: Improve stack-overflow #PF handling

Ingo Molnar (1):
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

 arch/Kconfig                        | 29 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 arch/x86/Kconfig                    |  1 +
 arch/x86/entry/entry_32.S           | 11 +++++
 arch/x86/entry/entry_64.S           | 11 +++++
 arch/x86/include/asm/switch_to.h    | 28 +++++++++++-
 arch/x86/include/asm/traps.h        |  6 +++
 arch/x86/kernel/dumpstack.c         | 19 ++++++++-
 arch/x86/kernel/dumpstack_32.c      |  4 +-
 arch/x86/kernel/dumpstack_64.c      | 16 +++++--
 arch/x86/kernel/traps.c             | 32 ++++++++++++++
 arch/x86/mm/fault.c                 | 39 +++++++++++++++++
 arch/x86/mm/init_64.c               | 27 ------------
 arch/x86/mm/pageattr.c              |  7 ++-
 arch/x86/mm/tlb.c                   | 15 +++++++
 drivers/base/node.c                 |  3 +-
 fs/proc/meminfo.c                   |  2 +-
 include/linux/mmzone.h              |  2 +-
 include/linux/sched.h               | 15 +++++++
 kernel/fork.c                       | 85 ++++++++++++++++++++++++++++---------
 mm/page_alloc.c                     |  3 +-
 21 files changed, 295 insertions(+), 62 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Ingo Molnar, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Waiman Long, linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately Andy Lutomirski
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This avoids pointless races in which another CPU or task might see a
partially populated global pgd entry.  These races should normally
be harmless, but, if another CPU propagates the entry via
vmalloc_fault and then populate_pgd fails (due to memory allocation
failure, for example), this prevents a use-after-free of the pgd
entry.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bbf4105..6a8026918bf6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 
 	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
 	if (ret < 0) {
-		unmap_pgd_range(cpa->pgd, addr,
+		if (pud)
+			free_page((unsigned long)pud);
+		unmap_pud_range(pgd_entry, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:30   ` Borislav Petkov
  2016-06-17 20:00 ` [PATCH v2 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

It's currently only used in the EFI code, which is safe AFAICT.
Warn if anyone tries to use it on the normal kernel pgd.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a8026918bf6..e9b9c5cedbb8 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1996,6 +1996,9 @@ out:
 void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
 			       unsigned numpages)
 {
+	/* Unmapping kernel entries from init_mm's pgd is not allowed. */
+	WARN_ON(root == init_mm.pgd);
+
 	unmap_pgd_range(root, address, address + (numpages << PAGE_SHIFT));
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-20 13:16   ` Michal Hocko
  2016-06-17 20:00 ` [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
all architectures.  Keep it simple and use KiB.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 kernel/fork.c          | 3 ++-
 mm/page_alloc.c        | 3 +--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..27dc68a0ed2d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..239b5a06cee0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		global_page_state(NR_KERNEL_STACK_KB),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..63f05a7efb54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,7 +127,7 @@ enum zone_stat_item {
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
-	NR_KERNEL_STACK,
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	/* Second 128 byte cacheline */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355aa97f..be7f006af727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(ti));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+			    THREAD_SIZE / 1024 * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..a277dea926c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			zone_page_state(zone, NR_KERNEL_STACK_KB),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:55   ` Josh Poimboeuf
  2016-06-20 13:02   ` Michal Hocko
  2016-06-17 20:00 ` [PATCH v2 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size.  Move it into
account_kernel_stack.

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/fork.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index be7f006af727..cd2abe6e4e41 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	struct page *page = virt_to_page(ti);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 
 	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
 			    THREAD_SIZE / 1024 * account);
+
+	/* All stack pages belong to the same memcg. */
+	memcg_kmem_update_page_stat(
+		virt_to_page(ti), MEMCG_KERNEL_STACK,
+		account * (THREAD_SIZE / PAGE_SIZE));
 }
 
 void free_task(struct task_struct *tsk)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 06/13] fork: Add generic vmalloced stack support
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:57   ` Josh Poimboeuf
  2016-06-20 13:36   ` Michal Hocko
  2016-06-17 20:00 ` [PATCH v2 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig                        | 29 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 +++++++
 kernel/fork.c                       | 81 +++++++++++++++++++++++++++++--------
 4 files changed, 109 insertions(+), 18 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d794384a0404..a71e6e7195e6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -658,4 +658,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to and switch_mm,
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..d13edda6e09c 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_info_node(tsk, node)	((struct thread_info *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_info(ti)	/* nothing */
+#define free_thread_info(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..a37c3b790309 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1934,6 +1937,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cd2abe6e4e41..ad77a6b07708 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,38 @@ void __weak arch_release_thread_info(struct thread_info *ti)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	struct thread_info *ti = __vmalloc_node_range(
+		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
+		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
+		0, node, __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_info can be called in interrupt context, so cache
+	 * the vm_struct.
+	 */
+	if (ti)
+		tsk->stack_vm_area = find_vm_area(ti);
+	return ti;
+#else
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_info(struct thread_info *ti)
+static inline void free_thread_info(struct task_struct *tsk)
 {
-	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -181,9 +200,9 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_info(struct thread_info *ti)
+static void free_thread_info(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_info_cache, ti);
+	kmem_cache_free(thread_info_cache, tsk->stack);
 }
 
 void thread_info_cache_init(void)
@@ -213,24 +232,46 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(struct thread_info *ti, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	struct zone *zone = page_zone(virt_to_page(ti));
+	struct zone *zone;
+	struct thread_info *ti = task_thread_info(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
-	/* All stack pages belong to the same memcg. */
-	memcg_kmem_update_page_stat(
-		virt_to_page(ti), MEMCG_KERNEL_STACK,
-		account * (THREAD_SIZE / PAGE_SIZE));
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    1, PAGE_SIZE / 1024 * account);
+		}
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			vm->pages[0], MEMCG_KERNEL_STACK,
+			account * (THREAD_SIZE / PAGE_SIZE));
+	} else {
+		zone = page_zone(virt_to_page(ti));
+
+		mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			virt_to_page(ti), MEMCG_KERNEL_STACK,
+			account * (THREAD_SIZE / PAGE_SIZE));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_info(tsk->stack);
-	free_thread_info(tsk->stack);
+	free_thread_info(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +383,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +396,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!ti)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
 	if (err)
 		goto free_ti;
 
 	tsk->stack = ti;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +437,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(ti, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_ti:
-	free_thread_info(ti);
+	free_thread_info(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (5 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

It's not going to work, because the scheduler will explode if we try
to schedule when running on an IST stack or similar.

This will matter when we let kernel stack overflows (which are #DF)
call die().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..36effb39c9c9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -247,6 +247,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
+	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
+	     & ~(THREAD_SIZE - 1)) != 0)
+		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	do_exit(signr);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (6 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we call do_exit with a clean stack, we greatly reduce the risk of
recursive oopses due to stack overflow in do_exit, and we allow
do_exit to work even if we OOPS from an IST stack.  The latter gives
us a much better chance of surviving long enough after we detect a
stack overflow to write out our logs.

I intentionally separated this from the preceding patch that
disables do_exit-on-OOPS on IST stacks.  This way, if we need to
revert this patch, we still end up in an acceptable state wrt stack
overflow handling.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S   | 11 +++++++++++
 arch/x86/entry/entry_64.S   | 11 +++++++++++
 arch/x86/kernel/dumpstack.c | 13 +++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3a0d27..0b56666e6039 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
 	jmp	error_code
 END(async_page_fault)
 #endif
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..b846875aeea6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 36effb39c9c9..d4d085e27d04 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -228,6 +228,8 @@ unsigned long oops_begin(void)
 EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
+extern void __noreturn rewind_stack_do_exit(int signr);
+
 void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 {
 	if (regs && kexec_should_crash(current))
@@ -247,12 +249,15 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
-	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
-	     & ~(THREAD_SIZE - 1)) != 0)
-		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
-	do_exit(signr);
+
+	/*
+	 * We're not going to return, but we might be on an IST stack or
+	 * have very little stack space left.  Rewind the stack and kill
+	 * the task.
+	 */
+	rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (7 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

The comment suggests that show_stack(NULL, NULL) should backtrace
the current context, but the code doesn't match the comment.  If
regs are given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 464ffd69b92e..91069ebe3c87 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -98,7 +98,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..603356a5597a 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -266,7 +266,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (8 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack, print_context_stack will abort.  Detect
this case and rewind back into the valid part of the stack so that
we can trace it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d4d085e27d04..9cdf05d768cf 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -89,7 +89,7 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 		else
 			return 0;
 	}
-	return p > t && p < t + THREAD_SIZE - size;
+	return p >= t && p < t + THREAD_SIZE - size;
 }
 
 unsigned long
@@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
+	/*
+	 * If we overflowed the stack into a guard page, jump back to the
+	 * bottom of the usable stack.
+	 */
+	if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
+		stack = (unsigned long *)tinfo;
+
 	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (9 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 12/13] x86/mm/64: Enable vmapped stacks Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 13/13] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack into a guard page, we'll recursively fault
when trying to dump the contents of the guard page.  Use
probe_kernel_address so we can recover if this happens.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_64.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 603356a5597a..5e298638c790 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -276,6 +276,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
+		unsigned long word;
+
 		if (stack >= irq_stack && stack <= irq_stack_end) {
 			if (stack == irq_stack_end) {
 				stack = (unsigned long *) (irq_stack_end[-1]);
@@ -285,12 +287,18 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (kstack_end(stack))
 			break;
 		}
+
+		if (probe_kernel_address(stack, word))
+			break;
+
 		if ((i % STACKSLOTS_PER_LINE) == 0) {
 			if (i != 0)
 				pr_cont("\n");
-			printk("%s %016lx", log_lvl, *stack++);
+			printk("%s %016lx", log_lvl, word);
 		} else
-			pr_cont(" %016lx", *stack++);
+			pr_cont(" %016lx", word);
+
+		stack++;
 		touch_nmi_watchdog();
 	}
 	preempt_enable();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 12/13] x86/mm/64: Enable vmapped stacks
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (10 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  2016-06-17 20:00 ` [PATCH v2 13/13] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and an oops without triggering a double-fault, and
no_context doesn't know about stack overflows.  The next patch will
improve that case.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 +++++++++++++++
 4 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0a7b885964ba..b624b24d1dc1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..9cb7ea781176 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  CR2 will contain
+	 * the linear address of the second fault, which will be in the
+	 * guard page below the bottom of the stack.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)",
+			regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..fbf036ae72ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 13/13] x86/mm: Improve stack-overflow #PF handling
  2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (11 preceding siblings ...)
  2016-06-17 20:00 ` [PATCH v2 12/13] x86/mm/64: Enable vmapped stacks Andy Lutomirski
@ 2016-06-17 20:00 ` Andy Lutomirski
  12 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 20:00 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h |  6 ++++++
 arch/x86/kernel/traps.c      |  6 +++---
 arch/x86/mm/fault.c          | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619740a..01fd0a7f48cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -117,6 +117,12 @@ extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
+#ifdef CONFIG_VMAP_STACK
+void __noreturn handle_stack_overflow(const char *message,
+				      struct pt_regs *regs,
+				      unsigned long fault_address);
+#endif
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9cb7ea781176..b389c0539eb9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -293,9 +293,9 @@ DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
 #ifdef CONFIG_VMAP_STACK
-static void __noreturn handle_stack_overflow(const char *message,
-					     struct pt_regs *regs,
-					     unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(const char *message,
+						struct pt_regs *regs,
+						unsigned long fault_address)
 {
 	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
 		 (void *)fault_address, current->stack,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..c68b81f5659f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -753,6 +753,45 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * Stack overflow?  During boot, we can fault near the initial
+	 * stack in the direct map, but that's not an overflow -- check
+	 * that we're in vmalloc space to avoid this.
+	 *
+	 * Check this after trying fixup_exception, since there are handful
+	 * of kernel code paths that wander off the top of the stack but
+	 * handle any faults that occur.  Once those are fixed, we can
+	 * move this above fixup_exception.
+	 */
+	if (is_vmalloc_addr((void *)address) &&
+	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+		register void *__sp asm("rsp");
+		unsigned long stack =
+			this_cpu_read(orig_ist.ist[DOUBLEFAULT_STACK]) -
+			sizeof(void *);
+		/*
+		 * We're likely to be running with very little stack space
+		 * left.  It's plausible that we'd hit this condition but
+		 * double-fault even before we get this far, in which case
+		 * we're fine: the double-fault handler will deal with it.
+		 *
+		 * We don't want to make it all the way into the oops code
+		 * and then double-fault, though, because we're likely to
+		 * break the console driver and lose most of the stack dump.
+		 */
+		asm volatile ("movq %[stack], %%rsp\n\t"
+			      "call handle_stack_overflow\n\t"
+			      "1: jmp 1b"
+			      : "+r" (__sp)
+			      : "D" ("kernel stack overflow (page fault)"),
+				"S" (regs), "d" (address),
+				[stack] "rm" (stack));
+		unreachable();
+	}
+#endif
+
 	/*
 	 * 32-bit:
 	 *
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
  2016-06-17 20:00 ` [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately Andy Lutomirski
@ 2016-06-17 20:30   ` Borislav Petkov
  2016-06-18 10:29     ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2016-06-17 20:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Fri, Jun 17, 2016 at 01:00:39PM -0700, Andy Lutomirski wrote:
> It's currently only used in the EFI code, which is safe AFAICT.

"It is basically useful for a pagetable hierarchy which is not init_mm."

> Warn if anyone tries to use it on the normal kernel pgd.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/mm/pageattr.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index 6a8026918bf6..e9b9c5cedbb8 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -1996,6 +1996,9 @@ out:
>  void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
>  			       unsigned numpages)
>  {
> +	/* Unmapping kernel entries from init_mm's pgd is not allowed. */
> +	WARN_ON(root == init_mm.pgd);

We can also return and not do the unmapping:

	if (WARN_ON(root == init_mm.pgd))
		return;

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack
  2016-06-17 20:00 ` [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
@ 2016-06-17 20:55   ` Josh Poimboeuf
  2016-06-17 22:18     ` Andy Lutomirski
  2016-06-20 13:02   ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-17 20:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens, Vladimir Davydov, Johannes Weiner,
	Michal Hocko, linux-mm

On Fri, Jun 17, 2016 at 01:00:41PM -0700, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size.  Move it into
> account_kernel_stack.
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  kernel/fork.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..cd2abe6e4e41 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK,
> +		account * (THREAD_SIZE / PAGE_SIZE));
>  }

Won't this be broken in the case where THREAD_SIZE < PAGE_SIZE?

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 06/13] fork: Add generic vmalloced stack support
  2016-06-17 20:00 ` [PATCH v2 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-06-17 20:57   ` Josh Poimboeuf
  2016-06-17 22:18     ` Andy Lutomirski
  2016-06-20 13:36   ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-17 20:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Fri, Jun 17, 2016 at 01:00:42PM -0700, Andy Lutomirski wrote:
> @@ -213,24 +232,46 @@ struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -static void account_kernel_stack(struct thread_info *ti, int account)
> +static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
> -	struct zone *zone = page_zone(virt_to_page(ti));
> +	struct zone *zone;
> +	struct thread_info *ti = task_thread_info(tsk);
> +	struct vm_struct *vm = task_stack_vm_area(tsk);
> +
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
> +
> +	if (vm) {
> +		int i;
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> -			    THREAD_SIZE / 1024 * account);
> +		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
>  
> -	/* All stack pages belong to the same memcg. */
> -	memcg_kmem_update_page_stat(
> -		virt_to_page(ti), MEMCG_KERNEL_STACK,
> -		account * (THREAD_SIZE / PAGE_SIZE));
> +		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
> +			mod_zone_page_state(page_zone(vm->pages[i]),
> +					    1, PAGE_SIZE / 1024 * account);

Shouldn't the second argument be NR_KERNEL_STACK_KB instead of 1?

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack
  2016-06-17 20:55   ` Josh Poimboeuf
@ 2016-06-17 22:18     ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 22:18 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Jann Horn, Heiko Carstens,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, linux-mm

On Fri, Jun 17, 2016 at 1:55 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Jun 17, 2016 at 01:00:41PM -0700, Andy Lutomirski wrote:
>> We should account for stacks regardless of stack size.  Move it into
>> account_kernel_stack.
>>
>> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
>> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  kernel/fork.c | 15 ++++++---------
>>  1 file changed, 6 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index be7f006af727..cd2abe6e4e41 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>       struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>>                                                 THREAD_SIZE_ORDER);
>>
>> -     if (page)
>> -             memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
>> -                                         1 << THREAD_SIZE_ORDER);
>> -
>>       return page ? page_address(page) : NULL;
>>  }
>>
>>  static inline void free_thread_info(struct thread_info *ti)
>>  {
>> -     struct page *page = virt_to_page(ti);
>> -
>> -     memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
>> -                                 -(1 << THREAD_SIZE_ORDER));
>> -     __free_kmem_pages(page, THREAD_SIZE_ORDER);
>> +     free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>>  }
>>  # else
>>  static struct kmem_cache *thread_info_cache;
>> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>>
>>       mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>>                           THREAD_SIZE / 1024 * account);
>> +
>> +     /* All stack pages belong to the same memcg. */
>> +     memcg_kmem_update_page_stat(
>> +             virt_to_page(ti), MEMCG_KERNEL_STACK,
>> +             account * (THREAD_SIZE / PAGE_SIZE));
>>  }
>
> Won't this be broken in the case where THREAD_SIZE < PAGE_SIZE?

In my defense, it was broken before this change, too.  Sigh.  I'll
change this to count in KiB too.

As far as I can tell, this thing is used for display only.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 06/13] fork: Add generic vmalloced stack support
  2016-06-17 20:57   ` Josh Poimboeuf
@ 2016-06-17 22:18     ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-17 22:18 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Jann Horn, Heiko Carstens

On Fri, Jun 17, 2016 at 1:57 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Jun 17, 2016 at 01:00:42PM -0700, Andy Lutomirski wrote:
>> @@ -213,24 +232,46 @@ struct kmem_cache *vm_area_cachep;
>>  /* SLAB cache for mm_struct structures (tsk->mm) */
>>  static struct kmem_cache *mm_cachep;
>>
>> -static void account_kernel_stack(struct thread_info *ti, int account)
>> +static void account_kernel_stack(struct task_struct *tsk, int account)
>>  {
>> -     struct zone *zone = page_zone(virt_to_page(ti));
>> +     struct zone *zone;
>> +     struct thread_info *ti = task_thread_info(tsk);
>> +     struct vm_struct *vm = task_stack_vm_area(tsk);
>> +
>> +     BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
>> +
>> +     if (vm) {
>> +             int i;
>>
>> -     mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>> -                         THREAD_SIZE / 1024 * account);
>> +             BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
>>
>> -     /* All stack pages belong to the same memcg. */
>> -     memcg_kmem_update_page_stat(
>> -             virt_to_page(ti), MEMCG_KERNEL_STACK,
>> -             account * (THREAD_SIZE / PAGE_SIZE));
>> +             for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
>> +                     mod_zone_page_state(page_zone(vm->pages[i]),
>> +                                         1, PAGE_SIZE / 1024 * account);
>
> Shouldn't the second argument be NR_KERNEL_STACK_KB instead of 1?

Indeed. Queued for v3.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
  2016-06-17 20:30   ` Borislav Petkov
@ 2016-06-18 10:29     ` Andy Lutomirski
  2016-06-18 10:37       ` Borislav Petkov
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-18 10:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Matt Fleming,
	linux-efi

On Fri, Jun 17, 2016 at 1:30 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jun 17, 2016 at 01:00:39PM -0700, Andy Lutomirski wrote:
>> It's currently only used in the EFI code, which is safe AFAICT.
>
> "It is basically useful for a pagetable hierarchy which is not init_mm."
>
>> Warn if anyone tries to use it on the normal kernel pgd.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/mm/pageattr.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>> index 6a8026918bf6..e9b9c5cedbb8 100644
>> --- a/arch/x86/mm/pageattr.c
>> +++ b/arch/x86/mm/pageattr.c
>> @@ -1996,6 +1996,9 @@ out:
>>  void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
>>                              unsigned numpages)
>>  {
>> +     /* Unmapping kernel entries from init_mm's pgd is not allowed. */
>> +     WARN_ON(root == init_mm.pgd);
>
> We can also return and not do the unmapping:
>
>         if (WARN_ON(root == init_mm.pgd))
>                 return;

I'll do one better: the only function that calls this function is
unused.  I'll just delete it.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately
  2016-06-18 10:29     ` Andy Lutomirski
@ 2016-06-18 10:37       ` Borislav Petkov
  0 siblings, 0 replies; 28+ messages in thread
From: Borislav Petkov @ 2016-06-18 10:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Matt Fleming,
	linux-efi

On Sat, Jun 18, 2016 at 03:29:01AM -0700, Andy Lutomirski wrote:
> I'll do one better: the only function that calls this function is
> unused.  I'll just delete it.

Fair enough - Matt is on CC.

Btw, normally I'm almost never talking to you at that time of the day,
what's up? Can't sleep?

:-))

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack
  2016-06-17 20:00 ` [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
  2016-06-17 20:55   ` Josh Poimboeuf
@ 2016-06-20 13:02   ` Michal Hocko
  2016-06-20 16:05     ` Andy Lutomirski
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 13:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Fri 17-06-16 13:00:41, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size.  Move it into
> account_kernel_stack.
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  kernel/fork.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..cd2abe6e4e41 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK,
> +		account * (THREAD_SIZE / PAGE_SIZE));
>  }
>  
>  void free_task(struct task_struct *tsk)
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-17 20:00 ` [PATCH v2 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
@ 2016-06-20 13:16   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 13:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Fri 17-06-16 13:00:40, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Makes sense to me.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  drivers/base/node.c    | 3 +--
>  fs/proc/meminfo.c      | 2 +-
>  include/linux/mmzone.h | 2 +-
>  kernel/fork.c          | 3 ++-
>  mm/page_alloc.c        | 3 +--
>  5 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 560751bad294..27dc68a0ed2d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
>  		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
>  		       nid, K(i.sharedram),
> -		       nid, node_page_state(nid, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
>  		       nid, K(node_page_state(nid, NR_PAGETABLE)),
>  		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
>  		       nid, K(node_page_state(nid, NR_BOUNCE)),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 83720460c5bc..239b5a06cee0 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  				global_page_state(NR_SLAB_UNRECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_RECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
> -		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
> +		global_page_state(NR_KERNEL_STACK_KB),
>  		K(global_page_state(NR_PAGETABLE)),
>  #ifdef CONFIG_QUICKLIST
>  		K(quicklist_total_size()),
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 02069c23486d..63f05a7efb54 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -127,7 +127,7 @@ enum zone_stat_item {
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
>  	NR_PAGETABLE,		/* used for pagetables */
> -	NR_KERNEL_STACK,
> +	NR_KERNEL_STACK_KB,	/* measured in KiB */
>  	/* Second 128 byte cacheline */
>  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_BOUNCE,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355aa97f..be7f006af727 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  {
>  	struct zone *zone = page_zone(virt_to_page(ti));
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
> +	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> +			    THREAD_SIZE / 1024 * account);
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6903b695ebae..a277dea926c9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_SHMEM)),
>  			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
>  			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
> -			zone_page_state(zone, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +			zone_page_state(zone, NR_KERNEL_STACK_KB),
>  			K(zone_page_state(zone, NR_PAGETABLE)),
>  			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
>  			K(zone_page_state(zone, NR_BOUNCE)),
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 06/13] fork: Add generic vmalloced stack support
  2016-06-17 20:00 ` [PATCH v2 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
  2016-06-17 20:57   ` Josh Poimboeuf
@ 2016-06-20 13:36   ` Michal Hocko
  2016-06-20 16:13     ` Andy Lutomirski
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-20 13:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri 17-06-16 13:00:42, Andy Lutomirski wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.

I like this! It also reduces demand for higher order (order-2) pages
considerably which is a great plus on its own. I would be little bit
worried about the performance because vmalloc wasn't the fastest one
AFAIR. Have you tried to measure that?

>From a quick glance the patch looks OK, I will have to look closer
though.

> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/Kconfig                        | 29 +++++++++++++
>  arch/ia64/include/asm/thread_info.h |  2 +-
>  include/linux/sched.h               | 15 +++++++
>  kernel/fork.c                       | 81 +++++++++++++++++++++++++++++--------
>  4 files changed, 109 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index d794384a0404..a71e6e7195e6 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -658,4 +658,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
>  config CPU_NO_EFFICIENT_FFS
>  	def_bool n
>  
> +config HAVE_ARCH_VMAP_STACK
> +	def_bool n
> +	help
> +	  An arch should select this symbol if it can support kernel stacks
> +	  in vmalloc space.  This means:
> +
> +	  - vmalloc space must be large enough to hold many kernel stacks.
> +	    This may rule out many 32-bit architectures.
> +
> +	  - Stacks in vmalloc space need to work reliably.  For example, if
> +	    vmap page tables are created on demand, either this mechanism
> +	    needs to work while the stack points to a virtual address with
> +	    unpopulated page tables or arch code (switch_to and switch_mm,
> +	    most likely) needs to ensure that the stack's page table entries
> +	    are populated before running on a possibly unpopulated stack.
> +
> +	  - If the stack overflows into a guard page, something reasonable
> +	    should happen.  The definition of "reasonable" is flexible, but
> +	    instantly rebooting without logging anything would be unfriendly.
> +
> +config VMAP_STACK
> +	bool "Use a virtually-mapped stack"
> +	depends on HAVE_ARCH_VMAP_STACK
> +	---help---
> +	  Enable this if you want the use virtually-mapped kernel stacks
> +	  with guard pages.  This causes kernel stack overflows to be
> +	  caught immediately rather than causing difficult-to-diagnose
> +	  corruption.
> +
>  source "kernel/gcov/Kconfig"
> diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
> index aa995b67c3f5..d13edda6e09c 100644
> --- a/arch/ia64/include/asm/thread_info.h
> +++ b/arch/ia64/include/asm/thread_info.h
> @@ -56,7 +56,7 @@ struct thread_info {
>  #define alloc_thread_info_node(tsk, node)	((struct thread_info *) 0)
>  #define task_thread_info(tsk)	((struct thread_info *) 0)
>  #endif
> -#define free_thread_info(ti)	/* nothing */
> +#define free_thread_info(tsk)	/* nothing */
>  #define task_stack_page(tsk)	((void *)(tsk))
>  
>  #define __HAVE_THREAD_FUNCTIONS
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada26345..a37c3b790309 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1918,6 +1918,9 @@ struct task_struct {
>  #ifdef CONFIG_MMU
>  	struct task_struct *oom_reaper_list;
>  #endif
> +#ifdef CONFIG_VMAP_STACK
> +	struct vm_struct *stack_vm_area;
> +#endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
>  /*
> @@ -1934,6 +1937,18 @@ extern int arch_task_struct_size __read_mostly;
>  # define arch_task_struct_size (sizeof(struct task_struct))
>  #endif
>  
> +#ifdef CONFIG_VMAP_STACK
> +static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
> +{
> +	return t->stack_vm_area;
> +}
> +#else
> +static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
> +{
> +	return NULL;
> +}
> +#endif
> +
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
>  #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cd2abe6e4e41..ad77a6b07708 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -158,19 +158,38 @@ void __weak arch_release_thread_info(struct thread_info *ti)
>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
>   * kmemcache based allocator.
>   */
> -# if THREAD_SIZE >= PAGE_SIZE
> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  						  int node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> +	struct thread_info *ti = __vmalloc_node_range(
> +		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
> +		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
> +		0, node, __builtin_return_address(0));
> +
> +	/*
> +	 * We can't call find_vm_area() in interrupt context, and
> +	 * free_thread_info can be called in interrupt context, so cache
> +	 * the vm_struct.
> +	 */
> +	if (ti)
> +		tsk->stack_vm_area = find_vm_area(ti);
> +	return ti;
> +#else
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
>  	return page ? page_address(page) : NULL;
> +#endif
>  }
>  
> -static inline void free_thread_info(struct thread_info *ti)
> +static inline void free_thread_info(struct task_struct *tsk)
>  {
> -	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> +	if (task_stack_vm_area(tsk))
> +		vfree(tsk->stack);
> +	else
> +		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -181,9 +200,9 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);
>  }
>  
> -static void free_thread_info(struct thread_info *ti)
> +static void free_thread_info(struct task_struct *tsk)
>  {
> -	kmem_cache_free(thread_info_cache, ti);
> +	kmem_cache_free(thread_info_cache, tsk->stack);
>  }
>  
>  void thread_info_cache_init(void)
> @@ -213,24 +232,46 @@ struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -static void account_kernel_stack(struct thread_info *ti, int account)
> +static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
> -	struct zone *zone = page_zone(virt_to_page(ti));
> +	struct zone *zone;
> +	struct thread_info *ti = task_thread_info(tsk);
> +	struct vm_struct *vm = task_stack_vm_area(tsk);
> +
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
> +
> +	if (vm) {
> +		int i;
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> -			    THREAD_SIZE / 1024 * account);
> +		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
>  
> -	/* All stack pages belong to the same memcg. */
> -	memcg_kmem_update_page_stat(
> -		virt_to_page(ti), MEMCG_KERNEL_STACK,
> -		account * (THREAD_SIZE / PAGE_SIZE));
> +		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
> +			mod_zone_page_state(page_zone(vm->pages[i]),
> +					    1, PAGE_SIZE / 1024 * account);
> +		}
> +
> +		/* All stack pages belong to the same memcg. */
> +		memcg_kmem_update_page_stat(
> +			vm->pages[0], MEMCG_KERNEL_STACK,
> +			account * (THREAD_SIZE / PAGE_SIZE));
> +	} else {
> +		zone = page_zone(virt_to_page(ti));
> +
> +		mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> +				    THREAD_SIZE / 1024 * account);
> +
> +		/* All stack pages belong to the same memcg. */
> +		memcg_kmem_update_page_stat(
> +			virt_to_page(ti), MEMCG_KERNEL_STACK,
> +			account * (THREAD_SIZE / PAGE_SIZE));
> +	}
>  }
>  
>  void free_task(struct task_struct *tsk)
>  {
> -	account_kernel_stack(tsk->stack, -1);
> +	account_kernel_stack(tsk, -1);
>  	arch_release_thread_info(tsk->stack);
> -	free_thread_info(tsk->stack);
> +	free_thread_info(tsk);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
>  	put_seccomp_filter(tsk);
> @@ -342,6 +383,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  {
>  	struct task_struct *tsk;
>  	struct thread_info *ti;
> +	struct vm_struct *stack_vm_area;
>  	int err;
>  
>  	if (node == NUMA_NO_NODE)
> @@ -354,11 +396,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	if (!ti)
>  		goto free_tsk;
>  
> +	stack_vm_area = task_stack_vm_area(tsk);
> +
>  	err = arch_dup_task_struct(tsk, orig);
>  	if (err)
>  		goto free_ti;
>  
>  	tsk->stack = ti;
> +#ifdef CONFIG_VMAP_STACK
> +	tsk->stack_vm_area = stack_vm_area;
> +#endif
>  #ifdef CONFIG_SECCOMP
>  	/*
>  	 * We must handle setting up seccomp filters once we're under
> @@ -390,14 +437,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	tsk->task_frag.page = NULL;
>  	tsk->wake_q.next = NULL;
>  
> -	account_kernel_stack(ti, 1);
> +	account_kernel_stack(tsk, 1);
>  
>  	kcov_task_init(tsk);
>  
>  	return tsk;
>  
>  free_ti:
> -	free_thread_info(ti);
> +	free_thread_info(tsk);
>  free_tsk:
>  	free_task_struct(tsk);
>  	return NULL;
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack
  2016-06-20 13:02   ` Michal Hocko
@ 2016-06-20 16:05     ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-20 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arch, Josh Poimboeuf, Borislav Petkov, Heiko Carstens,
	linux-mm, Brian Gerst, Johannes Weiner, Linus Torvalds,
	Nadav Amit, Kees Cook, linux-kernel, Jann Horn, Vladimir Davydov,
	kernel-hardening, X86 ML

On Jun 20, 2016 6:02 AM, "Michal Hocko" <mhocko@kernel.org> wrote:
>
> On Fri 17-06-16 13:00:41, Andy Lutomirski wrote:
> > We should account for stacks regardless of stack size.  Move it into
> > account_kernel_stack.
> >
> > Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> > Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: linux-mm@kvack.org
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>

This needs the same kilobyte treatment as the other accounting patch,
so I'm going to send v3 without your ack.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 06/13] fork: Add generic vmalloced stack support
  2016-06-20 13:36   ` Michal Hocko
@ 2016-06-20 16:13     ` Andy Lutomirski
  2016-06-21  8:46       ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-20 16:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Mon, Jun 20, 2016 at 6:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 17-06-16 13:00:42, Andy Lutomirski wrote:
>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> vmalloc_node.
>
> I like this! It also reduces demand for higher order (order-2) pages
> considerably which is a great plus on its own. I would be little bit
> worried about the performance because vmalloc wasn't the fastest one
> AFAIR. Have you tried to measure that?

It seems to add about 1.5µs to pthread_create+join on my laptop.  (On
an unmodified, stripped-down kernel, it took about 7µs before.  On a
Fedora system, the baseline is much worse.)  I think that most of the
overhead is because vmalloc allocates one page at a time, which means
that it won't use a higher order page even if one is sitting on a
freelist.

I can imagine better integration with the page allocator in which
higher order pages are used if readily available.  Similarly, vfree
could free pages that happen to be aligned and consecutive as a unit
to avoid the overhead of merging them back together one at a time.

But I'm not planning on doing any of this myself any time soon.  I
just want to get the code working and merged.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 06/13] fork: Add generic vmalloced stack support
  2016-06-20 16:13     ` Andy Lutomirski
@ 2016-06-21  8:46       ` Michal Hocko
  2016-06-21 17:01         ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-21  8:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Mon 20-06-16 09:13:55, Andy Lutomirski wrote:
> On Mon, Jun 20, 2016 at 6:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 17-06-16 13:00:42, Andy Lutomirski wrote:
> >> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> >> vmalloc_node.
> >
> > I like this! It also reduces demand for higher order (order-2) pages
> > considerably which is a great plus on its own. I would be little bit
> > worried about the performance because vmalloc wasn't the fastest one
> > AFAIR. Have you tried to measure that?
> 
> It seems to add about 1.5µs to pthread_create+join on my laptop.  (On
> an unmodified, stripped-down kernel, it took about 7µs before.  On a
> Fedora system, the baseline is much worse.)  I think that most of the
> overhead is because vmalloc allocates one page at a time, which means
> that it won't use a higher order page even if one is sitting on a
> freelist.

I guess a less artificial test case which would would generate a lot of
tasks and some memory pressure would be more representative (e.g.
kernbench). The thing is that even order-2 pages might get quite
expensive when the memory is fragmented.

> I can imagine better integration with the page allocator in which
> higher order pages are used if readily available.  Similarly, vfree
> could free pages that happen to be aligned and consecutive as a unit
> to avoid the overhead of merging them back together one at a time.
> 
> But I'm not planning on doing any of this myself any time soon.  I
> just want to get the code working and merged.

I agree, there is a room for improvement but no necessarily as a part of
this series.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 06/13] fork: Add generic vmalloced stack support
  2016-06-21  8:46       ` Michal Hocko
@ 2016-06-21 17:01         ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-21 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tue, Jun 21, 2016 at 1:46 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 20-06-16 09:13:55, Andy Lutomirski wrote:
>> On Mon, Jun 20, 2016 at 6:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Fri 17-06-16 13:00:42, Andy Lutomirski wrote:
>> >> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> >> vmalloc_node.
>> >
>> > I like this! It also reduces demand for higher order (order-2) pages
>> > considerably which is a great plus on its own. I would be little bit
>> > worried about the performance because vmalloc wasn't the fastest one
>> > AFAIR. Have you tried to measure that?
>>
>> It seems to add about 1.5盜 to pthread_create+join on my laptop.  (On
>> an unmodified, stripped-down kernel, it took about 7盜 before.  On a
>> Fedora system, the baseline is much worse.)  I think that most of the
>> overhead is because vmalloc allocates one page at a time, which means
>> that it won't use a higher order page even if one is sitting on a
>> freelist.
>
> I guess a less artificial test case which would would generate a lot of
> tasks and some memory pressure would be more representative (e.g.
> kernbench). The thing is that even order-2 pages might get quite
> expensive when the memory is fragmented.
>
>> I can imagine better integration with the page allocator in which
>> higher order pages are used if readily available.  Similarly, vfree
>> could free pages that happen to be aligned and consecutive as a unit
>> to avoid the overhead of merging them back together one at a time.
>>
>> But I'm not planning on doing any of this myself any time soon.  I
>> just want to get the code working and merged.
>
> I agree, there is a room for improvement but no necessarily as a part of
> this series.
>

Agreed.  My goal is to get this good enough for upstream, and we can
make it even better down the road.

That being said, I think I will implement Linus' suggestion of a tiny
percpu cache.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-06-21 17:01 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-17 20:00 [PATCH v2 00/13] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 03/13] x86/cpa: Warn if kernel_unmap_pages_in_pgd is used inappropriately Andy Lutomirski
2016-06-17 20:30   ` Borislav Petkov
2016-06-18 10:29     ` Andy Lutomirski
2016-06-18 10:37       ` Borislav Petkov
2016-06-17 20:00 ` [PATCH v2 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
2016-06-20 13:16   ` Michal Hocko
2016-06-17 20:00 ` [PATCH v2 05/13] mm: Move memcg stack accounting to account_kernel_stack Andy Lutomirski
2016-06-17 20:55   ` Josh Poimboeuf
2016-06-17 22:18     ` Andy Lutomirski
2016-06-20 13:02   ` Michal Hocko
2016-06-20 16:05     ` Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 06/13] fork: Add generic vmalloced stack support Andy Lutomirski
2016-06-17 20:57   ` Josh Poimboeuf
2016-06-17 22:18     ` Andy Lutomirski
2016-06-20 13:36   ` Michal Hocko
2016-06-20 16:13     ` Andy Lutomirski
2016-06-21  8:46       ` Michal Hocko
2016-06-21 17:01         ` Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 12/13] x86/mm/64: Enable vmapped stacks Andy Lutomirski
2016-06-17 20:00 ` [PATCH v2 13/13] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).