All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-20 23:43 ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

On my laptop, this adds about 1.5µs of overhead to task creation,
which seems to be mainly caused by vmalloc inefficiently allocating
individual pages even when a higher-order page is available on the
freelist.

This does not address interrupt stacks.  It also does not address
the possibility of privilege escalation by a controlled stack
overflow that overwrites thread_info without hitting the guard page.
I'll send patches to address the latter issue once this series
lands.

It's worth noting that s390 has an arch-specific gcc feature that
detects stack overflows by adjusting function prologues.  Arches
with features like that may wish to avoid using vmapped stacks to
minimize the performance hit.

Ingo, would it make sense to throw it into a seaparate branch in
-tip?  I wouldn't mind seeing some -next testing to give people a
chance to shake out problems.  I'm particularly interested in
whether there are any drivers that expect virt_to_phys to work on
stack addresses.  (I know that virtio-net used to, but I fixed that
a while back.)

Changes from v2:
 - Delete kerne_unmap_pages_in_pgd rather than hardening it (Borislav)
 - Fix sub-page stack accounting better (Josh)

Changes from v1:
 - Fix rewind_stack_and_do_exit (Josh)
 - Fix deadlock under load
 - Clean up generic stack vmalloc code
 - Many other minor fixes

Andy Lutomirski (12):
  x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  x86/mm: Remove kernel_unmap_pages_in_pgd() and
    efi_cleanup_page_tables()
  mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  mm: Fix memcg stack accounting for sub-page stacks
  fork: Add generic vmalloced stack support
  x86/die: Don't try to recover from an OOPS on a non-default stack
  x86/dumpstack: When OOPSing, rewind the stack before do_exit
  x86/dumpstack: When dumping stack bytes due to OOPS, start with
    regs->sp
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
    OOPS
  x86/mm/64: Enable vmapped stacks
  x86/mm: Improve stack-overflow #PF handling

Ingo Molnar (1):
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

 arch/Kconfig                         | 29 ++++++++++++
 arch/ia64/include/asm/thread_info.h  |  2 +-
 arch/x86/Kconfig                     |  1 +
 arch/x86/entry/entry_32.S            | 11 +++++
 arch/x86/entry/entry_64.S            | 11 +++++
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 -
 arch/x86/include/asm/switch_to.h     | 28 +++++++++++-
 arch/x86/include/asm/traps.h         |  6 +++
 arch/x86/kernel/dumpstack.c          | 19 +++++++-
 arch/x86/kernel/dumpstack_32.c       |  4 +-
 arch/x86/kernel/dumpstack_64.c       | 16 +++++--
 arch/x86/kernel/traps.c              | 32 ++++++++++++++
 arch/x86/mm/fault.c                  | 39 ++++++++++++++++
 arch/x86/mm/init_64.c                | 27 -----------
 arch/x86/mm/pageattr.c               | 32 ++------------
 arch/x86/mm/tlb.c                    | 15 +++++++
 arch/x86/platform/efi/efi.c          |  2 -
 arch/x86/platform/efi/efi_32.c       |  3 --
 arch/x86/platform/efi/efi_64.c       |  5 ---
 drivers/base/node.c                  |  3 +-
 fs/proc/meminfo.c                    |  2 +-
 include/linux/memcontrol.h           |  2 +-
 include/linux/mmzone.h               |  2 +-
 include/linux/sched.h                | 15 +++++++
 kernel/fork.c                        | 86 +++++++++++++++++++++++++++---------
 mm/memcontrol.c                      |  2 +-
 mm/page_alloc.c                      |  3 +-
 28 files changed, 295 insertions(+), 105 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-20 23:43 ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

On my laptop, this adds about 1.5µs of overhead to task creation,
which seems to be mainly caused by vmalloc inefficiently allocating
individual pages even when a higher-order page is available on the
freelist.

This does not address interrupt stacks.  It also does not address
the possibility of privilege escalation by a controlled stack
overflow that overwrites thread_info without hitting the guard page.
I'll send patches to address the latter issue once this series
lands.

It's worth noting that s390 has an arch-specific gcc feature that
detects stack overflows by adjusting function prologues.  Arches
with features like that may wish to avoid using vmapped stacks to
minimize the performance hit.

Ingo, would it make sense to throw it into a seaparate branch in
-tip?  I wouldn't mind seeing some -next testing to give people a
chance to shake out problems.  I'm particularly interested in
whether there are any drivers that expect virt_to_phys to work on
stack addresses.  (I know that virtio-net used to, but I fixed that
a while back.)

Changes from v2:
 - Delete kerne_unmap_pages_in_pgd rather than hardening it (Borislav)
 - Fix sub-page stack accounting better (Josh)

Changes from v1:
 - Fix rewind_stack_and_do_exit (Josh)
 - Fix deadlock under load
 - Clean up generic stack vmalloc code
 - Many other minor fixes

Andy Lutomirski (12):
  x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  x86/mm: Remove kernel_unmap_pages_in_pgd() and
    efi_cleanup_page_tables()
  mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  mm: Fix memcg stack accounting for sub-page stacks
  fork: Add generic vmalloced stack support
  x86/die: Don't try to recover from an OOPS on a non-default stack
  x86/dumpstack: When OOPSing, rewind the stack before do_exit
  x86/dumpstack: When dumping stack bytes due to OOPS, start with
    regs->sp
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
    OOPS
  x86/mm/64: Enable vmapped stacks
  x86/mm: Improve stack-overflow #PF handling

Ingo Molnar (1):
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

 arch/Kconfig                         | 29 ++++++++++++
 arch/ia64/include/asm/thread_info.h  |  2 +-
 arch/x86/Kconfig                     |  1 +
 arch/x86/entry/entry_32.S            | 11 +++++
 arch/x86/entry/entry_64.S            | 11 +++++
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 -
 arch/x86/include/asm/switch_to.h     | 28 +++++++++++-
 arch/x86/include/asm/traps.h         |  6 +++
 arch/x86/kernel/dumpstack.c          | 19 +++++++-
 arch/x86/kernel/dumpstack_32.c       |  4 +-
 arch/x86/kernel/dumpstack_64.c       | 16 +++++--
 arch/x86/kernel/traps.c              | 32 ++++++++++++++
 arch/x86/mm/fault.c                  | 39 ++++++++++++++++
 arch/x86/mm/init_64.c                | 27 -----------
 arch/x86/mm/pageattr.c               | 32 ++------------
 arch/x86/mm/tlb.c                    | 15 +++++++
 arch/x86/platform/efi/efi.c          |  2 -
 arch/x86/platform/efi/efi_32.c       |  3 --
 arch/x86/platform/efi/efi_64.c       |  5 ---
 drivers/base/node.c                  |  3 +-
 fs/proc/meminfo.c                    |  2 +-
 include/linux/memcontrol.h           |  2 +-
 include/linux/mmzone.h               |  2 +-
 include/linux/sched.h                | 15 +++++++
 kernel/fork.c                        | 86 +++++++++++++++++++++++++++---------
 mm/memcontrol.c                      |  2 +-
 mm/page_alloc.c                      |  3 +-
 28 files changed, 295 insertions(+), 105 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-20 23:43 ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

On my laptop, this adds about 1.5µs of overhead to task creation,
which seems to be mainly caused by vmalloc inefficiently allocating
individual pages even when a higher-order page is available on the
freelist.

This does not address interrupt stacks.  It also does not address
the possibility of privilege escalation by a controlled stack
overflow that overwrites thread_info without hitting the guard page.
I'll send patches to address the latter issue once this series
lands.

It's worth noting that s390 has an arch-specific gcc feature that
detects stack overflows by adjusting function prologues.  Arches
with features like that may wish to avoid using vmapped stacks to
minimize the performance hit.

Ingo, would it make sense to throw it into a seaparate branch in
-tip?  I wouldn't mind seeing some -next testing to give people a
chance to shake out problems.  I'm particularly interested in
whether there are any drivers that expect virt_to_phys to work on
stack addresses.  (I know that virtio-net used to, but I fixed that
a while back.)

Changes from v2:
 - Delete kerne_unmap_pages_in_pgd rather than hardening it (Borislav)
 - Fix sub-page stack accounting better (Josh)

Changes from v1:
 - Fix rewind_stack_and_do_exit (Josh)
 - Fix deadlock under load
 - Clean up generic stack vmalloc code
 - Many other minor fixes

Andy Lutomirski (12):
  x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  x86/mm: Remove kernel_unmap_pages_in_pgd() and
    efi_cleanup_page_tables()
  mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  mm: Fix memcg stack accounting for sub-page stacks
  fork: Add generic vmalloced stack support
  x86/die: Don't try to recover from an OOPS on a non-default stack
  x86/dumpstack: When OOPSing, rewind the stack before do_exit
  x86/dumpstack: When dumping stack bytes due to OOPS, start with
    regs->sp
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
    OOPS
  x86/mm/64: Enable vmapped stacks
  x86/mm: Improve stack-overflow #PF handling

Ingo Molnar (1):
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

 arch/Kconfig                         | 29 ++++++++++++
 arch/ia64/include/asm/thread_info.h  |  2 +-
 arch/x86/Kconfig                     |  1 +
 arch/x86/entry/entry_32.S            | 11 +++++
 arch/x86/entry/entry_64.S            | 11 +++++
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 -
 arch/x86/include/asm/switch_to.h     | 28 +++++++++++-
 arch/x86/include/asm/traps.h         |  6 +++
 arch/x86/kernel/dumpstack.c          | 19 +++++++-
 arch/x86/kernel/dumpstack_32.c       |  4 +-
 arch/x86/kernel/dumpstack_64.c       | 16 +++++--
 arch/x86/kernel/traps.c              | 32 ++++++++++++++
 arch/x86/mm/fault.c                  | 39 ++++++++++++++++
 arch/x86/mm/init_64.c                | 27 -----------
 arch/x86/mm/pageattr.c               | 32 ++------------
 arch/x86/mm/tlb.c                    | 15 +++++++
 arch/x86/platform/efi/efi.c          |  2 -
 arch/x86/platform/efi/efi_32.c       |  3 --
 arch/x86/platform/efi/efi_64.c       |  5 ---
 drivers/base/node.c                  |  3 +-
 fs/proc/meminfo.c                    |  2 +-
 include/linux/memcontrol.h           |  2 +-
 include/linux/mmzone.h               |  2 +-
 include/linux/sched.h                | 15 +++++++
 kernel/fork.c                        | 86 +++++++++++++++++++++++++++---------
 mm/memcontrol.c                      |  2 +-
 mm/page_alloc.c                      |  3 +-
 28 files changed, 295 insertions(+), 105 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Ingo Molnar, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Waiman Long, linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Ingo Molnar, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Waiman Long, linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Ingo Molnar, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Waiman Long, linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.5.5


^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 01/13] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Ingo Molnar, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Waiman Long, linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This avoids pointless races in which another CPU or task might see a
partially populated global pgd entry.  These races should normally
be harmless, but, if another CPU propagates the entry via
vmalloc_fault and then populate_pgd fails (due to memory allocation
failure, for example), this prevents a use-after-free of the pgd
entry.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bbf4105..6a8026918bf6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 
 	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
 	if (ret < 0) {
-		unmap_pgd_range(cpa->pgd, addr,
+		if (pud)
+			free_page((unsigned long)pud);
+		unmap_pud_range(pgd_entry, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This avoids pointless races in which another CPU or task might see a
partially populated global pgd entry.  These races should normally
be harmless, but, if another CPU propagates the entry via
vmalloc_fault and then populate_pgd fails (due to memory allocation
failure, for example), this prevents a use-after-free of the pgd
entry.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bbf4105..6a8026918bf6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 
 	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
 	if (ret < 0) {
-		unmap_pgd_range(cpa->pgd, addr,
+		if (pud)
+			free_page((unsigned long)pud);
+		unmap_pud_range(pgd_entry, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 02/13] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This avoids pointless races in which another CPU or task might see a
partially populated global pgd entry.  These races should normally
be harmless, but, if another CPU propagates the entry via
vmalloc_fault and then populate_pgd fails (due to memory allocation
failure, for example), this prevents a use-after-free of the pgd
entry.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bbf4105..6a8026918bf6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 
 	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
 	if (ret < 0) {
-		unmap_pgd_range(cpa->pgd, addr,
+		if (pud)
+			free_page((unsigned long)pud);
+		unmap_pud_range(pgd_entry, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 03/13] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Matt Fleming, linux-efi

kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
init_mm.pgd were to be cleared, callers would need to ensure that
the pgd entry hadn't been propagated to any other pgd.

Its only caller was efi_cleanup_page_tables(), and that, in turn,
was unused, so just delete both functions.  This leaves a couple of
other helpers unused, so delete them, too.

Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/pageattr.c               | 28 ----------------------------
 arch/x86/platform/efi/efi.c          |  2 --
 arch/x86/platform/efi/efi_32.c       |  3 ---
 arch/x86/platform/efi/efi_64.c       |  5 -----
 6 files changed, 41 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 78d1e7467eae..45ea38df86d4 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -125,7 +125,6 @@ extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
 extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages);
-extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
 extern void __init runtime_code_page_mkexec(void);
 extern void __init efi_runtime_update_mappings(void);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 7b5efe264eff..0b9f58ad10c8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -475,8 +475,6 @@ extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags);
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages);
 #endif	/* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_DEFS_H */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a8026918bf6..762162af3662 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -746,18 +746,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -871,16 +859,6 @@ static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
 	 */
 }
 
-static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd_entry = root + pgd_index(addr);
-
-	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
-}
-
 static int alloc_pte_page(pmd_t *pmd)
 {
 	pte_t *pte = (pte_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
@@ -1993,12 +1971,6 @@ out:
 	return retval;
 }
 
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages)
-{
-	unmap_pgd_range(root, address, address + (numpages << PAGE_SHIFT));
-}
-
 /*
  * The testcases use internal knowledge of the implementation that shouldn't
  * be exposed to the rest of the kernel. Include these directly here.
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index f93545e7dc54..62986e5fbdba 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -978,8 +978,6 @@ static void __init __efi_enter_virtual_mode(void)
 	 * EFI mixed mode we need all of memory to be accessible when
 	 * we pass parameters to the EFI runtime services in the
 	 * thunking code.
-	 *
-	 * efi_cleanup_page_tables(__pa(new_memmap), 1 << pg_shift);
 	 */
 	free_pages((unsigned long)new_memmap, pg_shift);
 
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 338402b91d2e..cef39b097649 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,9 +49,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	return 0;
 }
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-}
 
 void __init efi_map_region(efi_memory_desc_t *md)
 {
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 6e7242be1c87..5ab219c2ba43 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -285,11 +285,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	return 0;
 }
 
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-	kernel_unmap_pages_in_pgd(efi_pgd, pa_memmap, num_pages);
-}
-
 static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
 	unsigned long flags = _PAGE_RW;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 03/13] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Matt Fleming, linux-efi

kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
init_mm.pgd were to be cleared, callers would need to ensure that
the pgd entry hadn't been propagated to any other pgd.

Its only caller was efi_cleanup_page_tables(), and that, in turn,
was unused, so just delete both functions.  This leaves a couple of
other helpers unused, so delete them, too.

Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/pageattr.c               | 28 ----------------------------
 arch/x86/platform/efi/efi.c          |  2 --
 arch/x86/platform/efi/efi_32.c       |  3 ---
 arch/x86/platform/efi/efi_64.c       |  5 -----
 6 files changed, 41 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 78d1e7467eae..45ea38df86d4 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -125,7 +125,6 @@ extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
 extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages);
-extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
 extern void __init runtime_code_page_mkexec(void);
 extern void __init efi_runtime_update_mappings(void);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 7b5efe264eff..0b9f58ad10c8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -475,8 +475,6 @@ extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags);
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages);
 #endif	/* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_DEFS_H */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a8026918bf6..762162af3662 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -746,18 +746,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -871,16 +859,6 @@ static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
 	 */
 }
 
-static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd_entry = root + pgd_index(addr);
-
-	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
-}
-
 static int alloc_pte_page(pmd_t *pmd)
 {
 	pte_t *pte = (pte_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
@@ -1993,12 +1971,6 @@ out:
 	return retval;
 }
 
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages)
-{
-	unmap_pgd_range(root, address, address + (numpages << PAGE_SHIFT));
-}
-
 /*
  * The testcases use internal knowledge of the implementation that shouldn't
  * be exposed to the rest of the kernel. Include these directly here.
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index f93545e7dc54..62986e5fbdba 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -978,8 +978,6 @@ static void __init __efi_enter_virtual_mode(void)
 	 * EFI mixed mode we need all of memory to be accessible when
 	 * we pass parameters to the EFI runtime services in the
 	 * thunking code.
-	 *
-	 * efi_cleanup_page_tables(__pa(new_memmap), 1 << pg_shift);
 	 */
 	free_pages((unsigned long)new_memmap, pg_shift);
 
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 338402b91d2e..cef39b097649 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,9 +49,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	return 0;
 }
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-}
 
 void __init efi_map_region(efi_memory_desc_t *md)
 {
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 6e7242be1c87..5ab219c2ba43 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -285,11 +285,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	return 0;
 }
 
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-	kernel_unmap_pages_in_pgd(efi_pgd, pa_memmap, num_pages);
-}
-
 static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
 	unsigned long flags = _PAGE_RW;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 03/13] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Matt Fleming, linux-efi

kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
init_mm.pgd were to be cleared, callers would need to ensure that
the pgd entry hadn't been propagated to any other pgd.

Its only caller was efi_cleanup_page_tables(), and that, in turn,
was unused, so just delete both functions.  This leaves a couple of
other helpers unused, so delete them, too.

Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/pageattr.c               | 28 ----------------------------
 arch/x86/platform/efi/efi.c          |  2 --
 arch/x86/platform/efi/efi_32.c       |  3 ---
 arch/x86/platform/efi/efi_64.c       |  5 -----
 6 files changed, 41 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 78d1e7467eae..45ea38df86d4 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -125,7 +125,6 @@ extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
 extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages);
-extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
 extern void __init runtime_code_page_mkexec(void);
 extern void __init efi_runtime_update_mappings(void);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 7b5efe264eff..0b9f58ad10c8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -475,8 +475,6 @@ extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags);
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages);
 #endif	/* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_DEFS_H */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a8026918bf6..762162af3662 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -746,18 +746,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -871,16 +859,6 @@ static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
 	 */
 }
 
-static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd_entry = root + pgd_index(addr);
-
-	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
-}
-
 static int alloc_pte_page(pmd_t *pmd)
 {
 	pte_t *pte = (pte_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
@@ -1993,12 +1971,6 @@ out:
 	return retval;
 }
 
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages)
-{
-	unmap_pgd_range(root, address, address + (numpages << PAGE_SHIFT));
-}
-
 /*
  * The testcases use internal knowledge of the implementation that shouldn't
  * be exposed to the rest of the kernel. Include these directly here.
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index f93545e7dc54..62986e5fbdba 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -978,8 +978,6 @@ static void __init __efi_enter_virtual_mode(void)
 	 * EFI mixed mode we need all of memory to be accessible when
 	 * we pass parameters to the EFI runtime services in the
 	 * thunking code.
-	 *
-	 * efi_cleanup_page_tables(__pa(new_memmap), 1 << pg_shift);
 	 */
 	free_pages((unsigned long)new_memmap, pg_shift);
 
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 338402b91d2e..cef39b097649 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,9 +49,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	return 0;
 }
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-}
 
 void __init efi_map_region(efi_memory_desc_t *md)
 {
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 6e7242be1c87..5ab219c2ba43 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -285,11 +285,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	return 0;
 }
 
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-	kernel_unmap_pages_in_pgd(efi_pgd, pa_memmap, num_pages);
-}
-
 static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
 	unsigned long flags = _PAGE_RW;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
all architectures.  Keep it simple and use KiB.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 kernel/fork.c          | 3 ++-
 mm/page_alloc.c        | 3 +--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..27dc68a0ed2d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..239b5a06cee0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		global_page_state(NR_KERNEL_STACK_KB),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..63f05a7efb54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,7 +127,7 @@ enum zone_stat_item {
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
-	NR_KERNEL_STACK,
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	/* Second 128 byte cacheline */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355aa97f..be7f006af727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(ti));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+			    THREAD_SIZE / 1024 * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..a277dea926c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			zone_page_state(zone, NR_KERNEL_STACK_KB),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
all architectures.  Keep it simple and use KiB.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 kernel/fork.c          | 3 ++-
 mm/page_alloc.c        | 3 +--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..27dc68a0ed2d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..239b5a06cee0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		global_page_state(NR_KERNEL_STACK_KB),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..63f05a7efb54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,7 +127,7 @@ enum zone_stat_item {
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
-	NR_KERNEL_STACK,
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	/* Second 128 byte cacheline */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355aa97f..be7f006af727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(ti));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+			    THREAD_SIZE / 1024 * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..a277dea926c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			zone_page_state(zone, NR_KERNEL_STACK_KB),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
all architectures.  Keep it simple and use KiB.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 kernel/fork.c          | 3 ++-
 mm/page_alloc.c        | 3 +--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..27dc68a0ed2d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..239b5a06cee0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		global_page_state(NR_KERNEL_STACK_KB),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..63f05a7efb54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,7 +127,7 @@ enum zone_stat_item {
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
-	NR_KERNEL_STACK,
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	/* Second 128 byte cacheline */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355aa97f..be7f006af727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(ti));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+			    THREAD_SIZE / 1024 * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..a277dea926c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			zone_page_state(zone, NR_KERNEL_STACK_KB),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.5.5


^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
all architectures.  Keep it simple and use KiB.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 kernel/fork.c          | 3 ++-
 mm/page_alloc.c        | 3 +--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..27dc68a0ed2d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..239b5a06cee0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		global_page_state(NR_KERNEL_STACK_KB),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..63f05a7efb54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,7 +127,7 @@ enum zone_stat_item {
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
-	NR_KERNEL_STACK,
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	/* Second 128 byte cacheline */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355aa97f..be7f006af727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(ti));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+			    THREAD_SIZE / 1024 * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..a277dea926c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			zone_page_state(zone, NR_KERNEL_STACK_KB),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size, and we need
to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
units to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/memcontrol.h |  2 +-
 kernel/fork.c              | 15 ++++++---------
 mm/memcontrol.c            |  2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..3b653b86bb8f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 	/* default hierarchy stats */
-	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
+	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
 	MEMCG_SLAB_RECLAIMABLE,
 	MEMCG_SLAB_UNRECLAIMABLE,
 	MEMCG_SOCK,
diff --git a/kernel/fork.c b/kernel/fork.c
index be7f006af727..ff3c41c2ba96 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	struct page *page = virt_to_page(ti);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 
 	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
 			    THREAD_SIZE / 1024 * account);
+
+	/* All stack pages belong to the same memcg. */
+	memcg_kmem_update_page_stat(
+		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+		account * (THREAD_SIZE / 1024));
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e74408cc8f..8e13a2419dad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "file %llu\n",
 		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
 	seq_printf(m, "kernel_stack %llu\n",
-		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
+		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
 	seq_printf(m, "slab %llu\n",
 		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
 			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size, and we need
to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
units to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/memcontrol.h |  2 +-
 kernel/fork.c              | 15 ++++++---------
 mm/memcontrol.c            |  2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..3b653b86bb8f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 	/* default hierarchy stats */
-	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
+	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
 	MEMCG_SLAB_RECLAIMABLE,
 	MEMCG_SLAB_UNRECLAIMABLE,
 	MEMCG_SOCK,
diff --git a/kernel/fork.c b/kernel/fork.c
index be7f006af727..ff3c41c2ba96 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	struct page *page = virt_to_page(ti);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 
 	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
 			    THREAD_SIZE / 1024 * account);
+
+	/* All stack pages belong to the same memcg. */
+	memcg_kmem_update_page_stat(
+		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+		account * (THREAD_SIZE / 1024));
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e74408cc8f..8e13a2419dad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "file %llu\n",
 		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
 	seq_printf(m, "kernel_stack %llu\n",
-		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
+		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
 	seq_printf(m, "slab %llu\n",
 		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
 			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size, and we need
to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
units to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/memcontrol.h |  2 +-
 kernel/fork.c              | 15 ++++++---------
 mm/memcontrol.c            |  2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..3b653b86bb8f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 	/* default hierarchy stats */
-	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
+	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
 	MEMCG_SLAB_RECLAIMABLE,
 	MEMCG_SLAB_UNRECLAIMABLE,
 	MEMCG_SOCK,
diff --git a/kernel/fork.c b/kernel/fork.c
index be7f006af727..ff3c41c2ba96 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	struct page *page = virt_to_page(ti);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 
 	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
 			    THREAD_SIZE / 1024 * account);
+
+	/* All stack pages belong to the same memcg. */
+	memcg_kmem_update_page_stat(
+		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+		account * (THREAD_SIZE / 1024));
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e74408cc8f..8e13a2419dad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "file %llu\n",
 		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
 	seq_printf(m, "kernel_stack %llu\n",
-		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
+		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
 	seq_printf(m, "slab %llu\n",
 		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
 			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
-- 
2.5.5


^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size, and we need
to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
units to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/memcontrol.h |  2 +-
 kernel/fork.c              | 15 ++++++---------
 mm/memcontrol.c            |  2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..3b653b86bb8f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 	/* default hierarchy stats */
-	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
+	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
 	MEMCG_SLAB_RECLAIMABLE,
 	MEMCG_SLAB_UNRECLAIMABLE,
 	MEMCG_SOCK,
diff --git a/kernel/fork.c b/kernel/fork.c
index be7f006af727..ff3c41c2ba96 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	struct page *page = virt_to_page(ti);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 
 	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
 			    THREAD_SIZE / 1024 * account);
+
+	/* All stack pages belong to the same memcg. */
+	memcg_kmem_update_page_stat(
+		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+		account * (THREAD_SIZE / 1024));
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e74408cc8f..8e13a2419dad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "file %llu\n",
 		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
 	seq_printf(m, "kernel_stack %llu\n",
-		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
+		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
 	seq_printf(m, "slab %llu\n",
 		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
 			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig                        | 29 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 +++++++
 kernel/fork.c                       | 82 +++++++++++++++++++++++++++++--------
 4 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d794384a0404..a71e6e7195e6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -658,4 +658,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to and switch_mm,
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..d13edda6e09c 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_info_node(tsk, node)	((struct thread_info *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_info(ti)	/* nothing */
+#define free_thread_info(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..a37c3b790309 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1934,6 +1937,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index ff3c41c2ba96..fe1c785e5f8c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,38 @@ void __weak arch_release_thread_info(struct thread_info *ti)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	struct thread_info *ti = __vmalloc_node_range(
+		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
+		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
+		0, node, __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_info can be called in interrupt context, so cache
+	 * the vm_struct.
+	 */
+	if (ti)
+		tsk->stack_vm_area = find_vm_area(ti);
+	return ti;
+#else
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_info(struct thread_info *ti)
+static inline void free_thread_info(struct task_struct *tsk)
 {
-	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -181,9 +200,9 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_info(struct thread_info *ti)
+static void free_thread_info(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_info_cache, ti);
+	kmem_cache_free(thread_info_cache, tsk->stack);
 }
 
 void thread_info_cache_init(void)
@@ -213,24 +232,47 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(struct thread_info *ti, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	struct zone *zone = page_zone(virt_to_page(ti));
+	struct zone *zone;
+	struct thread_info *ti = task_thread_info(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
-	/* All stack pages belong to the same memcg. */
-	memcg_kmem_update_page_stat(
-		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
-		account * (THREAD_SIZE / 1024));
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    NR_KERNEL_STACK_KB,
+					    PAGE_SIZE / 1024 * account);
+		}
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			vm->pages[0], MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	} else {
+		zone = page_zone(virt_to_page(ti));
+
+		mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_info(tsk->stack);
-	free_thread_info(tsk->stack);
+	free_thread_info(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +384,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +397,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!ti)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
 	if (err)
 		goto free_ti;
 
 	tsk->stack = ti;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +438,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(ti, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_ti:
-	free_thread_info(ti);
+	free_thread_info(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig                        | 29 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 +++++++
 kernel/fork.c                       | 82 +++++++++++++++++++++++++++++--------
 4 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d794384a0404..a71e6e7195e6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -658,4 +658,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to and switch_mm,
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..d13edda6e09c 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_info_node(tsk, node)	((struct thread_info *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_info(ti)	/* nothing */
+#define free_thread_info(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..a37c3b790309 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1934,6 +1937,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index ff3c41c2ba96..fe1c785e5f8c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,38 @@ void __weak arch_release_thread_info(struct thread_info *ti)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	struct thread_info *ti = __vmalloc_node_range(
+		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
+		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
+		0, node, __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_info can be called in interrupt context, so cache
+	 * the vm_struct.
+	 */
+	if (ti)
+		tsk->stack_vm_area = find_vm_area(ti);
+	return ti;
+#else
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_info(struct thread_info *ti)
+static inline void free_thread_info(struct task_struct *tsk)
 {
-	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -181,9 +200,9 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_info(struct thread_info *ti)
+static void free_thread_info(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_info_cache, ti);
+	kmem_cache_free(thread_info_cache, tsk->stack);
 }
 
 void thread_info_cache_init(void)
@@ -213,24 +232,47 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(struct thread_info *ti, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	struct zone *zone = page_zone(virt_to_page(ti));
+	struct zone *zone;
+	struct thread_info *ti = task_thread_info(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
-	/* All stack pages belong to the same memcg. */
-	memcg_kmem_update_page_stat(
-		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
-		account * (THREAD_SIZE / 1024));
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    NR_KERNEL_STACK_KB,
+					    PAGE_SIZE / 1024 * account);
+		}
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			vm->pages[0], MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	} else {
+		zone = page_zone(virt_to_page(ti));
+
+		mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_info(tsk->stack);
-	free_thread_info(tsk->stack);
+	free_thread_info(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +384,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +397,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!ti)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
 	if (err)
 		goto free_ti;
 
 	tsk->stack = ti;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +438,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(ti, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_ti:
-	free_thread_info(ti);
+	free_thread_info(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig                        | 29 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 +++++++
 kernel/fork.c                       | 82 +++++++++++++++++++++++++++++--------
 4 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d794384a0404..a71e6e7195e6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -658,4 +658,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to and switch_mm,
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..d13edda6e09c 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_info_node(tsk, node)	((struct thread_info *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_info(ti)	/* nothing */
+#define free_thread_info(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..a37c3b790309 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1934,6 +1937,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index ff3c41c2ba96..fe1c785e5f8c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,38 @@ void __weak arch_release_thread_info(struct thread_info *ti)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	struct thread_info *ti = __vmalloc_node_range(
+		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
+		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
+		0, node, __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_info can be called in interrupt context, so cache
+	 * the vm_struct.
+	 */
+	if (ti)
+		tsk->stack_vm_area = find_vm_area(ti);
+	return ti;
+#else
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_info(struct thread_info *ti)
+static inline void free_thread_info(struct task_struct *tsk)
 {
-	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -181,9 +200,9 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_info(struct thread_info *ti)
+static void free_thread_info(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_info_cache, ti);
+	kmem_cache_free(thread_info_cache, tsk->stack);
 }
 
 void thread_info_cache_init(void)
@@ -213,24 +232,47 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(struct thread_info *ti, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	struct zone *zone = page_zone(virt_to_page(ti));
+	struct zone *zone;
+	struct thread_info *ti = task_thread_info(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
-	/* All stack pages belong to the same memcg. */
-	memcg_kmem_update_page_stat(
-		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
-		account * (THREAD_SIZE / 1024));
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    NR_KERNEL_STACK_KB,
+					    PAGE_SIZE / 1024 * account);
+		}
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			vm->pages[0], MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	} else {
+		zone = page_zone(virt_to_page(ti));
+
+		mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_info(tsk->stack);
-	free_thread_info(tsk->stack);
+	free_thread_info(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +384,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +397,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!ti)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
 	if (err)
 		goto free_ti;
 
 	tsk->stack = ti;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +438,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(ti, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_ti:
-	free_thread_info(ti);
+	free_thread_info(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

It's not going to work, because the scheduler will explode if we try
to schedule when running on an IST stack or similar.

This will matter when we let kernel stack overflows (which are #DF)
call die().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..36effb39c9c9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -247,6 +247,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
+	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
+	     & ~(THREAD_SIZE - 1)) != 0)
+		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	do_exit(signr);
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

It's not going to work, because the scheduler will explode if we try
to schedule when running on an IST stack or similar.

This will matter when we let kernel stack overflows (which are #DF)
call die().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..36effb39c9c9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -247,6 +247,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
+	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
+	     & ~(THREAD_SIZE - 1)) != 0)
+		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	do_exit(signr);
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 07/13] x86/die: Don't try to recover from an OOPS on a non-default stack
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

It's not going to work, because the scheduler will explode if we try
to schedule when running on an IST stack or similar.

This will matter when we let kernel stack overflows (which are #DF)
call die().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..36effb39c9c9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -247,6 +247,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
+	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
+	     & ~(THREAD_SIZE - 1)) != 0)
+		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	do_exit(signr);
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we call do_exit with a clean stack, we greatly reduce the risk of
recursive oopses due to stack overflow in do_exit, and we allow
do_exit to work even if we OOPS from an IST stack.  The latter gives
us a much better chance of surviving long enough after we detect a
stack overflow to write out our logs.

I intentionally separated this from the preceding patch that
disables do_exit-on-OOPS on IST stacks.  This way, if we need to
revert this patch, we still end up in an acceptable state wrt stack
overflow handling.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S   | 11 +++++++++++
 arch/x86/entry/entry_64.S   | 11 +++++++++++
 arch/x86/kernel/dumpstack.c | 13 +++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3a0d27..0b56666e6039 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
 	jmp	error_code
 END(async_page_fault)
 #endif
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..b846875aeea6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 36effb39c9c9..d4d085e27d04 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -228,6 +228,8 @@ unsigned long oops_begin(void)
 EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
+extern void __noreturn rewind_stack_do_exit(int signr);
+
 void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 {
 	if (regs && kexec_should_crash(current))
@@ -247,12 +249,15 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
-	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
-	     & ~(THREAD_SIZE - 1)) != 0)
-		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
-	do_exit(signr);
+
+	/*
+	 * We're not going to return, but we might be on an IST stack or
+	 * have very little stack space left.  Rewind the stack and kill
+	 * the task.
+	 */
+	rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we call do_exit with a clean stack, we greatly reduce the risk of
recursive oopses due to stack overflow in do_exit, and we allow
do_exit to work even if we OOPS from an IST stack.  The latter gives
us a much better chance of surviving long enough after we detect a
stack overflow to write out our logs.

I intentionally separated this from the preceding patch that
disables do_exit-on-OOPS on IST stacks.  This way, if we need to
revert this patch, we still end up in an acceptable state wrt stack
overflow handling.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S   | 11 +++++++++++
 arch/x86/entry/entry_64.S   | 11 +++++++++++
 arch/x86/kernel/dumpstack.c | 13 +++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3a0d27..0b56666e6039 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
 	jmp	error_code
 END(async_page_fault)
 #endif
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..b846875aeea6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 36effb39c9c9..d4d085e27d04 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -228,6 +228,8 @@ unsigned long oops_begin(void)
 EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
+extern void __noreturn rewind_stack_do_exit(int signr);
+
 void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 {
 	if (regs && kexec_should_crash(current))
@@ -247,12 +249,15 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
-	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
-	     & ~(THREAD_SIZE - 1)) != 0)
-		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
-	do_exit(signr);
+
+	/*
+	 * We're not going to return, but we might be on an IST stack or
+	 * have very little stack space left.  Rewind the stack and kill
+	 * the task.
+	 */
+	rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 08/13] x86/dumpstack: When OOPSing, rewind the stack before do_exit
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we call do_exit with a clean stack, we greatly reduce the risk of
recursive oopses due to stack overflow in do_exit, and we allow
do_exit to work even if we OOPS from an IST stack.  The latter gives
us a much better chance of surviving long enough after we detect a
stack overflow to write out our logs.

I intentionally separated this from the preceding patch that
disables do_exit-on-OOPS on IST stacks.  This way, if we need to
revert this patch, we still end up in an acceptable state wrt stack
overflow handling.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S   | 11 +++++++++++
 arch/x86/entry/entry_64.S   | 11 +++++++++++
 arch/x86/kernel/dumpstack.c | 13 +++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3a0d27..0b56666e6039 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
 	jmp	error_code
 END(async_page_fault)
 #endif
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..b846875aeea6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 36effb39c9c9..d4d085e27d04 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -228,6 +228,8 @@ unsigned long oops_begin(void)
 EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
+extern void __noreturn rewind_stack_do_exit(int signr);
+
 void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 {
 	if (regs && kexec_should_crash(current))
@@ -247,12 +249,15 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
-	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
-	     & ~(THREAD_SIZE - 1)) != 0)
-		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
-	do_exit(signr);
+
+	/*
+	 * We're not going to return, but we might be on an IST stack or
+	 * have very little stack space left.  Rewind the stack and kill
+	 * the task.
+	 */
+	rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

The comment suggests that show_stack(NULL, NULL) should backtrace
the current context, but the code doesn't match the comment.  If
regs are given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 464ffd69b92e..91069ebe3c87 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -98,7 +98,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..603356a5597a 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -266,7 +266,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

The comment suggests that show_stack(NULL, NULL) should backtrace
the current context, but the code doesn't match the comment.  If
regs are given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 464ffd69b92e..91069ebe3c87 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -98,7 +98,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..603356a5597a 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -266,7 +266,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 09/13] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

The comment suggests that show_stack(NULL, NULL) should backtrace
the current context, but the code doesn't match the comment.  If
regs are given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 464ffd69b92e..91069ebe3c87 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -98,7 +98,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..603356a5597a 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -266,7 +266,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack, print_context_stack will abort.  Detect
this case and rewind back into the valid part of the stack so that
we can trace it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d4d085e27d04..9cdf05d768cf 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -89,7 +89,7 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 		else
 			return 0;
 	}
-	return p > t && p < t + THREAD_SIZE - size;
+	return p >= t && p < t + THREAD_SIZE - size;
 }
 
 unsigned long
@@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
+	/*
+	 * If we overflowed the stack into a guard page, jump back to the
+	 * bottom of the usable stack.
+	 */
+	if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
+		stack = (unsigned long *)tinfo;
+
 	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack, print_context_stack will abort.  Detect
this case and rewind back into the valid part of the stack so that
we can trace it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d4d085e27d04..9cdf05d768cf 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -89,7 +89,7 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 		else
 			return 0;
 	}
-	return p > t && p < t + THREAD_SIZE - size;
+	return p >= t && p < t + THREAD_SIZE - size;
 }
 
 unsigned long
@@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
+	/*
+	 * If we overflowed the stack into a guard page, jump back to the
+	 * bottom of the usable stack.
+	 */
+	if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
+		stack = (unsigned long *)tinfo;
+
 	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 10/13] x86/dumpstack: Try harder to get a call trace on stack overflow
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack, print_context_stack will abort.  Detect
this case and rewind back into the valid part of the stack so that
we can trace it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d4d085e27d04..9cdf05d768cf 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -89,7 +89,7 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 		else
 			return 0;
 	}
-	return p > t && p < t + THREAD_SIZE - size;
+	return p >= t && p < t + THREAD_SIZE - size;
 }
 
 unsigned long
@@ -100,6 +100,13 @@ print_context_stack(struct thread_info *tinfo,
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
+	/*
+	 * If we overflowed the stack into a guard page, jump back to the
+	 * bottom of the usable stack.
+	 */
+	if ((unsigned long)tinfo - (unsigned long)stack < PAGE_SIZE)
+		stack = (unsigned long *)tinfo;
+
 	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack into a guard page, we'll recursively fault
when trying to dump the contents of the guard page.  Use
probe_kernel_address so we can recover if this happens.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_64.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 603356a5597a..5e298638c790 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -276,6 +276,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
+		unsigned long word;
+
 		if (stack >= irq_stack && stack <= irq_stack_end) {
 			if (stack == irq_stack_end) {
 				stack = (unsigned long *) (irq_stack_end[-1]);
@@ -285,12 +287,18 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (kstack_end(stack))
 			break;
 		}
+
+		if (probe_kernel_address(stack, word))
+			break;
+
 		if ((i % STACKSLOTS_PER_LINE) == 0) {
 			if (i != 0)
 				pr_cont("\n");
-			printk("%s %016lx", log_lvl, *stack++);
+			printk("%s %016lx", log_lvl, word);
 		} else
-			pr_cont(" %016lx", *stack++);
+			pr_cont(" %016lx", word);
+
+		stack++;
 		touch_nmi_watchdog();
 	}
 	preempt_enable();
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack into a guard page, we'll recursively fault
when trying to dump the contents of the guard page.  Use
probe_kernel_address so we can recover if this happens.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_64.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 603356a5597a..5e298638c790 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -276,6 +276,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
+		unsigned long word;
+
 		if (stack >= irq_stack && stack <= irq_stack_end) {
 			if (stack == irq_stack_end) {
 				stack = (unsigned long *) (irq_stack_end[-1]);
@@ -285,12 +287,18 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (kstack_end(stack))
 			break;
 		}
+
+		if (probe_kernel_address(stack, word))
+			break;
+
 		if ((i % STACKSLOTS_PER_LINE) == 0) {
 			if (i != 0)
 				pr_cont("\n");
-			printk("%s %016lx", log_lvl, *stack++);
+			printk("%s %016lx", log_lvl, word);
 		} else
-			pr_cont(" %016lx", *stack++);
+			pr_cont(" %016lx", word);
+
+		stack++;
 		touch_nmi_watchdog();
 	}
 	preempt_enable();
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 11/13] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack into a guard page, we'll recursively fault
when trying to dump the contents of the guard page.  Use
probe_kernel_address so we can recover if this happens.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_64.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 603356a5597a..5e298638c790 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -276,6 +276,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
+		unsigned long word;
+
 		if (stack >= irq_stack && stack <= irq_stack_end) {
 			if (stack == irq_stack_end) {
 				stack = (unsigned long *) (irq_stack_end[-1]);
@@ -285,12 +287,18 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (kstack_end(stack))
 			break;
 		}
+
+		if (probe_kernel_address(stack, word))
+			break;
+
 		if ((i % STACKSLOTS_PER_LINE) == 0) {
 			if (i != 0)
 				pr_cont("\n");
-			printk("%s %016lx", log_lvl, *stack++);
+			printk("%s %016lx", log_lvl, word);
 		} else
-			pr_cont(" %016lx", *stack++);
+			pr_cont(" %016lx", word);
+
+		stack++;
 		touch_nmi_watchdog();
 	}
 	preempt_enable();
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 12/13] x86/mm/64: Enable vmapped stacks
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and an oops without triggering a double-fault, and
no_context doesn't know about stack overflows.  The next patch will
improve that case.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 +++++++++++++++
 4 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0a7b885964ba..b624b24d1dc1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..9cb7ea781176 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  CR2 will contain
+	 * the linear address of the second fault, which will be in the
+	 * guard page below the bottom of the stack.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)",
+			regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..fbf036ae72ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 12/13] x86/mm/64: Enable vmapped stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and an oops without triggering a double-fault, and
no_context doesn't know about stack overflows.  The next patch will
improve that case.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 +++++++++++++++
 4 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0a7b885964ba..b624b24d1dc1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..9cb7ea781176 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  CR2 will contain
+	 * the linear address of the second fault, which will be in the
+	 * guard page below the bottom of the stack.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)",
+			regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..fbf036ae72ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 12/13] x86/mm/64: Enable vmapped stacks
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and an oops without triggering a double-fault, and
no_context doesn't know about stack overflows.  The next patch will
improve that case.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 +++++++++++++++
 4 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0a7b885964ba..b624b24d1dc1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..9cb7ea781176 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  CR2 will contain
+	 * the linear address of the second fault, which will be in the
+	 * guard page below the bottom of the stack.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)",
+			regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..fbf036ae72ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 13/13] x86/mm: Improve stack-overflow #PF handling
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-20 23:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h |  6 ++++++
 arch/x86/kernel/traps.c      |  6 +++---
 arch/x86/mm/fault.c          | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619740a..01fd0a7f48cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -117,6 +117,12 @@ extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
+#ifdef CONFIG_VMAP_STACK
+void __noreturn handle_stack_overflow(const char *message,
+				      struct pt_regs *regs,
+				      unsigned long fault_address);
+#endif
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9cb7ea781176..b389c0539eb9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -293,9 +293,9 @@ DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
 #ifdef CONFIG_VMAP_STACK
-static void __noreturn handle_stack_overflow(const char *message,
-					     struct pt_regs *regs,
-					     unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(const char *message,
+						struct pt_regs *regs,
+						unsigned long fault_address)
 {
 	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
 		 (void *)fault_address, current->stack,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..c68b81f5659f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -753,6 +753,45 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * Stack overflow?  During boot, we can fault near the initial
+	 * stack in the direct map, but that's not an overflow -- check
+	 * that we're in vmalloc space to avoid this.
+	 *
+	 * Check this after trying fixup_exception, since there are handful
+	 * of kernel code paths that wander off the top of the stack but
+	 * handle any faults that occur.  Once those are fixed, we can
+	 * move this above fixup_exception.
+	 */
+	if (is_vmalloc_addr((void *)address) &&
+	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+		register void *__sp asm("rsp");
+		unsigned long stack =
+			this_cpu_read(orig_ist.ist[DOUBLEFAULT_STACK]) -
+			sizeof(void *);
+		/*
+		 * We're likely to be running with very little stack space
+		 * left.  It's plausible that we'd hit this condition but
+		 * double-fault even before we get this far, in which case
+		 * we're fine: the double-fault handler will deal with it.
+		 *
+		 * We don't want to make it all the way into the oops code
+		 * and then double-fault, though, because we're likely to
+		 * break the console driver and lose most of the stack dump.
+		 */
+		asm volatile ("movq %[stack], %%rsp\n\t"
+			      "call handle_stack_overflow\n\t"
+			      "1: jmp 1b"
+			      : "+r" (__sp)
+			      : "D" ("kernel stack overflow (page fault)"),
+				"S" (regs), "d" (address),
+				[stack] "rm" (stack));
+		unreachable();
+	}
+#endif
+
 	/*
 	 * 32-bit:
 	 *
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [PATCH v3 13/13] x86/mm: Improve stack-overflow #PF handling
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h |  6 ++++++
 arch/x86/kernel/traps.c      |  6 +++---
 arch/x86/mm/fault.c          | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619740a..01fd0a7f48cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -117,6 +117,12 @@ extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
+#ifdef CONFIG_VMAP_STACK
+void __noreturn handle_stack_overflow(const char *message,
+				      struct pt_regs *regs,
+				      unsigned long fault_address);
+#endif
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9cb7ea781176..b389c0539eb9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -293,9 +293,9 @@ DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
 #ifdef CONFIG_VMAP_STACK
-static void __noreturn handle_stack_overflow(const char *message,
-					     struct pt_regs *regs,
-					     unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(const char *message,
+						struct pt_regs *regs,
+						unsigned long fault_address)
 {
 	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
 		 (void *)fault_address, current->stack,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..c68b81f5659f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -753,6 +753,45 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * Stack overflow?  During boot, we can fault near the initial
+	 * stack in the direct map, but that's not an overflow -- check
+	 * that we're in vmalloc space to avoid this.
+	 *
+	 * Check this after trying fixup_exception, since there are handful
+	 * of kernel code paths that wander off the top of the stack but
+	 * handle any faults that occur.  Once those are fixed, we can
+	 * move this above fixup_exception.
+	 */
+	if (is_vmalloc_addr((void *)address) &&
+	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+		register void *__sp asm("rsp");
+		unsigned long stack =
+			this_cpu_read(orig_ist.ist[DOUBLEFAULT_STACK]) -
+			sizeof(void *);
+		/*
+		 * We're likely to be running with very little stack space
+		 * left.  It's plausible that we'd hit this condition but
+		 * double-fault even before we get this far, in which case
+		 * we're fine: the double-fault handler will deal with it.
+		 *
+		 * We don't want to make it all the way into the oops code
+		 * and then double-fault, though, because we're likely to
+		 * break the console driver and lose most of the stack dump.
+		 */
+		asm volatile ("movq %[stack], %%rsp\n\t"
+			      "call handle_stack_overflow\n\t"
+			      "1: jmp 1b"
+			      : "+r" (__sp)
+			      : "D" ("kernel stack overflow (page fault)"),
+				"S" (regs), "d" (address),
+				[stack] "rm" (stack));
+		unreachable();
+	}
+#endif
+
 	/*
 	 * 32-bit:
 	 *
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] [PATCH v3 13/13] x86/mm: Improve stack-overflow #PF handling
@ 2016-06-20 23:43   ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-20 23:43 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h |  6 ++++++
 arch/x86/kernel/traps.c      |  6 +++---
 arch/x86/mm/fault.c          | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619740a..01fd0a7f48cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -117,6 +117,12 @@ extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
+#ifdef CONFIG_VMAP_STACK
+void __noreturn handle_stack_overflow(const char *message,
+				      struct pt_regs *regs,
+				      unsigned long fault_address);
+#endif
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9cb7ea781176..b389c0539eb9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -293,9 +293,9 @@ DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
 #ifdef CONFIG_VMAP_STACK
-static void __noreturn handle_stack_overflow(const char *message,
-					     struct pt_regs *regs,
-					     unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(const char *message,
+						struct pt_regs *regs,
+						unsigned long fault_address)
 {
 	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
 		 (void *)fault_address, current->stack,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..c68b81f5659f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -753,6 +753,45 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * Stack overflow?  During boot, we can fault near the initial
+	 * stack in the direct map, but that's not an overflow -- check
+	 * that we're in vmalloc space to avoid this.
+	 *
+	 * Check this after trying fixup_exception, since there are handful
+	 * of kernel code paths that wander off the top of the stack but
+	 * handle any faults that occur.  Once those are fixed, we can
+	 * move this above fixup_exception.
+	 */
+	if (is_vmalloc_addr((void *)address) &&
+	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+		register void *__sp asm("rsp");
+		unsigned long stack =
+			this_cpu_read(orig_ist.ist[DOUBLEFAULT_STACK]) -
+			sizeof(void *);
+		/*
+		 * We're likely to be running with very little stack space
+		 * left.  It's plausible that we'd hit this condition but
+		 * double-fault even before we get this far, in which case
+		 * we're fine: the double-fault handler will deal with it.
+		 *
+		 * We don't want to make it all the way into the oops code
+		 * and then double-fault, though, because we're likely to
+		 * break the console driver and lose most of the stack dump.
+		 */
+		asm volatile ("movq %[stack], %%rsp\n\t"
+			      "call handle_stack_overflow\n\t"
+			      "1: jmp 1b"
+			      : "+r" (__sp)
+			      : "D" ("kernel stack overflow (page fault)"),
+				"S" (regs), "d" (address),
+				[stack] "rm" (stack));
+		unreachable();
+	}
+#endif
+
 	/*
 	 * 32-bit:
 	 *
-- 
2.5.5

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-21  4:01   ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-21  4:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On my laptop, this adds about 1.5µs of overhead to task creation,
> which seems to be mainly caused by vmalloc inefficiently allocating
> individual pages even when a higher-order page is available on the
> freelist.

I really think that problem needs to be fixed before this should be merged.

The easy fix may be to just have a very limited re-use of these stacks
in generic code, rather than try to do anything fancy with multi-page
allocations. Just a few of these allocations held in reserve (perhaps
make the allocations percpu to avoid new locks).

It won't help for a thundering herd problem where you start tons of
new threads, but those don't tend to be short-lived ones anyway. In
contrast, I think one common case is the "run shell scripts" that runs
tons and tons of short-lived processes, and having a small "stack of
stacks" would probably catch that case very nicely. Even a
single-entry cache might be ok, but I see no reason to not make it be
perhaps three or four stacks per CPU.

Make the "thread create/exit" sequence go really fast by avoiding the
allocation/deallocation, and hopefully catching a hot cache and TLB
line too.

Performance is not something that we add later. If the first version
of the patch series doesn't perform well, it should not be considered
ready.

            Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21  4:01   ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-21  4:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On my laptop, this adds about 1.5µs of overhead to task creation,
> which seems to be mainly caused by vmalloc inefficiently allocating
> individual pages even when a higher-order page is available on the
> freelist.

I really think that problem needs to be fixed before this should be merged.

The easy fix may be to just have a very limited re-use of these stacks
in generic code, rather than try to do anything fancy with multi-page
allocations. Just a few of these allocations held in reserve (perhaps
make the allocations percpu to avoid new locks).

It won't help for a thundering herd problem where you start tons of
new threads, but those don't tend to be short-lived ones anyway. In
contrast, I think one common case is the "run shell scripts" that runs
tons and tons of short-lived processes, and having a small "stack of
stacks" would probably catch that case very nicely. Even a
single-entry cache might be ok, but I see no reason to not make it be
perhaps three or four stacks per CPU.

Make the "thread create/exit" sequence go really fast by avoiding the
allocation/deallocation, and hopefully catching a hot cache and TLB
line too.

Performance is not something that we add later. If the first version
of the patch series doesn't perform well, it should not be considered
ready.

            Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21  4:01   ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-21  4:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On my laptop, this adds about 1.5µs of overhead to task creation,
> which seems to be mainly caused by vmalloc inefficiently allocating
> individual pages even when a higher-order page is available on the
> freelist.

I really think that problem needs to be fixed before this should be merged.

The easy fix may be to just have a very limited re-use of these stacks
in generic code, rather than try to do anything fancy with multi-page
allocations. Just a few of these allocations held in reserve (perhaps
make the allocations percpu to avoid new locks).

It won't help for a thundering herd problem where you start tons of
new threads, but those don't tend to be short-lived ones anyway. In
contrast, I think one common case is the "run shell scripts" that runs
tons and tons of short-lived processes, and having a small "stack of
stacks" would probably catch that case very nicely. Even a
single-entry cache might be ok, but I see no reason to not make it be
perhaps three or four stacks per CPU.

Make the "thread create/exit" sequence go really fast by avoiding the
allocation/deallocation, and hopefully catching a hot cache and TLB
line too.

Performance is not something that we add later. If the first version
of the patch series doesn't perform well, it should not be considered
ready.

            Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-20 23:43   ` Andy Lutomirski
  (?)
@ 2016-06-21  7:30     ` Jann Horn
  -1 siblings, 0 replies; 269+ messages in thread
From: Jann Horn @ 2016-06-21  7:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
[...]
>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>                                                   int node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> +       struct thread_info *ti = __vmalloc_node_range(
> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
> +               0, node, __builtin_return_address(0));
> +

After spender gave some hints on IRC about the guard pages not working
reliably, I decided to have a closer look at this. As far as I can
tell, the idea is that __vmalloc_node_range() automatically adds guard
pages unless the VM_NO_GUARD flag is specified. However, those guard
pages are *behind* allocations, not in front of them, while a stack
guard primarily needs to be in front of the allocation. This wouldn't
matter if all allocations in the vmalloc area had guard pages behind
them, but if someone first does some data allocation with VM_NO_GUARD
and then a stack allocation directly behind that, there won't be a
guard between the data allocation and the stack allocation.

(I might be wrong though; this is only from looking at the code, not
from testing it.)

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21  7:30     ` Jann Horn
  0 siblings, 0 replies; 269+ messages in thread
From: Jann Horn @ 2016-06-21  7:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
[...]
>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>                                                   int node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> +       struct thread_info *ti = __vmalloc_node_range(
> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
> +               0, node, __builtin_return_address(0));
> +

After spender gave some hints on IRC about the guard pages not working
reliably, I decided to have a closer look at this. As far as I can
tell, the idea is that __vmalloc_node_range() automatically adds guard
pages unless the VM_NO_GUARD flag is specified. However, those guard
pages are *behind* allocations, not in front of them, while a stack
guard primarily needs to be in front of the allocation. This wouldn't
matter if all allocations in the vmalloc area had guard pages behind
them, but if someone first does some data allocation with VM_NO_GUARD
and then a stack allocation directly behind that, there won't be a
guard between the data allocation and the stack allocation.

(I might be wrong though; this is only from looking at the code, not
from testing it.)

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21  7:30     ` Jann Horn
  0 siblings, 0 replies; 269+ messages in thread
From: Jann Horn @ 2016-06-21  7:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
[...]
>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>                                                   int node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> +       struct thread_info *ti = __vmalloc_node_range(
> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
> +               0, node, __builtin_return_address(0));
> +

After spender gave some hints on IRC about the guard pages not working
reliably, I decided to have a closer look at this. As far as I can
tell, the idea is that __vmalloc_node_range() automatically adds guard
pages unless the VM_NO_GUARD flag is specified. However, those guard
pages are *behind* allocations, not in front of them, while a stack
guard primarily needs to be in front of the allocation. This wouldn't
matter if all allocations in the vmalloc area had guard pages behind
them, but if someone first does some data allocation with VM_NO_GUARD
and then a stack allocation directly behind that, there won't be a
guard between the data allocation and the stack allocation.

(I might be wrong though; this is only from looking at the code, not
from testing it.)

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-20 23:43 ` Andy Lutomirski
  (?)
@ 2016-06-21  9:24   ` Arnd Bergmann
  -1 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21  9:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> 
> On my laptop, this adds about 1.5µs of overhead to task creation,
> which seems to be mainly caused by vmalloc inefficiently allocating
> individual pages even when a higher-order page is available on the
> freelist.

Would it help to have a fixed virtual address for the stack instead
and map the current stack to that during a task switch, similar to
how we handle fixmap pages?

That would of course trade the allocation overhead for a task switch
overhead, which may be better or worse. It would also give "current"
a constant address, which may give a small performance advantage
but may also introduce a new attack vector unless we randomize it
again.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21  9:24   ` Arnd Bergmann
  0 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21  9:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> 
> On my laptop, this adds about 1.5µs of overhead to task creation,
> which seems to be mainly caused by vmalloc inefficiently allocating
> individual pages even when a higher-order page is available on the
> freelist.

Would it help to have a fixed virtual address for the stack instead
and map the current stack to that during a task switch, similar to
how we handle fixmap pages?

That would of course trade the allocation overhead for a task switch
overhead, which may be better or worse. It would also give "current"
a constant address, which may give a small performance advantage
but may also introduce a new attack vector unless we randomize it
again.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21  9:24   ` Arnd Bergmann
  0 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21  9:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> 
> On my laptop, this adds about 1.5µs of overhead to task creation,
> which seems to be mainly caused by vmalloc inefficiently allocating
> individual pages even when a higher-order page is available on the
> freelist.

Would it help to have a fixed virtual address for the stack instead
and map the current stack to that during a task switch, similar to
how we handle fixmap pages?

That would of course trade the allocation overhead for a task switch
overhead, which may be better or worse. It would also give "current"
a constant address, which may give a small performance advantage
but may also introduce a new attack vector unless we randomize it
again.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-20 23:43   ` Andy Lutomirski
  (?)
  (?)
@ 2016-06-21  9:46     ` Vladimir Davydov
  -1 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm

On Mon, Jun 20, 2016 at 04:43:34PM -0700, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-21  9:46     ` Vladimir Davydov
  0 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm

On Mon, Jun 20, 2016 at 04:43:34PM -0700, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-21  9:46     ` Vladimir Davydov
  0 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm

On Mon, Jun 20, 2016 at 04:43:34PM -0700, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-21  9:46     ` Vladimir Davydov
  0 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm

On Mon, Jun 20, 2016 at 04:43:34PM -0700, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 03/13] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  2016-06-20 23:43   ` Andy Lutomirski
  (?)
@ 2016-06-21  9:53     ` Matt Fleming
  -1 siblings, 0 replies; 269+ messages in thread
From: Matt Fleming @ 2016-06-21  9:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, linux-efi

On Mon, 20 Jun, at 04:43:33PM, Andy Lutomirski wrote:
> kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
> init_mm.pgd were to be cleared, callers would need to ensure that
> the pgd entry hadn't been propagated to any other pgd.
> 
> Its only caller was efi_cleanup_page_tables(), and that, in turn,
> was unused, so just delete both functions.  This leaves a couple of
> other helpers unused, so delete them, too.
> 
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/efi.h           |  1 -
>  arch/x86/include/asm/pgtable_types.h |  2 --
>  arch/x86/mm/pageattr.c               | 28 ----------------------------
>  arch/x86/platform/efi/efi.c          |  2 --
>  arch/x86/platform/efi/efi_32.c       |  3 ---
>  arch/x86/platform/efi/efi_64.c       |  5 -----
>  6 files changed, 41 deletions(-)

Looks fine.

Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 03/13] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
@ 2016-06-21  9:53     ` Matt Fleming
  0 siblings, 0 replies; 269+ messages in thread
From: Matt Fleming @ 2016-06-21  9:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, linux-efi

On Mon, 20 Jun, at 04:43:33PM, Andy Lutomirski wrote:
> kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
> init_mm.pgd were to be cleared, callers would need to ensure that
> the pgd entry hadn't been propagated to any other pgd.
> 
> Its only caller was efi_cleanup_page_tables(), and that, in turn,
> was unused, so just delete both functions.  This leaves a couple of
> other helpers unused, so delete them, too.
> 
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/efi.h           |  1 -
>  arch/x86/include/asm/pgtable_types.h |  2 --
>  arch/x86/mm/pageattr.c               | 28 ----------------------------
>  arch/x86/platform/efi/efi.c          |  2 --
>  arch/x86/platform/efi/efi_32.c       |  3 ---
>  arch/x86/platform/efi/efi_64.c       |  5 -----
>  6 files changed, 41 deletions(-)

Looks fine.

Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 03/13] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
@ 2016-06-21  9:53     ` Matt Fleming
  0 siblings, 0 replies; 269+ messages in thread
From: Matt Fleming @ 2016-06-21  9:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, linux-efi

On Mon, 20 Jun, at 04:43:33PM, Andy Lutomirski wrote:
> kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
> init_mm.pgd were to be cleared, callers would need to ensure that
> the pgd entry hadn't been propagated to any other pgd.
> 
> Its only caller was efi_cleanup_page_tables(), and that, in turn,
> was unused, so just delete both functions.  This leaves a couple of
> other helpers unused, so delete them, too.
> 
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/efi.h           |  1 -
>  arch/x86/include/asm/pgtable_types.h |  2 --
>  arch/x86/mm/pageattr.c               | 28 ----------------------------
>  arch/x86/platform/efi/efi.c          |  2 --
>  arch/x86/platform/efi/efi_32.c       |  3 ---
>  arch/x86/platform/efi/efi_64.c       |  5 -----
>  6 files changed, 41 deletions(-)

Looks fine.

Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
  2016-06-20 23:43   ` Andy Lutomirski
  (?)
  (?)
@ 2016-06-21  9:54     ` Vladimir Davydov
  -1 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm, Andrew Morton

On Mon, Jun 20, 2016 at 04:43:35PM -0700, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

This patch is going to have a minor conflict with recent changes in
mmotm, where {alloc,free}_kmem_pages were dropped, The conflict should
be trivial to resolve - we only need to replace {alloc,free}_kmem_pages
with {alloc,free}_pages in this patch.

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-21  9:54     ` Vladimir Davydov
  0 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm, Andrew Morton

On Mon, Jun 20, 2016 at 04:43:35PM -0700, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

This patch is going to have a minor conflict with recent changes in
mmotm, where {alloc,free}_kmem_pages were dropped, The conflict should
be trivial to resolve - we only need to replace {alloc,free}_kmem_pages
with {alloc,free}_pages in this patch.

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-21  9:54     ` Vladimir Davydov
  0 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm, Andrew Morton

On Mon, Jun 20, 2016 at 04:43:35PM -0700, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

This patch is going to have a minor conflict with recent changes in
mmotm, where {alloc,free}_kmem_pages were dropped, The conflict should
be trivial to resolve - we only need to replace {alloc,free}_kmem_pages
with {alloc,free}_pages in this patch.

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-21  9:54     ` Vladimir Davydov
  0 siblings, 0 replies; 269+ messages in thread
From: Vladimir Davydov @ 2016-06-21  9:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Johannes Weiner,
	Michal Hocko, linux-mm, Andrew Morton

On Mon, Jun 20, 2016 at 04:43:35PM -0700, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

This patch is going to have a minor conflict with recent changes in
mmotm, where {alloc,free}_kmem_pages were dropped, The conflict should
be trivial to resolve - we only need to replace {alloc,free}_kmem_pages
with {alloc,free}_pages in this patch.

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21  4:01   ` Linus Torvalds
  (?)
@ 2016-06-21 16:45     ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> I really think that problem needs to be fixed before this should be merged.
>
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).
>
> It won't help for a thundering herd problem where you start tons of
> new threads, but those don't tend to be short-lived ones anyway. In
> contrast, I think one common case is the "run shell scripts" that runs
> tons and tons of short-lived processes, and having a small "stack of
> stacks" would probably catch that case very nicely. Even a
> single-entry cache might be ok, but I see no reason to not make it be
> perhaps three or four stacks per CPU.
>
> Make the "thread create/exit" sequence go really fast by avoiding the
> allocation/deallocation, and hopefully catching a hot cache and TLB
> line too.

To put the numbers in perspective: we'll pay the 1.5µs every time we
do any kind of clone(), but I think that many of the interesting cases
may be so far dominated by other costs that this is lost in the noise.
For scripts, execve() and all the dynamic linking overhead is so much
larger that no one will ever notice this:

time for i in `seq 1000`; do /bin/true; done

real    0m2.641s
user    0m0.058s
sys    0m0.107s

That's over 2ms per /bin/true invocation, so we're talking about less
than a 0.1% slowdown.  For fork() (i.e. !CLONE_VM), we'll have the
full cost of copying the mm.  And for anything with a thundering herd,
there will be lots of context switches, and just the context switches
are likely to swamp the task creation time.

On the flip side, on workloads where higher-order page allocation
requires any sort of compation, using vmalloc should be much faster.

So I'm leaning toward fewer cache entries per cpu, maybe just one.
I'm all for making it a bit faster, but I think we should weigh that
against increasing memory usage too much and thus scaring away the
embedded folks.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 16:45     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> I really think that problem needs to be fixed before this should be merged.
>
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).
>
> It won't help for a thundering herd problem where you start tons of
> new threads, but those don't tend to be short-lived ones anyway. In
> contrast, I think one common case is the "run shell scripts" that runs
> tons and tons of short-lived processes, and having a small "stack of
> stacks" would probably catch that case very nicely. Even a
> single-entry cache might be ok, but I see no reason to not make it be
> perhaps three or four stacks per CPU.
>
> Make the "thread create/exit" sequence go really fast by avoiding the
> allocation/deallocation, and hopefully catching a hot cache and TLB
> line too.

To put the numbers in perspective: we'll pay the 1.5µs every time we
do any kind of clone(), but I think that many of the interesting cases
may be so far dominated by other costs that this is lost in the noise.
For scripts, execve() and all the dynamic linking overhead is so much
larger that no one will ever notice this:

time for i in `seq 1000`; do /bin/true; done

real    0m2.641s
user    0m0.058s
sys    0m0.107s

That's over 2ms per /bin/true invocation, so we're talking about less
than a 0.1% slowdown.  For fork() (i.e. !CLONE_VM), we'll have the
full cost of copying the mm.  And for anything with a thundering herd,
there will be lots of context switches, and just the context switches
are likely to swamp the task creation time.

On the flip side, on workloads where higher-order page allocation
requires any sort of compation, using vmalloc should be much faster.

So I'm leaning toward fewer cache entries per cpu, maybe just one.
I'm all for making it a bit faster, but I think we should weigh that
against increasing memory usage too much and thus scaring away the
embedded folks.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 16:45     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> I really think that problem needs to be fixed before this should be merged.
>
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).
>
> It won't help for a thundering herd problem where you start tons of
> new threads, but those don't tend to be short-lived ones anyway. In
> contrast, I think one common case is the "run shell scripts" that runs
> tons and tons of short-lived processes, and having a small "stack of
> stacks" would probably catch that case very nicely. Even a
> single-entry cache might be ok, but I see no reason to not make it be
> perhaps three or four stacks per CPU.
>
> Make the "thread create/exit" sequence go really fast by avoiding the
> allocation/deallocation, and hopefully catching a hot cache and TLB
> line too.

To put the numbers in perspective: we'll pay the 1.5µs every time we
do any kind of clone(), but I think that many of the interesting cases
may be so far dominated by other costs that this is lost in the noise.
For scripts, execve() and all the dynamic linking overhead is so much
larger that no one will ever notice this:

time for i in `seq 1000`; do /bin/true; done

real    0m2.641s
user    0m0.058s
sys    0m0.107s

That's over 2ms per /bin/true invocation, so we're talking about less
than a 0.1% slowdown.  For fork() (i.e. !CLONE_VM), we'll have the
full cost of copying the mm.  And for anything with a thundering herd,
there will be lots of context switches, and just the context switches
are likely to swamp the task creation time.

On the flip side, on workloads where higher-order page allocation
requires any sort of compation, using vmalloc should be much faster.

So I'm leaning toward fewer cache entries per cpu, maybe just one.
I'm all for making it a bit faster, but I think we should weigh that
against increasing memory usage too much and thus scaring away the
embedded folks.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-21  7:30     ` Jann Horn
  (?)
@ 2016-06-21 16:59       ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> vmalloc_node.
> [...]
>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>                                                   int node)
>>  {
>> +#ifdef CONFIG_VMAP_STACK
>> +       struct thread_info *ti = __vmalloc_node_range(
>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>> +               0, node, __builtin_return_address(0));
>> +
>
> After spender gave some hints on IRC about the guard pages not working
> reliably, I decided to have a closer look at this. As far as I can
> tell, the idea is that __vmalloc_node_range() automatically adds guard
> pages unless the VM_NO_GUARD flag is specified. However, those guard
> pages are *behind* allocations, not in front of them, while a stack
> guard primarily needs to be in front of the allocation. This wouldn't
> matter if all allocations in the vmalloc area had guard pages behind
> them, but if someone first does some data allocation with VM_NO_GUARD
> and then a stack allocation directly behind that, there won't be a
> guard between the data allocation and the stack allocation.

I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
It has no in-tree users for non-fixed addresses right now.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 16:59       ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> vmalloc_node.
> [...]
>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>                                                   int node)
>>  {
>> +#ifdef CONFIG_VMAP_STACK
>> +       struct thread_info *ti = __vmalloc_node_range(
>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>> +               0, node, __builtin_return_address(0));
>> +
>
> After spender gave some hints on IRC about the guard pages not working
> reliably, I decided to have a closer look at this. As far as I can
> tell, the idea is that __vmalloc_node_range() automatically adds guard
> pages unless the VM_NO_GUARD flag is specified. However, those guard
> pages are *behind* allocations, not in front of them, while a stack
> guard primarily needs to be in front of the allocation. This wouldn't
> matter if all allocations in the vmalloc area had guard pages behind
> them, but if someone first does some data allocation with VM_NO_GUARD
> and then a stack allocation directly behind that, there won't be a
> guard between the data allocation and the stack allocation.

I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
It has no in-tree users for non-fixed addresses right now.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 16:59       ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> vmalloc_node.
> [...]
>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>                                                   int node)
>>  {
>> +#ifdef CONFIG_VMAP_STACK
>> +       struct thread_info *ti = __vmalloc_node_range(
>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>> +               0, node, __builtin_return_address(0));
>> +
>
> After spender gave some hints on IRC about the guard pages not working
> reliably, I decided to have a closer look at this. As far as I can
> tell, the idea is that __vmalloc_node_range() automatically adds guard
> pages unless the VM_NO_GUARD flag is specified. However, those guard
> pages are *behind* allocations, not in front of them, while a stack
> guard primarily needs to be in front of the allocation. This wouldn't
> matter if all allocations in the vmalloc area had guard pages behind
> them, but if someone first does some data allocation with VM_NO_GUARD
> and then a stack allocation directly behind that, there won't be a
> guard between the data allocation and the stack allocation.

I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
It has no in-tree users for non-fixed addresses right now.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-21 16:59       ` Andy Lutomirski
  (?)
@ 2016-06-21 17:13         ` Kees Cook
  -1 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 17:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
>> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>>> vmalloc_node.
>> [...]
>>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>>                                                   int node)
>>>  {
>>> +#ifdef CONFIG_VMAP_STACK
>>> +       struct thread_info *ti = __vmalloc_node_range(
>>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>>> +               0, node, __builtin_return_address(0));
>>> +
>>
>> After spender gave some hints on IRC about the guard pages not working
>> reliably, I decided to have a closer look at this. As far as I can
>> tell, the idea is that __vmalloc_node_range() automatically adds guard
>> pages unless the VM_NO_GUARD flag is specified. However, those guard
>> pages are *behind* allocations, not in front of them, while a stack
>> guard primarily needs to be in front of the allocation. This wouldn't
>> matter if all allocations in the vmalloc area had guard pages behind
>> them, but if someone first does some data allocation with VM_NO_GUARD
>> and then a stack allocation directly behind that, there won't be a
>> guard between the data allocation and the stack allocation.
>
> I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
> It has no in-tree users for non-fixed addresses right now.

What about the lack of pre-range guard page? That seems like a
critical feature for this. :)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 17:13         ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 17:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
>> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>>> vmalloc_node.
>> [...]
>>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>>                                                   int node)
>>>  {
>>> +#ifdef CONFIG_VMAP_STACK
>>> +       struct thread_info *ti = __vmalloc_node_range(
>>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>>> +               0, node, __builtin_return_address(0));
>>> +
>>
>> After spender gave some hints on IRC about the guard pages not working
>> reliably, I decided to have a closer look at this. As far as I can
>> tell, the idea is that __vmalloc_node_range() automatically adds guard
>> pages unless the VM_NO_GUARD flag is specified. However, those guard
>> pages are *behind* allocations, not in front of them, while a stack
>> guard primarily needs to be in front of the allocation. This wouldn't
>> matter if all allocations in the vmalloc area had guard pages behind
>> them, but if someone first does some data allocation with VM_NO_GUARD
>> and then a stack allocation directly behind that, there won't be a
>> guard between the data allocation and the stack allocation.
>
> I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
> It has no in-tree users for non-fixed addresses right now.

What about the lack of pre-range guard page? That seems like a
critical feature for this. :)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 17:13         ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 17:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
>> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>>> vmalloc_node.
>> [...]
>>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>>                                                   int node)
>>>  {
>>> +#ifdef CONFIG_VMAP_STACK
>>> +       struct thread_info *ti = __vmalloc_node_range(
>>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>>> +               0, node, __builtin_return_address(0));
>>> +
>>
>> After spender gave some hints on IRC about the guard pages not working
>> reliably, I decided to have a closer look at this. As far as I can
>> tell, the idea is that __vmalloc_node_range() automatically adds guard
>> pages unless the VM_NO_GUARD flag is specified. However, those guard
>> pages are *behind* allocations, not in front of them, while a stack
>> guard primarily needs to be in front of the allocation. This wouldn't
>> matter if all allocations in the vmalloc area had guard pages behind
>> them, but if someone first does some data allocation with VM_NO_GUARD
>> and then a stack allocation directly behind that, there won't be a
>> guard between the data allocation and the stack allocation.
>
> I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
> It has no in-tree users for non-fixed addresses right now.

What about the lack of pre-range guard page? That seems like a
critical feature for this. :)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21  9:24   ` Arnd Bergmann
  (?)
@ 2016-06-21 17:16     ` Kees Cook
  -1 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 17:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> Would it help to have a fixed virtual address for the stack instead
> and map the current stack to that during a task switch, similar to
> how we handle fixmap pages?
>
> That would of course trade the allocation overhead for a task switch
> overhead, which may be better or worse. It would also give "current"
> a constant address, which may give a small performance advantage
> but may also introduce a new attack vector unless we randomize it
> again.

Right: we don't want a fixed address. That makes attacks WAY easier.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 17:16     ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 17:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> Would it help to have a fixed virtual address for the stack instead
> and map the current stack to that during a task switch, similar to
> how we handle fixmap pages?
>
> That would of course trade the allocation overhead for a task switch
> overhead, which may be better or worse. It would also give "current"
> a constant address, which may give a small performance advantage
> but may also introduce a new attack vector unless we randomize it
> again.

Right: we don't want a fixed address. That makes attacks WAY easier.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 17:16     ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 17:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> Would it help to have a fixed virtual address for the stack instead
> and map the current stack to that during a task switch, similar to
> how we handle fixmap pages?
>
> That would of course trade the allocation overhead for a task switch
> overhead, which may be better or worse. It would also give "current"
> a constant address, which may give a small performance advantage
> but may also introduce a new attack vector unless we randomize it
> again.

Right: we don't want a fixed address. That makes attacks WAY easier.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 16:45     ` Andy Lutomirski
  (?)
@ 2016-06-21 17:16       ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-21 17:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> So I'm leaning toward fewer cache entries per cpu, maybe just one.
> I'm all for making it a bit faster, but I think we should weigh that
> against increasing memory usage too much and thus scaring away the
> embedded folks.

I don't think the embedded folks will be scared by a per-cpu cache, if
it's just one or two entries.  And I really do think that even just
one or two entries will indeed catch a lot of the cases.

And yes, fork+execve() is too damn expensive in page table build-up
and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
has to wait for the process anyway, but it doesn't seem to do that.

                 Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 17:16       ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-21 17:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> So I'm leaning toward fewer cache entries per cpu, maybe just one.
> I'm all for making it a bit faster, but I think we should weigh that
> against increasing memory usage too much and thus scaring away the
> embedded folks.

I don't think the embedded folks will be scared by a per-cpu cache, if
it's just one or two entries.  And I really do think that even just
one or two entries will indeed catch a lot of the cases.

And yes, fork+execve() is too damn expensive in page table build-up
and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
has to wait for the process anyway, but it doesn't seem to do that.

                 Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 17:16       ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-21 17:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> So I'm leaning toward fewer cache entries per cpu, maybe just one.
> I'm all for making it a bit faster, but I think we should weigh that
> against increasing memory usage too much and thus scaring away the
> embedded folks.

I don't think the embedded folks will be scared by a per-cpu cache, if
it's just one or two entries.  And I really do think that even just
one or two entries will indeed catch a lot of the cases.

And yes, fork+execve() is too damn expensive in page table build-up
and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
has to wait for the process anyway, but it doesn't seem to do that.

                 Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 17:16       ` Linus Torvalds
  (?)
@ 2016-06-21 17:27         ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> So I'm leaning toward fewer cache entries per cpu, maybe just one.
>> I'm all for making it a bit faster, but I think we should weigh that
>> against increasing memory usage too much and thus scaring away the
>> embedded folks.
>
> I don't think the embedded folks will be scared by a per-cpu cache, if
> it's just one or two entries.  And I really do think that even just
> one or two entries will indeed catch a lot of the cases.
>
> And yes, fork+execve() is too damn expensive in page table build-up
> and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
> has to wait for the process anyway, but it doesn't seem to do that.
>

I don't know about bash, but glibc very recently fixed a long-standing
but in posix_spawn and started using clone() in a sensible manner for
this.

FWIW, it may be a while before this can be enabled in distro kernels.
There are some code paths (*cough* crypto users *cough*) that think
that calling sg_init_one with a stack address is a reasonable thing to
do, and it doesn't work with a vmalloced stack.  grsecurity works
around this by using a real lowmem higher-order stack, aliasing it
into vmalloc space, and arranging for virt_to_phys to backtrack the
alias, but eww.  I think I'd rather find and fix the bugs, assuming
they're straightforward.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 17:27         ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> So I'm leaning toward fewer cache entries per cpu, maybe just one.
>> I'm all for making it a bit faster, but I think we should weigh that
>> against increasing memory usage too much and thus scaring away the
>> embedded folks.
>
> I don't think the embedded folks will be scared by a per-cpu cache, if
> it's just one or two entries.  And I really do think that even just
> one or two entries will indeed catch a lot of the cases.
>
> And yes, fork+execve() is too damn expensive in page table build-up
> and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
> has to wait for the process anyway, but it doesn't seem to do that.
>

I don't know about bash, but glibc very recently fixed a long-standing
but in posix_spawn and started using clone() in a sensible manner for
this.

FWIW, it may be a while before this can be enabled in distro kernels.
There are some code paths (*cough* crypto users *cough*) that think
that calling sg_init_one with a stack address is a reasonable thing to
do, and it doesn't work with a vmalloced stack.  grsecurity works
around this by using a real lowmem higher-order stack, aliasing it
into vmalloc space, and arranging for virt_to_phys to backtrack the
alias, but eww.  I think I'd rather find and fix the bugs, assuming
they're straightforward.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 17:27         ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> So I'm leaning toward fewer cache entries per cpu, maybe just one.
>> I'm all for making it a bit faster, but I think we should weigh that
>> against increasing memory usage too much and thus scaring away the
>> embedded folks.
>
> I don't think the embedded folks will be scared by a per-cpu cache, if
> it's just one or two entries.  And I really do think that even just
> one or two entries will indeed catch a lot of the cases.
>
> And yes, fork+execve() is too damn expensive in page table build-up
> and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
> has to wait for the process anyway, but it doesn't seem to do that.
>

I don't know about bash, but glibc very recently fixed a long-standing
but in posix_spawn and started using clone() in a sensible manner for
this.

FWIW, it may be a while before this can be enabled in distro kernels.
There are some code paths (*cough* crypto users *cough*) that think
that calling sg_init_one with a stack address is a reasonable thing to
do, and it doesn't work with a vmalloced stack.  grsecurity works
around this by using a real lowmem higher-order stack, aliasing it
into vmalloc space, and arranging for virt_to_phys to backtrack the
alias, but eww.  I think I'd rather find and fix the bugs, assuming
they're straightforward.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-21 17:13         ` Kees Cook
  (?)
@ 2016-06-21 17:28           ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 17:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:13 AM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
>>> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>>>> vmalloc_node.
>>> [...]
>>>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>>>                                                   int node)
>>>>  {
>>>> +#ifdef CONFIG_VMAP_STACK
>>>> +       struct thread_info *ti = __vmalloc_node_range(
>>>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>>>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>>>> +               0, node, __builtin_return_address(0));
>>>> +
>>>
>>> After spender gave some hints on IRC about the guard pages not working
>>> reliably, I decided to have a closer look at this. As far as I can
>>> tell, the idea is that __vmalloc_node_range() automatically adds guard
>>> pages unless the VM_NO_GUARD flag is specified. However, those guard
>>> pages are *behind* allocations, not in front of them, while a stack
>>> guard primarily needs to be in front of the allocation. This wouldn't
>>> matter if all allocations in the vmalloc area had guard pages behind
>>> them, but if someone first does some data allocation with VM_NO_GUARD
>>> and then a stack allocation directly behind that, there won't be a
>>> guard between the data allocation and the stack allocation.
>>
>> I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
>> It has no in-tree users for non-fixed addresses right now.
>
> What about the lack of pre-range guard page? That seems like a
> critical feature for this. :)
>

Agreed.  There's a big va hole there on x86_64, but I don't know about
other arches.  It might pay to add something to the vmalloc core code.
Any volunteers?

> -Kees
>
> --
> Kees Cook
> Chrome OS & Brillo Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 17:28           ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 17:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:13 AM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
>>> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>>>> vmalloc_node.
>>> [...]
>>>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>>>                                                   int node)
>>>>  {
>>>> +#ifdef CONFIG_VMAP_STACK
>>>> +       struct thread_info *ti = __vmalloc_node_range(
>>>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>>>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>>>> +               0, node, __builtin_return_address(0));
>>>> +
>>>
>>> After spender gave some hints on IRC about the guard pages not working
>>> reliably, I decided to have a closer look at this. As far as I can
>>> tell, the idea is that __vmalloc_node_range() automatically adds guard
>>> pages unless the VM_NO_GUARD flag is specified. However, those guard
>>> pages are *behind* allocations, not in front of them, while a stack
>>> guard primarily needs to be in front of the allocation. This wouldn't
>>> matter if all allocations in the vmalloc area had guard pages behind
>>> them, but if someone first does some data allocation with VM_NO_GUARD
>>> and then a stack allocation directly behind that, there won't be a
>>> guard between the data allocation and the stack allocation.
>>
>> I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
>> It has no in-tree users for non-fixed addresses right now.
>
> What about the lack of pre-range guard page? That seems like a
> critical feature for this. :)
>

Agreed.  There's a big va hole there on x86_64, but I don't know about
other arches.  It might pay to add something to the vmalloc core code.
Any volunteers?

> -Kees
>
> --
> Kees Cook
> Chrome OS & Brillo Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 17:28           ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 17:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:13 AM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jun 21, 2016 at 12:30 AM, Jann Horn <jannh@google.com> wrote:
>>> On Tue, Jun 21, 2016 at 1:43 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>>>> vmalloc_node.
>>> [...]
>>>>  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>>>>                                                   int node)
>>>>  {
>>>> +#ifdef CONFIG_VMAP_STACK
>>>> +       struct thread_info *ti = __vmalloc_node_range(
>>>> +               THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
>>>> +               THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
>>>> +               0, node, __builtin_return_address(0));
>>>> +
>>>
>>> After spender gave some hints on IRC about the guard pages not working
>>> reliably, I decided to have a closer look at this. As far as I can
>>> tell, the idea is that __vmalloc_node_range() automatically adds guard
>>> pages unless the VM_NO_GUARD flag is specified. However, those guard
>>> pages are *behind* allocations, not in front of them, while a stack
>>> guard primarily needs to be in front of the allocation. This wouldn't
>>> matter if all allocations in the vmalloc area had guard pages behind
>>> them, but if someone first does some data allocation with VM_NO_GUARD
>>> and then a stack allocation directly behind that, there won't be a
>>> guard between the data allocation and the stack allocation.
>>
>> I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc range.
>> It has no in-tree users for non-fixed addresses right now.
>
> What about the lack of pre-range guard page? That seems like a
> critical feature for this. :)
>

Agreed.  There's a big va hole there on x86_64, but I don't know about
other arches.  It might pay to add something to the vmalloc core code.
Any volunteers?

> -Kees
>
> --
> Kees Cook
> Chrome OS & Brillo Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 17:16     ` Kees Cook
@ 2016-06-21 18:02       ` Rik van Riel
  -1 siblings, 0 replies; 269+ messages in thread
From: Rik van Riel @ 2016-06-21 18:02 UTC (permalink / raw)
  To: kernel-hardening, Arnd Bergmann
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 1307 bytes --]

On Tue, 2016-06-21 at 10:16 -0700, Kees Cook wrote:
> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > 
> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> > > 
> > > 
> > > On my laptop, this adds about 1.5µs of overhead to task creation,
> > > which seems to be mainly caused by vmalloc inefficiently
> > > allocating
> > > individual pages even when a higher-order page is available on
> > > the
> > > freelist.
> > Would it help to have a fixed virtual address for the stack instead
> > and map the current stack to that during a task switch, similar to
> > how we handle fixmap pages?
> > 
> > That would of course trade the allocation overhead for a task
> > switch
> > overhead, which may be better or worse. It would also give
> > "current"
> > a constant address, which may give a small performance advantage
> > but may also introduce a new attack vector unless we randomize it
> > again.
> Right: we don't want a fixed address. That makes attacks WAY easier.

Does that imply we might want the per-cpu cache of
these stacks to be larger than one, in order to
introduce some more randomness after an attacker
crashed an ASLRed program looking for ROP gadgets,
and the next one is spawned? :)

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 18:02       ` Rik van Riel
  0 siblings, 0 replies; 269+ messages in thread
From: Rik van Riel @ 2016-06-21 18:02 UTC (permalink / raw)
  To: kernel-hardening, Arnd Bergmann
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 1307 bytes --]

On Tue, 2016-06-21 at 10:16 -0700, Kees Cook wrote:
> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > 
> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> > > 
> > > 
> > > On my laptop, this adds about 1.5µs of overhead to task creation,
> > > which seems to be mainly caused by vmalloc inefficiently
> > > allocating
> > > individual pages even when a higher-order page is available on
> > > the
> > > freelist.
> > Would it help to have a fixed virtual address for the stack instead
> > and map the current stack to that during a task switch, similar to
> > how we handle fixmap pages?
> > 
> > That would of course trade the allocation overhead for a task
> > switch
> > overhead, which may be better or worse. It would also give
> > "current"
> > a constant address, which may give a small performance advantage
> > but may also introduce a new attack vector unless we randomize it
> > again.
> Right: we don't want a fixed address. That makes attacks WAY easier.

Does that imply we might want the per-cpu cache of
these stacks to be larger than one, in order to
introduce some more randomness after an attacker
crashed an ASLRed program looking for ROP gadgets,
and the next one is spawned? :)

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 18:02       ` Rik van Riel
  (?)
@ 2016-06-21 18:05         ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 18:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kernel-hardening, Arnd Bergmann, Andy Lutomirski, x86, LKML,
	linux-arch, Borislav Petkov, Nadav Amit, Brian Gerst,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 11:02 AM, Rik van Riel <riel@redhat.com> wrote:
> On Tue, 2016-06-21 at 10:16 -0700, Kees Cook wrote:
>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> >
>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>> > >
>> > >
>> > > On my laptop, this adds about 1.5µs of overhead to task creation,
>> > > which seems to be mainly caused by vmalloc inefficiently
>> > > allocating
>> > > individual pages even when a higher-order page is available on
>> > > the
>> > > freelist.
>> > Would it help to have a fixed virtual address for the stack instead
>> > and map the current stack to that during a task switch, similar to
>> > how we handle fixmap pages?
>> >
>> > That would of course trade the allocation overhead for a task
>> > switch
>> > overhead, which may be better or worse. It would also give
>> > "current"
>> > a constant address, which may give a small performance advantage
>> > but may also introduce a new attack vector unless we randomize it
>> > again.
>> Right: we don't want a fixed address. That makes attacks WAY easier.
>
> Does that imply we might want the per-cpu cache of
> these stacks to be larger than one, in order to
> introduce some more randomness after an attacker
> crashed an ASLRed program looking for ROP gadgets,
> and the next one is spawned? :)

This is the kernel stack, so this only really matters if there's some
attack in which you OOPS but learn the kernel stack address in the
process and then reuse that stack.  So... maybe?

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 18:05         ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 18:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kernel-hardening, Arnd Bergmann, Andy Lutomirski, x86, LKML,
	linux-arch, Borislav Petkov, Nadav Amit, Brian Gerst,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 11:02 AM, Rik van Riel <riel@redhat.com> wrote:
> On Tue, 2016-06-21 at 10:16 -0700, Kees Cook wrote:
>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> >
>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>> > >
>> > >
>> > > On my laptop, this adds about 1.5µs of overhead to task creation,
>> > > which seems to be mainly caused by vmalloc inefficiently
>> > > allocating
>> > > individual pages even when a higher-order page is available on
>> > > the
>> > > freelist.
>> > Would it help to have a fixed virtual address for the stack instead
>> > and map the current stack to that during a task switch, similar to
>> > how we handle fixmap pages?
>> >
>> > That would of course trade the allocation overhead for a task
>> > switch
>> > overhead, which may be better or worse. It would also give
>> > "current"
>> > a constant address, which may give a small performance advantage
>> > but may also introduce a new attack vector unless we randomize it
>> > again.
>> Right: we don't want a fixed address. That makes attacks WAY easier.
>
> Does that imply we might want the per-cpu cache of
> these stacks to be larger than one, in order to
> introduce some more randomness after an attacker
> crashed an ASLRed program looking for ROP gadgets,
> and the next one is spawned? :)

This is the kernel stack, so this only really matters if there's some
attack in which you OOPS but learn the kernel stack address in the
process and then reuse that stack.  So... maybe?

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 18:05         ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 18:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kernel-hardening, Arnd Bergmann, Andy Lutomirski, x86, LKML,
	linux-arch, Borislav Petkov, Nadav Amit, Brian Gerst,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 11:02 AM, Rik van Riel <riel@redhat.com> wrote:
> On Tue, 2016-06-21 at 10:16 -0700, Kees Cook wrote:
>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> >
>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>> > >
>> > >
>> > > On my laptop, this adds about 1.5µs of overhead to task creation,
>> > > which seems to be mainly caused by vmalloc inefficiently
>> > > allocating
>> > > individual pages even when a higher-order page is available on
>> > > the
>> > > freelist.
>> > Would it help to have a fixed virtual address for the stack instead
>> > and map the current stack to that during a task switch, similar to
>> > how we handle fixmap pages?
>> >
>> > That would of course trade the allocation overhead for a task
>> > switch
>> > overhead, which may be better or worse. It would also give
>> > "current"
>> > a constant address, which may give a small performance advantage
>> > but may also introduce a new attack vector unless we randomize it
>> > again.
>> Right: we don't want a fixed address. That makes attacks WAY easier.
>
> Does that imply we might want the per-cpu cache of
> these stacks to be larger than one, in order to
> introduce some more randomness after an attacker
> crashed an ASLRed program looking for ROP gadgets,
> and the next one is spawned? :)

This is the kernel stack, so this only really matters if there's some
attack in which you OOPS but learn the kernel stack address in the
process and then reuse that stack.  So... maybe?

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 17:27         ` Andy Lutomirski
  (?)
@ 2016-06-21 18:12           ` Kees Cook
  -1 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 18:12 UTC (permalink / raw)
  To: Andy Lutomirski, Herbert Xu
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:27 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 10:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> So I'm leaning toward fewer cache entries per cpu, maybe just one.
>>> I'm all for making it a bit faster, but I think we should weigh that
>>> against increasing memory usage too much and thus scaring away the
>>> embedded folks.
>>
>> I don't think the embedded folks will be scared by a per-cpu cache, if
>> it's just one or two entries.  And I really do think that even just
>> one or two entries will indeed catch a lot of the cases.
>>
>> And yes, fork+execve() is too damn expensive in page table build-up
>> and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
>> has to wait for the process anyway, but it doesn't seem to do that.
>>
>
> I don't know about bash, but glibc very recently fixed a long-standing
> but in posix_spawn and started using clone() in a sensible manner for
> this.
>
> FWIW, it may be a while before this can be enabled in distro kernels.
> There are some code paths (*cough* crypto users *cough*) that think
> that calling sg_init_one with a stack address is a reasonable thing to
> do, and it doesn't work with a vmalloced stack.  grsecurity works

... O_o ...

Why does it not work on a vmalloced stack??

> around this by using a real lowmem higher-order stack, aliasing it
> into vmalloc space, and arranging for virt_to_phys to backtrack the
> alias, but eww.  I think I'd rather find and fix the bugs, assuming
> they're straightforward.

Yeah. That's ugly.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 18:12           ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 18:12 UTC (permalink / raw)
  To: Andy Lutomirski, Herbert Xu
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:27 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 10:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> So I'm leaning toward fewer cache entries per cpu, maybe just one.
>>> I'm all for making it a bit faster, but I think we should weigh that
>>> against increasing memory usage too much and thus scaring away the
>>> embedded folks.
>>
>> I don't think the embedded folks will be scared by a per-cpu cache, if
>> it's just one or two entries.  And I really do think that even just
>> one or two entries will indeed catch a lot of the cases.
>>
>> And yes, fork+execve() is too damn expensive in page table build-up
>> and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
>> has to wait for the process anyway, but it doesn't seem to do that.
>>
>
> I don't know about bash, but glibc very recently fixed a long-standing
> but in posix_spawn and started using clone() in a sensible manner for
> this.
>
> FWIW, it may be a while before this can be enabled in distro kernels.
> There are some code paths (*cough* crypto users *cough*) that think
> that calling sg_init_one with a stack address is a reasonable thing to
> do, and it doesn't work with a vmalloced stack.  grsecurity works

... O_o ...

Why does it not work on a vmalloced stack??

> around this by using a real lowmem higher-order stack, aliasing it
> into vmalloc space, and arranging for virt_to_phys to backtrack the
> alias, but eww.  I think I'd rather find and fix the bugs, assuming
> they're straightforward.

Yeah. That's ugly.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 18:12           ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 18:12 UTC (permalink / raw)
  To: Andy Lutomirski, Herbert Xu
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 10:27 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 10:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Tue, Jun 21, 2016 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> So I'm leaning toward fewer cache entries per cpu, maybe just one.
>>> I'm all for making it a bit faster, but I think we should weigh that
>>> against increasing memory usage too much and thus scaring away the
>>> embedded folks.
>>
>> I don't think the embedded folks will be scared by a per-cpu cache, if
>> it's just one or two entries.  And I really do think that even just
>> one or two entries will indeed catch a lot of the cases.
>>
>> And yes, fork+execve() is too damn expensive in page table build-up
>> and tear-down. I'm not sure why bash doesn't do vfork+exec for when it
>> has to wait for the process anyway, but it doesn't seem to do that.
>>
>
> I don't know about bash, but glibc very recently fixed a long-standing
> but in posix_spawn and started using clone() in a sensible manner for
> this.
>
> FWIW, it may be a while before this can be enabled in distro kernels.
> There are some code paths (*cough* crypto users *cough*) that think
> that calling sg_init_one with a stack address is a reasonable thing to
> do, and it doesn't work with a vmalloced stack.  grsecurity works

... O_o ...

Why does it not work on a vmalloced stack??

> around this by using a real lowmem higher-order stack, aliasing it
> into vmalloc space, and arranging for virt_to_phys to backtrack the
> alias, but eww.  I think I'd rather find and fix the bugs, assuming
> they're straightforward.

Yeah. That's ugly.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 18:12           ` Kees Cook
@ 2016-06-21 18:19             ` Rik van Riel
  -1 siblings, 0 replies; 269+ messages in thread
From: Rik van Riel @ 2016-06-21 18:19 UTC (permalink / raw)
  To: kernel-hardening, Andy Lutomirski, Herbert Xu
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On Tue, 2016-06-21 at 11:12 -0700, Kees Cook wrote:
> On Tue, Jun 21, 2016 at 10:27 AM, Andy Lutomirski
> <luto@amacapital.net> wrote:
> > FWIW, it may be a while before this can be enabled in distro
> > kernels.
> > There are some code paths (*cough* crypto users *cough*) that think
> > that calling sg_init_one with a stack address is a reasonable thing
> > to
> > do, and it doesn't work with a vmalloced stack.  grsecurity works
> ... O_o ...
> 
> Why does it not work on a vmalloced stack??

Because virt_to_page() does not work on vmalloced
memory.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 18:19             ` Rik van Riel
  0 siblings, 0 replies; 269+ messages in thread
From: Rik van Riel @ 2016-06-21 18:19 UTC (permalink / raw)
  To: kernel-hardening, Andy Lutomirski, Herbert Xu
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On Tue, 2016-06-21 at 11:12 -0700, Kees Cook wrote:
> On Tue, Jun 21, 2016 at 10:27 AM, Andy Lutomirski
> <luto@amacapital.net> wrote:
> > FWIW, it may be a while before this can be enabled in distro
> > kernels.
> > There are some code paths (*cough* crypto users *cough*) that think
> > that calling sg_init_one with a stack address is a reasonable thing
> > to
> > do, and it doesn't work with a vmalloced stack.  grsecurity works
> ... O_o ...
> 
> Why does it not work on a vmalloced stack??

Because virt_to_page() does not work on vmalloced
memory.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-21 17:13         ` Kees Cook
@ 2016-06-21 18:32           ` Rik van Riel
  -1 siblings, 0 replies; 269+ messages in thread
From: Rik van Riel @ 2016-06-21 18:32 UTC (permalink / raw)
  To: kernel-hardening, Andy Lutomirski
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 995 bytes --]

On Tue, 2016-06-21 at 10:13 -0700, Kees Cook wrote:
> On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net
> > wrote:
> > 
> > I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc
> > range.
> > It has no in-tree users for non-fixed addresses right now.
> What about the lack of pre-range guard page? That seems like a
> critical feature for this. :)

If VM_NO_GUARD is disallowed, and every vmalloc area has
a guard area behind it, then every subsequent vmalloc area
will have a guard page ahead of it.

I think disallowing VM_NO_GUARD will be all that is required.

The only thing we may want to verify on the architectures that
we care about is that there is nothing mapped immediately before
the start of the vmalloc range, otherwise the first vmalloced
area will not have a guard page below it.

I suspect all the 64 bit architectures are fine in that regard,
with enormous gaps between kernel memory ranges.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 18:32           ` Rik van Riel
  0 siblings, 0 replies; 269+ messages in thread
From: Rik van Riel @ 2016-06-21 18:32 UTC (permalink / raw)
  To: kernel-hardening, Andy Lutomirski
  Cc: Jann Horn, Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 995 bytes --]

On Tue, 2016-06-21 at 10:13 -0700, Kees Cook wrote:
> On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net
> > wrote:
> > 
> > I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc
> > range.
> > It has no in-tree users for non-fixed addresses right now.
> What about the lack of pre-range guard page? That seems like a
> critical feature for this. :)

If VM_NO_GUARD is disallowed, and every vmalloc area has
a guard area behind it, then every subsequent vmalloc area
will have a guard page ahead of it.

I think disallowing VM_NO_GUARD will be all that is required.

The only thing we may want to verify on the architectures that
we care about is that there is nothing mapped immediately before
the start of the vmalloc range, otherwise the first vmalloced
area will not have a guard page below it.

I suspect all the 64 bit architectures are fine in that regard,
with enormous gaps between kernel memory ranges.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-21 19:44             ` Arnd Bergmann
  (?)
@ 2016-06-21 19:43               ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 19:43 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Rik van Riel, kernel-hardening, Jann Horn, Andy Lutomirski,
	X86 ML, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Brian Gerst, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tue, Jun 21, 2016 at 12:44 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, June 21, 2016 2:32:28 PM CEST Rik van Riel wrote:
>> On Tue, 2016-06-21 at 10:13 -0700, Kees Cook wrote:
>> > On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net
>> > > wrote:
>> > >
>> > > I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc
>> > > range.
>> > > It has no in-tree users for non-fixed addresses right now.
>> > What about the lack of pre-range guard page? That seems like a
>> > critical feature for this.
>>
>> If VM_NO_GUARD is disallowed, and every vmalloc area has
>> a guard area behind it, then every subsequent vmalloc area
>> will have a guard page ahead of it.
>>
>> I think disallowing VM_NO_GUARD will be all that is required.
>>
>> The only thing we may want to verify on the architectures that
>> we care about is that there is nothing mapped immediately before
>> the start of the vmalloc range, otherwise the first vmalloced
>> area will not have a guard page below it.
>
> FWIW, ARM has an 8MB guard area between the linear mapping of
> physical memory and the start of the vmalloc area. I have not
> checked any of the other architectures though.

If we start banning VM_NO_GUARD in the vmalloc area, we could also
explicitly prevent use of the bottom page of the vmalloc area.

>
>         Arnd



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 19:43               ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 19:43 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Rik van Riel, kernel-hardening, Jann Horn, Andy Lutomirski,
	X86 ML, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Brian Gerst, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tue, Jun 21, 2016 at 12:44 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, June 21, 2016 2:32:28 PM CEST Rik van Riel wrote:
>> On Tue, 2016-06-21 at 10:13 -0700, Kees Cook wrote:
>> > On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net
>> > > wrote:
>> > >
>> > > I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc
>> > > range.
>> > > It has no in-tree users for non-fixed addresses right now.
>> > What about the lack of pre-range guard page? That seems like a
>> > critical feature for this.
>>
>> If VM_NO_GUARD is disallowed, and every vmalloc area has
>> a guard area behind it, then every subsequent vmalloc area
>> will have a guard page ahead of it.
>>
>> I think disallowing VM_NO_GUARD will be all that is required.
>>
>> The only thing we may want to verify on the architectures that
>> we care about is that there is nothing mapped immediately before
>> the start of the vmalloc range, otherwise the first vmalloced
>> area will not have a guard page below it.
>
> FWIW, ARM has an 8MB guard area between the linear mapping of
> physical memory and the start of the vmalloc area. I have not
> checked any of the other architectures though.

If we start banning VM_NO_GUARD in the vmalloc area, we could also
explicitly prevent use of the bottom page of the vmalloc area.

>
>         Arnd



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 19:43               ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 19:43 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Rik van Riel, kernel-hardening, Jann Horn, Andy Lutomirski,
	X86 ML, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Brian Gerst, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tue, Jun 21, 2016 at 12:44 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, June 21, 2016 2:32:28 PM CEST Rik van Riel wrote:
>> On Tue, 2016-06-21 at 10:13 -0700, Kees Cook wrote:
>> > On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net
>> > > wrote:
>> > >
>> > > I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc
>> > > range.
>> > > It has no in-tree users for non-fixed addresses right now.
>> > What about the lack of pre-range guard page? That seems like a
>> > critical feature for this.
>>
>> If VM_NO_GUARD is disallowed, and every vmalloc area has
>> a guard area behind it, then every subsequent vmalloc area
>> will have a guard page ahead of it.
>>
>> I think disallowing VM_NO_GUARD will be all that is required.
>>
>> The only thing we may want to verify on the architectures that
>> we care about is that there is nothing mapped immediately before
>> the start of the vmalloc range, otherwise the first vmalloced
>> area will not have a guard page below it.
>
> FWIW, ARM has an 8MB guard area between the linear mapping of
> physical memory and the start of the vmalloc area. I have not
> checked any of the other architectures though.

If we start banning VM_NO_GUARD in the vmalloc area, we could also
explicitly prevent use of the bottom page of the vmalloc area.

>
>         Arnd



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [kernel-hardening] Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
  2016-06-21 18:32           ` Rik van Riel
@ 2016-06-21 19:44             ` Arnd Bergmann
  -1 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21 19:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kernel-hardening, Andy Lutomirski, Jann Horn, Andy Lutomirski,
	X86 ML, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Brian Gerst, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tuesday, June 21, 2016 2:32:28 PM CEST Rik van Riel wrote:
> On Tue, 2016-06-21 at 10:13 -0700, Kees Cook wrote:
> > On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net
> > > wrote:
> > > 
> > > I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc
> > > range.
> > > It has no in-tree users for non-fixed addresses right now.
> > What about the lack of pre-range guard page? That seems like a
> > critical feature for this. 
> 
> If VM_NO_GUARD is disallowed, and every vmalloc area has
> a guard area behind it, then every subsequent vmalloc area
> will have a guard page ahead of it.
> 
> I think disallowing VM_NO_GUARD will be all that is required.
> 
> The only thing we may want to verify on the architectures that
> we care about is that there is nothing mapped immediately before
> the start of the vmalloc range, otherwise the first vmalloced
> area will not have a guard page below it.

FWIW, ARM has an 8MB guard area between the linear mapping of
physical memory and the start of the vmalloc area. I have not
checked any of the other architectures though.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: Re: [PATCH v3 06/13] fork: Add generic vmalloced stack support
@ 2016-06-21 19:44             ` Arnd Bergmann
  0 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21 19:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kernel-hardening, Andy Lutomirski, Jann Horn, Andy Lutomirski,
	X86 ML, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Brian Gerst, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

On Tuesday, June 21, 2016 2:32:28 PM CEST Rik van Riel wrote:
> On Tue, 2016-06-21 at 10:13 -0700, Kees Cook wrote:
> > On Tue, Jun 21, 2016 at 9:59 AM, Andy Lutomirski <luto@amacapital.net
> > > wrote:
> > > 
> > > I'm tempted to explicitly disallow VM_NO_GUARD in the vmalloc
> > > range.
> > > It has no in-tree users for non-fixed addresses right now.
> > What about the lack of pre-range guard page? That seems like a
> > critical feature for this. 
> 
> If VM_NO_GUARD is disallowed, and every vmalloc area has
> a guard area behind it, then every subsequent vmalloc area
> will have a guard page ahead of it.
> 
> I think disallowing VM_NO_GUARD will be all that is required.
> 
> The only thing we may want to verify on the architectures that
> we care about is that there is nothing mapped immediately before
> the start of the vmalloc range, otherwise the first vmalloced
> area will not have a guard page below it.

FWIW, ARM has an 8MB guard area between the linear mapping of
physical memory and the start of the vmalloc area. I have not
checked any of the other architectures though.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 19:47       ` Arnd Bergmann
  (?)
@ 2016-06-21 19:47         ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 19:47 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kees Cook, Andy Lutomirski, x86, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 12:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>> >>
>> >> On my laptop, this adds about 1.5µs of overhead to task creation,
>> >> which seems to be mainly caused by vmalloc inefficiently allocating
>> >> individual pages even when a higher-order page is available on the
>> >> freelist.
>> >
>> > Would it help to have a fixed virtual address for the stack instead
>> > and map the current stack to that during a task switch, similar to
>> > how we handle fixmap pages?
>> >
>> > That would of course trade the allocation overhead for a task switch
>> > overhead, which may be better or worse. It would also give "current"
>> > a constant address, which may give a small performance advantage
>> > but may also introduce a new attack vector unless we randomize it
>> > again.
>>
>> Right: we don't want a fixed address. That makes attacks WAY easier.
>
> Do we care about making the address more random then? When I look
> at /proc/vmallocinfo, I see that allocations are all using
> consecutive addresses, so if you can figure out the virtual
> address of the stack for one process that would give you a good
> chance of guessing the address for the next pid.

Quite possibly.  We should seriously consider at least randomizing the
*start* of the vmalloc area, at least on 64-bit architectures.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 19:47         ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 19:47 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kees Cook, Andy Lutomirski, x86, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 12:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>> >>
>> >> On my laptop, this adds about 1.5µs of overhead to task creation,
>> >> which seems to be mainly caused by vmalloc inefficiently allocating
>> >> individual pages even when a higher-order page is available on the
>> >> freelist.
>> >
>> > Would it help to have a fixed virtual address for the stack instead
>> > and map the current stack to that during a task switch, similar to
>> > how we handle fixmap pages?
>> >
>> > That would of course trade the allocation overhead for a task switch
>> > overhead, which may be better or worse. It would also give "current"
>> > a constant address, which may give a small performance advantage
>> > but may also introduce a new attack vector unless we randomize it
>> > again.
>>
>> Right: we don't want a fixed address. That makes attacks WAY easier.
>
> Do we care about making the address more random then? When I look
> at /proc/vmallocinfo, I see that allocations are all using
> consecutive addresses, so if you can figure out the virtual
> address of the stack for one process that would give you a good
> chance of guessing the address for the next pid.

Quite possibly.  We should seriously consider at least randomizing the
*start* of the vmalloc area, at least on 64-bit architectures.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 19:47         ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-21 19:47 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kees Cook, Andy Lutomirski, x86, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 12:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>> >>
>> >> On my laptop, this adds about 1.5µs of overhead to task creation,
>> >> which seems to be mainly caused by vmalloc inefficiently allocating
>> >> individual pages even when a higher-order page is available on the
>> >> freelist.
>> >
>> > Would it help to have a fixed virtual address for the stack instead
>> > and map the current stack to that during a task switch, similar to
>> > how we handle fixmap pages?
>> >
>> > That would of course trade the allocation overhead for a task switch
>> > overhead, which may be better or worse. It would also give "current"
>> > a constant address, which may give a small performance advantage
>> > but may also introduce a new attack vector unless we randomize it
>> > again.
>>
>> Right: we don't want a fixed address. That makes attacks WAY easier.
>
> Do we care about making the address more random then? When I look
> at /proc/vmallocinfo, I see that allocations are all using
> consecutive addresses, so if you can figure out the virtual
> address of the stack for one process that would give you a good
> chance of guessing the address for the next pid.

Quite possibly.  We should seriously consider at least randomizing the
*start* of the vmalloc area, at least on 64-bit architectures.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 17:16     ` Kees Cook
  (?)
@ 2016-06-21 19:47       ` Arnd Bergmann
  -1 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21 19:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> >>
> >> On my laptop, this adds about 1.5µs of overhead to task creation,
> >> which seems to be mainly caused by vmalloc inefficiently allocating
> >> individual pages even when a higher-order page is available on the
> >> freelist.
> >
> > Would it help to have a fixed virtual address for the stack instead
> > and map the current stack to that during a task switch, similar to
> > how we handle fixmap pages?
> >
> > That would of course trade the allocation overhead for a task switch
> > overhead, which may be better or worse. It would also give "current"
> > a constant address, which may give a small performance advantage
> > but may also introduce a new attack vector unless we randomize it
> > again.
> 
> Right: we don't want a fixed address. That makes attacks WAY easier.

Do we care about making the address more random then? When I look
at /proc/vmallocinfo, I see that allocations are all using
consecutive addresses, so if you can figure out the virtual
address of the stack for one process that would give you a good
chance of guessing the address for the next pid.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 19:47       ` Arnd Bergmann
  0 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21 19:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> >>
> >> On my laptop, this adds about 1.5µs of overhead to task creation,
> >> which seems to be mainly caused by vmalloc inefficiently allocating
> >> individual pages even when a higher-order page is available on the
> >> freelist.
> >
> > Would it help to have a fixed virtual address for the stack instead
> > and map the current stack to that during a task switch, similar to
> > how we handle fixmap pages?
> >
> > That would of course trade the allocation overhead for a task switch
> > overhead, which may be better or worse. It would also give "current"
> > a constant address, which may give a small performance advantage
> > but may also introduce a new attack vector unless we randomize it
> > again.
> 
> Right: we don't want a fixed address. That makes attacks WAY easier.

Do we care about making the address more random then? When I look
at /proc/vmallocinfo, I see that allocations are all using
consecutive addresses, so if you can figure out the virtual
address of the stack for one process that would give you a good
chance of guessing the address for the next pid.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 19:47       ` Arnd Bergmann
  0 siblings, 0 replies; 269+ messages in thread
From: Arnd Bergmann @ 2016-06-21 19:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
> >>
> >> On my laptop, this adds about 1.5µs of overhead to task creation,
> >> which seems to be mainly caused by vmalloc inefficiently allocating
> >> individual pages even when a higher-order page is available on the
> >> freelist.
> >
> > Would it help to have a fixed virtual address for the stack instead
> > and map the current stack to that during a task switch, similar to
> > how we handle fixmap pages?
> >
> > That would of course trade the allocation overhead for a task switch
> > overhead, which may be better or worse. It would also give "current"
> > a constant address, which may give a small performance advantage
> > but may also introduce a new attack vector unless we randomize it
> > again.
> 
> Right: we don't want a fixed address. That makes attacks WAY easier.

Do we care about making the address more random then? When I look
at /proc/vmallocinfo, I see that allocations are all using
consecutive addresses, so if you can figure out the virtual
address of the stack for one process that would give you a good
chance of guessing the address for the next pid.

	Arnd

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21 19:47         ` Andy Lutomirski
  (?)
@ 2016-06-21 20:18           ` Kees Cook
  -1 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 20:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Andy Lutomirski, x86, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 12:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 12:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
>>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>>> >>
>>> >> On my laptop, this adds about 1.5µs of overhead to task creation,
>>> >> which seems to be mainly caused by vmalloc inefficiently allocating
>>> >> individual pages even when a higher-order page is available on the
>>> >> freelist.
>>> >
>>> > Would it help to have a fixed virtual address for the stack instead
>>> > and map the current stack to that during a task switch, similar to
>>> > how we handle fixmap pages?
>>> >
>>> > That would of course trade the allocation overhead for a task switch
>>> > overhead, which may be better or worse. It would also give "current"
>>> > a constant address, which may give a small performance advantage
>>> > but may also introduce a new attack vector unless we randomize it
>>> > again.
>>>
>>> Right: we don't want a fixed address. That makes attacks WAY easier.
>>
>> Do we care about making the address more random then? When I look
>> at /proc/vmallocinfo, I see that allocations are all using
>> consecutive addresses, so if you can figure out the virtual
>> address of the stack for one process that would give you a good
>> chance of guessing the address for the next pid.
>
> Quite possibly.  We should seriously consider at least randomizing the
> *start* of the vmalloc area, at least on 64-bit architectures.

Yup, this is already under way for x86. Thomas Garnier has a series
that he's been working on:

http://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=kaslr/memory

I'd love to see similar for other architectures too.

Thomas just sent me an updated series I'll be putting up for review later today.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 20:18           ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 20:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Andy Lutomirski, x86, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 12:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 12:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
>>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>>> >>
>>> >> On my laptop, this adds about 1.5µs of overhead to task creation,
>>> >> which seems to be mainly caused by vmalloc inefficiently allocating
>>> >> individual pages even when a higher-order page is available on the
>>> >> freelist.
>>> >
>>> > Would it help to have a fixed virtual address for the stack instead
>>> > and map the current stack to that during a task switch, similar to
>>> > how we handle fixmap pages?
>>> >
>>> > That would of course trade the allocation overhead for a task switch
>>> > overhead, which may be better or worse. It would also give "current"
>>> > a constant address, which may give a small performance advantage
>>> > but may also introduce a new attack vector unless we randomize it
>>> > again.
>>>
>>> Right: we don't want a fixed address. That makes attacks WAY easier.
>>
>> Do we care about making the address more random then? When I look
>> at /proc/vmallocinfo, I see that allocations are all using
>> consecutive addresses, so if you can figure out the virtual
>> address of the stack for one process that would give you a good
>> chance of guessing the address for the next pid.
>
> Quite possibly.  We should seriously consider at least randomizing the
> *start* of the vmalloc area, at least on 64-bit architectures.

Yup, this is already under way for x86. Thomas Garnier has a series
that he's been working on:

http://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=kaslr/memory

I'd love to see similar for other architectures too.

Thomas just sent me an updated series I'll be putting up for review later today.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-21 20:18           ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-21 20:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Andy Lutomirski, x86, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 21, 2016 at 12:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 12:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Tuesday, June 21, 2016 10:16:21 AM CEST Kees Cook wrote:
>>> On Tue, Jun 21, 2016 at 2:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> > On Monday, June 20, 2016 4:43:30 PM CEST Andy Lutomirski wrote:
>>> >>
>>> >> On my laptop, this adds about 1.5µs of overhead to task creation,
>>> >> which seems to be mainly caused by vmalloc inefficiently allocating
>>> >> individual pages even when a higher-order page is available on the
>>> >> freelist.
>>> >
>>> > Would it help to have a fixed virtual address for the stack instead
>>> > and map the current stack to that during a task switch, similar to
>>> > how we handle fixmap pages?
>>> >
>>> > That would of course trade the allocation overhead for a task switch
>>> > overhead, which may be better or worse. It would also give "current"
>>> > a constant address, which may give a small performance advantage
>>> > but may also introduce a new attack vector unless we randomize it
>>> > again.
>>>
>>> Right: we don't want a fixed address. That makes attacks WAY easier.
>>
>> Do we care about making the address more random then? When I look
>> at /proc/vmallocinfo, I see that allocations are all using
>> consecutive addresses, so if you can figure out the virtual
>> address of the stack for one process that would give you a good
>> chance of guessing the address for the next pid.
>
> Quite possibly.  We should seriously consider at least randomizing the
> *start* of the vmalloc area, at least on 64-bit architectures.

Yup, this is already under way for x86. Thomas Garnier has a series
that he's been working on:

http://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=kaslr/memory

I'd love to see similar for other architectures too.

Thomas just sent me an updated series I'll be putting up for review later today.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-20 23:43   ` Andy Lutomirski
  (?)
  (?)
@ 2016-06-22  7:35     ` Michal Hocko
  -1 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:34, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  drivers/base/node.c    | 3 +--
>  fs/proc/meminfo.c      | 2 +-
>  include/linux/mmzone.h | 2 +-
>  kernel/fork.c          | 3 ++-
>  mm/page_alloc.c        | 3 +--
>  5 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 560751bad294..27dc68a0ed2d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
>  		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
>  		       nid, K(i.sharedram),
> -		       nid, node_page_state(nid, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
>  		       nid, K(node_page_state(nid, NR_PAGETABLE)),
>  		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
>  		       nid, K(node_page_state(nid, NR_BOUNCE)),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 83720460c5bc..239b5a06cee0 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  				global_page_state(NR_SLAB_UNRECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_RECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
> -		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
> +		global_page_state(NR_KERNEL_STACK_KB),
>  		K(global_page_state(NR_PAGETABLE)),
>  #ifdef CONFIG_QUICKLIST
>  		K(quicklist_total_size()),
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 02069c23486d..63f05a7efb54 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -127,7 +127,7 @@ enum zone_stat_item {
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
>  	NR_PAGETABLE,		/* used for pagetables */
> -	NR_KERNEL_STACK,
> +	NR_KERNEL_STACK_KB,	/* measured in KiB */
>  	/* Second 128 byte cacheline */
>  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_BOUNCE,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355aa97f..be7f006af727 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  {
>  	struct zone *zone = page_zone(virt_to_page(ti));
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
> +	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> +			    THREAD_SIZE / 1024 * account);
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6903b695ebae..a277dea926c9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_SHMEM)),
>  			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
>  			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
> -			zone_page_state(zone, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +			zone_page_state(zone, NR_KERNEL_STACK_KB),
>  			K(zone_page_state(zone, NR_PAGETABLE)),
>  			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
>  			K(zone_page_state(zone, NR_BOUNCE)),
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-22  7:35     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:34, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  drivers/base/node.c    | 3 +--
>  fs/proc/meminfo.c      | 2 +-
>  include/linux/mmzone.h | 2 +-
>  kernel/fork.c          | 3 ++-
>  mm/page_alloc.c        | 3 +--
>  5 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 560751bad294..27dc68a0ed2d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
>  		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
>  		       nid, K(i.sharedram),
> -		       nid, node_page_state(nid, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
>  		       nid, K(node_page_state(nid, NR_PAGETABLE)),
>  		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
>  		       nid, K(node_page_state(nid, NR_BOUNCE)),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 83720460c5bc..239b5a06cee0 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  				global_page_state(NR_SLAB_UNRECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_RECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
> -		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
> +		global_page_state(NR_KERNEL_STACK_KB),
>  		K(global_page_state(NR_PAGETABLE)),
>  #ifdef CONFIG_QUICKLIST
>  		K(quicklist_total_size()),
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 02069c23486d..63f05a7efb54 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -127,7 +127,7 @@ enum zone_stat_item {
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
>  	NR_PAGETABLE,		/* used for pagetables */
> -	NR_KERNEL_STACK,
> +	NR_KERNEL_STACK_KB,	/* measured in KiB */
>  	/* Second 128 byte cacheline */
>  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_BOUNCE,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355aa97f..be7f006af727 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  {
>  	struct zone *zone = page_zone(virt_to_page(ti));
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
> +	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> +			    THREAD_SIZE / 1024 * account);
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6903b695ebae..a277dea926c9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_SHMEM)),
>  			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
>  			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
> -			zone_page_state(zone, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +			zone_page_state(zone, NR_KERNEL_STACK_KB),
>  			K(zone_page_state(zone, NR_PAGETABLE)),
>  			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
>  			K(zone_page_state(zone, NR_BOUNCE)),
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-22  7:35     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:34, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  drivers/base/node.c    | 3 +--
>  fs/proc/meminfo.c      | 2 +-
>  include/linux/mmzone.h | 2 +-
>  kernel/fork.c          | 3 ++-
>  mm/page_alloc.c        | 3 +--
>  5 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 560751bad294..27dc68a0ed2d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
>  		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
>  		       nid, K(i.sharedram),
> -		       nid, node_page_state(nid, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
>  		       nid, K(node_page_state(nid, NR_PAGETABLE)),
>  		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
>  		       nid, K(node_page_state(nid, NR_BOUNCE)),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 83720460c5bc..239b5a06cee0 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  				global_page_state(NR_SLAB_UNRECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_RECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
> -		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
> +		global_page_state(NR_KERNEL_STACK_KB),
>  		K(global_page_state(NR_PAGETABLE)),
>  #ifdef CONFIG_QUICKLIST
>  		K(quicklist_total_size()),
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 02069c23486d..63f05a7efb54 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -127,7 +127,7 @@ enum zone_stat_item {
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
>  	NR_PAGETABLE,		/* used for pagetables */
> -	NR_KERNEL_STACK,
> +	NR_KERNEL_STACK_KB,	/* measured in KiB */
>  	/* Second 128 byte cacheline */
>  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_BOUNCE,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355aa97f..be7f006af727 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  {
>  	struct zone *zone = page_zone(virt_to_page(ti));
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
> +	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> +			    THREAD_SIZE / 1024 * account);
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6903b695ebae..a277dea926c9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_SHMEM)),
>  			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
>  			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
> -			zone_page_state(zone, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +			zone_page_state(zone, NR_KERNEL_STACK_KB),
>  			K(zone_page_state(zone, NR_PAGETABLE)),
>  			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
>  			K(zone_page_state(zone, NR_BOUNCE)),
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
@ 2016-06-22  7:35     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:34, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  drivers/base/node.c    | 3 +--
>  fs/proc/meminfo.c      | 2 +-
>  include/linux/mmzone.h | 2 +-
>  kernel/fork.c          | 3 ++-
>  mm/page_alloc.c        | 3 +--
>  5 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 560751bad294..27dc68a0ed2d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
>  		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
>  		       nid, K(i.sharedram),
> -		       nid, node_page_state(nid, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
>  		       nid, K(node_page_state(nid, NR_PAGETABLE)),
>  		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
>  		       nid, K(node_page_state(nid, NR_BOUNCE)),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 83720460c5bc..239b5a06cee0 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  				global_page_state(NR_SLAB_UNRECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_RECLAIMABLE)),
>  		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
> -		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
> +		global_page_state(NR_KERNEL_STACK_KB),
>  		K(global_page_state(NR_PAGETABLE)),
>  #ifdef CONFIG_QUICKLIST
>  		K(quicklist_total_size()),
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 02069c23486d..63f05a7efb54 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -127,7 +127,7 @@ enum zone_stat_item {
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
>  	NR_PAGETABLE,		/* used for pagetables */
> -	NR_KERNEL_STACK,
> +	NR_KERNEL_STACK_KB,	/* measured in KiB */
>  	/* Second 128 byte cacheline */
>  	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_BOUNCE,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355aa97f..be7f006af727 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  {
>  	struct zone *zone = page_zone(virt_to_page(ti));
>  
> -	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
> +	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
> +			    THREAD_SIZE / 1024 * account);
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6903b695ebae..a277dea926c9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
>  			K(zone_page_state(zone, NR_SHMEM)),
>  			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
>  			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
> -			zone_page_state(zone, NR_KERNEL_STACK) *
> -				THREAD_SIZE / 1024,
> +			zone_page_state(zone, NR_KERNEL_STACK_KB),
>  			K(zone_page_state(zone, NR_PAGETABLE)),
>  			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
>  			K(zone_page_state(zone, NR_BOUNCE)),
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
  2016-06-20 23:43   ` Andy Lutomirski
  (?)
  (?)
@ 2016-06-22  7:38     ` Michal Hocko
  -1 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:35, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-22  7:38     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:35, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-22  7:38     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:35, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
@ 2016-06-22  7:38     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-22  7:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Vladimir Davydov,
	Johannes Weiner, linux-mm

On Mon 20-06-16 16:43:35, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c              | 15 ++++++---------
>  mm/memcontrol.c            |  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  	/* default hierarchy stats */
> -	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> +	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>  	MEMCG_SLAB_RECLAIMABLE,
>  	MEMCG_SLAB_UNRECLAIMABLE,
>  	MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
> -	if (page)
> -		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -					    1 << THREAD_SIZE_ORDER);
> -
>  	return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	struct page *page = virt_to_page(ti);
> -
> -	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> -				    -(1 << THREAD_SIZE_ORDER));
> -	__free_kmem_pages(page, THREAD_SIZE_ORDER);
> +	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  
>  	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>  			    THREAD_SIZE / 1024 * account);
> +
> +	/* All stack pages belong to the same memcg. */
> +	memcg_kmem_update_page_stat(
> +		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> +		account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	seq_printf(m, "file %llu\n",
>  		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>  	seq_printf(m, "kernel_stack %llu\n",
> -		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>  	seq_printf(m, "slab %llu\n",
>  		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>  			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-21  4:01   ` Linus Torvalds
  (?)
@ 2016-06-23  1:22     ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> I really think that problem needs to be fixed before this should be merged.
>
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).

I implemented a percpu cache, and it's useless.

When a task goes away, one reference is held until the next RCU grace
period so that task_struct can be used under RCU (look for
delayed_put_task_struct).  This means that free_task gets called in
giant batches under heavy clone() load, which is the only time that
any of this matters, which means that only get to refill the cache
once per RCU batch, which means that there's very little benefit.

Once thread_info stops living in the stack, we could, in principle,
exempt the stack itself from RCU protection, thus saving a bit of
memory under load and making the cache work.  I've started working on
(optionally, per-arch) getting rid of on-stack thread_info, but that's
not ready yet.

FWIW, the same issue quite possibly hurts non-vmap-stack performance
as well, as it makes it much less likely that a cache-hot stack gets
immediately reused under heavy fork load.

So may I skip this for now?  I think that the performance hit is
unlikely to matter on most workloads, and I also expect the speedup
from not using higher-order allocations to be a decent win on some
workloads.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23  1:22     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> I really think that problem needs to be fixed before this should be merged.
>
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).

I implemented a percpu cache, and it's useless.

When a task goes away, one reference is held until the next RCU grace
period so that task_struct can be used under RCU (look for
delayed_put_task_struct).  This means that free_task gets called in
giant batches under heavy clone() load, which is the only time that
any of this matters, which means that only get to refill the cache
once per RCU batch, which means that there's very little benefit.

Once thread_info stops living in the stack, we could, in principle,
exempt the stack itself from RCU protection, thus saving a bit of
memory under load and making the cache work.  I've started working on
(optionally, per-arch) getting rid of on-stack thread_info, but that's
not ready yet.

FWIW, the same issue quite possibly hurts non-vmap-stack performance
as well, as it makes it much less likely that a cache-hot stack gets
immediately reused under heavy fork load.

So may I skip this for now?  I think that the performance hit is
unlikely to matter on most workloads, and I also expect the speedup
from not using higher-order allocations to be a decent win on some
workloads.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23  1:22     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On my laptop, this adds about 1.5µs of overhead to task creation,
>> which seems to be mainly caused by vmalloc inefficiently allocating
>> individual pages even when a higher-order page is available on the
>> freelist.
>
> I really think that problem needs to be fixed before this should be merged.
>
> The easy fix may be to just have a very limited re-use of these stacks
> in generic code, rather than try to do anything fancy with multi-page
> allocations. Just a few of these allocations held in reserve (perhaps
> make the allocations percpu to avoid new locks).

I implemented a percpu cache, and it's useless.

When a task goes away, one reference is held until the next RCU grace
period so that task_struct can be used under RCU (look for
delayed_put_task_struct).  This means that free_task gets called in
giant batches under heavy clone() load, which is the only time that
any of this matters, which means that only get to refill the cache
once per RCU batch, which means that there's very little benefit.

Once thread_info stops living in the stack, we could, in principle,
exempt the stack itself from RCU protection, thus saving a bit of
memory under load and making the cache work.  I've started working on
(optionally, per-arch) getting rid of on-stack thread_info, but that's
not ready yet.

FWIW, the same issue quite possibly hurts non-vmap-stack performance
as well, as it makes it much less likely that a cache-hot stack gets
immediately reused under heavy fork load.

So may I skip this for now?  I think that the performance hit is
unlikely to matter on most workloads, and I also expect the speedup
from not using higher-order allocations to be a decent win on some
workloads.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23  1:22     ` Andy Lutomirski
  (?)
@ 2016-06-23  6:02       ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23  6:02 UTC (permalink / raw)
  To: Andy Lutomirski, Oleg Nesterov
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Wed, Jun 22, 2016 at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I implemented a percpu cache, and it's useless.
>
> When a task goes away, one reference is held until the next RCU grace
> period so that task_struct can be used under RCU (look for
> delayed_put_task_struct).

Yeah, that RCU batching will screw the cache idea.

But isn't it only the "task_struct" that needs that? That's a separate
allocation from the stack, which contains the "thread_info".

I think that what we *could* do is re-use the tread-info within the
RCU grace period, as long as we delay freeing the task_struct.

Yes, yes, we currently tie the task_struct and thread_info lifetimes
together very tightly, but that's a historical thing rather than a
requirement. We do the

        account_kernel_stack(tsk->stack, -1);
        arch_release_thread_info(tsk->stack);
        free_thread_info(tsk->stack);

in free_task(), but I could imagine doing it earlier, and
independently of the RCU-delayed free.

In fact, I think we just do that at exit() time synchronously. The
reference counting of the task_struct() is because a lot of other
threads can have references to the exiting thread (and we have the
tasklist and thread lists that are RCU-traversed), but none of those
other references should ever look at the stack. Or even the
thread-info.

Hmm. I bet it would show some problems, but not be technically
impossible. Especially if we make the thread-info rules be like the
SLAB_DESTROY_BY_RCU semantics - the allocation may be re-used during
the RCU grace period, but it is going to still exists and be of the
same type.

This sounds very much like something for Oleg Nesterov.

Oleg, what do you think? Would it be reasonable to free the stack and
thread_info synchronously at exit time, clear the pointer (to catch
any odd use), and only RCU-delay the task_struct itself?

That is, after all, what we already do with the VM, semaphores, files,
fs info etc. There's no real reason I see to keep the stack around.

(Obviously, we can't release it in do_exit() itself like we do some of
the other state - it would need to be released after we've scheduled
away to another process' stack, but we already have that TASK_DEAD
handling in finish_task_switch for this exact reason).

                 Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23  6:02       ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23  6:02 UTC (permalink / raw)
  To: Andy Lutomirski, Oleg Nesterov
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Wed, Jun 22, 2016 at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I implemented a percpu cache, and it's useless.
>
> When a task goes away, one reference is held until the next RCU grace
> period so that task_struct can be used under RCU (look for
> delayed_put_task_struct).

Yeah, that RCU batching will screw the cache idea.

But isn't it only the "task_struct" that needs that? That's a separate
allocation from the stack, which contains the "thread_info".

I think that what we *could* do is re-use the tread-info within the
RCU grace period, as long as we delay freeing the task_struct.

Yes, yes, we currently tie the task_struct and thread_info lifetimes
together very tightly, but that's a historical thing rather than a
requirement. We do the

        account_kernel_stack(tsk->stack, -1);
        arch_release_thread_info(tsk->stack);
        free_thread_info(tsk->stack);

in free_task(), but I could imagine doing it earlier, and
independently of the RCU-delayed free.

In fact, I think we just do that at exit() time synchronously. The
reference counting of the task_struct() is because a lot of other
threads can have references to the exiting thread (and we have the
tasklist and thread lists that are RCU-traversed), but none of those
other references should ever look at the stack. Or even the
thread-info.

Hmm. I bet it would show some problems, but not be technically
impossible. Especially if we make the thread-info rules be like the
SLAB_DESTROY_BY_RCU semantics - the allocation may be re-used during
the RCU grace period, but it is going to still exists and be of the
same type.

This sounds very much like something for Oleg Nesterov.

Oleg, what do you think? Would it be reasonable to free the stack and
thread_info synchronously at exit time, clear the pointer (to catch
any odd use), and only RCU-delay the task_struct itself?

That is, after all, what we already do with the VM, semaphores, files,
fs info etc. There's no real reason I see to keep the stack around.

(Obviously, we can't release it in do_exit() itself like we do some of
the other state - it would need to be released after we've scheduled
away to another process' stack, but we already have that TASK_DEAD
handling in finish_task_switch for this exact reason).

                 Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23  6:02       ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23  6:02 UTC (permalink / raw)
  To: Andy Lutomirski, Oleg Nesterov
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Wed, Jun 22, 2016 at 6:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I implemented a percpu cache, and it's useless.
>
> When a task goes away, one reference is held until the next RCU grace
> period so that task_struct can be used under RCU (look for
> delayed_put_task_struct).

Yeah, that RCU batching will screw the cache idea.

But isn't it only the "task_struct" that needs that? That's a separate
allocation from the stack, which contains the "thread_info".

I think that what we *could* do is re-use the tread-info within the
RCU grace period, as long as we delay freeing the task_struct.

Yes, yes, we currently tie the task_struct and thread_info lifetimes
together very tightly, but that's a historical thing rather than a
requirement. We do the

        account_kernel_stack(tsk->stack, -1);
        arch_release_thread_info(tsk->stack);
        free_thread_info(tsk->stack);

in free_task(), but I could imagine doing it earlier, and
independently of the RCU-delayed free.

In fact, I think we just do that at exit() time synchronously. The
reference counting of the task_struct() is because a lot of other
threads can have references to the exiting thread (and we have the
tasklist and thread lists that are RCU-traversed), but none of those
other references should ever look at the stack. Or even the
thread-info.

Hmm. I bet it would show some problems, but not be technically
impossible. Especially if we make the thread-info rules be like the
SLAB_DESTROY_BY_RCU semantics - the allocation may be re-used during
the RCU grace period, but it is going to still exists and be of the
same type.

This sounds very much like something for Oleg Nesterov.

Oleg, what do you think? Would it be reasonable to free the stack and
thread_info synchronously at exit time, clear the pointer (to catch
any odd use), and only RCU-delay the task_struct itself?

That is, after all, what we already do with the VM, semaphores, files,
fs info etc. There's no real reason I see to keep the stack around.

(Obviously, we can't release it in do_exit() itself like we do some of
the other state - it would need to be released after we've scheduled
away to another process' stack, but we already have that TASK_DEAD
handling in finish_task_switch for this exact reason).

                 Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23  6:02       ` Linus Torvalds
  (?)
@ 2016-06-23 14:31         ` Oleg Nesterov
  -1 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 14:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/22, Linus Torvalds wrote:
>
> Oleg, what do you think? Would it be reasonable to free the stack and
> thread_info synchronously at exit time, clear the pointer (to catch
> any odd use), and only RCU-delay the task_struct itself?

I didn't see the patches yet, quite possibly I misunderstood... But no,
I don't this we can do this (if we are not going to move ti->flags to
task_struct at least).

> (Obviously, we can't release it in do_exit() itself like we do some of
> the other state - it would need to be released after we've scheduled
> away to another process' stack, but we already have that TASK_DEAD
> handling in finish_task_switch for this exact reason).

Yes, but the problem is that a zombie thread can do its last schedule
before it is reaped.

Just for example, syscall_regfunc() does

		read_lock(&tasklist_lock);
		for_each_process_thread(p, t) {
			set_tsk_thread_flag(t, TIF_SYSCALL_TRACEPOINT);
		}
		read_unlock(&tasklist_lock);

and this can easily hit a TASK_DEAD thread with ->stack == NULL.

And we can't free/nullify it when the parent/debuger reaps a zombie,
say, mark_oom_victim() expects that get_task_struct() protects
thread_info as well.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 14:31         ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 14:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/22, Linus Torvalds wrote:
>
> Oleg, what do you think? Would it be reasonable to free the stack and
> thread_info synchronously at exit time, clear the pointer (to catch
> any odd use), and only RCU-delay the task_struct itself?

I didn't see the patches yet, quite possibly I misunderstood... But no,
I don't this we can do this (if we are not going to move ti->flags to
task_struct at least).

> (Obviously, we can't release it in do_exit() itself like we do some of
> the other state - it would need to be released after we've scheduled
> away to another process' stack, but we already have that TASK_DEAD
> handling in finish_task_switch for this exact reason).

Yes, but the problem is that a zombie thread can do its last schedule
before it is reaped.

Just for example, syscall_regfunc() does

		read_lock(&tasklist_lock);
		for_each_process_thread(p, t) {
			set_tsk_thread_flag(t, TIF_SYSCALL_TRACEPOINT);
		}
		read_unlock(&tasklist_lock);

and this can easily hit a TASK_DEAD thread with ->stack == NULL.

And we can't free/nullify it when the parent/debuger reaps a zombie,
say, mark_oom_victim() expects that get_task_struct() protects
thread_info as well.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 14:31         ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 14:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/22, Linus Torvalds wrote:
>
> Oleg, what do you think? Would it be reasonable to free the stack and
> thread_info synchronously at exit time, clear the pointer (to catch
> any odd use), and only RCU-delay the task_struct itself?

I didn't see the patches yet, quite possibly I misunderstood... But no,
I don't this we can do this (if we are not going to move ti->flags to
task_struct at least).

> (Obviously, we can't release it in do_exit() itself like we do some of
> the other state - it would need to be released after we've scheduled
> away to another process' stack, but we already have that TASK_DEAD
> handling in finish_task_switch for this exact reason).

Yes, but the problem is that a zombie thread can do its last schedule
before it is reaped.

Just for example, syscall_regfunc() does

		read_lock(&tasklist_lock);
		for_each_process_thread(p, t) {
			set_tsk_thread_flag(t, TIF_SYSCALL_TRACEPOINT);
		}
		read_unlock(&tasklist_lock);

and this can easily hit a TASK_DEAD thread with ->stack == NULL.

And we can't free/nullify it when the parent/debuger reaps a zombie,
say, mark_oom_victim() expects that get_task_struct() protects
thread_info as well.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 14:31         ` Oleg Nesterov
  (?)
@ 2016-06-23 16:30           ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 16:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 7:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I didn't see the patches yet, quite possibly I misunderstood... But no,
> I don't this we can do this (if we are not going to move ti->flags to
> task_struct at least).

Argh. Yes, ti->flags is used by others. Everything else should be
thread-synchronous, but there's ti->flags.

(And if we get scheduled, the thread-synchronous things will matter, of course):

> Yes, but the problem is that a zombie thread can do its last schedule
> before it is reaped.

Worse, the wait sequence will definitely look at it.

But that does bring up another possibility: do it at wait() time, when
we do release_thread(). That's when we *used* to synchronously free
it, before we did the lockless RCU walks.

At that point, it has been removed from all the thread lists. So the
only way to find it is through the RCU walks. Do any of *those* touch
ti->flags? I'm not seeing it, and it sounds fixable if any do.

If we could release the thread stack in release_thread(), that would be good.

Andy - I bet you can at least test it.

            Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 16:30           ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 16:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 7:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I didn't see the patches yet, quite possibly I misunderstood... But no,
> I don't this we can do this (if we are not going to move ti->flags to
> task_struct at least).

Argh. Yes, ti->flags is used by others. Everything else should be
thread-synchronous, but there's ti->flags.

(And if we get scheduled, the thread-synchronous things will matter, of course):

> Yes, but the problem is that a zombie thread can do its last schedule
> before it is reaped.

Worse, the wait sequence will definitely look at it.

But that does bring up another possibility: do it at wait() time, when
we do release_thread(). That's when we *used* to synchronously free
it, before we did the lockless RCU walks.

At that point, it has been removed from all the thread lists. So the
only way to find it is through the RCU walks. Do any of *those* touch
ti->flags? I'm not seeing it, and it sounds fixable if any do.

If we could release the thread stack in release_thread(), that would be good.

Andy - I bet you can at least test it.

            Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 16:30           ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 16:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 7:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I didn't see the patches yet, quite possibly I misunderstood... But no,
> I don't this we can do this (if we are not going to move ti->flags to
> task_struct at least).

Argh. Yes, ti->flags is used by others. Everything else should be
thread-synchronous, but there's ti->flags.

(And if we get scheduled, the thread-synchronous things will matter, of course):

> Yes, but the problem is that a zombie thread can do its last schedule
> before it is reaped.

Worse, the wait sequence will definitely look at it.

But that does bring up another possibility: do it at wait() time, when
we do release_thread(). That's when we *used* to synchronously free
it, before we did the lockless RCU walks.

At that point, it has been removed from all the thread lists. So the
only way to find it is through the RCU walks. Do any of *those* touch
ti->flags? I'm not seeing it, and it sounds fixable if any do.

If we could release the thread stack in release_thread(), that would be good.

Andy - I bet you can at least test it.

            Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 16:30           ` Linus Torvalds
  (?)
@ 2016-06-23 16:41             ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 9:30 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 7:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> I didn't see the patches yet, quite possibly I misunderstood... But no,
>> I don't this we can do this (if we are not going to move ti->flags to
>> task_struct at least).
>
> Argh. Yes, ti->flags is used by others. Everything else should be
> thread-synchronous, but there's ti->flags.
>
> (And if we get scheduled, the thread-synchronous things will matter, of course):
>
>> Yes, but the problem is that a zombie thread can do its last schedule
>> before it is reaped.
>
> Worse, the wait sequence will definitely look at it.
>
> But that does bring up another possibility: do it at wait() time, when
> we do release_thread(). That's when we *used* to synchronously free
> it, before we did the lockless RCU walks.
>
> At that point, it has been removed from all the thread lists. So the
> only way to find it is through the RCU walks. Do any of *those* touch
> ti->flags? I'm not seeing it, and it sounds fixable if any do.
>
> If we could release the thread stack in release_thread(), that would be good.
>
> Andy - I bet you can at least test it.

That sounds a bit more fragile than I'm really comfortable with,
although it'll at least oops reliably if we get it wrong.

But I'm planning on moving ti->flags (and the rest of thread_info,
either piecemeal or as a unit) into task_struct on architectures that
opt in, which, as a practical matter, hopefully means everyone who
opts in to virtual stacks.  So I'm more inclined make all the changes
in a different order:

1. Virtually mapped stacks (off by default but merged for testing,
possibly with a warning that distros shouldn't enable it yet.)

2. thread_info cleanup (which I want to do *anyway* because it's
critical to get the full hardening benefit)

3. Free stacks immediately and cache them (really easy).

This has the benefit of being much less dependent on who access what
field when and it should perform well with no churn.  I'm hoping to
have the thread_info stuff done in time for 4.8, too.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 16:41             ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 9:30 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 7:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> I didn't see the patches yet, quite possibly I misunderstood... But no,
>> I don't this we can do this (if we are not going to move ti->flags to
>> task_struct at least).
>
> Argh. Yes, ti->flags is used by others. Everything else should be
> thread-synchronous, but there's ti->flags.
>
> (And if we get scheduled, the thread-synchronous things will matter, of course):
>
>> Yes, but the problem is that a zombie thread can do its last schedule
>> before it is reaped.
>
> Worse, the wait sequence will definitely look at it.
>
> But that does bring up another possibility: do it at wait() time, when
> we do release_thread(). That's when we *used* to synchronously free
> it, before we did the lockless RCU walks.
>
> At that point, it has been removed from all the thread lists. So the
> only way to find it is through the RCU walks. Do any of *those* touch
> ti->flags? I'm not seeing it, and it sounds fixable if any do.
>
> If we could release the thread stack in release_thread(), that would be good.
>
> Andy - I bet you can at least test it.

That sounds a bit more fragile than I'm really comfortable with,
although it'll at least oops reliably if we get it wrong.

But I'm planning on moving ti->flags (and the rest of thread_info,
either piecemeal or as a unit) into task_struct on architectures that
opt in, which, as a practical matter, hopefully means everyone who
opts in to virtual stacks.  So I'm more inclined make all the changes
in a different order:

1. Virtually mapped stacks (off by default but merged for testing,
possibly with a warning that distros shouldn't enable it yet.)

2. thread_info cleanup (which I want to do *anyway* because it's
critical to get the full hardening benefit)

3. Free stacks immediately and cache them (really easy).

This has the benefit of being much less dependent on who access what
field when and it should perform well with no churn.  I'm hoping to
have the thread_info stuff done in time for 4.8, too.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 16:41             ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 9:30 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 7:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> I didn't see the patches yet, quite possibly I misunderstood... But no,
>> I don't this we can do this (if we are not going to move ti->flags to
>> task_struct at least).
>
> Argh. Yes, ti->flags is used by others. Everything else should be
> thread-synchronous, but there's ti->flags.
>
> (And if we get scheduled, the thread-synchronous things will matter, of course):
>
>> Yes, but the problem is that a zombie thread can do its last schedule
>> before it is reaped.
>
> Worse, the wait sequence will definitely look at it.
>
> But that does bring up another possibility: do it at wait() time, when
> we do release_thread(). That's when we *used* to synchronously free
> it, before we did the lockless RCU walks.
>
> At that point, it has been removed from all the thread lists. So the
> only way to find it is through the RCU walks. Do any of *those* touch
> ti->flags? I'm not seeing it, and it sounds fixable if any do.
>
> If we could release the thread stack in release_thread(), that would be good.
>
> Andy - I bet you can at least test it.

That sounds a bit more fragile than I'm really comfortable with,
although it'll at least oops reliably if we get it wrong.

But I'm planning on moving ti->flags (and the rest of thread_info,
either piecemeal or as a unit) into task_struct on architectures that
opt in, which, as a practical matter, hopefully means everyone who
opts in to virtual stacks.  So I'm more inclined make all the changes
in a different order:

1. Virtually mapped stacks (off by default but merged for testing,
possibly with a warning that distros shouldn't enable it yet.)

2. thread_info cleanup (which I want to do *anyway* because it's
critical to get the full hardening benefit)

3. Free stacks immediately and cache them (really easy).

This has the benefit of being much less dependent on who access what
field when and it should perform well with no churn.  I'm hoping to
have the thread_info stuff done in time for 4.8, too.

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 16:30           ` Linus Torvalds
  (?)
@ 2016-06-23 17:03             ` Oleg Nesterov
  -1 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> But that does bring up another possibility: do it at wait() time, when
> we do release_thread(). That's when we *used* to synchronously free
> it, before we did the lockless RCU walks.

Let me quote my previous email ;)

	And we can't free/nullify it when the parent/debuger reaps a zombie,
	say, mark_oom_victim() expects that get_task_struct() protects
	thread_info as well.

probably we can fix all such users though...

> At that point, it has been removed from all the thread lists. So the
> only way to find it is through the RCU walks. Do any of *those* touch
> ti->flags? I'm not seeing it,

Neither me, although I didn't try to grep too much.

> and it sounds fixable if any do

probably yes, but this would mean that tasklist_lock protects task->stack,
doesn't look really nice...

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:03             ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> But that does bring up another possibility: do it at wait() time, when
> we do release_thread(). That's when we *used* to synchronously free
> it, before we did the lockless RCU walks.

Let me quote my previous email ;)

	And we can't free/nullify it when the parent/debuger reaps a zombie,
	say, mark_oom_victim() expects that get_task_struct() protects
	thread_info as well.

probably we can fix all such users though...

> At that point, it has been removed from all the thread lists. So the
> only way to find it is through the RCU walks. Do any of *those* touch
> ti->flags? I'm not seeing it,

Neither me, although I didn't try to grep too much.

> and it sounds fixable if any do

probably yes, but this would mean that tasklist_lock protects task->stack,
doesn't look really nice...

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:03             ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> But that does bring up another possibility: do it at wait() time, when
> we do release_thread(). That's when we *used* to synchronously free
> it, before we did the lockless RCU walks.

Let me quote my previous email ;)

	And we can't free/nullify it when the parent/debuger reaps a zombie,
	say, mark_oom_victim() expects that get_task_struct() protects
	thread_info as well.

probably we can fix all such users though...

> At that point, it has been removed from all the thread lists. So the
> only way to find it is through the RCU walks. Do any of *those* touch
> ti->flags? I'm not seeing it,

Neither me, although I didn't try to grep too much.

> and it sounds fixable if any do

probably yes, but this would mean that tasklist_lock protects task->stack,
doesn't look really nice...

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 16:41             ` Andy Lutomirski
  (?)
@ 2016-06-23 17:10               ` Oleg Nesterov
  -1 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 17:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Andy Lutomirski wrote:
>
> That sounds a bit more fragile than I'm really comfortable with,
> although it'll at least oops reliably if we get it wrong.
>
> But I'm planning on moving ti->flags (and the rest of thread_info,
> either piecemeal or as a unit) into task_struct on architectures that
> opt in,

I agree, this looks better. probably it should not be that hard to fix
GET_THREAD_INFO/etc, but you know this much better than me.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:10               ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 17:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Andy Lutomirski wrote:
>
> That sounds a bit more fragile than I'm really comfortable with,
> although it'll at least oops reliably if we get it wrong.
>
> But I'm planning on moving ti->flags (and the rest of thread_info,
> either piecemeal or as a unit) into task_struct on architectures that
> opt in,

I agree, this looks better. probably it should not be that hard to fix
GET_THREAD_INFO/etc, but you know this much better than me.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:10               ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 17:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Andy Lutomirski wrote:
>
> That sounds a bit more fragile than I'm really comfortable with,
> although it'll at least oops reliably if we get it wrong.
>
> But I'm planning on moving ti->flags (and the rest of thread_info,
> either piecemeal or as a unit) into task_struct on architectures that
> opt in,

I agree, this looks better. probably it should not be that hard to fix
GET_THREAD_INFO/etc, but you know this much better than me.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 17:03             ` Oleg Nesterov
  (?)
@ 2016-06-23 17:44               ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 17:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Let me quote my previous email ;)
>
>         And we can't free/nullify it when the parent/debuger reaps a zombie,
>         say, mark_oom_victim() expects that get_task_struct() protects
>         thread_info as well.
>
> probably we can fix all such users though...

TIF_MEMDIE is indeed a potential problem, but I don't think
mark_oom_victim() is actually problematic.

mark_oom_victim() is called with either "current", or with a victim
that still has its mm and signal pointer (and the task is locked). So
the lifetime is already guaranteed - or that code is already very very
buggy, since it follows tsk->signal and tsk->mm

So as far as I can tell, that's all fine.

That said, by now it would actually in many ways be great if we could
get rid of thread_info entirely. The historical reasons for
thread_info have almost all been subsumed by the percpu area.

The reason for thread_info originally was

 - we used to find the task_struct by just masking the stack pointer
(long long ago). When the task struct grew too big, we kept just the
critical pieces and some arch-specific stuff and , called it
"thread_info", and moved the rest to an external allocation and added
the pointer to it.

 - the really crticial stuff we didn't want to follow a pointer for,
so things like preempt_count etc were in thread_info

 - but they were *so* critical that PeterZ (at my prodding) moved
those things to percpu caches that get updated at schedule time
instead

so these days, thread_info has almost nothing really critical in it
any more. There's the thread-local flags, yes, but they could stay or
easily be moved to the task_struct or get similar per-cpu fixup as
preempt_count did a couple of years ago. The only annoyance is the few
remaining entry code assembly sequences, but I suspect they would
actually become simpler with a per-cpu thing, and with Andy's cleanups
they are pretty insignificant these days. There seems to be exactly
two uses of ASM_THREAD_INFO(TI_flags,.. left.

So I suspect that it would

 (a) already be possible to just free the stack and thread info at
release time, because any rcu users will already be doing task_lock()
and check mm etc.

 (b) it probably would be a nice cleanup to try to make it even more
obviously safe by just shrinking thread_info more (or even getting rid
of it entirely, but that may be too painful because there are other
architectures that may depend on it more).

I dunno. Looking at what remains of thread_info, it really doesn't
seem very critical.

The thread_info->tsk pointer, that was one of the most critical issues
and the main raison d'être of the thread_info, has been replaced on
x86 by just using the per-cpu "current_task". Yes,.there are probably
more than a few "ti->task" users left for legacy reasons, harking back
to when the thread-info was cheaper to access, but it shouldn't be a
big deal.

                  Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:44               ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 17:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Let me quote my previous email ;)
>
>         And we can't free/nullify it when the parent/debuger reaps a zombie,
>         say, mark_oom_victim() expects that get_task_struct() protects
>         thread_info as well.
>
> probably we can fix all such users though...

TIF_MEMDIE is indeed a potential problem, but I don't think
mark_oom_victim() is actually problematic.

mark_oom_victim() is called with either "current", or with a victim
that still has its mm and signal pointer (and the task is locked). So
the lifetime is already guaranteed - or that code is already very very
buggy, since it follows tsk->signal and tsk->mm

So as far as I can tell, that's all fine.

That said, by now it would actually in many ways be great if we could
get rid of thread_info entirely. The historical reasons for
thread_info have almost all been subsumed by the percpu area.

The reason for thread_info originally was

 - we used to find the task_struct by just masking the stack pointer
(long long ago). When the task struct grew too big, we kept just the
critical pieces and some arch-specific stuff and , called it
"thread_info", and moved the rest to an external allocation and added
the pointer to it.

 - the really crticial stuff we didn't want to follow a pointer for,
so things like preempt_count etc were in thread_info

 - but they were *so* critical that PeterZ (at my prodding) moved
those things to percpu caches that get updated at schedule time
instead

so these days, thread_info has almost nothing really critical in it
any more. There's the thread-local flags, yes, but they could stay or
easily be moved to the task_struct or get similar per-cpu fixup as
preempt_count did a couple of years ago. The only annoyance is the few
remaining entry code assembly sequences, but I suspect they would
actually become simpler with a per-cpu thing, and with Andy's cleanups
they are pretty insignificant these days. There seems to be exactly
two uses of ASM_THREAD_INFO(TI_flags,.. left.

So I suspect that it would

 (a) already be possible to just free the stack and thread info at
release time, because any rcu users will already be doing task_lock()
and check mm etc.

 (b) it probably would be a nice cleanup to try to make it even more
obviously safe by just shrinking thread_info more (or even getting rid
of it entirely, but that may be too painful because there are other
architectures that may depend on it more).

I dunno. Looking at what remains of thread_info, it really doesn't
seem very critical.

The thread_info->tsk pointer, that was one of the most critical issues
and the main raison d'être of the thread_info, has been replaced on
x86 by just using the per-cpu "current_task". Yes,.there are probably
more than a few "ti->task" users left for legacy reasons, harking back
to when the thread-info was cheaper to access, but it shouldn't be a
big deal.

                  Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:44               ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 17:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Let me quote my previous email ;)
>
>         And we can't free/nullify it when the parent/debuger reaps a zombie,
>         say, mark_oom_victim() expects that get_task_struct() protects
>         thread_info as well.
>
> probably we can fix all such users though...

TIF_MEMDIE is indeed a potential problem, but I don't think
mark_oom_victim() is actually problematic.

mark_oom_victim() is called with either "current", or with a victim
that still has its mm and signal pointer (and the task is locked). So
the lifetime is already guaranteed - or that code is already very very
buggy, since it follows tsk->signal and tsk->mm

So as far as I can tell, that's all fine.

That said, by now it would actually in many ways be great if we could
get rid of thread_info entirely. The historical reasons for
thread_info have almost all been subsumed by the percpu area.

The reason for thread_info originally was

 - we used to find the task_struct by just masking the stack pointer
(long long ago). When the task struct grew too big, we kept just the
critical pieces and some arch-specific stuff and , called it
"thread_info", and moved the rest to an external allocation and added
the pointer to it.

 - the really crticial stuff we didn't want to follow a pointer for,
so things like preempt_count etc were in thread_info

 - but they were *so* critical that PeterZ (at my prodding) moved
those things to percpu caches that get updated at schedule time
instead

so these days, thread_info has almost nothing really critical in it
any more. There's the thread-local flags, yes, but they could stay or
easily be moved to the task_struct or get similar per-cpu fixup as
preempt_count did a couple of years ago. The only annoyance is the few
remaining entry code assembly sequences, but I suspect they would
actually become simpler with a per-cpu thing, and with Andy's cleanups
they are pretty insignificant these days. There seems to be exactly
two uses of ASM_THREAD_INFO(TI_flags,.. left.

So I suspect that it would

 (a) already be possible to just free the stack and thread info at
release time, because any rcu users will already be doing task_lock()
and check mm etc.

 (b) it probably would be a nice cleanup to try to make it even more
obviously safe by just shrinking thread_info more (or even getting rid
of it entirely, but that may be too painful because there are other
architectures that may depend on it more).

I dunno. Looking at what remains of thread_info, it really doesn't
seem very critical.

The thread_info->tsk pointer, that was one of the most critical issues
and the main raison d'être of the thread_info, has been replaced on
x86 by just using the per-cpu "current_task". Yes,.there are probably
more than a few "ti->task" users left for legacy reasons, harking back
to when the thread-info was cheaper to access, but it shouldn't be a
big deal.

                  Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 17:44               ` Linus Torvalds
  (?)
@ 2016-06-23 17:52                 ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 17:52 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

On Thu, Jun 23, 2016 at 10:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The thread_info->tsk pointer, that was one of the most critical issues
> and the main raison d'être of the thread_info, has been replaced on
> x86 by just using the per-cpu "current_task". Yes,.there are probably
> more than a few "ti->task" users left for legacy reasons, harking back
> to when the thread-info was cheaper to access, but it shouldn't be a
> big deal.

Ugh. Looking around at this, it turns out that a great example of this
kind of legacy issue is the debug_mutex stuff.

It uses "struct thread_info *" as the owner pointer, and there is _no_
existing reason for it. In fact, in every single place it actually
wants the task_struct, and it does task_thread_info(task) just to
convert it to the thread-info, and then converts it back with
"ti->task".

So the attached patch seems to be the right thing to do regardless of
this whole discussion.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 3823 bytes --]

 kernel/locking/mutex-debug.c | 12 ++++++------
 kernel/locking/mutex-debug.h |  4 ++--
 kernel/locking/mutex.c       |  6 +++---
 kernel/locking/mutex.h       |  2 +-
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 3ef3736002d8..9c951fade415 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -49,21 +49,21 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 }
 
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			    struct thread_info *ti)
+			    struct task_struct *task)
 {
 	SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
 
 	/* Mark the current thread as blocked on the lock: */
-	ti->task->blocked_on = waiter;
+	task->blocked_on = waiter;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			 struct thread_info *ti)
+			 struct task_struct *task)
 {
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
-	DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
-	DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
-	ti->task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(waiter->task != task);
+	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
+	task->blocked_on = NULL;
 
 	list_del_init(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex-debug.h b/kernel/locking/mutex-debug.h
index 0799fd3e4cfa..d06ae3bb46c5 100644
--- a/kernel/locking/mutex-debug.h
+++ b/kernel/locking/mutex-debug.h
@@ -20,9 +20,9 @@ extern void debug_mutex_wake_waiter(struct mutex *lock,
 extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
 extern void debug_mutex_add_waiter(struct mutex *lock,
 				   struct mutex_waiter *waiter,
-				   struct thread_info *ti);
+				   struct task_struct *task);
 extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-				struct thread_info *ti);
+				struct task_struct *task);
 extern void debug_mutex_unlock(struct mutex *lock);
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d765a75f..a70b90db3909 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -537,7 +537,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
-	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
+	debug_mutex_add_waiter(lock, &waiter, task);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
@@ -584,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	}
 	__set_task_state(task, TASK_RUNNING);
 
-	mutex_remove_waiter(lock, &waiter, current_thread_info());
+	mutex_remove_waiter(lock, &waiter, task);
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -605,7 +605,7 @@ skip_wait:
 	return 0;
 
 err:
-	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+	mutex_remove_waiter(lock, &waiter, task);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, 1, ip);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 5cda397607f2..a68bae5e852a 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -13,7 +13,7 @@
 		do { spin_lock(lock); (void)(flags); } while (0)
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
-#define mutex_remove_waiter(lock, waiter, ti) \
+#define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:52                 ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 17:52 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

On Thu, Jun 23, 2016 at 10:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The thread_info->tsk pointer, that was one of the most critical issues
> and the main raison d'être of the thread_info, has been replaced on
> x86 by just using the per-cpu "current_task". Yes,.there are probably
> more than a few "ti->task" users left for legacy reasons, harking back
> to when the thread-info was cheaper to access, but it shouldn't be a
> big deal.

Ugh. Looking around at this, it turns out that a great example of this
kind of legacy issue is the debug_mutex stuff.

It uses "struct thread_info *" as the owner pointer, and there is _no_
existing reason for it. In fact, in every single place it actually
wants the task_struct, and it does task_thread_info(task) just to
convert it to the thread-info, and then converts it back with
"ti->task".

So the attached patch seems to be the right thing to do regardless of
this whole discussion.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 3823 bytes --]

 kernel/locking/mutex-debug.c | 12 ++++++------
 kernel/locking/mutex-debug.h |  4 ++--
 kernel/locking/mutex.c       |  6 +++---
 kernel/locking/mutex.h       |  2 +-
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 3ef3736002d8..9c951fade415 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -49,21 +49,21 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 }
 
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			    struct thread_info *ti)
+			    struct task_struct *task)
 {
 	SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
 
 	/* Mark the current thread as blocked on the lock: */
-	ti->task->blocked_on = waiter;
+	task->blocked_on = waiter;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			 struct thread_info *ti)
+			 struct task_struct *task)
 {
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
-	DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
-	DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
-	ti->task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(waiter->task != task);
+	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
+	task->blocked_on = NULL;
 
 	list_del_init(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex-debug.h b/kernel/locking/mutex-debug.h
index 0799fd3e4cfa..d06ae3bb46c5 100644
--- a/kernel/locking/mutex-debug.h
+++ b/kernel/locking/mutex-debug.h
@@ -20,9 +20,9 @@ extern void debug_mutex_wake_waiter(struct mutex *lock,
 extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
 extern void debug_mutex_add_waiter(struct mutex *lock,
 				   struct mutex_waiter *waiter,
-				   struct thread_info *ti);
+				   struct task_struct *task);
 extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-				struct thread_info *ti);
+				struct task_struct *task);
 extern void debug_mutex_unlock(struct mutex *lock);
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d765a75f..a70b90db3909 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -537,7 +537,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
-	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
+	debug_mutex_add_waiter(lock, &waiter, task);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
@@ -584,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	}
 	__set_task_state(task, TASK_RUNNING);
 
-	mutex_remove_waiter(lock, &waiter, current_thread_info());
+	mutex_remove_waiter(lock, &waiter, task);
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -605,7 +605,7 @@ skip_wait:
 	return 0;
 
 err:
-	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+	mutex_remove_waiter(lock, &waiter, task);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, 1, ip);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 5cda397607f2..a68bae5e852a 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -13,7 +13,7 @@
 		do { spin_lock(lock); (void)(flags); } while (0)
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
-#define mutex_remove_waiter(lock, waiter, ti) \
+#define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 17:52                 ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 17:52 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

On Thu, Jun 23, 2016 at 10:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The thread_info->tsk pointer, that was one of the most critical issues
> and the main raison d'être of the thread_info, has been replaced on
> x86 by just using the per-cpu "current_task". Yes,.there are probably
> more than a few "ti->task" users left for legacy reasons, harking back
> to when the thread-info was cheaper to access, but it shouldn't be a
> big deal.

Ugh. Looking around at this, it turns out that a great example of this
kind of legacy issue is the debug_mutex stuff.

It uses "struct thread_info *" as the owner pointer, and there is _no_
existing reason for it. In fact, in every single place it actually
wants the task_struct, and it does task_thread_info(task) just to
convert it to the thread-info, and then converts it back with
"ti->task".

So the attached patch seems to be the right thing to do regardless of
this whole discussion.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 3823 bytes --]

 kernel/locking/mutex-debug.c | 12 ++++++------
 kernel/locking/mutex-debug.h |  4 ++--
 kernel/locking/mutex.c       |  6 +++---
 kernel/locking/mutex.h       |  2 +-
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 3ef3736002d8..9c951fade415 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -49,21 +49,21 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 }
 
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			    struct thread_info *ti)
+			    struct task_struct *task)
 {
 	SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
 
 	/* Mark the current thread as blocked on the lock: */
-	ti->task->blocked_on = waiter;
+	task->blocked_on = waiter;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			 struct thread_info *ti)
+			 struct task_struct *task)
 {
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
-	DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
-	DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
-	ti->task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(waiter->task != task);
+	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
+	task->blocked_on = NULL;
 
 	list_del_init(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex-debug.h b/kernel/locking/mutex-debug.h
index 0799fd3e4cfa..d06ae3bb46c5 100644
--- a/kernel/locking/mutex-debug.h
+++ b/kernel/locking/mutex-debug.h
@@ -20,9 +20,9 @@ extern void debug_mutex_wake_waiter(struct mutex *lock,
 extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
 extern void debug_mutex_add_waiter(struct mutex *lock,
 				   struct mutex_waiter *waiter,
-				   struct thread_info *ti);
+				   struct task_struct *task);
 extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-				struct thread_info *ti);
+				struct task_struct *task);
 extern void debug_mutex_unlock(struct mutex *lock);
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d765a75f..a70b90db3909 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -537,7 +537,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
-	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
+	debug_mutex_add_waiter(lock, &waiter, task);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
@@ -584,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	}
 	__set_task_state(task, TASK_RUNNING);
 
-	mutex_remove_waiter(lock, &waiter, current_thread_info());
+	mutex_remove_waiter(lock, &waiter, task);
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -605,7 +605,7 @@ skip_wait:
 	return 0;
 
 err:
-	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+	mutex_remove_waiter(lock, &waiter, task);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, 1, ip);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 5cda397607f2..a68bae5e852a 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -13,7 +13,7 @@
 		do { spin_lock(lock); (void)(flags); } while (0)
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
-#define mutex_remove_waiter(lock, waiter, ti) \
+#define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 17:52                 ` Linus Torvalds
  (?)
@ 2016-06-23 18:00                   ` Kees Cook
  -1 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-23 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Peter Zijlstra, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 10:44 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The thread_info->tsk pointer, that was one of the most critical issues
>> and the main raison d'être of the thread_info, has been replaced on
>> x86 by just using the per-cpu "current_task". Yes,.there are probably
>> more than a few "ti->task" users left for legacy reasons, harking back
>> to when the thread-info was cheaper to access, but it shouldn't be a
>> big deal.
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.
>
> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".

Heh, yeah, that looks like a nice clean-up.

> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

Why does __mutex_lock_common() have "task" as a stack variable? It's
only assigned at the start, and is always "current". (I only noticed
from the patch changing "current_thread_info()" and
"task_thread_info(task)" both to "task".)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:00                   ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-23 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Peter Zijlstra, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 10:44 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The thread_info->tsk pointer, that was one of the most critical issues
>> and the main raison d'être of the thread_info, has been replaced on
>> x86 by just using the per-cpu "current_task". Yes,.there are probably
>> more than a few "ti->task" users left for legacy reasons, harking back
>> to when the thread-info was cheaper to access, but it shouldn't be a
>> big deal.
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.
>
> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".

Heh, yeah, that looks like a nice clean-up.

> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

Why does __mutex_lock_common() have "task" as a stack variable? It's
only assigned at the start, and is always "current". (I only noticed
from the patch changing "current_thread_info()" and
"task_thread_info(task)" both to "task".)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:00                   ` Kees Cook
  0 siblings, 0 replies; 269+ messages in thread
From: Kees Cook @ 2016-06-23 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Peter Zijlstra, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 10:44 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The thread_info->tsk pointer, that was one of the most critical issues
>> and the main raison d'être of the thread_info, has been replaced on
>> x86 by just using the per-cpu "current_task". Yes,.there are probably
>> more than a few "ti->task" users left for legacy reasons, harking back
>> to when the thread-info was cheaper to access, but it shouldn't be a
>> big deal.
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.
>
> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".

Heh, yeah, that looks like a nice clean-up.

> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

Why does __mutex_lock_common() have "task" as a stack variable? It's
only assigned at the start, and is always "current". (I only noticed
from the patch changing "current_thread_info()" and
"task_thread_info(task)" both to "task".)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 17:52                 ` Linus Torvalds
  (?)
@ 2016-06-23 18:12                   ` Oleg Nesterov
  -1 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.

Heh ;) I am looking at it too.

> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".

Even worse, this task is always "current" afaics, so

> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

I think we should simply remove this argument.

And probably kill task_struct->blocked_on? I do not see the point of
this task->blocked_on != waiter check.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:12                   ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.

Heh ;) I am looking at it too.

> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".

Even worse, this task is always "current" afaics, so

> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

I think we should simply remove this argument.

And probably kill task_struct->blocked_on? I do not see the point of
this task->blocked_on != waiter check.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:12                   ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.

Heh ;) I am looking at it too.

> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".

Even worse, this task is always "current" afaics, so

> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

I think we should simply remove this argument.

And probably kill task_struct->blocked_on? I do not see the point of
this task->blocked_on != waiter check.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 17:52                 ` Linus Torvalds
  (?)
@ 2016-06-23 18:46                   ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 18:46 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.

Interestingly, the *only* other user of ti->task for a full
allmodconfig build of x86-64 seems to be

  arch/x86/kernel/dumpstack.c

with the print_context_stack() -> print_ftrace_graph_addr() -> task =
tinfo->task chain.

And that doesn't really seem to want thread_info either. The callers
all have 'task', and have to generate thread_info from that anyway.

So this attached patch (which includes the previous one) seems to
build. I didn't actually boot it, but there should be no users left
unless there is some asm code that has hardcoded offsets..

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 11470 bytes --]

 arch/x86/include/asm/stacktrace.h  |  6 +++---
 arch/x86/include/asm/thread_info.h |  4 +---
 arch/x86/kernel/dumpstack.c        | 22 ++++++++++------------
 arch/x86/kernel/dumpstack_64.c     |  8 +++-----
 include/linux/sched.h              |  1 -
 kernel/locking/mutex-debug.c       | 12 ++++++------
 kernel/locking/mutex-debug.h       |  4 ++--
 kernel/locking/mutex.c             |  6 +++---
 kernel/locking/mutex.h             |  2 +-
 9 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 7c247e7404be..0944218af9e2 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -14,7 +14,7 @@ extern int kstack_depth_to_print;
 struct thread_info;
 struct stacktrace_ops;
 
-typedef unsigned long (*walk_stack_t)(struct thread_info *tinfo,
+typedef unsigned long (*walk_stack_t)(struct task_struct *task,
 				      unsigned long *stack,
 				      unsigned long bp,
 				      const struct stacktrace_ops *ops,
@@ -23,13 +23,13 @@ typedef unsigned long (*walk_stack_t)(struct thread_info *tinfo,
 				      int *graph);
 
 extern unsigned long
-print_context_stack(struct thread_info *tinfo,
+print_context_stack(struct task_struct *task,
 		    unsigned long *stack, unsigned long bp,
 		    const struct stacktrace_ops *ops, void *data,
 		    unsigned long *end, int *graph);
 
 extern unsigned long
-print_context_stack_bp(struct thread_info *tinfo,
+print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
 		       unsigned long *end, int *graph);
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..420acbf477ff 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -53,18 +53,16 @@ struct task_struct;
 #include <linux/atomic.h>
 
 struct thread_info {
-	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
-	mm_segment_t		addr_limit;
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
+	mm_segment_t		addr_limit;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.task		= &tsk,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.addr_limit	= KERNEL_DS,		\
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..d6209f3a69cb 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -42,16 +42,14 @@ void printk_address(unsigned long address)
 static void
 print_ftrace_graph_addr(unsigned long addr, void *data,
 			const struct stacktrace_ops *ops,
-			struct thread_info *tinfo, int *graph)
+			struct task_struct *task, int *graph)
 {
-	struct task_struct *task;
 	unsigned long ret_addr;
 	int index;
 
 	if (addr != (unsigned long)return_to_handler)
 		return;
 
-	task = tinfo->task;
 	index = task->curr_ret_stack;
 
 	if (!task->ret_stack || index < *graph)
@@ -68,7 +66,7 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
 static inline void
 print_ftrace_graph_addr(unsigned long addr, void *data,
 			const struct stacktrace_ops *ops,
-			struct thread_info *tinfo, int *graph)
+			struct task_struct *task, int *graph)
 { }
 #endif
 
@@ -79,10 +77,10 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
  * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
  */
 
-static inline int valid_stack_ptr(struct thread_info *tinfo,
+static inline int valid_stack_ptr(struct task_struct *task,
 			void *p, unsigned int size, void *end)
 {
-	void *t = tinfo;
+	void *t = task_thread_info(task);
 	if (end) {
 		if (p < end && p >= (end-THREAD_SIZE))
 			return 1;
@@ -93,14 +91,14 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 }
 
 unsigned long
-print_context_stack(struct thread_info *tinfo,
+print_context_stack(struct task_struct *task,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data,
 		unsigned long *end, int *graph)
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
-	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
+	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
 		addr = *stack;
@@ -112,7 +110,7 @@ print_context_stack(struct thread_info *tinfo,
 			} else {
 				ops->address(data, addr, 0);
 			}
-			print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+			print_ftrace_graph_addr(addr, data, ops, task, graph);
 		}
 		stack++;
 	}
@@ -121,7 +119,7 @@ print_context_stack(struct thread_info *tinfo,
 EXPORT_SYMBOL_GPL(print_context_stack);
 
 unsigned long
-print_context_stack_bp(struct thread_info *tinfo,
+print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
 		       unsigned long *end, int *graph)
@@ -129,7 +127,7 @@ print_context_stack_bp(struct thread_info *tinfo,
 	struct stack_frame *frame = (struct stack_frame *)bp;
 	unsigned long *ret_addr = &frame->return_address;
 
-	while (valid_stack_ptr(tinfo, ret_addr, sizeof(*ret_addr), end)) {
+	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
 		unsigned long addr = *ret_addr;
 
 		if (!__kernel_text_address(addr))
@@ -139,7 +137,7 @@ print_context_stack_bp(struct thread_info *tinfo,
 			break;
 		frame = frame->next_frame;
 		ret_addr = &frame->return_address;
-		print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+		print_ftrace_graph_addr(addr, data, ops, task, graph);
 	}
 
 	return (unsigned long)frame;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..d558a8a49016 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -153,7 +153,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
-	struct thread_info *tinfo;
 	unsigned long *irq_stack = (unsigned long *)per_cpu(irq_stack_ptr, cpu);
 	unsigned long dummy;
 	unsigned used = 0;
@@ -179,7 +178,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	 * current stack address. If the stacks consist of nested
 	 * exceptions
 	 */
-	tinfo = task_thread_info(task);
 	while (!done) {
 		unsigned long *stack_end;
 		enum stack_type stype;
@@ -202,7 +200,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			if (ops->stack(data, id) < 0)
 				break;
 
-			bp = ops->walk_stack(tinfo, stack, bp, ops,
+			bp = ops->walk_stack(task, stack, bp, ops,
 					     data, stack_end, &graph);
 			ops->stack(data, "<EOE>");
 			/*
@@ -218,7 +216,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 
 			if (ops->stack(data, "IRQ") < 0)
 				break;
-			bp = ops->walk_stack(tinfo, stack, bp,
+			bp = ops->walk_stack(task, stack, bp,
 				     ops, data, stack_end, &graph);
 			/*
 			 * We link to the next stack (which would be
@@ -240,7 +238,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	/*
 	 * This handles the process stack:
 	 */
-	bp = ops->walk_stack(tinfo, stack, bp, ops, data, NULL, &graph);
+	bp = ops->walk_stack(task, stack, bp, ops, data, NULL, &graph);
 	put_cpu();
 }
 EXPORT_SYMBOL(dump_trace);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..17be3f2507f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2975,7 +2975,6 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org)
 {
 	*task_thread_info(p) = *task_thread_info(org);
-	task_thread_info(p)->task = p;
 }
 
 /*
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 3ef3736002d8..9c951fade415 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -49,21 +49,21 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 }
 
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			    struct thread_info *ti)
+			    struct task_struct *task)
 {
 	SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
 
 	/* Mark the current thread as blocked on the lock: */
-	ti->task->blocked_on = waiter;
+	task->blocked_on = waiter;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			 struct thread_info *ti)
+			 struct task_struct *task)
 {
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
-	DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
-	DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
-	ti->task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(waiter->task != task);
+	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
+	task->blocked_on = NULL;
 
 	list_del_init(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex-debug.h b/kernel/locking/mutex-debug.h
index 0799fd3e4cfa..d06ae3bb46c5 100644
--- a/kernel/locking/mutex-debug.h
+++ b/kernel/locking/mutex-debug.h
@@ -20,9 +20,9 @@ extern void debug_mutex_wake_waiter(struct mutex *lock,
 extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
 extern void debug_mutex_add_waiter(struct mutex *lock,
 				   struct mutex_waiter *waiter,
-				   struct thread_info *ti);
+				   struct task_struct *task);
 extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-				struct thread_info *ti);
+				struct task_struct *task);
 extern void debug_mutex_unlock(struct mutex *lock);
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d765a75f..a70b90db3909 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -537,7 +537,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
-	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
+	debug_mutex_add_waiter(lock, &waiter, task);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
@@ -584,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	}
 	__set_task_state(task, TASK_RUNNING);
 
-	mutex_remove_waiter(lock, &waiter, current_thread_info());
+	mutex_remove_waiter(lock, &waiter, task);
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -605,7 +605,7 @@ skip_wait:
 	return 0;
 
 err:
-	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+	mutex_remove_waiter(lock, &waiter, task);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, 1, ip);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 5cda397607f2..a68bae5e852a 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -13,7 +13,7 @@
 		do { spin_lock(lock); (void)(flags); } while (0)
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
-#define mutex_remove_waiter(lock, waiter, ti) \
+#define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:46                   ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 18:46 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.

Interestingly, the *only* other user of ti->task for a full
allmodconfig build of x86-64 seems to be

  arch/x86/kernel/dumpstack.c

with the print_context_stack() -> print_ftrace_graph_addr() -> task =
tinfo->task chain.

And that doesn't really seem to want thread_info either. The callers
all have 'task', and have to generate thread_info from that anyway.

So this attached patch (which includes the previous one) seems to
build. I didn't actually boot it, but there should be no users left
unless there is some asm code that has hardcoded offsets..

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 11470 bytes --]

 arch/x86/include/asm/stacktrace.h  |  6 +++---
 arch/x86/include/asm/thread_info.h |  4 +---
 arch/x86/kernel/dumpstack.c        | 22 ++++++++++------------
 arch/x86/kernel/dumpstack_64.c     |  8 +++-----
 include/linux/sched.h              |  1 -
 kernel/locking/mutex-debug.c       | 12 ++++++------
 kernel/locking/mutex-debug.h       |  4 ++--
 kernel/locking/mutex.c             |  6 +++---
 kernel/locking/mutex.h             |  2 +-
 9 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 7c247e7404be..0944218af9e2 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -14,7 +14,7 @@ extern int kstack_depth_to_print;
 struct thread_info;
 struct stacktrace_ops;
 
-typedef unsigned long (*walk_stack_t)(struct thread_info *tinfo,
+typedef unsigned long (*walk_stack_t)(struct task_struct *task,
 				      unsigned long *stack,
 				      unsigned long bp,
 				      const struct stacktrace_ops *ops,
@@ -23,13 +23,13 @@ typedef unsigned long (*walk_stack_t)(struct thread_info *tinfo,
 				      int *graph);
 
 extern unsigned long
-print_context_stack(struct thread_info *tinfo,
+print_context_stack(struct task_struct *task,
 		    unsigned long *stack, unsigned long bp,
 		    const struct stacktrace_ops *ops, void *data,
 		    unsigned long *end, int *graph);
 
 extern unsigned long
-print_context_stack_bp(struct thread_info *tinfo,
+print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
 		       unsigned long *end, int *graph);
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..420acbf477ff 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -53,18 +53,16 @@ struct task_struct;
 #include <linux/atomic.h>
 
 struct thread_info {
-	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
-	mm_segment_t		addr_limit;
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
+	mm_segment_t		addr_limit;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.task		= &tsk,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.addr_limit	= KERNEL_DS,		\
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..d6209f3a69cb 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -42,16 +42,14 @@ void printk_address(unsigned long address)
 static void
 print_ftrace_graph_addr(unsigned long addr, void *data,
 			const struct stacktrace_ops *ops,
-			struct thread_info *tinfo, int *graph)
+			struct task_struct *task, int *graph)
 {
-	struct task_struct *task;
 	unsigned long ret_addr;
 	int index;
 
 	if (addr != (unsigned long)return_to_handler)
 		return;
 
-	task = tinfo->task;
 	index = task->curr_ret_stack;
 
 	if (!task->ret_stack || index < *graph)
@@ -68,7 +66,7 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
 static inline void
 print_ftrace_graph_addr(unsigned long addr, void *data,
 			const struct stacktrace_ops *ops,
-			struct thread_info *tinfo, int *graph)
+			struct task_struct *task, int *graph)
 { }
 #endif
 
@@ -79,10 +77,10 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
  * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
  */
 
-static inline int valid_stack_ptr(struct thread_info *tinfo,
+static inline int valid_stack_ptr(struct task_struct *task,
 			void *p, unsigned int size, void *end)
 {
-	void *t = tinfo;
+	void *t = task_thread_info(task);
 	if (end) {
 		if (p < end && p >= (end-THREAD_SIZE))
 			return 1;
@@ -93,14 +91,14 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 }
 
 unsigned long
-print_context_stack(struct thread_info *tinfo,
+print_context_stack(struct task_struct *task,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data,
 		unsigned long *end, int *graph)
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
-	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
+	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
 		addr = *stack;
@@ -112,7 +110,7 @@ print_context_stack(struct thread_info *tinfo,
 			} else {
 				ops->address(data, addr, 0);
 			}
-			print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+			print_ftrace_graph_addr(addr, data, ops, task, graph);
 		}
 		stack++;
 	}
@@ -121,7 +119,7 @@ print_context_stack(struct thread_info *tinfo,
 EXPORT_SYMBOL_GPL(print_context_stack);
 
 unsigned long
-print_context_stack_bp(struct thread_info *tinfo,
+print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
 		       unsigned long *end, int *graph)
@@ -129,7 +127,7 @@ print_context_stack_bp(struct thread_info *tinfo,
 	struct stack_frame *frame = (struct stack_frame *)bp;
 	unsigned long *ret_addr = &frame->return_address;
 
-	while (valid_stack_ptr(tinfo, ret_addr, sizeof(*ret_addr), end)) {
+	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
 		unsigned long addr = *ret_addr;
 
 		if (!__kernel_text_address(addr))
@@ -139,7 +137,7 @@ print_context_stack_bp(struct thread_info *tinfo,
 			break;
 		frame = frame->next_frame;
 		ret_addr = &frame->return_address;
-		print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+		print_ftrace_graph_addr(addr, data, ops, task, graph);
 	}
 
 	return (unsigned long)frame;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..d558a8a49016 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -153,7 +153,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
-	struct thread_info *tinfo;
 	unsigned long *irq_stack = (unsigned long *)per_cpu(irq_stack_ptr, cpu);
 	unsigned long dummy;
 	unsigned used = 0;
@@ -179,7 +178,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	 * current stack address. If the stacks consist of nested
 	 * exceptions
 	 */
-	tinfo = task_thread_info(task);
 	while (!done) {
 		unsigned long *stack_end;
 		enum stack_type stype;
@@ -202,7 +200,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			if (ops->stack(data, id) < 0)
 				break;
 
-			bp = ops->walk_stack(tinfo, stack, bp, ops,
+			bp = ops->walk_stack(task, stack, bp, ops,
 					     data, stack_end, &graph);
 			ops->stack(data, "<EOE>");
 			/*
@@ -218,7 +216,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 
 			if (ops->stack(data, "IRQ") < 0)
 				break;
-			bp = ops->walk_stack(tinfo, stack, bp,
+			bp = ops->walk_stack(task, stack, bp,
 				     ops, data, stack_end, &graph);
 			/*
 			 * We link to the next stack (which would be
@@ -240,7 +238,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	/*
 	 * This handles the process stack:
 	 */
-	bp = ops->walk_stack(tinfo, stack, bp, ops, data, NULL, &graph);
+	bp = ops->walk_stack(task, stack, bp, ops, data, NULL, &graph);
 	put_cpu();
 }
 EXPORT_SYMBOL(dump_trace);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..17be3f2507f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2975,7 +2975,6 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org)
 {
 	*task_thread_info(p) = *task_thread_info(org);
-	task_thread_info(p)->task = p;
 }
 
 /*
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 3ef3736002d8..9c951fade415 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -49,21 +49,21 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 }
 
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			    struct thread_info *ti)
+			    struct task_struct *task)
 {
 	SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
 
 	/* Mark the current thread as blocked on the lock: */
-	ti->task->blocked_on = waiter;
+	task->blocked_on = waiter;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			 struct thread_info *ti)
+			 struct task_struct *task)
 {
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
-	DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
-	DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
-	ti->task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(waiter->task != task);
+	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
+	task->blocked_on = NULL;
 
 	list_del_init(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex-debug.h b/kernel/locking/mutex-debug.h
index 0799fd3e4cfa..d06ae3bb46c5 100644
--- a/kernel/locking/mutex-debug.h
+++ b/kernel/locking/mutex-debug.h
@@ -20,9 +20,9 @@ extern void debug_mutex_wake_waiter(struct mutex *lock,
 extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
 extern void debug_mutex_add_waiter(struct mutex *lock,
 				   struct mutex_waiter *waiter,
-				   struct thread_info *ti);
+				   struct task_struct *task);
 extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-				struct thread_info *ti);
+				struct task_struct *task);
 extern void debug_mutex_unlock(struct mutex *lock);
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d765a75f..a70b90db3909 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -537,7 +537,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
-	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
+	debug_mutex_add_waiter(lock, &waiter, task);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
@@ -584,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	}
 	__set_task_state(task, TASK_RUNNING);
 
-	mutex_remove_waiter(lock, &waiter, current_thread_info());
+	mutex_remove_waiter(lock, &waiter, task);
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -605,7 +605,7 @@ skip_wait:
 	return 0;
 
 err:
-	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+	mutex_remove_waiter(lock, &waiter, task);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, 1, ip);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 5cda397607f2..a68bae5e852a 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -13,7 +13,7 @@
 		do { spin_lock(lock); (void)(flags); } while (0)
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
-#define mutex_remove_waiter(lock, waiter, ti) \
+#define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:46                   ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 18:46 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.

Interestingly, the *only* other user of ti->task for a full
allmodconfig build of x86-64 seems to be

  arch/x86/kernel/dumpstack.c

with the print_context_stack() -> print_ftrace_graph_addr() -> task =
tinfo->task chain.

And that doesn't really seem to want thread_info either. The callers
all have 'task', and have to generate thread_info from that anyway.

So this attached patch (which includes the previous one) seems to
build. I didn't actually boot it, but there should be no users left
unless there is some asm code that has hardcoded offsets..

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 11470 bytes --]

 arch/x86/include/asm/stacktrace.h  |  6 +++---
 arch/x86/include/asm/thread_info.h |  4 +---
 arch/x86/kernel/dumpstack.c        | 22 ++++++++++------------
 arch/x86/kernel/dumpstack_64.c     |  8 +++-----
 include/linux/sched.h              |  1 -
 kernel/locking/mutex-debug.c       | 12 ++++++------
 kernel/locking/mutex-debug.h       |  4 ++--
 kernel/locking/mutex.c             |  6 +++---
 kernel/locking/mutex.h             |  2 +-
 9 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 7c247e7404be..0944218af9e2 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -14,7 +14,7 @@ extern int kstack_depth_to_print;
 struct thread_info;
 struct stacktrace_ops;
 
-typedef unsigned long (*walk_stack_t)(struct thread_info *tinfo,
+typedef unsigned long (*walk_stack_t)(struct task_struct *task,
 				      unsigned long *stack,
 				      unsigned long bp,
 				      const struct stacktrace_ops *ops,
@@ -23,13 +23,13 @@ typedef unsigned long (*walk_stack_t)(struct thread_info *tinfo,
 				      int *graph);
 
 extern unsigned long
-print_context_stack(struct thread_info *tinfo,
+print_context_stack(struct task_struct *task,
 		    unsigned long *stack, unsigned long bp,
 		    const struct stacktrace_ops *ops, void *data,
 		    unsigned long *end, int *graph);
 
 extern unsigned long
-print_context_stack_bp(struct thread_info *tinfo,
+print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
 		       unsigned long *end, int *graph);
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..420acbf477ff 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -53,18 +53,16 @@ struct task_struct;
 #include <linux/atomic.h>
 
 struct thread_info {
-	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
-	mm_segment_t		addr_limit;
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
+	mm_segment_t		addr_limit;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.task		= &tsk,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.addr_limit	= KERNEL_DS,		\
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2bb25c3fe2e8..d6209f3a69cb 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -42,16 +42,14 @@ void printk_address(unsigned long address)
 static void
 print_ftrace_graph_addr(unsigned long addr, void *data,
 			const struct stacktrace_ops *ops,
-			struct thread_info *tinfo, int *graph)
+			struct task_struct *task, int *graph)
 {
-	struct task_struct *task;
 	unsigned long ret_addr;
 	int index;
 
 	if (addr != (unsigned long)return_to_handler)
 		return;
 
-	task = tinfo->task;
 	index = task->curr_ret_stack;
 
 	if (!task->ret_stack || index < *graph)
@@ -68,7 +66,7 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
 static inline void
 print_ftrace_graph_addr(unsigned long addr, void *data,
 			const struct stacktrace_ops *ops,
-			struct thread_info *tinfo, int *graph)
+			struct task_struct *task, int *graph)
 { }
 #endif
 
@@ -79,10 +77,10 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
  * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
  */
 
-static inline int valid_stack_ptr(struct thread_info *tinfo,
+static inline int valid_stack_ptr(struct task_struct *task,
 			void *p, unsigned int size, void *end)
 {
-	void *t = tinfo;
+	void *t = task_thread_info(task);
 	if (end) {
 		if (p < end && p >= (end-THREAD_SIZE))
 			return 1;
@@ -93,14 +91,14 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
 }
 
 unsigned long
-print_context_stack(struct thread_info *tinfo,
+print_context_stack(struct task_struct *task,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data,
 		unsigned long *end, int *graph)
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
-	while (valid_stack_ptr(tinfo, stack, sizeof(*stack), end)) {
+	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
 		addr = *stack;
@@ -112,7 +110,7 @@ print_context_stack(struct thread_info *tinfo,
 			} else {
 				ops->address(data, addr, 0);
 			}
-			print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+			print_ftrace_graph_addr(addr, data, ops, task, graph);
 		}
 		stack++;
 	}
@@ -121,7 +119,7 @@ print_context_stack(struct thread_info *tinfo,
 EXPORT_SYMBOL_GPL(print_context_stack);
 
 unsigned long
-print_context_stack_bp(struct thread_info *tinfo,
+print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
 		       unsigned long *end, int *graph)
@@ -129,7 +127,7 @@ print_context_stack_bp(struct thread_info *tinfo,
 	struct stack_frame *frame = (struct stack_frame *)bp;
 	unsigned long *ret_addr = &frame->return_address;
 
-	while (valid_stack_ptr(tinfo, ret_addr, sizeof(*ret_addr), end)) {
+	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
 		unsigned long addr = *ret_addr;
 
 		if (!__kernel_text_address(addr))
@@ -139,7 +137,7 @@ print_context_stack_bp(struct thread_info *tinfo,
 			break;
 		frame = frame->next_frame;
 		ret_addr = &frame->return_address;
-		print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+		print_ftrace_graph_addr(addr, data, ops, task, graph);
 	}
 
 	return (unsigned long)frame;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 5f1c6266eb30..d558a8a49016 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -153,7 +153,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
-	struct thread_info *tinfo;
 	unsigned long *irq_stack = (unsigned long *)per_cpu(irq_stack_ptr, cpu);
 	unsigned long dummy;
 	unsigned used = 0;
@@ -179,7 +178,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	 * current stack address. If the stacks consist of nested
 	 * exceptions
 	 */
-	tinfo = task_thread_info(task);
 	while (!done) {
 		unsigned long *stack_end;
 		enum stack_type stype;
@@ -202,7 +200,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			if (ops->stack(data, id) < 0)
 				break;
 
-			bp = ops->walk_stack(tinfo, stack, bp, ops,
+			bp = ops->walk_stack(task, stack, bp, ops,
 					     data, stack_end, &graph);
 			ops->stack(data, "<EOE>");
 			/*
@@ -218,7 +216,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 
 			if (ops->stack(data, "IRQ") < 0)
 				break;
-			bp = ops->walk_stack(tinfo, stack, bp,
+			bp = ops->walk_stack(task, stack, bp,
 				     ops, data, stack_end, &graph);
 			/*
 			 * We link to the next stack (which would be
@@ -240,7 +238,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	/*
 	 * This handles the process stack:
 	 */
-	bp = ops->walk_stack(tinfo, stack, bp, ops, data, NULL, &graph);
+	bp = ops->walk_stack(task, stack, bp, ops, data, NULL, &graph);
 	put_cpu();
 }
 EXPORT_SYMBOL(dump_trace);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..17be3f2507f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2975,7 +2975,6 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org)
 {
 	*task_thread_info(p) = *task_thread_info(org);
-	task_thread_info(p)->task = p;
 }
 
 /*
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 3ef3736002d8..9c951fade415 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -49,21 +49,21 @@ void debug_mutex_free_waiter(struct mutex_waiter *waiter)
 }
 
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			    struct thread_info *ti)
+			    struct task_struct *task)
 {
 	SMP_DEBUG_LOCKS_WARN_ON(!spin_is_locked(&lock->wait_lock));
 
 	/* Mark the current thread as blocked on the lock: */
-	ti->task->blocked_on = waiter;
+	task->blocked_on = waiter;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-			 struct thread_info *ti)
+			 struct task_struct *task)
 {
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
-	DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
-	DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
-	ti->task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(waiter->task != task);
+	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
+	task->blocked_on = NULL;
 
 	list_del_init(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex-debug.h b/kernel/locking/mutex-debug.h
index 0799fd3e4cfa..d06ae3bb46c5 100644
--- a/kernel/locking/mutex-debug.h
+++ b/kernel/locking/mutex-debug.h
@@ -20,9 +20,9 @@ extern void debug_mutex_wake_waiter(struct mutex *lock,
 extern void debug_mutex_free_waiter(struct mutex_waiter *waiter);
 extern void debug_mutex_add_waiter(struct mutex *lock,
 				   struct mutex_waiter *waiter,
-				   struct thread_info *ti);
+				   struct task_struct *task);
 extern void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
-				struct thread_info *ti);
+				struct task_struct *task);
 extern void debug_mutex_unlock(struct mutex *lock);
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d765a75f..a70b90db3909 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -537,7 +537,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		goto skip_wait;
 
 	debug_mutex_lock_common(lock, &waiter);
-	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
+	debug_mutex_add_waiter(lock, &waiter, task);
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
@@ -584,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	}
 	__set_task_state(task, TASK_RUNNING);
 
-	mutex_remove_waiter(lock, &waiter, current_thread_info());
+	mutex_remove_waiter(lock, &waiter, task);
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -605,7 +605,7 @@ skip_wait:
 	return 0;
 
 err:
-	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
+	mutex_remove_waiter(lock, &waiter, task);
 	spin_unlock_mutex(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, 1, ip);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 5cda397607f2..a68bae5e852a 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -13,7 +13,7 @@
 		do { spin_lock(lock); (void)(flags); } while (0)
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
-#define mutex_remove_waiter(lock, waiter, ti) \
+#define mutex_remove_waiter(lock, waiter, task) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 17:44               ` Linus Torvalds
  (?)
@ 2016-06-23 18:52                 ` Oleg Nesterov
  -1 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Let me quote my previous email ;)
> >
> >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> >         say, mark_oom_victim() expects that get_task_struct() protects
> >         thread_info as well.
> >
> > probably we can fix all such users though...
>
> TIF_MEMDIE is indeed a potential problem, but I don't think
> mark_oom_victim() is actually problematic.
>
> mark_oom_victim() is called with either "current",

This is no longer true in -mm tree.

But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
at least in its current form).

But I am afraid we can have more users which assume that thread_info can't
go away if you have a reference to task_struct.

And yes, we have a users which rely on RCU, say show_state_filter() which
walks the task under rcu_read_lock() and calls sched_show_task() which prints
task_thread_info(p)->flags.

Yes this is fixable too, but

> so these days, thread_info has almost nothing really critical in it
> any more. There's the thread-local flags, yes, but they could stay or
> easily be moved to the task_struct or get similar per-cpu fixup as
> preempt_count did a couple of years ago. The only annoyance is the few
> remaining entry code assembly sequences, but I suspect they would
> actually become simpler with a per-cpu thing, and with Andy's cleanups
> they are pretty insignificant these days. There seems to be exactly
> two uses of ASM_THREAD_INFO(TI_flags,.. left.

So perhaps on x86_64 we should move thread_info from thread_union to
task_struct->thread as Andy suggests.


And just in case, even if we move thread_info, of course we will need
to change dump_trace/etc which reads ->stack. Again, show_state_filter()
relies on RCU, proc_pid_stack() on get_task_struct(). They need to pin
task->stack somehow,  but this is clear.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:52                 ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Let me quote my previous email ;)
> >
> >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> >         say, mark_oom_victim() expects that get_task_struct() protects
> >         thread_info as well.
> >
> > probably we can fix all such users though...
>
> TIF_MEMDIE is indeed a potential problem, but I don't think
> mark_oom_victim() is actually problematic.
>
> mark_oom_victim() is called with either "current",

This is no longer true in -mm tree.

But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
at least in its current form).

But I am afraid we can have more users which assume that thread_info can't
go away if you have a reference to task_struct.

And yes, we have a users which rely on RCU, say show_state_filter() which
walks the task under rcu_read_lock() and calls sched_show_task() which prints
task_thread_info(p)->flags.

Yes this is fixable too, but

> so these days, thread_info has almost nothing really critical in it
> any more. There's the thread-local flags, yes, but they could stay or
> easily be moved to the task_struct or get similar per-cpu fixup as
> preempt_count did a couple of years ago. The only annoyance is the few
> remaining entry code assembly sequences, but I suspect they would
> actually become simpler with a per-cpu thing, and with Andy's cleanups
> they are pretty insignificant these days. There seems to be exactly
> two uses of ASM_THREAD_INFO(TI_flags,.. left.

So perhaps on x86_64 we should move thread_info from thread_union to
task_struct->thread as Andy suggests.


And just in case, even if we move thread_info, of course we will need
to change dump_trace/etc which reads ->stack. Again, show_state_filter()
relies on RCU, proc_pid_stack() on get_task_struct(). They need to pin
task->stack somehow,  but this is clear.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:52                 ` Oleg Nesterov
  0 siblings, 0 replies; 269+ messages in thread
From: Oleg Nesterov @ 2016-06-23 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On 06/23, Linus Torvalds wrote:
>
> On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Let me quote my previous email ;)
> >
> >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> >         say, mark_oom_victim() expects that get_task_struct() protects
> >         thread_info as well.
> >
> > probably we can fix all such users though...
>
> TIF_MEMDIE is indeed a potential problem, but I don't think
> mark_oom_victim() is actually problematic.
>
> mark_oom_victim() is called with either "current",

This is no longer true in -mm tree.

But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
at least in its current form).

But I am afraid we can have more users which assume that thread_info can't
go away if you have a reference to task_struct.

And yes, we have a users which rely on RCU, say show_state_filter() which
walks the task under rcu_read_lock() and calls sched_show_task() which prints
task_thread_info(p)->flags.

Yes this is fixable too, but

> so these days, thread_info has almost nothing really critical in it
> any more. There's the thread-local flags, yes, but they could stay or
> easily be moved to the task_struct or get similar per-cpu fixup as
> preempt_count did a couple of years ago. The only annoyance is the few
> remaining entry code assembly sequences, but I suspect they would
> actually become simpler with a per-cpu thing, and with Andy's cleanups
> they are pretty insignificant these days. There seems to be exactly
> two uses of ASM_THREAD_INFO(TI_flags,.. left.

So perhaps on x86_64 we should move thread_info from thread_union to
task_struct->thread as Andy suggests.


And just in case, even if we move thread_info, of course we will need
to change dump_trace/etc which reads ->stack. Again, show_state_filter()
relies on RCU, proc_pid_stack() on get_task_struct(). They need to pin
task->stack somehow,  but this is clear.

Oleg.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 17:52                 ` Linus Torvalds
  (?)
@ 2016-06-23 18:53                   ` Peter Zijlstra
  -1 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:52:58AM -0700, Linus Torvalds wrote:
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.
> 
> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".
> 
> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

Yeah, that looks fine. Want me to take it or will you just commit?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:53                   ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:52:58AM -0700, Linus Torvalds wrote:
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.
> 
> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".
> 
> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

Yeah, that looks fine. Want me to take it or will you just commit?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:53                   ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 10:52:58AM -0700, Linus Torvalds wrote:
> Ugh. Looking around at this, it turns out that a great example of this
> kind of legacy issue is the debug_mutex stuff.
> 
> It uses "struct thread_info *" as the owner pointer, and there is _no_
> existing reason for it. In fact, in every single place it actually
> wants the task_struct, and it does task_thread_info(task) just to
> convert it to the thread-info, and then converts it back with
> "ti->task".
> 
> So the attached patch seems to be the right thing to do regardless of
> this whole discussion.

Yeah, that looks fine. Want me to take it or will you just commit?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 18:00                   ` Kees Cook
  (?)
@ 2016-06-23 18:54                     ` Peter Zijlstra
  -1 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:00:08AM -0700, Kees Cook wrote:
> 
> Why does __mutex_lock_common() have "task" as a stack variable?

That's actually a fairly common thing to do. The reason is that
'current' is far more expensive to evaluate than a local variable.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:54                     ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:00:08AM -0700, Kees Cook wrote:
> 
> Why does __mutex_lock_common() have "task" as a stack variable?

That's actually a fairly common thing to do. The reason is that
'current' is far more expensive to evaluate than a local variable.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:54                     ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Brian Gerst, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:00:08AM -0700, Kees Cook wrote:
> 
> Why does __mutex_lock_common() have "task" as a stack variable?

That's actually a fairly common thing to do. The reason is that
'current' is far more expensive to evaluate than a local variable.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 18:12                   ` Oleg Nesterov
  (?)
@ 2016-06-23 18:55                     ` Peter Zijlstra
  -1 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 08:12:16PM +0200, Oleg Nesterov wrote:
> 
> And probably kill task_struct->blocked_on? I do not see the point of
> this task->blocked_on != waiter check.

I think that came about because of PI and or deadlock detection. Of
course, the current mutex code doesn't have anything like that these
days, and rt_mutex has task_struct::pi_blocked_on.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:55                     ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 08:12:16PM +0200, Oleg Nesterov wrote:
> 
> And probably kill task_struct->blocked_on? I do not see the point of
> this task->blocked_on != waiter check.

I think that came about because of PI and or deadlock detection. Of
course, the current mutex code doesn't have anything like that these
days, and rt_mutex has task_struct::pi_blocked_on.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 18:55                     ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 18:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 08:12:16PM +0200, Oleg Nesterov wrote:
> 
> And probably kill task_struct->blocked_on? I do not see the point of
> this task->blocked_on != waiter check.

I think that came about because of PI and or deadlock detection. Of
course, the current mutex code doesn't have anything like that these
days, and rt_mutex has task_struct::pi_blocked_on.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 18:46                   ` Linus Torvalds
  (?)
@ 2016-06-23 19:08                     ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Peter Zijlstra, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ugh. Looking around at this, it turns out that a great example of this
>> kind of legacy issue is the debug_mutex stuff.
>
> Interestingly, the *only* other user of ti->task for a full
> allmodconfig build of x86-64 seems to be
>
>   arch/x86/kernel/dumpstack.c
>
> with the print_context_stack() -> print_ftrace_graph_addr() -> task =
> tinfo->task chain.
>
> And that doesn't really seem to want thread_info either. The callers
> all have 'task', and have to generate thread_info from that anyway.
>
> So this attached patch (which includes the previous one) seems to
> build. I didn't actually boot it, but there should be no users left
> unless there is some asm code that has hardcoded offsets..

I think you'll break some architectures when you remove the
initialization of ti->task.  That either needs to be pushed down into
arch code in unicore32, openrisc, microblaze, powerpx, xtensa, sparc,
parisc, arm, mips, s390, and whatever I missed, or you should leave
the field initialized and existing and wait for my patch to
conditionally remove/embed thread_info to get rid of the
initialization part.

On the C side, there's:

arm's contextidr_notifier (easily fixable)

sh's irqctx->tinfo.task = curctx->task; (probably useless) and
print_ftrace_graph_addr

cris's ugdb_trap_user (probably easily fixable)

sparc's arch_trigger_all_cpu_backtrace (possibly quite hard to fix)

sparc's flush_thread (trivial)

sparc's __save_stack_trace (not sure)

unicore's __die (probably easy)

metag's do_softirq_own_stack (not sure if it's useful)


I found these with this coccinelle script:

@@
struct thread_info *ti;
@@

* ti->task

$ spatch --sp-file titask.cocci --dir .

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:08                     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Peter Zijlstra, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ugh. Looking around at this, it turns out that a great example of this
>> kind of legacy issue is the debug_mutex stuff.
>
> Interestingly, the *only* other user of ti->task for a full
> allmodconfig build of x86-64 seems to be
>
>   arch/x86/kernel/dumpstack.c
>
> with the print_context_stack() -> print_ftrace_graph_addr() -> task =
> tinfo->task chain.
>
> And that doesn't really seem to want thread_info either. The callers
> all have 'task', and have to generate thread_info from that anyway.
>
> So this attached patch (which includes the previous one) seems to
> build. I didn't actually boot it, but there should be no users left
> unless there is some asm code that has hardcoded offsets..

I think you'll break some architectures when you remove the
initialization of ti->task.  That either needs to be pushed down into
arch code in unicore32, openrisc, microblaze, powerpx, xtensa, sparc,
parisc, arm, mips, s390, and whatever I missed, or you should leave
the field initialized and existing and wait for my patch to
conditionally remove/embed thread_info to get rid of the
initialization part.

On the C side, there's:

arm's contextidr_notifier (easily fixable)

sh's irqctx->tinfo.task = curctx->task; (probably useless) and
print_ftrace_graph_addr

cris's ugdb_trap_user (probably easily fixable)

sparc's arch_trigger_all_cpu_backtrace (possibly quite hard to fix)

sparc's flush_thread (trivial)

sparc's __save_stack_trace (not sure)

unicore's __die (probably easy)

metag's do_softirq_own_stack (not sure if it's useful)


I found these with this coccinelle script:

@@
struct thread_info *ti;
@@

* ti->task

$ spatch --sp-file titask.cocci --dir .

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:08                     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Peter Zijlstra, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 10:52 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ugh. Looking around at this, it turns out that a great example of this
>> kind of legacy issue is the debug_mutex stuff.
>
> Interestingly, the *only* other user of ti->task for a full
> allmodconfig build of x86-64 seems to be
>
>   arch/x86/kernel/dumpstack.c
>
> with the print_context_stack() -> print_ftrace_graph_addr() -> task =
> tinfo->task chain.
>
> And that doesn't really seem to want thread_info either. The callers
> all have 'task', and have to generate thread_info from that anyway.
>
> So this attached patch (which includes the previous one) seems to
> build. I didn't actually boot it, but there should be no users left
> unless there is some asm code that has hardcoded offsets..

I think you'll break some architectures when you remove the
initialization of ti->task.  That either needs to be pushed down into
arch code in unicore32, openrisc, microblaze, powerpx, xtensa, sparc,
parisc, arm, mips, s390, and whatever I missed, or you should leave
the field initialized and existing and wait for my patch to
conditionally remove/embed thread_info to get rid of the
initialization part.

On the C side, there's:

arm's contextidr_notifier (easily fixable)

sh's irqctx->tinfo.task = curctx->task; (probably useless) and
print_ftrace_graph_addr

cris's ugdb_trap_user (probably easily fixable)

sparc's arch_trigger_all_cpu_backtrace (possibly quite hard to fix)

sparc's flush_thread (trivial)

sparc's __save_stack_trace (not sure)

unicore's __die (probably easy)

metag's do_softirq_own_stack (not sure if it's useful)


I found these with this coccinelle script:

@@
struct thread_info *ti;
@@

* ti->task

$ spatch --sp-file titask.cocci --dir .

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 18:53                   ` Peter Zijlstra
  (?)
@ 2016-06-23 19:09                     ` Andy Lutomirski
  -1 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jun 23, 2016 at 10:52:58AM -0700, Linus Torvalds wrote:
>> Ugh. Looking around at this, it turns out that a great example of this
>> kind of legacy issue is the debug_mutex stuff.
>>
>> It uses "struct thread_info *" as the owner pointer, and there is _no_
>> existing reason for it. In fact, in every single place it actually
>> wants the task_struct, and it does task_thread_info(task) just to
>> convert it to the thread-info, and then converts it back with
>> "ti->task".
>>
>> So the attached patch seems to be the right thing to do regardless of
>> this whole discussion.
>
> Yeah, that looks fine. Want me to take it or will you just commit?

PeterZ, mind if I split it into a couple of patches, test it, and add
it to my series?

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:09                     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jun 23, 2016 at 10:52:58AM -0700, Linus Torvalds wrote:
>> Ugh. Looking around at this, it turns out that a great example of this
>> kind of legacy issue is the debug_mutex stuff.
>>
>> It uses "struct thread_info *" as the owner pointer, and there is _no_
>> existing reason for it. In fact, in every single place it actually
>> wants the task_struct, and it does task_thread_info(task) just to
>> convert it to the thread-info, and then converts it back with
>> "ti->task".
>>
>> So the attached patch seems to be the right thing to do regardless of
>> this whole discussion.
>
> Yeah, that looks fine. Want me to take it or will you just commit?

PeterZ, mind if I split it into a couple of patches, test it, and add
it to my series?

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:09                     ` Andy Lutomirski
  0 siblings, 0 replies; 269+ messages in thread
From: Andy Lutomirski @ 2016-06-23 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jun 23, 2016 at 10:52:58AM -0700, Linus Torvalds wrote:
>> Ugh. Looking around at this, it turns out that a great example of this
>> kind of legacy issue is the debug_mutex stuff.
>>
>> It uses "struct thread_info *" as the owner pointer, and there is _no_
>> existing reason for it. In fact, in every single place it actually
>> wants the task_struct, and it does task_thread_info(task) just to
>> convert it to the thread-info, and then converts it back with
>> "ti->task".
>>
>> So the attached patch seems to be the right thing to do regardless of
>> this whole discussion.
>
> Yeah, that looks fine. Want me to take it or will you just commit?

PeterZ, mind if I split it into a couple of patches, test it, and add
it to my series?

--Andy

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 14:31         ` Oleg Nesterov
  (?)
@ 2016-06-23 19:11           ` Peter Zijlstra
  -1 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 04:31:26PM +0200, Oleg Nesterov wrote:
> On 06/22, Linus Torvalds wrote:
> >
> > Oleg, what do you think? Would it be reasonable to free the stack and
> > thread_info synchronously at exit time, clear the pointer (to catch
> > any odd use), and only RCU-delay the task_struct itself?
> 
> I didn't see the patches yet, quite possibly I misunderstood... But no,
> I don't this we can do this (if we are not going to move ti->flags to
> task_struct at least).

Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
If that is possible, a reuse in per-cpu cache is equally possible.

All we really want to guarantee is that the memory remains a
task_struct, it need not remain the same task, right?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:11           ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 04:31:26PM +0200, Oleg Nesterov wrote:
> On 06/22, Linus Torvalds wrote:
> >
> > Oleg, what do you think? Would it be reasonable to free the stack and
> > thread_info synchronously at exit time, clear the pointer (to catch
> > any odd use), and only RCU-delay the task_struct itself?
> 
> I didn't see the patches yet, quite possibly I misunderstood... But no,
> I don't this we can do this (if we are not going to move ti->flags to
> task_struct at least).

Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
If that is possible, a reuse in per-cpu cache is equally possible.

All we really want to guarantee is that the memory remains a
task_struct, it need not remain the same task, right?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:11           ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 04:31:26PM +0200, Oleg Nesterov wrote:
> On 06/22, Linus Torvalds wrote:
> >
> > Oleg, what do you think? Would it be reasonable to free the stack and
> > thread_info synchronously at exit time, clear the pointer (to catch
> > any odd use), and only RCU-delay the task_struct itself?
> 
> I didn't see the patches yet, quite possibly I misunderstood... But no,
> I don't this we can do this (if we are not going to move ti->flags to
> task_struct at least).

Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
If that is possible, a reuse in per-cpu cache is equally possible.

All we really want to guarantee is that the memory remains a
task_struct, it need not remain the same task, right?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 19:09                     ` Andy Lutomirski
  (?)
@ 2016-06-23 19:13                       ` Peter Zijlstra
  -1 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:09:53PM -0700, Andy Lutomirski wrote:
> On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Yeah, that looks fine. Want me to take it or will you just commit?
> 
> PeterZ, mind if I split it into a couple of patches, test it, and add
> it to my series?

Not at all, keep me on Cc?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:13                       ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:09:53PM -0700, Andy Lutomirski wrote:
> On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Yeah, that looks fine. Want me to take it or will you just commit?
> 
> PeterZ, mind if I split it into a couple of patches, test it, and add
> it to my series?

Not at all, keep me on Cc?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:13                       ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:09:53PM -0700, Andy Lutomirski wrote:
> On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Yeah, that looks fine. Want me to take it or will you just commit?
> 
> PeterZ, mind if I split it into a couple of patches, test it, and add
> it to my series?

Not at all, keep me on Cc?

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 18:53                   ` Peter Zijlstra
  (?)
@ 2016-06-23 19:17                     ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> So the attached patch seems to be the right thing to do regardless of
>> this whole discussion.
>
> Yeah, that looks fine. Want me to take it or will you just commit?

I'm committing these trivial non-semantic patches, I'm actually
running the kernel without any ti->task pointer now (the previous
patch I sent out).

So I'll do the mutex debug patch and the stack dump patch as just he
obvious cleanup patches.

Those are the "purely legacy reasons for a bad calling convention",
and I'm ok with those during the rc series to make it easier for
people to play around with this.

With he goal being that I'm hoping that we can then actually get rid
of this (at least on x86-64, even if we leave it in some other
architectures) in 4.8.

                    Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:17                     ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> So the attached patch seems to be the right thing to do regardless of
>> this whole discussion.
>
> Yeah, that looks fine. Want me to take it or will you just commit?

I'm committing these trivial non-semantic patches, I'm actually
running the kernel without any ti->task pointer now (the previous
patch I sent out).

So I'll do the mutex debug patch and the stack dump patch as just he
obvious cleanup patches.

Those are the "purely legacy reasons for a bad calling convention",
and I'm ok with those during the rc series to make it easier for
people to play around with this.

With he goal being that I'm hoping that we can then actually get rid
of this (at least on x86-64, even if we leave it in some other
architectures) in 4.8.

                    Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:17                     ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 11:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> So the attached patch seems to be the right thing to do regardless of
>> this whole discussion.
>
> Yeah, that looks fine. Want me to take it or will you just commit?

I'm committing these trivial non-semantic patches, I'm actually
running the kernel without any ti->task pointer now (the previous
patch I sent out).

So I'll do the mutex debug patch and the stack dump patch as just he
obvious cleanup patches.

Those are the "purely legacy reasons for a bad calling convention",
and I'm ok with those during the rc series to make it easier for
people to play around with this.

With he goal being that I'm hoping that we can then actually get rid
of this (at least on x86-64, even if we leave it in some other
architectures) in 4.8.

                    Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 19:11           ` Peter Zijlstra
  (?)
@ 2016-06-23 19:34             ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
> If that is possible, a reuse in per-cpu cache is equally possible.
>
> All we really want to guarantee is that the memory remains a
> task_struct, it need not remain the same task, right?

No, we can't do SLAB_DESTROY_BY_RCU for the task_struct itself,
because the RCU list traversal does expect that the thread and task
lists are stable even if it walks into a "stale" struct task_struct.

If we re-use the task-struct before the RCU grace period is over, then
the list walker might end up walking into the wrong thread group
(bad!) or seeing tasks twice on the task list (also bad, although
perhaps not _as_ bad).

The _other_ fields might be ok, but updaing the very list fields that
we walk with RCU is a no-no.

Basically, SLAB_DESTROY_BY_RCU is fine only for things where the RCU
field use is idempotent. So for things where the RCU walker only looks
at entries that don't matter semantically, or where it does things
like "lock/unlock" on a lock that is still valid.

It's actually fairly rare that we can use SLAB_DESTROY_BY_RCU. We have
that sighand thing, and there's a couple of networking uses for the
request_sock and socket slabs. And I sincerely hope the socket slab
RCU lists are safe, because it's dangerous.

               Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:34             ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
> If that is possible, a reuse in per-cpu cache is equally possible.
>
> All we really want to guarantee is that the memory remains a
> task_struct, it need not remain the same task, right?

No, we can't do SLAB_DESTROY_BY_RCU for the task_struct itself,
because the RCU list traversal does expect that the thread and task
lists are stable even if it walks into a "stale" struct task_struct.

If we re-use the task-struct before the RCU grace period is over, then
the list walker might end up walking into the wrong thread group
(bad!) or seeing tasks twice on the task list (also bad, although
perhaps not _as_ bad).

The _other_ fields might be ok, but updaing the very list fields that
we walk with RCU is a no-no.

Basically, SLAB_DESTROY_BY_RCU is fine only for things where the RCU
field use is idempotent. So for things where the RCU walker only looks
at entries that don't matter semantically, or where it does things
like "lock/unlock" on a lock that is still valid.

It's actually fairly rare that we can use SLAB_DESTROY_BY_RCU. We have
that sighand thing, and there's a couple of networking uses for the
request_sock and socket slabs. And I sincerely hope the socket slab
RCU lists are safe, because it's dangerous.

               Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:34             ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-23 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
> If that is possible, a reuse in per-cpu cache is equally possible.
>
> All we really want to guarantee is that the memory remains a
> task_struct, it need not remain the same task, right?

No, we can't do SLAB_DESTROY_BY_RCU for the task_struct itself,
because the RCU list traversal does expect that the thread and task
lists are stable even if it walks into a "stale" struct task_struct.

If we re-use the task-struct before the RCU grace period is over, then
the list walker might end up walking into the wrong thread group
(bad!) or seeing tasks twice on the task list (also bad, although
perhaps not _as_ bad).

The _other_ fields might be ok, but updaing the very list fields that
we walk with RCU is a no-no.

Basically, SLAB_DESTROY_BY_RCU is fine only for things where the RCU
field use is idempotent. So for things where the RCU walker only looks
at entries that don't matter semantically, or where it does things
like "lock/unlock" on a lock that is still valid.

It's actually fairly rare that we can use SLAB_DESTROY_BY_RCU. We have
that sighand thing, and there's a couple of networking uses for the
request_sock and socket slabs. And I sincerely hope the socket slab
RCU lists are safe, because it's dangerous.

               Linus

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 19:34             ` Linus Torvalds
  (?)
@ 2016-06-23 19:46               ` Peter Zijlstra
  -1 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:34:24PM -0700, Linus Torvalds wrote:
> On Thu, Jun 23, 2016 at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
> > If that is possible, a reuse in per-cpu cache is equally possible.
> >
> > All we really want to guarantee is that the memory remains a
> > task_struct, it need not remain the same task, right?
> 
> No, we can't do SLAB_DESTROY_BY_RCU for the task_struct itself,
> because the RCU list traversal does expect that the thread and task
> lists are stable even if it walks into a "stale" struct task_struct.

Indeed.

OK, so the situation we talked about before is different, we wanted to
do SLAB_DESTROY_BY_RCU on top of the existing delayed_put_task_struct()
to get a double grace period.

The problem was for things like rq->curr, which isn't RCU managed as
such, we could still do:

	rcu_read_lock();
	task = rq->curr;

and rely on task being _a_ task_struct, even though it might not be the
self-same task we thought we had.

So yes, not an option and I was stitching together two half remembered
situations to create utter nonsense.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:46               ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:34:24PM -0700, Linus Torvalds wrote:
> On Thu, Jun 23, 2016 at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
> > If that is possible, a reuse in per-cpu cache is equally possible.
> >
> > All we really want to guarantee is that the memory remains a
> > task_struct, it need not remain the same task, right?
> 
> No, we can't do SLAB_DESTROY_BY_RCU for the task_struct itself,
> because the RCU list traversal does expect that the thread and task
> lists are stable even if it walks into a "stale" struct task_struct.

Indeed.

OK, so the situation we talked about before is different, we wanted to
do SLAB_DESTROY_BY_RCU on top of the existing delayed_put_task_struct()
to get a double grace period.

The problem was for things like rq->curr, which isn't RCU managed as
such, we could still do:

	rcu_read_lock();
	task = rq->curr;

and rely on task being _a_ task_struct, even though it might not be the
self-same task we thought we had.

So yes, not an option and I was stitching together two half remembered
situations to create utter nonsense.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-23 19:46               ` Peter Zijlstra
  0 siblings, 0 replies; 269+ messages in thread
From: Peter Zijlstra @ 2016-06-23 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 12:34:24PM -0700, Linus Torvalds wrote:
> On Thu, Jun 23, 2016 at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Didn't we talk about using SLAB_DESTROY_BY_RCU for task_struct before?
> > If that is possible, a reuse in per-cpu cache is equally possible.
> >
> > All we really want to guarantee is that the memory remains a
> > task_struct, it need not remain the same task, right?
> 
> No, we can't do SLAB_DESTROY_BY_RCU for the task_struct itself,
> because the RCU list traversal does expect that the thread and task
> lists are stable even if it walks into a "stale" struct task_struct.

Indeed.

OK, so the situation we talked about before is different, we wanted to
do SLAB_DESTROY_BY_RCU on top of the existing delayed_put_task_struct()
to get a double grace period.

The problem was for things like rq->curr, which isn't RCU managed as
such, we could still do:

	rcu_read_lock();
	task = rq->curr;

and rely on task being _a_ task_struct, even though it might not be the
self-same task we thought we had.

So yes, not an option and I was stitching together two half remembered
situations to create utter nonsense.

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 19:17                     ` Linus Torvalds
  (?)
@ 2016-06-24  6:17                       ` Linus Torvalds
  -1 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-24  6:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

On Thu, Jun 23, 2016 at 12:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> With the goal being that I'm hoping that we can then actually get rid
> of this (at least on x86-64, even if we leave it in some other
> architectures) in 4.8.

The context here was that we could almost get rid of thread-info
entirely, at least for x86-64, by moving it into struct task_struct.

It turns out that we're not *that* far off after the obvious cleanups
I already committed, but I couldn't get things quite to work.

I'm attaching a patch that I wrote today that doesn't boot, but "looks
right". The reason I'm attaching it is because I'm hoping somebody
wants to take a look and maybe see what else I missed, but mostly
because I think the patch is interesting in a couple of cases where we
just do incredibly ugly things.

First off, some code that Andy wrote when he re-organized the entry path.

Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want to do
unspeakable acts on a poor innocent wax figure that looked _exactly_
like you.

I just got rid of pt_regs_to_thread_info() entirely, and just replaced
it with current_thread_info().  I'm not at all convinced that trying
to be that clever was really a good idea.

Secondly, the x86-64 ret_from_fork calling convention was documented
wrongly. It says %rdi contains the previous task pointer. Yes it does,
but it doesn't mention that %r8 is supposed to contain the new
thread_info. That was fun to find.

And thirdly, the stack size games that asm/kprobes.h plays are just
disgusting. I stared at that code for much too long. I may in fact be
going blind as a result.

The rest was fairly straightforward, although since the end result
doesn't actually work, that "straightforward" may be broken too. But
the basic approach _looks_ sane.

Comments? Anybody want to play with this and see where I went wrong?

(Note - this patch was written on top of the two thread-info removal
patches I committed in

   da01e18a37a5 x86: avoid avoid passing around 'thread_info' in stack
dumping code
   6720a305df74 locking: avoid passing around 'thread_info' in mutex
debugging code

and depends on them, since "ti->task" no longer exists with
CONFIG_THREAD_INFO_IN_TASK. "ti" and "task" will have the same value).

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 17448 bytes --]

This is a non-working attempt at moving the thread_info into the
task_struct

 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 21 +++++++--------------
 arch/x86/entry/entry_64.S          |  9 ++++++---
 arch/x86/include/asm/kprobes.h     | 12 ++++++------
 arch/x86/include/asm/switch_to.h   |  6 ++----
 arch/x86/include/asm/thread_info.h | 38 ++++----------------------------------
 arch/x86/kernel/dumpstack.c        |  2 +-
 arch/x86/kernel/irq_32.c           |  2 --
 arch/x86/kernel/irq_64.c           |  3 +--
 arch/x86/kernel/process.c          |  6 ++----
 arch/x86/um/ptrace_32.c            |  8 ++++----
 include/linux/init_task.h          |  9 +++++++++
 include/linux/sched.h              | 14 +++++++++++++-
 init/Kconfig                       |  3 +++
 init/init_task.c                   |  7 +++++--
 15 files changed, 64 insertions(+), 77 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9a94da0c29f..f33bc80577c5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -154,6 +154,7 @@ config X86
 	select SPARSE_IRQ
 	select SRCU
 	select SYSCTL_EXCEPTION_TRACE
+	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e538c44..d5feac5f252d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,13 +31,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
-{
-	unsigned long top_of_stack =
-		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
-	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
-}
-
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
 __visible void enter_from_user_mode(void)
@@ -78,7 +71,7 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
  */
 unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long ret = 0;
 	u32 work;
 
@@ -156,7 +149,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
 				unsigned long phase1_result)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	long ret = 0;
 	u32 work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
 
@@ -239,7 +232,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
-		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
@@ -250,7 +243,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
 
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN_ON(!irqs_disabled()))
@@ -309,7 +302,7 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
  */
 __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -332,7 +325,7 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long nr = regs->orig_ax;
 
 	enter_from_user_mode();
@@ -365,7 +358,7 @@ __visible void do_syscall_64(struct pt_regs *regs)
  */
 static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..f49742de2c65 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -179,7 +179,8 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	GET_THREAD_INFO(%r11)
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TI_flags(%r11)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -217,7 +218,8 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	GET_THREAD_INFO(%r11)
+	testl	$_TIF_ALLWORK_MASK, TI_flags(%r11)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
@@ -368,9 +370,10 @@ END(ptregs_\func)
  * A newly forked process directly context switches into this address.
  *
  * rdi: prev task we switched from
+ * rsi: task we're switching to
  */
 ENTRY(ret_from_fork)
-	LOCK ; btr $TIF_FORK, TI_flags(%r8)
+	LOCK ; btr $TIF_FORK, TI_flags(%rsi)	/* rsi: this newly forked task */
 
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 4421b5da409d..1d2997e74b08 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -38,12 +38,12 @@ typedef u8 kprobe_opcode_t;
 #define RELATIVECALL_OPCODE 0xe8
 #define RELATIVE_ADDR_SIZE 4
 #define MAX_STACK_SIZE 64
-#define MIN_STACK_SIZE(ADDR)					       \
-	(((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
-			      THREAD_SIZE - (unsigned long)(ADDR)))    \
-	 ? (MAX_STACK_SIZE)					       \
-	 : (((unsigned long)current_thread_info()) +		       \
-	    THREAD_SIZE - (unsigned long)(ADDR)))
+
+#define current_stack_top() ((unsigned long)task_stack_page(current)+THREAD_SIZE)
+#define current_stack_size(ADDR) (current_stack_top() - (unsigned long)(ADDR))
+
+#define MIN_STACK_SIZE(ADDR) \
+	(MAX_STACK_SIZE < current_stack_size(ADDR) ? MAX_STACK_SIZE : current_stack_size(ADDR))
 
 #define flush_insn_slot(p)	do { } while (0)
 
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..ae0aa0612c67 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -110,18 +110,16 @@ do {									\
 	     "call __switch_to\n\t"					  \
 	     "movq "__percpu_arg([current_task])",%%rsi\n\t"		  \
 	     __switch_canary						  \
-	     "movq %P[thread_info](%%rsi),%%r8\n\t"			  \
 	     "movq %%rax,%%rdi\n\t" 					  \
-	     "testl  %[_tif_fork],%P[ti_flags](%%r8)\n\t"		  \
+	     "testl  %[_tif_fork],%P[ti_flags](%%rsi)\n\t"		  \
 	     "jnz   ret_from_fork\n\t"					  \
 	     RESTORE_CONTEXT						  \
 	     : "=a" (last)					  	  \
 	       __switch_canary_oparam					  \
 	     : [next] "S" (next), [prev] "D" (prev),			  \
 	       [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \
-	       [ti_flags] "i" (offsetof(struct thread_info, flags)),	  \
+	       [ti_flags] "i" (offsetof(struct task_struct, thread_info.flags)),	  \
 	       [_tif_fork] "i" (_TIF_FORK),			  	  \
-	       [thread_info] "i" (offsetof(struct task_struct, stack)),   \
 	       [current_task] "m" (current_task)			  \
 	       __switch_canary_iparam					  \
 	     : "memory", "cc" __EXTRA_CLOBBER)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..eef687fdc90d 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -53,24 +53,22 @@ struct task_struct;
 #include <linux/atomic.h>
 
 struct thread_info {
-	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
-	mm_segment_t		addr_limit;
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
+	mm_segment_t		addr_limit;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.task		= &tsk,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.addr_limit	= KERNEL_DS,		\
 }
 
-#define init_thread_info	(init_thread_union.thread_info)
+#define init_thread_info	(init_task.thread_info)
 #define init_stack		(init_thread_union.stack)
 
 #else /* !__ASSEMBLY__ */
@@ -166,7 +164,7 @@ struct thread_info {
 
 static inline struct thread_info *current_thread_info(void)
 {
-	return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
+	return (struct thread_info *)current;
 }
 
 static inline unsigned long current_stack_pointer(void)
@@ -188,35 +186,7 @@ static inline unsigned long current_stack_pointer(void)
 
 /* Load thread_info address into "reg" */
 #define GET_THREAD_INFO(reg) \
-	_ASM_MOV PER_CPU_VAR(cpu_current_top_of_stack),reg ; \
-	_ASM_SUB $(THREAD_SIZE),reg ;
-
-/*
- * ASM operand which evaluates to a 'thread_info' address of
- * the current task, if it is known that "reg" is exactly "off"
- * bytes below the top of the stack currently.
- *
- * ( The kernel stack's size is known at build time, it is usually
- *   2 or 4 pages, and the bottom  of the kernel stack contains
- *   the thread_info structure. So to access the thread_info very
- *   quickly from assembly code we can calculate down from the
- *   top of the kernel stack to the bottom, using constant,
- *   build-time calculations only. )
- *
- * For example, to fetch the current thread_info->flags value into %eax
- * on x86-64 defconfig kernels, in syscall entry code where RSP is
- * currently at exactly SIZEOF_PTREGS bytes away from the top of the
- * stack:
- *
- *      mov ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS), %eax
- *
- * will translate to:
- *
- *      8b 84 24 b8 c0 ff ff      mov    -0x3f48(%rsp), %eax
- *
- * which is below the current RSP by almost 16K.
- */
-#define ASM_THREAD_INFO(field, reg, off) ((field)+(off)-THREAD_SIZE)(reg)
+	_ASM_MOV PER_CPU_VAR(current_task),reg
 
 #endif
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d6209f3a69cb..ef8017ca5ba9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -80,7 +80,7 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
 static inline int valid_stack_ptr(struct task_struct *task,
 			void *p, unsigned int size, void *end)
 {
-	void *t = task_thread_info(task);
+	void *t = task_stack_page(task);
 	if (end) {
 		if (p < end && p >= (end-THREAD_SIZE))
 			return 1;
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 38da8f29a9c8..c627bf8d98ad 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -130,11 +130,9 @@ void irq_ctx_init(int cpu)
 
 void do_softirq_own_stack(void)
 {
-	struct thread_info *curstk;
 	struct irq_stack *irqstk;
 	u32 *isp, *prev_esp;
 
-	curstk = current_stack();
 	irqstk = __this_cpu_read(softirq_stack);
 
 	/* build the stack frame on the softirq stack */
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 206d0b90a3ab..38f9f5678dc8 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -41,8 +41,7 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (user_mode(regs))
 		return;
 
-	if (regs->sp >= curbase + sizeof(struct thread_info) +
-				  sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
+	if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
 	    regs->sp <= curbase + THREAD_SIZE)
 		return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 96becbbb52e0..8f60f810a9e7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -536,9 +536,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 * PADDING
 	 * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
 	 * stack
-	 * ----------- bottom = start + sizeof(thread_info)
-	 * thread_info
-	 * ----------- start
+	 * ----------- bottom = start
 	 *
 	 * The tasks stack pointer points at the location where the
 	 * framepointer is stored. The data on the stack is:
@@ -549,7 +547,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 */
 	top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
 	top -= 2 * sizeof(unsigned long);
-	bottom = start + sizeof(struct thread_info);
+	bottom = start;
 
 	sp = READ_ONCE(p->thread.sp);
 	if (sp < bottom || sp > top)
diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
index ebd4dd6ef73b..14e8f6a628c2 100644
--- a/arch/x86/um/ptrace_32.c
+++ b/arch/x86/um/ptrace_32.c
@@ -191,7 +191,7 @@ int peek_user(struct task_struct *child, long addr, long data)
 
 static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	err = save_i387_registers(userspace_pid[cpu],
@@ -208,7 +208,7 @@ static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
@@ -221,7 +221,7 @@ static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	err = save_fpx_registers(userspace_pid[cpu], (unsigned long *) &fpregs);
@@ -237,7 +237,7 @@ static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *
 
 static int set_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d45513d..a00f53b64c09 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,8 @@
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
+#include <asm/thread_info.h>
+
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
 	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO),
@@ -183,12 +185,19 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
+#else
+# define INIT_TASK_TI(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
  */
 #define INIT_TASK(tsk)	\
 {									\
+	INIT_TASK_TI(tsk)						\
 	.state		= 0,						\
 	.stack		= &init_thread_info,				\
 	.usage		= ATOMIC_INIT(2),				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..06236a36ba17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1456,6 +1456,9 @@ struct tlbflush_unmap_batch {
 };
 
 struct task_struct {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	struct thread_info thread_info;
+#endif
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
 	atomic_t usage;
@@ -2539,7 +2542,9 @@ extern void set_curr_task(int cpu, struct task_struct *p);
 void yield(void);
 
 union thread_union {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
 	struct thread_info thread_info;
+#endif
 	unsigned long stack[THREAD_SIZE/sizeof(long)];
 };
 
@@ -2967,7 +2972,14 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 	cgroup_threadgroup_change_end(tsk);
 }
 
-#ifndef __HAVE_THREAD_FUNCTIONS
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+
+#define task_thread_info(task)		(&(task)->thread_info)
+#define task_stack_page(task)		((task)->stack)
+#define setup_thread_stack(new,old)	do { } while(0)
+#define end_of_stack(task)		((unsigned long *)task_stack_page(task))
+
+#elif !defined(__HAVE_THREAD_FUNCTIONS)
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
 #define task_stack_page(task)	((task)->stack)
diff --git a/init/Kconfig b/init/Kconfig
index f755a602d4a1..0c83af6d3753 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -26,6 +26,9 @@ config IRQ_WORK
 config BUILDTIME_EXTABLE_SORT
 	bool
 
+config THREAD_INFO_IN_TASK
+	bool
+
 menu "General setup"
 
 config BROKEN
diff --git a/init/init_task.c b/init/init_task.c
index ba0a7f362d9e..11f83be1fa79 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,5 +22,8 @@ EXPORT_SYMBOL(init_task);
  * Initial thread structure. Alignment of this is handled by a special
  * linker map entry.
  */
-union thread_union init_thread_union __init_task_data =
-	{ INIT_THREAD_INFO(init_task) };
+union thread_union init_thread_union __init_task_data = {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	INIT_THREAD_INFO(init_task)
+#endif
+};

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24  6:17                       ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-24  6:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

On Thu, Jun 23, 2016 at 12:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> With the goal being that I'm hoping that we can then actually get rid
> of this (at least on x86-64, even if we leave it in some other
> architectures) in 4.8.

The context here was that we could almost get rid of thread-info
entirely, at least for x86-64, by moving it into struct task_struct.

It turns out that we're not *that* far off after the obvious cleanups
I already committed, but I couldn't get things quite to work.

I'm attaching a patch that I wrote today that doesn't boot, but "looks
right". The reason I'm attaching it is because I'm hoping somebody
wants to take a look and maybe see what else I missed, but mostly
because I think the patch is interesting in a couple of cases where we
just do incredibly ugly things.

First off, some code that Andy wrote when he re-organized the entry path.

Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want to do
unspeakable acts on a poor innocent wax figure that looked _exactly_
like you.

I just got rid of pt_regs_to_thread_info() entirely, and just replaced
it with current_thread_info().  I'm not at all convinced that trying
to be that clever was really a good idea.

Secondly, the x86-64 ret_from_fork calling convention was documented
wrongly. It says %rdi contains the previous task pointer. Yes it does,
but it doesn't mention that %r8 is supposed to contain the new
thread_info. That was fun to find.

And thirdly, the stack size games that asm/kprobes.h plays are just
disgusting. I stared at that code for much too long. I may in fact be
going blind as a result.

The rest was fairly straightforward, although since the end result
doesn't actually work, that "straightforward" may be broken too. But
the basic approach _looks_ sane.

Comments? Anybody want to play with this and see where I went wrong?

(Note - this patch was written on top of the two thread-info removal
patches I committed in

   da01e18a37a5 x86: avoid avoid passing around 'thread_info' in stack
dumping code
   6720a305df74 locking: avoid passing around 'thread_info' in mutex
debugging code

and depends on them, since "ti->task" no longer exists with
CONFIG_THREAD_INFO_IN_TASK. "ti" and "task" will have the same value).

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 17448 bytes --]

This is a non-working attempt at moving the thread_info into the
task_struct

 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 21 +++++++--------------
 arch/x86/entry/entry_64.S          |  9 ++++++---
 arch/x86/include/asm/kprobes.h     | 12 ++++++------
 arch/x86/include/asm/switch_to.h   |  6 ++----
 arch/x86/include/asm/thread_info.h | 38 ++++----------------------------------
 arch/x86/kernel/dumpstack.c        |  2 +-
 arch/x86/kernel/irq_32.c           |  2 --
 arch/x86/kernel/irq_64.c           |  3 +--
 arch/x86/kernel/process.c          |  6 ++----
 arch/x86/um/ptrace_32.c            |  8 ++++----
 include/linux/init_task.h          |  9 +++++++++
 include/linux/sched.h              | 14 +++++++++++++-
 init/Kconfig                       |  3 +++
 init/init_task.c                   |  7 +++++--
 15 files changed, 64 insertions(+), 77 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9a94da0c29f..f33bc80577c5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -154,6 +154,7 @@ config X86
 	select SPARSE_IRQ
 	select SRCU
 	select SYSCTL_EXCEPTION_TRACE
+	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e538c44..d5feac5f252d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,13 +31,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
-{
-	unsigned long top_of_stack =
-		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
-	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
-}
-
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
 __visible void enter_from_user_mode(void)
@@ -78,7 +71,7 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
  */
 unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long ret = 0;
 	u32 work;
 
@@ -156,7 +149,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
 				unsigned long phase1_result)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	long ret = 0;
 	u32 work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
 
@@ -239,7 +232,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
-		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
@@ -250,7 +243,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
 
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN_ON(!irqs_disabled()))
@@ -309,7 +302,7 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
  */
 __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -332,7 +325,7 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long nr = regs->orig_ax;
 
 	enter_from_user_mode();
@@ -365,7 +358,7 @@ __visible void do_syscall_64(struct pt_regs *regs)
  */
 static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..f49742de2c65 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -179,7 +179,8 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	GET_THREAD_INFO(%r11)
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TI_flags(%r11)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -217,7 +218,8 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	GET_THREAD_INFO(%r11)
+	testl	$_TIF_ALLWORK_MASK, TI_flags(%r11)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
@@ -368,9 +370,10 @@ END(ptregs_\func)
  * A newly forked process directly context switches into this address.
  *
  * rdi: prev task we switched from
+ * rsi: task we're switching to
  */
 ENTRY(ret_from_fork)
-	LOCK ; btr $TIF_FORK, TI_flags(%r8)
+	LOCK ; btr $TIF_FORK, TI_flags(%rsi)	/* rsi: this newly forked task */
 
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 4421b5da409d..1d2997e74b08 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -38,12 +38,12 @@ typedef u8 kprobe_opcode_t;
 #define RELATIVECALL_OPCODE 0xe8
 #define RELATIVE_ADDR_SIZE 4
 #define MAX_STACK_SIZE 64
-#define MIN_STACK_SIZE(ADDR)					       \
-	(((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
-			      THREAD_SIZE - (unsigned long)(ADDR)))    \
-	 ? (MAX_STACK_SIZE)					       \
-	 : (((unsigned long)current_thread_info()) +		       \
-	    THREAD_SIZE - (unsigned long)(ADDR)))
+
+#define current_stack_top() ((unsigned long)task_stack_page(current)+THREAD_SIZE)
+#define current_stack_size(ADDR) (current_stack_top() - (unsigned long)(ADDR))
+
+#define MIN_STACK_SIZE(ADDR) \
+	(MAX_STACK_SIZE < current_stack_size(ADDR) ? MAX_STACK_SIZE : current_stack_size(ADDR))
 
 #define flush_insn_slot(p)	do { } while (0)
 
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..ae0aa0612c67 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -110,18 +110,16 @@ do {									\
 	     "call __switch_to\n\t"					  \
 	     "movq "__percpu_arg([current_task])",%%rsi\n\t"		  \
 	     __switch_canary						  \
-	     "movq %P[thread_info](%%rsi),%%r8\n\t"			  \
 	     "movq %%rax,%%rdi\n\t" 					  \
-	     "testl  %[_tif_fork],%P[ti_flags](%%r8)\n\t"		  \
+	     "testl  %[_tif_fork],%P[ti_flags](%%rsi)\n\t"		  \
 	     "jnz   ret_from_fork\n\t"					  \
 	     RESTORE_CONTEXT						  \
 	     : "=a" (last)					  	  \
 	       __switch_canary_oparam					  \
 	     : [next] "S" (next), [prev] "D" (prev),			  \
 	       [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \
-	       [ti_flags] "i" (offsetof(struct thread_info, flags)),	  \
+	       [ti_flags] "i" (offsetof(struct task_struct, thread_info.flags)),	  \
 	       [_tif_fork] "i" (_TIF_FORK),			  	  \
-	       [thread_info] "i" (offsetof(struct task_struct, stack)),   \
 	       [current_task] "m" (current_task)			  \
 	       __switch_canary_iparam					  \
 	     : "memory", "cc" __EXTRA_CLOBBER)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..eef687fdc90d 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -53,24 +53,22 @@ struct task_struct;
 #include <linux/atomic.h>
 
 struct thread_info {
-	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
-	mm_segment_t		addr_limit;
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
+	mm_segment_t		addr_limit;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.task		= &tsk,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.addr_limit	= KERNEL_DS,		\
 }
 
-#define init_thread_info	(init_thread_union.thread_info)
+#define init_thread_info	(init_task.thread_info)
 #define init_stack		(init_thread_union.stack)
 
 #else /* !__ASSEMBLY__ */
@@ -166,7 +164,7 @@ struct thread_info {
 
 static inline struct thread_info *current_thread_info(void)
 {
-	return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
+	return (struct thread_info *)current;
 }
 
 static inline unsigned long current_stack_pointer(void)
@@ -188,35 +186,7 @@ static inline unsigned long current_stack_pointer(void)
 
 /* Load thread_info address into "reg" */
 #define GET_THREAD_INFO(reg) \
-	_ASM_MOV PER_CPU_VAR(cpu_current_top_of_stack),reg ; \
-	_ASM_SUB $(THREAD_SIZE),reg ;
-
-/*
- * ASM operand which evaluates to a 'thread_info' address of
- * the current task, if it is known that "reg" is exactly "off"
- * bytes below the top of the stack currently.
- *
- * ( The kernel stack's size is known at build time, it is usually
- *   2 or 4 pages, and the bottom  of the kernel stack contains
- *   the thread_info structure. So to access the thread_info very
- *   quickly from assembly code we can calculate down from the
- *   top of the kernel stack to the bottom, using constant,
- *   build-time calculations only. )
- *
- * For example, to fetch the current thread_info->flags value into %eax
- * on x86-64 defconfig kernels, in syscall entry code where RSP is
- * currently at exactly SIZEOF_PTREGS bytes away from the top of the
- * stack:
- *
- *      mov ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS), %eax
- *
- * will translate to:
- *
- *      8b 84 24 b8 c0 ff ff      mov    -0x3f48(%rsp), %eax
- *
- * which is below the current RSP by almost 16K.
- */
-#define ASM_THREAD_INFO(field, reg, off) ((field)+(off)-THREAD_SIZE)(reg)
+	_ASM_MOV PER_CPU_VAR(current_task),reg
 
 #endif
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d6209f3a69cb..ef8017ca5ba9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -80,7 +80,7 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
 static inline int valid_stack_ptr(struct task_struct *task,
 			void *p, unsigned int size, void *end)
 {
-	void *t = task_thread_info(task);
+	void *t = task_stack_page(task);
 	if (end) {
 		if (p < end && p >= (end-THREAD_SIZE))
 			return 1;
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 38da8f29a9c8..c627bf8d98ad 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -130,11 +130,9 @@ void irq_ctx_init(int cpu)
 
 void do_softirq_own_stack(void)
 {
-	struct thread_info *curstk;
 	struct irq_stack *irqstk;
 	u32 *isp, *prev_esp;
 
-	curstk = current_stack();
 	irqstk = __this_cpu_read(softirq_stack);
 
 	/* build the stack frame on the softirq stack */
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 206d0b90a3ab..38f9f5678dc8 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -41,8 +41,7 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (user_mode(regs))
 		return;
 
-	if (regs->sp >= curbase + sizeof(struct thread_info) +
-				  sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
+	if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
 	    regs->sp <= curbase + THREAD_SIZE)
 		return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 96becbbb52e0..8f60f810a9e7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -536,9 +536,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 * PADDING
 	 * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
 	 * stack
-	 * ----------- bottom = start + sizeof(thread_info)
-	 * thread_info
-	 * ----------- start
+	 * ----------- bottom = start
 	 *
 	 * The tasks stack pointer points at the location where the
 	 * framepointer is stored. The data on the stack is:
@@ -549,7 +547,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 */
 	top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
 	top -= 2 * sizeof(unsigned long);
-	bottom = start + sizeof(struct thread_info);
+	bottom = start;
 
 	sp = READ_ONCE(p->thread.sp);
 	if (sp < bottom || sp > top)
diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
index ebd4dd6ef73b..14e8f6a628c2 100644
--- a/arch/x86/um/ptrace_32.c
+++ b/arch/x86/um/ptrace_32.c
@@ -191,7 +191,7 @@ int peek_user(struct task_struct *child, long addr, long data)
 
 static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	err = save_i387_registers(userspace_pid[cpu],
@@ -208,7 +208,7 @@ static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
@@ -221,7 +221,7 @@ static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	err = save_fpx_registers(userspace_pid[cpu], (unsigned long *) &fpregs);
@@ -237,7 +237,7 @@ static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *
 
 static int set_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d45513d..a00f53b64c09 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,8 @@
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
+#include <asm/thread_info.h>
+
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
 	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO),
@@ -183,12 +185,19 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
+#else
+# define INIT_TASK_TI(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
  */
 #define INIT_TASK(tsk)	\
 {									\
+	INIT_TASK_TI(tsk)						\
 	.state		= 0,						\
 	.stack		= &init_thread_info,				\
 	.usage		= ATOMIC_INIT(2),				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..06236a36ba17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1456,6 +1456,9 @@ struct tlbflush_unmap_batch {
 };
 
 struct task_struct {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	struct thread_info thread_info;
+#endif
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
 	atomic_t usage;
@@ -2539,7 +2542,9 @@ extern void set_curr_task(int cpu, struct task_struct *p);
 void yield(void);
 
 union thread_union {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
 	struct thread_info thread_info;
+#endif
 	unsigned long stack[THREAD_SIZE/sizeof(long)];
 };
 
@@ -2967,7 +2972,14 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 	cgroup_threadgroup_change_end(tsk);
 }
 
-#ifndef __HAVE_THREAD_FUNCTIONS
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+
+#define task_thread_info(task)		(&(task)->thread_info)
+#define task_stack_page(task)		((task)->stack)
+#define setup_thread_stack(new,old)	do { } while(0)
+#define end_of_stack(task)		((unsigned long *)task_stack_page(task))
+
+#elif !defined(__HAVE_THREAD_FUNCTIONS)
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
 #define task_stack_page(task)	((task)->stack)
diff --git a/init/Kconfig b/init/Kconfig
index f755a602d4a1..0c83af6d3753 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -26,6 +26,9 @@ config IRQ_WORK
 config BUILDTIME_EXTABLE_SORT
 	bool
 
+config THREAD_INFO_IN_TASK
+	bool
+
 menu "General setup"
 
 config BROKEN
diff --git a/init/init_task.c b/init/init_task.c
index ba0a7f362d9e..11f83be1fa79 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,5 +22,8 @@ EXPORT_SYMBOL(init_task);
  * Initial thread structure. Alignment of this is handled by a special
  * linker map entry.
  */
-union thread_union init_thread_union __init_task_data =
-	{ INIT_THREAD_INFO(init_task) };
+union thread_union init_thread_union __init_task_data = {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	INIT_THREAD_INFO(init_task)
+#endif
+};

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24  6:17                       ` Linus Torvalds
  0 siblings, 0 replies; 269+ messages in thread
From: Linus Torvalds @ 2016-06-24  6:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

On Thu, Jun 23, 2016 at 12:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> With the goal being that I'm hoping that we can then actually get rid
> of this (at least on x86-64, even if we leave it in some other
> architectures) in 4.8.

The context here was that we could almost get rid of thread-info
entirely, at least for x86-64, by moving it into struct task_struct.

It turns out that we're not *that* far off after the obvious cleanups
I already committed, but I couldn't get things quite to work.

I'm attaching a patch that I wrote today that doesn't boot, but "looks
right". The reason I'm attaching it is because I'm hoping somebody
wants to take a look and maybe see what else I missed, but mostly
because I think the patch is interesting in a couple of cases where we
just do incredibly ugly things.

First off, some code that Andy wrote when he re-organized the entry path.

Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want to do
unspeakable acts on a poor innocent wax figure that looked _exactly_
like you.

I just got rid of pt_regs_to_thread_info() entirely, and just replaced
it with current_thread_info().  I'm not at all convinced that trying
to be that clever was really a good idea.

Secondly, the x86-64 ret_from_fork calling convention was documented
wrongly. It says %rdi contains the previous task pointer. Yes it does,
but it doesn't mention that %r8 is supposed to contain the new
thread_info. That was fun to find.

And thirdly, the stack size games that asm/kprobes.h plays are just
disgusting. I stared at that code for much too long. I may in fact be
going blind as a result.

The rest was fairly straightforward, although since the end result
doesn't actually work, that "straightforward" may be broken too. But
the basic approach _looks_ sane.

Comments? Anybody want to play with this and see where I went wrong?

(Note - this patch was written on top of the two thread-info removal
patches I committed in

   da01e18a37a5 x86: avoid avoid passing around 'thread_info' in stack
dumping code
   6720a305df74 locking: avoid passing around 'thread_info' in mutex
debugging code

and depends on them, since "ti->task" no longer exists with
CONFIG_THREAD_INFO_IN_TASK. "ti" and "task" will have the same value).

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 17448 bytes --]

This is a non-working attempt at moving the thread_info into the
task_struct

 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 21 +++++++--------------
 arch/x86/entry/entry_64.S          |  9 ++++++---
 arch/x86/include/asm/kprobes.h     | 12 ++++++------
 arch/x86/include/asm/switch_to.h   |  6 ++----
 arch/x86/include/asm/thread_info.h | 38 ++++----------------------------------
 arch/x86/kernel/dumpstack.c        |  2 +-
 arch/x86/kernel/irq_32.c           |  2 --
 arch/x86/kernel/irq_64.c           |  3 +--
 arch/x86/kernel/process.c          |  6 ++----
 arch/x86/um/ptrace_32.c            |  8 ++++----
 include/linux/init_task.h          |  9 +++++++++
 include/linux/sched.h              | 14 +++++++++++++-
 init/Kconfig                       |  3 +++
 init/init_task.c                   |  7 +++++--
 15 files changed, 64 insertions(+), 77 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9a94da0c29f..f33bc80577c5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -154,6 +154,7 @@ config X86
 	select SPARSE_IRQ
 	select SRCU
 	select SYSCTL_EXCEPTION_TRACE
+	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e538c44..d5feac5f252d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,13 +31,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
-{
-	unsigned long top_of_stack =
-		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
-	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
-}
-
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
 __visible void enter_from_user_mode(void)
@@ -78,7 +71,7 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
  */
 unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long ret = 0;
 	u32 work;
 
@@ -156,7 +149,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
 				unsigned long phase1_result)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	long ret = 0;
 	u32 work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
 
@@ -239,7 +232,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
-		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
@@ -250,7 +243,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
 
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN_ON(!irqs_disabled()))
@@ -309,7 +302,7 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
  */
 __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -332,7 +325,7 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long nr = regs->orig_ax;
 
 	enter_from_user_mode();
@@ -365,7 +358,7 @@ __visible void do_syscall_64(struct pt_regs *regs)
  */
 static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..f49742de2c65 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -179,7 +179,8 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	GET_THREAD_INFO(%r11)
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TI_flags(%r11)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -217,7 +218,8 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	GET_THREAD_INFO(%r11)
+	testl	$_TIF_ALLWORK_MASK, TI_flags(%r11)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
@@ -368,9 +370,10 @@ END(ptregs_\func)
  * A newly forked process directly context switches into this address.
  *
  * rdi: prev task we switched from
+ * rsi: task we're switching to
  */
 ENTRY(ret_from_fork)
-	LOCK ; btr $TIF_FORK, TI_flags(%r8)
+	LOCK ; btr $TIF_FORK, TI_flags(%rsi)	/* rsi: this newly forked task */
 
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 4421b5da409d..1d2997e74b08 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -38,12 +38,12 @@ typedef u8 kprobe_opcode_t;
 #define RELATIVECALL_OPCODE 0xe8
 #define RELATIVE_ADDR_SIZE 4
 #define MAX_STACK_SIZE 64
-#define MIN_STACK_SIZE(ADDR)					       \
-	(((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
-			      THREAD_SIZE - (unsigned long)(ADDR)))    \
-	 ? (MAX_STACK_SIZE)					       \
-	 : (((unsigned long)current_thread_info()) +		       \
-	    THREAD_SIZE - (unsigned long)(ADDR)))
+
+#define current_stack_top() ((unsigned long)task_stack_page(current)+THREAD_SIZE)
+#define current_stack_size(ADDR) (current_stack_top() - (unsigned long)(ADDR))
+
+#define MIN_STACK_SIZE(ADDR) \
+	(MAX_STACK_SIZE < current_stack_size(ADDR) ? MAX_STACK_SIZE : current_stack_size(ADDR))
 
 #define flush_insn_slot(p)	do { } while (0)
 
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..ae0aa0612c67 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -110,18 +110,16 @@ do {									\
 	     "call __switch_to\n\t"					  \
 	     "movq "__percpu_arg([current_task])",%%rsi\n\t"		  \
 	     __switch_canary						  \
-	     "movq %P[thread_info](%%rsi),%%r8\n\t"			  \
 	     "movq %%rax,%%rdi\n\t" 					  \
-	     "testl  %[_tif_fork],%P[ti_flags](%%r8)\n\t"		  \
+	     "testl  %[_tif_fork],%P[ti_flags](%%rsi)\n\t"		  \
 	     "jnz   ret_from_fork\n\t"					  \
 	     RESTORE_CONTEXT						  \
 	     : "=a" (last)					  	  \
 	       __switch_canary_oparam					  \
 	     : [next] "S" (next), [prev] "D" (prev),			  \
 	       [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \
-	       [ti_flags] "i" (offsetof(struct thread_info, flags)),	  \
+	       [ti_flags] "i" (offsetof(struct task_struct, thread_info.flags)),	  \
 	       [_tif_fork] "i" (_TIF_FORK),			  	  \
-	       [thread_info] "i" (offsetof(struct task_struct, stack)),   \
 	       [current_task] "m" (current_task)			  \
 	       __switch_canary_iparam					  \
 	     : "memory", "cc" __EXTRA_CLOBBER)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..eef687fdc90d 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -53,24 +53,22 @@ struct task_struct;
 #include <linux/atomic.h>
 
 struct thread_info {
-	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
-	mm_segment_t		addr_limit;
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
+	mm_segment_t		addr_limit;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.task		= &tsk,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.addr_limit	= KERNEL_DS,		\
 }
 
-#define init_thread_info	(init_thread_union.thread_info)
+#define init_thread_info	(init_task.thread_info)
 #define init_stack		(init_thread_union.stack)
 
 #else /* !__ASSEMBLY__ */
@@ -166,7 +164,7 @@ struct thread_info {
 
 static inline struct thread_info *current_thread_info(void)
 {
-	return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
+	return (struct thread_info *)current;
 }
 
 static inline unsigned long current_stack_pointer(void)
@@ -188,35 +186,7 @@ static inline unsigned long current_stack_pointer(void)
 
 /* Load thread_info address into "reg" */
 #define GET_THREAD_INFO(reg) \
-	_ASM_MOV PER_CPU_VAR(cpu_current_top_of_stack),reg ; \
-	_ASM_SUB $(THREAD_SIZE),reg ;
-
-/*
- * ASM operand which evaluates to a 'thread_info' address of
- * the current task, if it is known that "reg" is exactly "off"
- * bytes below the top of the stack currently.
- *
- * ( The kernel stack's size is known at build time, it is usually
- *   2 or 4 pages, and the bottom  of the kernel stack contains
- *   the thread_info structure. So to access the thread_info very
- *   quickly from assembly code we can calculate down from the
- *   top of the kernel stack to the bottom, using constant,
- *   build-time calculations only. )
- *
- * For example, to fetch the current thread_info->flags value into %eax
- * on x86-64 defconfig kernels, in syscall entry code where RSP is
- * currently at exactly SIZEOF_PTREGS bytes away from the top of the
- * stack:
- *
- *      mov ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS), %eax
- *
- * will translate to:
- *
- *      8b 84 24 b8 c0 ff ff      mov    -0x3f48(%rsp), %eax
- *
- * which is below the current RSP by almost 16K.
- */
-#define ASM_THREAD_INFO(field, reg, off) ((field)+(off)-THREAD_SIZE)(reg)
+	_ASM_MOV PER_CPU_VAR(current_task),reg
 
 #endif
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d6209f3a69cb..ef8017ca5ba9 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -80,7 +80,7 @@ print_ftrace_graph_addr(unsigned long addr, void *data,
 static inline int valid_stack_ptr(struct task_struct *task,
 			void *p, unsigned int size, void *end)
 {
-	void *t = task_thread_info(task);
+	void *t = task_stack_page(task);
 	if (end) {
 		if (p < end && p >= (end-THREAD_SIZE))
 			return 1;
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 38da8f29a9c8..c627bf8d98ad 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -130,11 +130,9 @@ void irq_ctx_init(int cpu)
 
 void do_softirq_own_stack(void)
 {
-	struct thread_info *curstk;
 	struct irq_stack *irqstk;
 	u32 *isp, *prev_esp;
 
-	curstk = current_stack();
 	irqstk = __this_cpu_read(softirq_stack);
 
 	/* build the stack frame on the softirq stack */
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 206d0b90a3ab..38f9f5678dc8 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -41,8 +41,7 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (user_mode(regs))
 		return;
 
-	if (regs->sp >= curbase + sizeof(struct thread_info) +
-				  sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
+	if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
 	    regs->sp <= curbase + THREAD_SIZE)
 		return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 96becbbb52e0..8f60f810a9e7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -536,9 +536,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 * PADDING
 	 * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
 	 * stack
-	 * ----------- bottom = start + sizeof(thread_info)
-	 * thread_info
-	 * ----------- start
+	 * ----------- bottom = start
 	 *
 	 * The tasks stack pointer points at the location where the
 	 * framepointer is stored. The data on the stack is:
@@ -549,7 +547,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 */
 	top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
 	top -= 2 * sizeof(unsigned long);
-	bottom = start + sizeof(struct thread_info);
+	bottom = start;
 
 	sp = READ_ONCE(p->thread.sp);
 	if (sp < bottom || sp > top)
diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
index ebd4dd6ef73b..14e8f6a628c2 100644
--- a/arch/x86/um/ptrace_32.c
+++ b/arch/x86/um/ptrace_32.c
@@ -191,7 +191,7 @@ int peek_user(struct task_struct *child, long addr, long data)
 
 static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	err = save_i387_registers(userspace_pid[cpu],
@@ -208,7 +208,7 @@ static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
@@ -221,7 +221,7 @@ static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	err = save_fpx_registers(userspace_pid[cpu], (unsigned long *) &fpregs);
@@ -237,7 +237,7 @@ static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *
 
 static int set_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d45513d..a00f53b64c09 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,8 @@
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
+#include <asm/thread_info.h>
+
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
 	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO),
@@ -183,12 +185,19 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
+#else
+# define INIT_TASK_TI(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
  */
 #define INIT_TASK(tsk)	\
 {									\
+	INIT_TASK_TI(tsk)						\
 	.state		= 0,						\
 	.stack		= &init_thread_info,				\
 	.usage		= ATOMIC_INIT(2),				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..06236a36ba17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1456,6 +1456,9 @@ struct tlbflush_unmap_batch {
 };
 
 struct task_struct {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	struct thread_info thread_info;
+#endif
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
 	atomic_t usage;
@@ -2539,7 +2542,9 @@ extern void set_curr_task(int cpu, struct task_struct *p);
 void yield(void);
 
 union thread_union {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
 	struct thread_info thread_info;
+#endif
 	unsigned long stack[THREAD_SIZE/sizeof(long)];
 };
 
@@ -2967,7 +2972,14 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 	cgroup_threadgroup_change_end(tsk);
 }
 
-#ifndef __HAVE_THREAD_FUNCTIONS
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+
+#define task_thread_info(task)		(&(task)->thread_info)
+#define task_stack_page(task)		((task)->stack)
+#define setup_thread_stack(new,old)	do { } while(0)
+#define end_of_stack(task)		((unsigned long *)task_stack_page(task))
+
+#elif !defined(__HAVE_THREAD_FUNCTIONS)
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
 #define task_stack_page(task)	((task)->stack)
diff --git a/init/Kconfig b/init/Kconfig
index f755a602d4a1..0c83af6d3753 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -26,6 +26,9 @@ config IRQ_WORK
 config BUILDTIME_EXTABLE_SORT
 	bool
 
+config THREAD_INFO_IN_TASK
+	bool
+
 menu "General setup"
 
 config BROKEN
diff --git a/init/init_task.c b/init/init_task.c
index ba0a7f362d9e..11f83be1fa79 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,5 +22,8 @@ EXPORT_SYMBOL(init_task);
  * Initial thread structure. Alignment of this is handled by a special
  * linker map entry.
  */
-union thread_union init_thread_union __init_task_data =
-	{ INIT_THREAD_INFO(init_task) };
+union thread_union init_thread_union __init_task_data = {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	INIT_THREAD_INFO(init_task)
+#endif
+};

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-24  6:17                       ` Linus Torvalds
  (?)
@ 2016-06-24 12:25                         ` Brian Gerst
  -1 siblings, 0 replies; 269+ messages in thread
From: Brian Gerst @ 2016-06-24 12:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri, Jun 24, 2016 at 2:17 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 12:17 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> With the goal being that I'm hoping that we can then actually get rid
>> of this (at least on x86-64, even if we leave it in some other
>> architectures) in 4.8.
>
> The context here was that we could almost get rid of thread-info
> entirely, at least for x86-64, by moving it into struct task_struct.
>
> It turns out that we're not *that* far off after the obvious cleanups
> I already committed, but I couldn't get things quite to work.
>
> I'm attaching a patch that I wrote today that doesn't boot, but "looks
> right". The reason I'm attaching it is because I'm hoping somebody
> wants to take a look and maybe see what else I missed, but mostly
> because I think the patch is interesting in a couple of cases where we
> just do incredibly ugly things.
>
> First off, some code that Andy wrote when he re-organized the entry path.
>
> Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want to do
> unspeakable acts on a poor innocent wax figure that looked _exactly_
> like you.
>
> I just got rid of pt_regs_to_thread_info() entirely, and just replaced
> it with current_thread_info().  I'm not at all convinced that trying
> to be that clever was really a good idea.
>
> Secondly, the x86-64 ret_from_fork calling convention was documented
> wrongly. It says %rdi contains the previous task pointer. Yes it does,
> but it doesn't mention that %r8 is supposed to contain the new
> thread_info. That was fun to find.
>
> And thirdly, the stack size games that asm/kprobes.h plays are just
> disgusting. I stared at that code for much too long. I may in fact be
> going blind as a result.
>
> The rest was fairly straightforward, although since the end result
> doesn't actually work, that "straightforward" may be broken too. But
> the basic approach _looks_ sane.
>
> Comments? Anybody want to play with this and see where I went wrong?
>
> (Note - this patch was written on top of the two thread-info removal
> patches I committed in
>
>    da01e18a37a5 x86: avoid avoid passing around 'thread_info' in stack
> dumping code
>    6720a305df74 locking: avoid passing around 'thread_info' in mutex
> debugging code
>
> and depends on them, since "ti->task" no longer exists with
> CONFIG_THREAD_INFO_IN_TASK. "ti" and "task" will have the same value).
>
>                  Linus

  * A newly forked process directly context switches into this address.
  *
  * rdi: prev task we switched from
+ * rsi: task we're switching to
  */
 ENTRY(ret_from_fork)
-    LOCK ; btr $TIF_FORK, TI_flags(%r8)
+    LOCK ; btr $TIF_FORK, TI_flags(%rsi)    /* rsi: this newly forked task */

     call    schedule_tail            /* rdi: 'prev' task parameter */

I think you forgot GET_THREAD_INFO() here.  RSI is the task, not the
thread_info.  FYI, this goes away with my switch_to() rewrite, which
removes TIF_FORK.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24 12:25                         ` Brian Gerst
  0 siblings, 0 replies; 269+ messages in thread
From: Brian Gerst @ 2016-06-24 12:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri, Jun 24, 2016 at 2:17 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 12:17 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> With the goal being that I'm hoping that we can then actually get rid
>> of this (at least on x86-64, even if we leave it in some other
>> architectures) in 4.8.
>
> The context here was that we could almost get rid of thread-info
> entirely, at least for x86-64, by moving it into struct task_struct.
>
> It turns out that we're not *that* far off after the obvious cleanups
> I already committed, but I couldn't get things quite to work.
>
> I'm attaching a patch that I wrote today that doesn't boot, but "looks
> right". The reason I'm attaching it is because I'm hoping somebody
> wants to take a look and maybe see what else I missed, but mostly
> because I think the patch is interesting in a couple of cases where we
> just do incredibly ugly things.
>
> First off, some code that Andy wrote when he re-organized the entry path.
>
> Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want to do
> unspeakable acts on a poor innocent wax figure that looked _exactly_
> like you.
>
> I just got rid of pt_regs_to_thread_info() entirely, and just replaced
> it with current_thread_info().  I'm not at all convinced that trying
> to be that clever was really a good idea.
>
> Secondly, the x86-64 ret_from_fork calling convention was documented
> wrongly. It says %rdi contains the previous task pointer. Yes it does,
> but it doesn't mention that %r8 is supposed to contain the new
> thread_info. That was fun to find.
>
> And thirdly, the stack size games that asm/kprobes.h plays are just
> disgusting. I stared at that code for much too long. I may in fact be
> going blind as a result.
>
> The rest was fairly straightforward, although since the end result
> doesn't actually work, that "straightforward" may be broken too. But
> the basic approach _looks_ sane.
>
> Comments? Anybody want to play with this and see where I went wrong?
>
> (Note - this patch was written on top of the two thread-info removal
> patches I committed in
>
>    da01e18a37a5 x86: avoid avoid passing around 'thread_info' in stack
> dumping code
>    6720a305df74 locking: avoid passing around 'thread_info' in mutex
> debugging code
>
> and depends on them, since "ti->task" no longer exists with
> CONFIG_THREAD_INFO_IN_TASK. "ti" and "task" will have the same value).
>
>                  Linus

  * A newly forked process directly context switches into this address.
  *
  * rdi: prev task we switched from
+ * rsi: task we're switching to
  */
 ENTRY(ret_from_fork)
-    LOCK ; btr $TIF_FORK, TI_flags(%r8)
+    LOCK ; btr $TIF_FORK, TI_flags(%rsi)    /* rsi: this newly forked task */

     call    schedule_tail            /* rdi: 'prev' task parameter */

I think you forgot GET_THREAD_INFO() here.  RSI is the task, not the
thread_info.  FYI, this goes away with my switch_to() rewrite, which
removes TIF_FORK.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24 12:25                         ` Brian Gerst
  0 siblings, 0 replies; 269+ messages in thread
From: Brian Gerst @ 2016-06-24 12:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, kernel-hardening,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri, Jun 24, 2016 at 2:17 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 23, 2016 at 12:17 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> With the goal being that I'm hoping that we can then actually get rid
>> of this (at least on x86-64, even if we leave it in some other
>> architectures) in 4.8.
>
> The context here was that we could almost get rid of thread-info
> entirely, at least for x86-64, by moving it into struct task_struct.
>
> It turns out that we're not *that* far off after the obvious cleanups
> I already committed, but I couldn't get things quite to work.
>
> I'm attaching a patch that I wrote today that doesn't boot, but "looks
> right". The reason I'm attaching it is because I'm hoping somebody
> wants to take a look and maybe see what else I missed, but mostly
> because I think the patch is interesting in a couple of cases where we
> just do incredibly ugly things.
>
> First off, some code that Andy wrote when he re-organized the entry path.
>
> Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want to do
> unspeakable acts on a poor innocent wax figure that looked _exactly_
> like you.
>
> I just got rid of pt_regs_to_thread_info() entirely, and just replaced
> it with current_thread_info().  I'm not at all convinced that trying
> to be that clever was really a good idea.
>
> Secondly, the x86-64 ret_from_fork calling convention was documented
> wrongly. It says %rdi contains the previous task pointer. Yes it does,
> but it doesn't mention that %r8 is supposed to contain the new
> thread_info. That was fun to find.
>
> And thirdly, the stack size games that asm/kprobes.h plays are just
> disgusting. I stared at that code for much too long. I may in fact be
> going blind as a result.
>
> The rest was fairly straightforward, although since the end result
> doesn't actually work, that "straightforward" may be broken too. But
> the basic approach _looks_ sane.
>
> Comments? Anybody want to play with this and see where I went wrong?
>
> (Note - this patch was written on top of the two thread-info removal
> patches I committed in
>
>    da01e18a37a5 x86: avoid avoid passing around 'thread_info' in stack
> dumping code
>    6720a305df74 locking: avoid passing around 'thread_info' in mutex
> debugging code
>
> and depends on them, since "ti->task" no longer exists with
> CONFIG_THREAD_INFO_IN_TASK. "ti" and "task" will have the same value).
>
>                  Linus

  * A newly forked process directly context switches into this address.
  *
  * rdi: prev task we switched from
+ * rsi: task we're switching to
  */
 ENTRY(ret_from_fork)
-    LOCK ; btr $TIF_FORK, TI_flags(%r8)
+    LOCK ; btr $TIF_FORK, TI_flags(%rsi)    /* rsi: this newly forked task */

     call    schedule_tail            /* rdi: 'prev' task parameter */

I think you forgot GET_THREAD_INFO() here.  RSI is the task, not the
thread_info.  FYI, this goes away with my switch_to() rewrite, which
removes TIF_FORK.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-23 18:52                 ` Oleg Nesterov
  (?)
@ 2016-06-24 14:05                   ` Michal Hocko
  -1 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-24 14:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu 23-06-16 20:52:21, Oleg Nesterov wrote:
> On 06/23, Linus Torvalds wrote:
> >
> > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > Let me quote my previous email ;)
> > >
> > >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> > >         say, mark_oom_victim() expects that get_task_struct() protects
> > >         thread_info as well.
> > >
> > > probably we can fix all such users though...
> >
> > TIF_MEMDIE is indeed a potential problem, but I don't think
> > mark_oom_victim() is actually problematic.
> >
> > mark_oom_victim() is called with either "current",
> 
> This is no longer true in -mm tree.
> 
> But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
> at least in its current form).

We can move the flag to the task_struct. There are still some bits left
there. This would be trivial so that the oom usage doesn't stay in the
way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24 14:05                   ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-24 14:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu 23-06-16 20:52:21, Oleg Nesterov wrote:
> On 06/23, Linus Torvalds wrote:
> >
> > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > Let me quote my previous email ;)
> > >
> > >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> > >         say, mark_oom_victim() expects that get_task_struct() protects
> > >         thread_info as well.
> > >
> > > probably we can fix all such users though...
> >
> > TIF_MEMDIE is indeed a potential problem, but I don't think
> > mark_oom_victim() is actually problematic.
> >
> > mark_oom_victim() is called with either "current",
> 
> This is no longer true in -mm tree.
> 
> But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
> at least in its current form).

We can move the flag to the task_struct. There are still some bits left
there. This would be trivial so that the oom usage doesn't stay in the
way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24 14:05                   ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-24 14:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Thu 23-06-16 20:52:21, Oleg Nesterov wrote:
> On 06/23, Linus Torvalds wrote:
> >
> > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > Let me quote my previous email ;)
> > >
> > >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> > >         say, mark_oom_victim() expects that get_task_struct() protects
> > >         thread_info as well.
> > >
> > > probably we can fix all such users though...
> >
> > TIF_MEMDIE is indeed a potential problem, but I don't think
> > mark_oom_victim() is actually problematic.
> >
> > mark_oom_victim() is called with either "current",
> 
> This is no longer true in -mm tree.
> 
> But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
> at least in its current form).

We can move the flag to the task_struct. There are still some bits left
there. This would be trivial so that the oom usage doesn't stay in the
way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
  2016-06-24 14:05                   ` Michal Hocko
  (?)
  (?)
@ 2016-06-24 15:06                     ` Michal Hocko
  -1 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-24 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri 24-06-16 16:05:58, Michal Hocko wrote:
> On Thu 23-06-16 20:52:21, Oleg Nesterov wrote:
> > On 06/23, Linus Torvalds wrote:
> > >
> > > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > Let me quote my previous email ;)
> > > >
> > > >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> > > >         say, mark_oom_victim() expects that get_task_struct() protects
> > > >         thread_info as well.
> > > >
> > > > probably we can fix all such users though...
> > >
> > > TIF_MEMDIE is indeed a potential problem, but I don't think
> > > mark_oom_victim() is actually problematic.
> > >
> > > mark_oom_victim() is called with either "current",
> > 
> > This is no longer true in -mm tree.
> > 
> > But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
> > at least in its current form).
> 
> We can move the flag to the task_struct. There are still some bits left
> there. This would be trivial so that the oom usage doesn't stay in the
> way.

Here is the patch. I've found two bugs when the TIF_MEMDIE was checked
on current rather than the given task. I will separate them into their
own patches (was just too lazy for it now). If the approach looks
reasonable then I will repost next week.
---
>From 1baaa1f8f9568f95d8feccb28cf1994f8ca0df9f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 24 Jun 2016 16:46:18 +0200
Subject: [PATCH] mm, oom: move TIF_MEMDIE to the task_struct

There is an interest to drop thread_info->flags usage for further clean
ups. TIF_MEMDIE stands in the way so let's move it out of the
thread_info into the task_struct. We cannot use flags because the oom
killer will set it for !current task without any locking so let's add
task_struct::memdie. It has to be atomic because we need it to be
updated atomically.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/alpha/include/asm/thread_info.h      |  1 -
 arch/arc/include/asm/thread_info.h        |  2 --
 arch/arm/include/asm/thread_info.h        |  1 -
 arch/arm64/include/asm/thread_info.h      |  1 -
 arch/avr32/include/asm/thread_info.h      |  2 --
 arch/blackfin/include/asm/thread_info.h   |  1 -
 arch/c6x/include/asm/thread_info.h        |  1 -
 arch/cris/include/asm/thread_info.h       |  1 -
 arch/frv/include/asm/thread_info.h        |  1 -
 arch/h8300/include/asm/thread_info.h      |  1 -
 arch/hexagon/include/asm/thread_info.h    |  1 -
 arch/ia64/include/asm/thread_info.h       |  1 -
 arch/m32r/include/asm/thread_info.h       |  1 -
 arch/m68k/include/asm/thread_info.h       |  1 -
 arch/metag/include/asm/thread_info.h      |  1 -
 arch/microblaze/include/asm/thread_info.h |  1 -
 arch/mips/include/asm/thread_info.h       |  1 -
 arch/mn10300/include/asm/thread_info.h    |  1 -
 arch/nios2/include/asm/thread_info.h      |  1 -
 arch/openrisc/include/asm/thread_info.h   |  1 -
 arch/parisc/include/asm/thread_info.h     |  1 -
 arch/powerpc/include/asm/thread_info.h    |  1 -
 arch/s390/include/asm/thread_info.h       |  1 -
 arch/score/include/asm/thread_info.h      |  1 -
 arch/sh/include/asm/thread_info.h         |  1 -
 arch/sparc/include/asm/thread_info_32.h   |  1 -
 arch/sparc/include/asm/thread_info_64.h   |  1 -
 arch/tile/include/asm/thread_info.h       |  2 --
 arch/um/include/asm/thread_info.h         |  2 --
 arch/unicore32/include/asm/thread_info.h  |  1 -
 arch/x86/include/asm/thread_info.h        |  1 -
 arch/xtensa/include/asm/thread_info.h     |  1 -
 drivers/staging/android/lowmemorykiller.c |  2 +-
 fs/ext4/mballoc.c                         |  2 +-
 include/linux/sched.h                     |  2 ++
 kernel/cpuset.c                           | 12 ++++++------
 kernel/exit.c                             |  2 +-
 kernel/freezer.c                          |  2 +-
 mm/ksm.c                                  |  4 ++--
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 20 ++++++++++----------
 mm/page_alloc.c                           |  6 +++---
 42 files changed, 28 insertions(+), 62 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 32e920a83ae5..126eaaf6559d 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -65,7 +65,6 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SYSCALL_AUDIT	4	/* syscall audit active */
 #define TIF_DIE_IF_KERNEL	9	/* dik recursion lock */
-#define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14	/* idle is polling for TIF_NEED_RESCHED */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/arc/include/asm/thread_info.h b/arch/arc/include/asm/thread_info.h
index 3af67455659a..46d1fc1a073d 100644
--- a/arch/arc/include/asm/thread_info.h
+++ b/arch/arc/include/asm/thread_info.h
@@ -88,14 +88,12 @@ static inline __attribute_const__ struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
 
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		16
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
-#define _TIF_MEMDIE		(1<<TIF_MEMDIE)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 776757d1604a..6277e56f15fd 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -146,7 +146,6 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	20
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..d78b3b2945a9 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -114,7 +114,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_AUDIT	9
 #define TIF_SYSCALL_TRACEPOINT	10
 #define TIF_SECCOMP		11
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
 #define TIF_SINGLESTEP		21
diff --git a/arch/avr32/include/asm/thread_info.h b/arch/avr32/include/asm/thread_info.h
index d4d3079541ea..680be13234ab 100644
--- a/arch/avr32/include/asm/thread_info.h
+++ b/arch/avr32/include/asm/thread_info.h
@@ -70,7 +70,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED        2       /* rescheduling necessary */
 #define TIF_BREAKPOINT		4	/* enter monitor mode on return */
 #define TIF_SINGLE_STEP		5	/* single step in progress */
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	7	/* restore signal mask in do_signal */
 #define TIF_CPU_GOING_TO_SLEEP	8	/* CPU is entering sleep 0 mode */
 #define TIF_NOTIFY_RESUME	9	/* callback before returning to user */
@@ -82,7 +81,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_BREAKPOINT		(1 << TIF_BREAKPOINT)
 #define _TIF_SINGLE_STEP	(1 << TIF_SINGLE_STEP)
-#define _TIF_MEMDIE		(1 << TIF_MEMDIE)
 #define _TIF_CPU_GOING_TO_SLEEP (1 << TIF_CPU_GOING_TO_SLEEP)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 
diff --git a/arch/blackfin/include/asm/thread_info.h b/arch/blackfin/include/asm/thread_info.h
index 2966b93850a1..a45ff075ab6a 100644
--- a/arch/blackfin/include/asm/thread_info.h
+++ b/arch/blackfin/include/asm/thread_info.h
@@ -79,7 +79,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	0	/* syscall trace active */
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_IRQ_SYNC		7	/* sync pipeline stage */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/c6x/include/asm/thread_info.h b/arch/c6x/include/asm/thread_info.h
index acc70c135ab8..22ff7b03641d 100644
--- a/arch/c6x/include/asm/thread_info.h
+++ b/arch/c6x/include/asm/thread_info.h
@@ -89,7 +89,6 @@ struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	4	/* restore signal mask in do_signal() */
 
-#define TIF_MEMDIE		17	/* OOM killer killed process */
 
 #define TIF_WORK_MASK		0x00007FFE /* work on irq/exception return */
 #define TIF_ALLWORK_MASK	0x00007FFF /* work on any return to u-space */
diff --git a/arch/cris/include/asm/thread_info.h b/arch/cris/include/asm/thread_info.h
index 4ead1b40d2d7..79ebddc22aa3 100644
--- a/arch/cris/include/asm/thread_info.h
+++ b/arch/cris/include/asm/thread_info.h
@@ -70,7 +70,6 @@ struct thread_info {
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/frv/include/asm/thread_info.h b/arch/frv/include/asm/thread_info.h
index ccba3b6ce918..993930f59d8e 100644
--- a/arch/frv/include/asm/thread_info.h
+++ b/arch/frv/include/asm/thread_info.h
@@ -86,7 +86,6 @@ register struct thread_info *__current_thread_info asm("gr15");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		7	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/h8300/include/asm/thread_info.h b/arch/h8300/include/asm/thread_info.h
index b408fe660cf8..68c10bce921e 100644
--- a/arch/h8300/include/asm/thread_info.h
+++ b/arch/h8300/include/asm/thread_info.h
@@ -73,7 +73,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SINGLESTEP		3	/* singlestepping active */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	6	/* callback before returning to user */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
diff --git a/arch/hexagon/include/asm/thread_info.h b/arch/hexagon/include/asm/thread_info.h
index b80fe1db7b64..e55c7d0a1755 100644
--- a/arch/hexagon/include/asm/thread_info.h
+++ b/arch/hexagon/include/asm/thread_info.h
@@ -112,7 +112,6 @@ register struct thread_info *__current_thread_info asm(QUOTED_THREADINFO_REG);
 #define TIF_SINGLESTEP          4       /* restore ss @ return to usr mode */
 #define TIF_RESTORE_SIGMASK     6       /* restore sig mask in do_signal() */
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE              17      /* OOM killer killed process */
 
 #define _TIF_SYSCALL_TRACE      (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME      (1 << TIF_NOTIFY_RESUME)
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..77064b1d188a 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -97,7 +97,6 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	3	/* syscall auditing active */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_NOTIFY_RESUME	6	/* resumption notification requested */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_MCA_INIT		18	/* this task is processing MCA or INIT */
 #define TIF_DB_DISABLED		19	/* debug trap disabled for fsyscall */
 #define TIF_RESTORE_RSE		21	/* user RBS is newer than kernel RBS */
diff --git a/arch/m32r/include/asm/thread_info.h b/arch/m32r/include/asm/thread_info.h
index f630d9c30b28..bc54a574fad0 100644
--- a/arch/m32r/include/asm/thread_info.h
+++ b/arch/m32r/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline unsigned int get_thread_fault_code(void)
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	8	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/m68k/include/asm/thread_info.h b/arch/m68k/include/asm/thread_info.h
index cee13c2e5161..ed497d31ea5d 100644
--- a/arch/m68k/include/asm/thread_info.h
+++ b/arch/m68k/include/asm/thread_info.h
@@ -68,7 +68,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	7	/* rescheduling necessary */
 #define TIF_DELAYED_TRACE	14	/* single step a syscall */
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
-#define TIF_MEMDIE		16	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal */
 
 #endif	/* _ASM_M68K_THREAD_INFO_H */
diff --git a/arch/metag/include/asm/thread_info.h b/arch/metag/include/asm/thread_info.h
index 32677cc278aa..c506e5a61714 100644
--- a/arch/metag/include/asm/thread_info.h
+++ b/arch/metag/include/asm/thread_info.h
@@ -111,7 +111,6 @@ static inline int kstack_end(void *addr)
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
-#define TIF_MEMDIE		8	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 
 
diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h
index 383f387b4eee..281a365bec48 100644
--- a/arch/microblaze/include/asm/thread_info.h
+++ b/arch/microblaze/include/asm/thread_info.h
@@ -113,7 +113,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3 /* rescheduling necessary */
 /* restore singlestep on return to user mode */
 #define TIF_SINGLESTEP		4
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_AUDIT	9       /* syscall auditing active */
 #define TIF_SECCOMP		10      /* secure computing */
 
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index e309d8fcb516..3dd906330867 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_UPROBE		6	/* breakpointed or singlestepping */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
diff --git a/arch/mn10300/include/asm/thread_info.h b/arch/mn10300/include/asm/thread_info.h
index 4861a78c7160..1dd24f251a98 100644
--- a/arch/mn10300/include/asm/thread_info.h
+++ b/arch/mn10300/include/asm/thread_info.h
@@ -145,7 +145,6 @@ void arch_release_thread_info(struct thread_info *ti);
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	+(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	+(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/nios2/include/asm/thread_info.h b/arch/nios2/include/asm/thread_info.h
index d69c338bd19c..bf7d38c1c6e2 100644
--- a/arch/nios2/include/asm/thread_info.h
+++ b/arch/nios2/include/asm/thread_info.h
@@ -86,7 +86,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOTIFY_RESUME	1	/* resumption notification requested */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_SYSCALL_AUDIT	6	/* syscall auditing active */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h
index 6e619a79a401..7678a1b2dc64 100644
--- a/arch/openrisc/include/asm/thread_info.h
+++ b/arch/openrisc/include/asm/thread_info.h
@@ -108,7 +108,6 @@ register struct thread_info *current_thread_info_reg asm("r10");
 #define TIF_RESTORE_SIGMASK     9
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling						 * TIF_NEED_RESCHED
 					 */
-#define TIF_MEMDIE              17
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/parisc/include/asm/thread_info.h b/arch/parisc/include/asm/thread_info.h
index e96e693fd58c..bcebec0b9418 100644
--- a/arch/parisc/include/asm/thread_info.h
+++ b/arch/parisc/include/asm/thread_info.h
@@ -48,7 +48,6 @@ struct thread_info {
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	3	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_32BIT               4       /* 32 bit binary */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	6	/* restore saved signal mask */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 7efee4a3240b..d744fa455dd2 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -97,7 +97,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT	15	/* syscall tracepoint instrumentation */
 #define TIF_EMULATE_STACK_STORE	16	/* Is an instruction emulation
 						for stack store? */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #if defined(CONFIG_PPC64)
 #define TIF_ELF2ABI		18	/* function descriptors must die! */
 #endif
diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index 2fffc2c27581..8fc2704dd263 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -79,7 +79,6 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
 #define TIF_UPROBE		7	/* breakpointed or single-stepping */
 #define TIF_31BIT		16	/* 32bit process */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal() */
 #define TIF_SINGLE_STEP		19	/* This task is single stepped */
 #define TIF_BLOCK_STEP		20	/* This task is block stepped */
diff --git a/arch/score/include/asm/thread_info.h b/arch/score/include/asm/thread_info.h
index 7d9ffb15c477..f6e1cc89cef9 100644
--- a/arch/score/include/asm/thread_info.h
+++ b/arch/score/include/asm/thread_info.h
@@ -78,7 +78,6 @@ register struct thread_info *__current_thread_info __asm__("r28");
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h
index 2afa321157be..017f3993f384 100644
--- a/arch/sh/include/asm/thread_info.h
+++ b/arch/sh/include/asm/thread_info.h
@@ -117,7 +117,6 @@ extern void init_thread_xstate(void);
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
 #define TIF_SYSCALL_TRACEPOINT	8	/* for ftrace syscall instrumentation */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/sparc/include/asm/thread_info_32.h b/arch/sparc/include/asm/thread_info_32.h
index 229475f0d7ce..bcf81999db0b 100644
--- a/arch/sparc/include/asm/thread_info_32.h
+++ b/arch/sparc/include/asm/thread_info_32.h
@@ -110,7 +110,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
 					 * this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	9	/* true if poll_idle() is polling
 					 * TIF_NEED_RESCHED */
-#define TIF_MEMDIE		10	/* is terminating due to OOM killer */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
index bde59825d06c..63b3285b7d46 100644
--- a/arch/sparc/include/asm/thread_info_64.h
+++ b/arch/sparc/include/asm/thread_info_64.h
@@ -191,7 +191,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
  *       an immediate value in instructions such as andcc.
  */
 /* flag bit 12 is available */
-#define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index 4b7cef9e94e0..734d53f4b435 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -121,7 +121,6 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SECCOMP		6	/* secure computing */
-#define TIF_MEMDIE		7	/* OOM killer at work */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
@@ -134,7 +133,6 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
-#define _TIF_MEMDIE		(1<<TIF_MEMDIE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
diff --git a/arch/um/include/asm/thread_info.h b/arch/um/include/asm/thread_info.h
index 053baff03674..b13047eeaede 100644
--- a/arch/um/include/asm/thread_info.h
+++ b/arch/um/include/asm/thread_info.h
@@ -58,7 +58,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_RESTART_BLOCK	4
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_AUDIT	6
 #define TIF_RESTORE_SIGMASK	7
 #define TIF_NOTIFY_RESUME	8
@@ -67,7 +66,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
-#define _TIF_MEMDIE		(1 << TIF_MEMDIE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 
diff --git a/arch/unicore32/include/asm/thread_info.h b/arch/unicore32/include/asm/thread_info.h
index e79ad6d5b5b2..2487cf9dd41e 100644
--- a/arch/unicore32/include/asm/thread_info.h
+++ b/arch/unicore32/include/asm/thread_info.h
@@ -121,7 +121,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_SYSCALL_TRACE	8
-#define TIF_MEMDIE		18
 #define TIF_RESTORE_SIGMASK	20
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index ffae84df8a93..79a4b75e814c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -101,7 +101,6 @@ struct thread_info {
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
-#define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
diff --git a/arch/xtensa/include/asm/thread_info.h b/arch/xtensa/include/asm/thread_info.h
index 7be2400f745a..791a0a0b5827 100644
--- a/arch/xtensa/include/asm/thread_info.h
+++ b/arch/xtensa/include/asm/thread_info.h
@@ -108,7 +108,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SINGLESTEP		3	/* restore singlestep on return to user mode */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
 #define TIF_DB_DISABLED		8	/* debug trap disabled for syscall */
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 2509e5df7244..55d02f2376ab 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -131,7 +131,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 		if (!p)
 			continue;
 
-		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+		if (atomic_read(&p->memdie) &&
 		    time_before_eq(jiffies, lowmem_deathpending_timeout)) {
 			task_unlock(p);
 			rcu_read_unlock();
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c1ab3ec30423..ddc12f571c50 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4815,7 +4815,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 #endif
 	trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
 
-	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
+	/* __GFP_NOFAIL: retry infinitely, ignore memdie tasks and memcg limit. */
 	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
 				     GFP_NOFS|__GFP_NOFAIL);
 	if (err)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d81a1eb974a..4c91fc0c2e8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1856,6 +1856,8 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+	/* oom victim - give it access to memory reserves */
+	atomic_t	memdie;
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 73e93e53884d..857fac0b973d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1038,9 +1038,9 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(atomic_read(&tsk->memdie)))
 		return;
-	if (current->flags & PF_EXITING) /* Let dying task have memory */
+	if (tsk->flags & PF_EXITING) /* Let dying task have memory */
 		return;
 
 	task_lock(tsk);
@@ -2496,12 +2496,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * If we're in interrupt, yes, we can always allocate.  If @node is set in
  * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
- * yes.  If current has access to memory reserves due to TIF_MEMDIE, yes.
+ * yes.  If current has access to memory reserves due to memdie, yes.
  * Otherwise, no.
  *
  * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
  * and do not allow allocations outside the current tasks cpuset
- * unless the task has been OOM killed as is marked TIF_MEMDIE.
+ * unless the task has been OOM killed as is marked memdie.
  * GFP_KERNEL allocations are not so marked, so can escape to the
  * nearest enclosing hardwalled ancestor cpuset.
  *
@@ -2524,7 +2524,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * affect that:
  *	in_interrupt - any node ok (current task context irrelevant)
  *	GFP_ATOMIC   - any node ok
- *	TIF_MEMDIE   - any node ok
+ *	memdie       - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
@@ -2542,7 +2542,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(atomic_read(&current->memdie)))
 		return true;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return false;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e1356e6bb..8bfdda9bc99a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
+	if (atomic_read(&current->memdie))
 		exit_oom_victim(tsk);
 }
 
diff --git a/kernel/freezer.c b/kernel/freezer.c
index a8900a3bc27a..e1bd9f2780fe 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -42,7 +42,7 @@ bool freezing_slow_path(struct task_struct *p)
 	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
 		return false;
 
-	if (test_thread_flag(TIF_MEMDIE))
+	if (atomic_read(&p->memdie))
 		return false;
 
 	if (pm_nosig_freezing || cgroup_freezing(p))
diff --git a/mm/ksm.c b/mm/ksm.c
index 73d43bafd9fb..8d5a295fb955 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -396,11 +396,11 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 	 *
 	 * VM_FAULT_OOM: at the time of writing (late July 2009), setting
 	 * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the
-	 * current task has TIF_MEMDIE set, and will be OOM killed on return
+	 * current task has memdie set, and will be OOM killed on return
 	 * to user; and ksmd, having no mm, would never be chosen for that.
 	 *
 	 * But if the mm is in a limited mem_cgroup, then the fault may fail
-	 * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and
+	 * with VM_FAULT_OOM even if the current task is not memdie; and
 	 * even ksmd can fail in this way - though it's usually breaking ksm
 	 * just to undo a merge it made a moment before, so unlikely to oom.
 	 *
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e8f9e5e9291..df411de17a75 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1987,7 +1987,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+	if (unlikely(atomic_read(&current->memdie) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
 		goto force;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..9d24007cdb82 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -473,7 +473,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 *				[...]
 	 *				out_of_memory
 	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
+	 *				    # no memdie task selects new victim
 	 *  unmap_page_range # frees some memory
 	 */
 	mutex_lock(&oom_lock);
@@ -593,7 +593,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	}
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * Clear memdie because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
 	 * for the oom victim right now because it cannot release its memory
 	 * itself nor by the oom reaper.
@@ -669,14 +669,14 @@ void mark_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
-	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+	if (!atomic_add_unless(&tsk->memdie, 1, 1))
 		return;
 	atomic_inc(&tsk->signal->oom_victims);
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
 	 * any memory and livelock. freezing_slow_path will tell the freezer
-	 * that TIF_MEMDIE tasks should be ignored.
+	 * that memdie tasks should be ignored.
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
@@ -687,7 +687,7 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+	if (!atomic_add_unless(&tsk->memdie, -1, 0))
 		return;
 	atomic_dec(&tsk->signal->oom_victims);
 
@@ -771,7 +771,7 @@ bool task_will_free_mem(struct task_struct *task)
 	 * If the process has passed exit_mm we have to skip it because
 	 * we have lost a link to other tasks sharing this mm, we do not
 	 * have anything to reap and the task might then get stuck waiting
-	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 * for parent as zombie and we do not want it to hold memdie
 	 */
 	p = find_lock_task_mm(task);
 	if (!p)
@@ -836,7 +836,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just set memdie so it can die quickly
 	 */
 	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
@@ -893,7 +893,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
+	 * We should send SIGKILL before setting memdie in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
 	 * space under its control.
 	 */
@@ -1016,7 +1016,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * quickly exit and free its memory.
 	 *
 	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
+	 * memdie flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
 	if (current->mm && task_will_free_mem(current)) {
 		mark_oom_victim(current);
@@ -1096,7 +1096,7 @@ void pagefault_out_of_memory(void)
 		 * be a racing OOM victim for which oom_killer_disable()
 		 * is waiting for.
 		 */
-		WARN_ON(test_thread_flag(TIF_MEMDIE));
+		WARN_ON(atomic_read(&current->memdie));
 	}
 
 	mutex_unlock(&oom_lock);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89128d64d662..6c550afde6a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3050,7 +3050,7 @@ void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (atomic_read(&current->memdie) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3428,7 +3428,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+				 unlikely(atomic_read(&current->memdie))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
@@ -3637,7 +3637,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	if (atomic_read(&current->memdie) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
 	/*
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24 15:06                     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-24 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri 24-06-16 16:05:58, Michal Hocko wrote:
> On Thu 23-06-16 20:52:21, Oleg Nesterov wrote:
> > On 06/23, Linus Torvalds wrote:
> > >
> > > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > Let me quote my previous email ;)
> > > >
> > > >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> > > >         say, mark_oom_victim() expects that get_task_struct() protects
> > > >         thread_info as well.
> > > >
> > > > probably we can fix all such users though...
> > >
> > > TIF_MEMDIE is indeed a potential problem, but I don't think
> > > mark_oom_victim() is actually problematic.
> > >
> > > mark_oom_victim() is called with either "current",
> > 
> > This is no longer true in -mm tree.
> > 
> > But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
> > at least in its current form).
> 
> We can move the flag to the task_struct. There are still some bits left
> there. This would be trivial so that the oom usage doesn't stay in the
> way.

Here is the patch. I've found two bugs when the TIF_MEMDIE was checked
on current rather than the given task. I will separate them into their
own patches (was just too lazy for it now). If the approach looks
reasonable then I will repost next week.
---
From 1baaa1f8f9568f95d8feccb28cf1994f8ca0df9f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 24 Jun 2016 16:46:18 +0200
Subject: [PATCH] mm, oom: move TIF_MEMDIE to the task_struct

There is an interest to drop thread_info->flags usage for further clean
ups. TIF_MEMDIE stands in the way so let's move it out of the
thread_info into the task_struct. We cannot use flags because the oom
killer will set it for !current task without any locking so let's add
task_struct::memdie. It has to be atomic because we need it to be
updated atomically.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/alpha/include/asm/thread_info.h      |  1 -
 arch/arc/include/asm/thread_info.h        |  2 --
 arch/arm/include/asm/thread_info.h        |  1 -
 arch/arm64/include/asm/thread_info.h      |  1 -
 arch/avr32/include/asm/thread_info.h      |  2 --
 arch/blackfin/include/asm/thread_info.h   |  1 -
 arch/c6x/include/asm/thread_info.h        |  1 -
 arch/cris/include/asm/thread_info.h       |  1 -
 arch/frv/include/asm/thread_info.h        |  1 -
 arch/h8300/include/asm/thread_info.h      |  1 -
 arch/hexagon/include/asm/thread_info.h    |  1 -
 arch/ia64/include/asm/thread_info.h       |  1 -
 arch/m32r/include/asm/thread_info.h       |  1 -
 arch/m68k/include/asm/thread_info.h       |  1 -
 arch/metag/include/asm/thread_info.h      |  1 -
 arch/microblaze/include/asm/thread_info.h |  1 -
 arch/mips/include/asm/thread_info.h       |  1 -
 arch/mn10300/include/asm/thread_info.h    |  1 -
 arch/nios2/include/asm/thread_info.h      |  1 -
 arch/openrisc/include/asm/thread_info.h   |  1 -
 arch/parisc/include/asm/thread_info.h     |  1 -
 arch/powerpc/include/asm/thread_info.h    |  1 -
 arch/s390/include/asm/thread_info.h       |  1 -
 arch/score/include/asm/thread_info.h      |  1 -
 arch/sh/include/asm/thread_info.h         |  1 -
 arch/sparc/include/asm/thread_info_32.h   |  1 -
 arch/sparc/include/asm/thread_info_64.h   |  1 -
 arch/tile/include/asm/thread_info.h       |  2 --
 arch/um/include/asm/thread_info.h         |  2 --
 arch/unicore32/include/asm/thread_info.h  |  1 -
 arch/x86/include/asm/thread_info.h        |  1 -
 arch/xtensa/include/asm/thread_info.h     |  1 -
 drivers/staging/android/lowmemorykiller.c |  2 +-
 fs/ext4/mballoc.c                         |  2 +-
 include/linux/sched.h                     |  2 ++
 kernel/cpuset.c                           | 12 ++++++------
 kernel/exit.c                             |  2 +-
 kernel/freezer.c                          |  2 +-
 mm/ksm.c                                  |  4 ++--
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 20 ++++++++++----------
 mm/page_alloc.c                           |  6 +++---
 42 files changed, 28 insertions(+), 62 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 32e920a83ae5..126eaaf6559d 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -65,7 +65,6 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SYSCALL_AUDIT	4	/* syscall audit active */
 #define TIF_DIE_IF_KERNEL	9	/* dik recursion lock */
-#define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14	/* idle is polling for TIF_NEED_RESCHED */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/arc/include/asm/thread_info.h b/arch/arc/include/asm/thread_info.h
index 3af67455659a..46d1fc1a073d 100644
--- a/arch/arc/include/asm/thread_info.h
+++ b/arch/arc/include/asm/thread_info.h
@@ -88,14 +88,12 @@ static inline __attribute_const__ struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
 
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		16
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
-#define _TIF_MEMDIE		(1<<TIF_MEMDIE)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 776757d1604a..6277e56f15fd 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -146,7 +146,6 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	20
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..d78b3b2945a9 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -114,7 +114,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_AUDIT	9
 #define TIF_SYSCALL_TRACEPOINT	10
 #define TIF_SECCOMP		11
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
 #define TIF_SINGLESTEP		21
diff --git a/arch/avr32/include/asm/thread_info.h b/arch/avr32/include/asm/thread_info.h
index d4d3079541ea..680be13234ab 100644
--- a/arch/avr32/include/asm/thread_info.h
+++ b/arch/avr32/include/asm/thread_info.h
@@ -70,7 +70,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED        2       /* rescheduling necessary */
 #define TIF_BREAKPOINT		4	/* enter monitor mode on return */
 #define TIF_SINGLE_STEP		5	/* single step in progress */
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	7	/* restore signal mask in do_signal */
 #define TIF_CPU_GOING_TO_SLEEP	8	/* CPU is entering sleep 0 mode */
 #define TIF_NOTIFY_RESUME	9	/* callback before returning to user */
@@ -82,7 +81,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_BREAKPOINT		(1 << TIF_BREAKPOINT)
 #define _TIF_SINGLE_STEP	(1 << TIF_SINGLE_STEP)
-#define _TIF_MEMDIE		(1 << TIF_MEMDIE)
 #define _TIF_CPU_GOING_TO_SLEEP (1 << TIF_CPU_GOING_TO_SLEEP)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 
diff --git a/arch/blackfin/include/asm/thread_info.h b/arch/blackfin/include/asm/thread_info.h
index 2966b93850a1..a45ff075ab6a 100644
--- a/arch/blackfin/include/asm/thread_info.h
+++ b/arch/blackfin/include/asm/thread_info.h
@@ -79,7 +79,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	0	/* syscall trace active */
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_IRQ_SYNC		7	/* sync pipeline stage */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/c6x/include/asm/thread_info.h b/arch/c6x/include/asm/thread_info.h
index acc70c135ab8..22ff7b03641d 100644
--- a/arch/c6x/include/asm/thread_info.h
+++ b/arch/c6x/include/asm/thread_info.h
@@ -89,7 +89,6 @@ struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	4	/* restore signal mask in do_signal() */
 
-#define TIF_MEMDIE		17	/* OOM killer killed process */
 
 #define TIF_WORK_MASK		0x00007FFE /* work on irq/exception return */
 #define TIF_ALLWORK_MASK	0x00007FFF /* work on any return to u-space */
diff --git a/arch/cris/include/asm/thread_info.h b/arch/cris/include/asm/thread_info.h
index 4ead1b40d2d7..79ebddc22aa3 100644
--- a/arch/cris/include/asm/thread_info.h
+++ b/arch/cris/include/asm/thread_info.h
@@ -70,7 +70,6 @@ struct thread_info {
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/frv/include/asm/thread_info.h b/arch/frv/include/asm/thread_info.h
index ccba3b6ce918..993930f59d8e 100644
--- a/arch/frv/include/asm/thread_info.h
+++ b/arch/frv/include/asm/thread_info.h
@@ -86,7 +86,6 @@ register struct thread_info *__current_thread_info asm("gr15");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		7	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/h8300/include/asm/thread_info.h b/arch/h8300/include/asm/thread_info.h
index b408fe660cf8..68c10bce921e 100644
--- a/arch/h8300/include/asm/thread_info.h
+++ b/arch/h8300/include/asm/thread_info.h
@@ -73,7 +73,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SINGLESTEP		3	/* singlestepping active */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	6	/* callback before returning to user */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
diff --git a/arch/hexagon/include/asm/thread_info.h b/arch/hexagon/include/asm/thread_info.h
index b80fe1db7b64..e55c7d0a1755 100644
--- a/arch/hexagon/include/asm/thread_info.h
+++ b/arch/hexagon/include/asm/thread_info.h
@@ -112,7 +112,6 @@ register struct thread_info *__current_thread_info asm(QUOTED_THREADINFO_REG);
 #define TIF_SINGLESTEP          4       /* restore ss @ return to usr mode */
 #define TIF_RESTORE_SIGMASK     6       /* restore sig mask in do_signal() */
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE              17      /* OOM killer killed process */
 
 #define _TIF_SYSCALL_TRACE      (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME      (1 << TIF_NOTIFY_RESUME)
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..77064b1d188a 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -97,7 +97,6 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	3	/* syscall auditing active */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_NOTIFY_RESUME	6	/* resumption notification requested */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_MCA_INIT		18	/* this task is processing MCA or INIT */
 #define TIF_DB_DISABLED		19	/* debug trap disabled for fsyscall */
 #define TIF_RESTORE_RSE		21	/* user RBS is newer than kernel RBS */
diff --git a/arch/m32r/include/asm/thread_info.h b/arch/m32r/include/asm/thread_info.h
index f630d9c30b28..bc54a574fad0 100644
--- a/arch/m32r/include/asm/thread_info.h
+++ b/arch/m32r/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline unsigned int get_thread_fault_code(void)
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	8	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/m68k/include/asm/thread_info.h b/arch/m68k/include/asm/thread_info.h
index cee13c2e5161..ed497d31ea5d 100644
--- a/arch/m68k/include/asm/thread_info.h
+++ b/arch/m68k/include/asm/thread_info.h
@@ -68,7 +68,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	7	/* rescheduling necessary */
 #define TIF_DELAYED_TRACE	14	/* single step a syscall */
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
-#define TIF_MEMDIE		16	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal */
 
 #endif	/* _ASM_M68K_THREAD_INFO_H */
diff --git a/arch/metag/include/asm/thread_info.h b/arch/metag/include/asm/thread_info.h
index 32677cc278aa..c506e5a61714 100644
--- a/arch/metag/include/asm/thread_info.h
+++ b/arch/metag/include/asm/thread_info.h
@@ -111,7 +111,6 @@ static inline int kstack_end(void *addr)
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
-#define TIF_MEMDIE		8	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 
 
diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h
index 383f387b4eee..281a365bec48 100644
--- a/arch/microblaze/include/asm/thread_info.h
+++ b/arch/microblaze/include/asm/thread_info.h
@@ -113,7 +113,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3 /* rescheduling necessary */
 /* restore singlestep on return to user mode */
 #define TIF_SINGLESTEP		4
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_AUDIT	9       /* syscall auditing active */
 #define TIF_SECCOMP		10      /* secure computing */
 
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index e309d8fcb516..3dd906330867 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_UPROBE		6	/* breakpointed or singlestepping */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
diff --git a/arch/mn10300/include/asm/thread_info.h b/arch/mn10300/include/asm/thread_info.h
index 4861a78c7160..1dd24f251a98 100644
--- a/arch/mn10300/include/asm/thread_info.h
+++ b/arch/mn10300/include/asm/thread_info.h
@@ -145,7 +145,6 @@ void arch_release_thread_info(struct thread_info *ti);
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	+(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	+(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/nios2/include/asm/thread_info.h b/arch/nios2/include/asm/thread_info.h
index d69c338bd19c..bf7d38c1c6e2 100644
--- a/arch/nios2/include/asm/thread_info.h
+++ b/arch/nios2/include/asm/thread_info.h
@@ -86,7 +86,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOTIFY_RESUME	1	/* resumption notification requested */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_SYSCALL_AUDIT	6	/* syscall auditing active */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h
index 6e619a79a401..7678a1b2dc64 100644
--- a/arch/openrisc/include/asm/thread_info.h
+++ b/arch/openrisc/include/asm/thread_info.h
@@ -108,7 +108,6 @@ register struct thread_info *current_thread_info_reg asm("r10");
 #define TIF_RESTORE_SIGMASK     9
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling						 * TIF_NEED_RESCHED
 					 */
-#define TIF_MEMDIE              17
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/parisc/include/asm/thread_info.h b/arch/parisc/include/asm/thread_info.h
index e96e693fd58c..bcebec0b9418 100644
--- a/arch/parisc/include/asm/thread_info.h
+++ b/arch/parisc/include/asm/thread_info.h
@@ -48,7 +48,6 @@ struct thread_info {
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	3	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_32BIT               4       /* 32 bit binary */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	6	/* restore saved signal mask */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 7efee4a3240b..d744fa455dd2 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -97,7 +97,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT	15	/* syscall tracepoint instrumentation */
 #define TIF_EMULATE_STACK_STORE	16	/* Is an instruction emulation
 						for stack store? */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #if defined(CONFIG_PPC64)
 #define TIF_ELF2ABI		18	/* function descriptors must die! */
 #endif
diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index 2fffc2c27581..8fc2704dd263 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -79,7 +79,6 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
 #define TIF_UPROBE		7	/* breakpointed or single-stepping */
 #define TIF_31BIT		16	/* 32bit process */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal() */
 #define TIF_SINGLE_STEP		19	/* This task is single stepped */
 #define TIF_BLOCK_STEP		20	/* This task is block stepped */
diff --git a/arch/score/include/asm/thread_info.h b/arch/score/include/asm/thread_info.h
index 7d9ffb15c477..f6e1cc89cef9 100644
--- a/arch/score/include/asm/thread_info.h
+++ b/arch/score/include/asm/thread_info.h
@@ -78,7 +78,6 @@ register struct thread_info *__current_thread_info __asm__("r28");
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h
index 2afa321157be..017f3993f384 100644
--- a/arch/sh/include/asm/thread_info.h
+++ b/arch/sh/include/asm/thread_info.h
@@ -117,7 +117,6 @@ extern void init_thread_xstate(void);
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
 #define TIF_SYSCALL_TRACEPOINT	8	/* for ftrace syscall instrumentation */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/sparc/include/asm/thread_info_32.h b/arch/sparc/include/asm/thread_info_32.h
index 229475f0d7ce..bcf81999db0b 100644
--- a/arch/sparc/include/asm/thread_info_32.h
+++ b/arch/sparc/include/asm/thread_info_32.h
@@ -110,7 +110,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
 					 * this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	9	/* true if poll_idle() is polling
 					 * TIF_NEED_RESCHED */
-#define TIF_MEMDIE		10	/* is terminating due to OOM killer */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
index bde59825d06c..63b3285b7d46 100644
--- a/arch/sparc/include/asm/thread_info_64.h
+++ b/arch/sparc/include/asm/thread_info_64.h
@@ -191,7 +191,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
  *       an immediate value in instructions such as andcc.
  */
 /* flag bit 12 is available */
-#define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index 4b7cef9e94e0..734d53f4b435 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -121,7 +121,6 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SECCOMP		6	/* secure computing */
-#define TIF_MEMDIE		7	/* OOM killer at work */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
@@ -134,7 +133,6 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
-#define _TIF_MEMDIE		(1<<TIF_MEMDIE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
diff --git a/arch/um/include/asm/thread_info.h b/arch/um/include/asm/thread_info.h
index 053baff03674..b13047eeaede 100644
--- a/arch/um/include/asm/thread_info.h
+++ b/arch/um/include/asm/thread_info.h
@@ -58,7 +58,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_RESTART_BLOCK	4
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_AUDIT	6
 #define TIF_RESTORE_SIGMASK	7
 #define TIF_NOTIFY_RESUME	8
@@ -67,7 +66,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
-#define _TIF_MEMDIE		(1 << TIF_MEMDIE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 
diff --git a/arch/unicore32/include/asm/thread_info.h b/arch/unicore32/include/asm/thread_info.h
index e79ad6d5b5b2..2487cf9dd41e 100644
--- a/arch/unicore32/include/asm/thread_info.h
+++ b/arch/unicore32/include/asm/thread_info.h
@@ -121,7 +121,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_SYSCALL_TRACE	8
-#define TIF_MEMDIE		18
 #define TIF_RESTORE_SIGMASK	20
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index ffae84df8a93..79a4b75e814c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -101,7 +101,6 @@ struct thread_info {
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
-#define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
diff --git a/arch/xtensa/include/asm/thread_info.h b/arch/xtensa/include/asm/thread_info.h
index 7be2400f745a..791a0a0b5827 100644
--- a/arch/xtensa/include/asm/thread_info.h
+++ b/arch/xtensa/include/asm/thread_info.h
@@ -108,7 +108,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SINGLESTEP		3	/* restore singlestep on return to user mode */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
 #define TIF_DB_DISABLED		8	/* debug trap disabled for syscall */
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 2509e5df7244..55d02f2376ab 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -131,7 +131,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 		if (!p)
 			continue;
 
-		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+		if (atomic_read(&p->memdie) &&
 		    time_before_eq(jiffies, lowmem_deathpending_timeout)) {
 			task_unlock(p);
 			rcu_read_unlock();
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c1ab3ec30423..ddc12f571c50 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4815,7 +4815,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 #endif
 	trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
 
-	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
+	/* __GFP_NOFAIL: retry infinitely, ignore memdie tasks and memcg limit. */
 	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
 				     GFP_NOFS|__GFP_NOFAIL);
 	if (err)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d81a1eb974a..4c91fc0c2e8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1856,6 +1856,8 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+	/* oom victim - give it access to memory reserves */
+	atomic_t	memdie;
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 73e93e53884d..857fac0b973d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1038,9 +1038,9 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(atomic_read(&tsk->memdie)))
 		return;
-	if (current->flags & PF_EXITING) /* Let dying task have memory */
+	if (tsk->flags & PF_EXITING) /* Let dying task have memory */
 		return;
 
 	task_lock(tsk);
@@ -2496,12 +2496,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * If we're in interrupt, yes, we can always allocate.  If @node is set in
  * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
- * yes.  If current has access to memory reserves due to TIF_MEMDIE, yes.
+ * yes.  If current has access to memory reserves due to memdie, yes.
  * Otherwise, no.
  *
  * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
  * and do not allow allocations outside the current tasks cpuset
- * unless the task has been OOM killed as is marked TIF_MEMDIE.
+ * unless the task has been OOM killed as is marked memdie.
  * GFP_KERNEL allocations are not so marked, so can escape to the
  * nearest enclosing hardwalled ancestor cpuset.
  *
@@ -2524,7 +2524,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * affect that:
  *	in_interrupt - any node ok (current task context irrelevant)
  *	GFP_ATOMIC   - any node ok
- *	TIF_MEMDIE   - any node ok
+ *	memdie       - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
@@ -2542,7 +2542,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(atomic_read(&current->memdie)))
 		return true;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return false;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e1356e6bb..8bfdda9bc99a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
+	if (atomic_read(&current->memdie))
 		exit_oom_victim(tsk);
 }
 
diff --git a/kernel/freezer.c b/kernel/freezer.c
index a8900a3bc27a..e1bd9f2780fe 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -42,7 +42,7 @@ bool freezing_slow_path(struct task_struct *p)
 	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
 		return false;
 
-	if (test_thread_flag(TIF_MEMDIE))
+	if (atomic_read(&p->memdie))
 		return false;
 
 	if (pm_nosig_freezing || cgroup_freezing(p))
diff --git a/mm/ksm.c b/mm/ksm.c
index 73d43bafd9fb..8d5a295fb955 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -396,11 +396,11 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 	 *
 	 * VM_FAULT_OOM: at the time of writing (late July 2009), setting
 	 * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the
-	 * current task has TIF_MEMDIE set, and will be OOM killed on return
+	 * current task has memdie set, and will be OOM killed on return
 	 * to user; and ksmd, having no mm, would never be chosen for that.
 	 *
 	 * But if the mm is in a limited mem_cgroup, then the fault may fail
-	 * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and
+	 * with VM_FAULT_OOM even if the current task is not memdie; and
 	 * even ksmd can fail in this way - though it's usually breaking ksm
 	 * just to undo a merge it made a moment before, so unlikely to oom.
 	 *
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e8f9e5e9291..df411de17a75 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1987,7 +1987,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+	if (unlikely(atomic_read(&current->memdie) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
 		goto force;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..9d24007cdb82 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -473,7 +473,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 *				[...]
 	 *				out_of_memory
 	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
+	 *				    # no memdie task selects new victim
 	 *  unmap_page_range # frees some memory
 	 */
 	mutex_lock(&oom_lock);
@@ -593,7 +593,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	}
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * Clear memdie because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
 	 * for the oom victim right now because it cannot release its memory
 	 * itself nor by the oom reaper.
@@ -669,14 +669,14 @@ void mark_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
-	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+	if (!atomic_add_unless(&tsk->memdie, 1, 1))
 		return;
 	atomic_inc(&tsk->signal->oom_victims);
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
 	 * any memory and livelock. freezing_slow_path will tell the freezer
-	 * that TIF_MEMDIE tasks should be ignored.
+	 * that memdie tasks should be ignored.
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
@@ -687,7 +687,7 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+	if (!atomic_add_unless(&tsk->memdie, -1, 0))
 		return;
 	atomic_dec(&tsk->signal->oom_victims);
 
@@ -771,7 +771,7 @@ bool task_will_free_mem(struct task_struct *task)
 	 * If the process has passed exit_mm we have to skip it because
 	 * we have lost a link to other tasks sharing this mm, we do not
 	 * have anything to reap and the task might then get stuck waiting
-	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 * for parent as zombie and we do not want it to hold memdie
 	 */
 	p = find_lock_task_mm(task);
 	if (!p)
@@ -836,7 +836,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just set memdie so it can die quickly
 	 */
 	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
@@ -893,7 +893,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
+	 * We should send SIGKILL before setting memdie in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
 	 * space under its control.
 	 */
@@ -1016,7 +1016,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * quickly exit and free its memory.
 	 *
 	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
+	 * memdie flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
 	if (current->mm && task_will_free_mem(current)) {
 		mark_oom_victim(current);
@@ -1096,7 +1096,7 @@ void pagefault_out_of_memory(void)
 		 * be a racing OOM victim for which oom_killer_disable()
 		 * is waiting for.
 		 */
-		WARN_ON(test_thread_flag(TIF_MEMDIE));
+		WARN_ON(atomic_read(&current->memdie));
 	}
 
 	mutex_unlock(&oom_lock);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89128d64d662..6c550afde6a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3050,7 +3050,7 @@ void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (atomic_read(&current->memdie) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3428,7 +3428,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+				 unlikely(atomic_read(&current->memdie))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
@@ -3637,7 +3637,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	if (atomic_read(&current->memdie) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
 	/*
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 269+ messages in thread

* Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24 15:06                     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-24 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri 24-06-16 16:05:58, Michal Hocko wrote:
> On Thu 23-06-16 20:52:21, Oleg Nesterov wrote:
> > On 06/23, Linus Torvalds wrote:
> > >
> > > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > Let me quote my previous email ;)
> > > >
> > > >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> > > >         say, mark_oom_victim() expects that get_task_struct() protects
> > > >         thread_info as well.
> > > >
> > > > probably we can fix all such users though...
> > >
> > > TIF_MEMDIE is indeed a potential problem, but I don't think
> > > mark_oom_victim() is actually problematic.
> > >
> > > mark_oom_victim() is called with either "current",
> > 
> > This is no longer true in -mm tree.
> > 
> > But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
> > at least in its current form).
> 
> We can move the flag to the task_struct. There are still some bits left
> there. This would be trivial so that the oom usage doesn't stay in the
> way.

Here is the patch. I've found two bugs when the TIF_MEMDIE was checked
on current rather than the given task. I will separate them into their
own patches (was just too lazy for it now). If the approach looks
reasonable then I will repost next week.
---

^ permalink raw reply	[flat|nested] 269+ messages in thread

* [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24 15:06                     ` Michal Hocko
  0 siblings, 0 replies; 269+ messages in thread
From: Michal Hocko @ 2016-06-24 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andy Lutomirski, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Fri 24-06-16 16:05:58, Michal Hocko wrote:
> On Thu 23-06-16 20:52:21, Oleg Nesterov wrote:
> > On 06/23, Linus Torvalds wrote:
> > >
> > > On Thu, Jun 23, 2016 at 10:03 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > Let me quote my previous email ;)
> > > >
> > > >         And we can't free/nullify it when the parent/debuger reaps a zombie,
> > > >         say, mark_oom_victim() expects that get_task_struct() protects
> > > >         thread_info as well.
> > > >
> > > > probably we can fix all such users though...
> > >
> > > TIF_MEMDIE is indeed a potential problem, but I don't think
> > > mark_oom_victim() is actually problematic.
> > >
> > > mark_oom_victim() is called with either "current",
> > 
> > This is no longer true in -mm tree.
> > 
> > But I agree, this is fixable (and in fact I still hope TIF_MEMDIE will die,
> > at least in its current form).
> 
> We can move the flag to the task_struct. There are still some bits left
> there. This would be trivial so that the oom usage doesn't stay in the
> way.

Here is the patch. I've found two bugs when the TIF_MEMDIE was checked
on current rather than the given task. I will separate them into their
own patches (was just too lazy for it now). If the approach looks
reasonable then I will repost next week.
---
>From 1baaa1f8f9568f95d8feccb28cf1994f8ca0df9f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 24 Jun 2016 16:46:18 +0200
Subject: [PATCH] mm, oom: move TIF_MEMDIE to the task_struct

There is an interest to drop thread_info->flags usage for further clean
ups. TIF_MEMDIE stands in the way so let's move it out of the
thread_info into the task_struct. We cannot use flags because the oom
killer will set it for !current task without any locking so let's add
task_struct::memdie. It has to be atomic because we need it to be
updated atomically.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/alpha/include/asm/thread_info.h      |  1 -
 arch/arc/include/asm/thread_info.h        |  2 --
 arch/arm/include/asm/thread_info.h        |  1 -
 arch/arm64/include/asm/thread_info.h      |  1 -
 arch/avr32/include/asm/thread_info.h      |  2 --
 arch/blackfin/include/asm/thread_info.h   |  1 -
 arch/c6x/include/asm/thread_info.h        |  1 -
 arch/cris/include/asm/thread_info.h       |  1 -
 arch/frv/include/asm/thread_info.h        |  1 -
 arch/h8300/include/asm/thread_info.h      |  1 -
 arch/hexagon/include/asm/thread_info.h    |  1 -
 arch/ia64/include/asm/thread_info.h       |  1 -
 arch/m32r/include/asm/thread_info.h       |  1 -
 arch/m68k/include/asm/thread_info.h       |  1 -
 arch/metag/include/asm/thread_info.h      |  1 -
 arch/microblaze/include/asm/thread_info.h |  1 -
 arch/mips/include/asm/thread_info.h       |  1 -
 arch/mn10300/include/asm/thread_info.h    |  1 -
 arch/nios2/include/asm/thread_info.h      |  1 -
 arch/openrisc/include/asm/thread_info.h   |  1 -
 arch/parisc/include/asm/thread_info.h     |  1 -
 arch/powerpc/include/asm/thread_info.h    |  1 -
 arch/s390/include/asm/thread_info.h       |  1 -
 arch/score/include/asm/thread_info.h      |  1 -
 arch/sh/include/asm/thread_info.h         |  1 -
 arch/sparc/include/asm/thread_info_32.h   |  1 -
 arch/sparc/include/asm/thread_info_64.h   |  1 -
 arch/tile/include/asm/thread_info.h       |  2 --
 arch/um/include/asm/thread_info.h         |  2 --
 arch/unicore32/include/asm/thread_info.h  |  1 -
 arch/x86/include/asm/thread_info.h        |  1 -
 arch/xtensa/include/asm/thread_info.h     |  1 -
 drivers/staging/android/lowmemorykiller.c |  2 +-
 fs/ext4/mballoc.c                         |  2 +-
 include/linux/sched.h                     |  2 ++
 kernel/cpuset.c                           | 12 ++++++------
 kernel/exit.c                             |  2 +-
 kernel/freezer.c                          |  2 +-
 mm/ksm.c                                  |  4 ++--
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 20 ++++++++++----------
 mm/page_alloc.c                           |  6 +++---
 42 files changed, 28 insertions(+), 62 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 32e920a83ae5..126eaaf6559d 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -65,7 +65,6 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SYSCALL_AUDIT	4	/* syscall audit active */
 #define TIF_DIE_IF_KERNEL	9	/* dik recursion lock */
-#define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14	/* idle is polling for TIF_NEED_RESCHED */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/arc/include/asm/thread_info.h b/arch/arc/include/asm/thread_info.h
index 3af67455659a..46d1fc1a073d 100644
--- a/arch/arc/include/asm/thread_info.h
+++ b/arch/arc/include/asm/thread_info.h
@@ -88,14 +88,12 @@ static inline __attribute_const__ struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
 
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		16
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
-#define _TIF_MEMDIE		(1<<TIF_MEMDIE)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 776757d1604a..6277e56f15fd 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -146,7 +146,6 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	20
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..d78b3b2945a9 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -114,7 +114,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_AUDIT	9
 #define TIF_SYSCALL_TRACEPOINT	10
 #define TIF_SECCOMP		11
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
 #define TIF_SINGLESTEP		21
diff --git a/arch/avr32/include/asm/thread_info.h b/arch/avr32/include/asm/thread_info.h
index d4d3079541ea..680be13234ab 100644
--- a/arch/avr32/include/asm/thread_info.h
+++ b/arch/avr32/include/asm/thread_info.h
@@ -70,7 +70,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED        2       /* rescheduling necessary */
 #define TIF_BREAKPOINT		4	/* enter monitor mode on return */
 #define TIF_SINGLE_STEP		5	/* single step in progress */
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	7	/* restore signal mask in do_signal */
 #define TIF_CPU_GOING_TO_SLEEP	8	/* CPU is entering sleep 0 mode */
 #define TIF_NOTIFY_RESUME	9	/* callback before returning to user */
@@ -82,7 +81,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_BREAKPOINT		(1 << TIF_BREAKPOINT)
 #define _TIF_SINGLE_STEP	(1 << TIF_SINGLE_STEP)
-#define _TIF_MEMDIE		(1 << TIF_MEMDIE)
 #define _TIF_CPU_GOING_TO_SLEEP (1 << TIF_CPU_GOING_TO_SLEEP)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 
diff --git a/arch/blackfin/include/asm/thread_info.h b/arch/blackfin/include/asm/thread_info.h
index 2966b93850a1..a45ff075ab6a 100644
--- a/arch/blackfin/include/asm/thread_info.h
+++ b/arch/blackfin/include/asm/thread_info.h
@@ -79,7 +79,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	0	/* syscall trace active */
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_IRQ_SYNC		7	/* sync pipeline stage */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/c6x/include/asm/thread_info.h b/arch/c6x/include/asm/thread_info.h
index acc70c135ab8..22ff7b03641d 100644
--- a/arch/c6x/include/asm/thread_info.h
+++ b/arch/c6x/include/asm/thread_info.h
@@ -89,7 +89,6 @@ struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	4	/* restore signal mask in do_signal() */
 
-#define TIF_MEMDIE		17	/* OOM killer killed process */
 
 #define TIF_WORK_MASK		0x00007FFE /* work on irq/exception return */
 #define TIF_ALLWORK_MASK	0x00007FFF /* work on any return to u-space */
diff --git a/arch/cris/include/asm/thread_info.h b/arch/cris/include/asm/thread_info.h
index 4ead1b40d2d7..79ebddc22aa3 100644
--- a/arch/cris/include/asm/thread_info.h
+++ b/arch/cris/include/asm/thread_info.h
@@ -70,7 +70,6 @@ struct thread_info {
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/frv/include/asm/thread_info.h b/arch/frv/include/asm/thread_info.h
index ccba3b6ce918..993930f59d8e 100644
--- a/arch/frv/include/asm/thread_info.h
+++ b/arch/frv/include/asm/thread_info.h
@@ -86,7 +86,6 @@ register struct thread_info *__current_thread_info asm("gr15");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		7	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/h8300/include/asm/thread_info.h b/arch/h8300/include/asm/thread_info.h
index b408fe660cf8..68c10bce921e 100644
--- a/arch/h8300/include/asm/thread_info.h
+++ b/arch/h8300/include/asm/thread_info.h
@@ -73,7 +73,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SINGLESTEP		3	/* singlestepping active */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	6	/* callback before returning to user */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
diff --git a/arch/hexagon/include/asm/thread_info.h b/arch/hexagon/include/asm/thread_info.h
index b80fe1db7b64..e55c7d0a1755 100644
--- a/arch/hexagon/include/asm/thread_info.h
+++ b/arch/hexagon/include/asm/thread_info.h
@@ -112,7 +112,6 @@ register struct thread_info *__current_thread_info asm(QUOTED_THREADINFO_REG);
 #define TIF_SINGLESTEP          4       /* restore ss @ return to usr mode */
 #define TIF_RESTORE_SIGMASK     6       /* restore sig mask in do_signal() */
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE              17      /* OOM killer killed process */
 
 #define _TIF_SYSCALL_TRACE      (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME      (1 << TIF_NOTIFY_RESUME)
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..77064b1d188a 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -97,7 +97,6 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	3	/* syscall auditing active */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_NOTIFY_RESUME	6	/* resumption notification requested */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_MCA_INIT		18	/* this task is processing MCA or INIT */
 #define TIF_DB_DISABLED		19	/* debug trap disabled for fsyscall */
 #define TIF_RESTORE_RSE		21	/* user RBS is newer than kernel RBS */
diff --git a/arch/m32r/include/asm/thread_info.h b/arch/m32r/include/asm/thread_info.h
index f630d9c30b28..bc54a574fad0 100644
--- a/arch/m32r/include/asm/thread_info.h
+++ b/arch/m32r/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline unsigned int get_thread_fault_code(void)
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	8	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/m68k/include/asm/thread_info.h b/arch/m68k/include/asm/thread_info.h
index cee13c2e5161..ed497d31ea5d 100644
--- a/arch/m68k/include/asm/thread_info.h
+++ b/arch/m68k/include/asm/thread_info.h
@@ -68,7 +68,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	7	/* rescheduling necessary */
 #define TIF_DELAYED_TRACE	14	/* single step a syscall */
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
-#define TIF_MEMDIE		16	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal */
 
 #endif	/* _ASM_M68K_THREAD_INFO_H */
diff --git a/arch/metag/include/asm/thread_info.h b/arch/metag/include/asm/thread_info.h
index 32677cc278aa..c506e5a61714 100644
--- a/arch/metag/include/asm/thread_info.h
+++ b/arch/metag/include/asm/thread_info.h
@@ -111,7 +111,6 @@ static inline int kstack_end(void *addr)
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
-#define TIF_MEMDIE		8	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 
 
diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h
index 383f387b4eee..281a365bec48 100644
--- a/arch/microblaze/include/asm/thread_info.h
+++ b/arch/microblaze/include/asm/thread_info.h
@@ -113,7 +113,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3 /* rescheduling necessary */
 /* restore singlestep on return to user mode */
 #define TIF_SINGLESTEP		4
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_AUDIT	9       /* syscall auditing active */
 #define TIF_SECCOMP		10      /* secure computing */
 
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index e309d8fcb516..3dd906330867 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_UPROBE		6	/* breakpointed or singlestepping */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
diff --git a/arch/mn10300/include/asm/thread_info.h b/arch/mn10300/include/asm/thread_info.h
index 4861a78c7160..1dd24f251a98 100644
--- a/arch/mn10300/include/asm/thread_info.h
+++ b/arch/mn10300/include/asm/thread_info.h
@@ -145,7 +145,6 @@ void arch_release_thread_info(struct thread_info *ti);
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	+(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	+(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/nios2/include/asm/thread_info.h b/arch/nios2/include/asm/thread_info.h
index d69c338bd19c..bf7d38c1c6e2 100644
--- a/arch/nios2/include/asm/thread_info.h
+++ b/arch/nios2/include/asm/thread_info.h
@@ -86,7 +86,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOTIFY_RESUME	1	/* resumption notification requested */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_SYSCALL_AUDIT	6	/* syscall auditing active */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h
index 6e619a79a401..7678a1b2dc64 100644
--- a/arch/openrisc/include/asm/thread_info.h
+++ b/arch/openrisc/include/asm/thread_info.h
@@ -108,7 +108,6 @@ register struct thread_info *current_thread_info_reg asm("r10");
 #define TIF_RESTORE_SIGMASK     9
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling						 * TIF_NEED_RESCHED
 					 */
-#define TIF_MEMDIE              17
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/parisc/include/asm/thread_info.h b/arch/parisc/include/asm/thread_info.h
index e96e693fd58c..bcebec0b9418 100644
--- a/arch/parisc/include/asm/thread_info.h
+++ b/arch/parisc/include/asm/thread_info.h
@@ -48,7 +48,6 @@ struct thread_info {
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	3	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_32BIT               4       /* 32 bit binary */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	6	/* restore saved signal mask */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 7efee4a3240b..d744fa455dd2 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -97,7 +97,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT	15	/* syscall tracepoint instrumentation */
 #define TIF_EMULATE_STACK_STORE	16	/* Is an instruction emulation
 						for stack store? */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #if defined(CONFIG_PPC64)
 #define TIF_ELF2ABI		18	/* function descriptors must die! */
 #endif
diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index 2fffc2c27581..8fc2704dd263 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -79,7 +79,6 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
 #define TIF_UPROBE		7	/* breakpointed or single-stepping */
 #define TIF_31BIT		16	/* 32bit process */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal() */
 #define TIF_SINGLE_STEP		19	/* This task is single stepped */
 #define TIF_BLOCK_STEP		20	/* This task is block stepped */
diff --git a/arch/score/include/asm/thread_info.h b/arch/score/include/asm/thread_info.h
index 7d9ffb15c477..f6e1cc89cef9 100644
--- a/arch/score/include/asm/thread_info.h
+++ b/arch/score/include/asm/thread_info.h
@@ -78,7 +78,6 @@ register struct thread_info *__current_thread_info __asm__("r28");
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h
index 2afa321157be..017f3993f384 100644
--- a/arch/sh/include/asm/thread_info.h
+++ b/arch/sh/include/asm/thread_info.h
@@ -117,7 +117,6 @@ extern void init_thread_xstate(void);