linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup
@ 2016-06-26 21:55 Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
                   ` (32 more replies)
  0 siblings, 33 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

Hi all-

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

This series (starting with this version, v4) also extensively cleans
up thread_info.  thread_info has been partially redundant with
thread_struct for a long time -- both are places for arch code to
add additional per-task variables.  thread_struct is much cleaner:
it's always in task_struct, and there's nothing particularly magical
about it.  So this series contains a bunch of cleanups on x86 to
move almost everything from thread_info to thread_struct (which,
even by itself, deletes more code than it adds) and to remove x86's
dependence on thread_info's position on the stack.  Then it opts x86
into a new config option THREAD_INFO_IN_TASK to get rid of
arch-specific thread_info entirely and simply embed a defanged
thread_info (containing only flags) and 'int cpu' into task_struct.

Once thread_info stops being magical, there's another benefit: we
can free the thread stack as soon as the task is dead (without
waiting for RCU) and then, if vmapped stacks are in use, cache the
entire stack for reuse on the same cpu.

This seems to be an overall speedup of about 0.5-1 µs per
pthread_create/join in a simple test -- a percpu cache of vmalloced
stacks appears to be a bit faster than a high-order stack
allocation, at least when the cache hits.  (I expect that workloads
with a low cache hit rate are likely to be dominated by other
effects anyway.)

This does not address interrupt stacks.

It's worth noting that s390 has an arch-specific gcc feature that
detects stack overflows by adjusting function prologues.  Arches
with features like that may wish to avoid using vmapped stacks to
minimize the performance hit.

Known issues:
 - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
   has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
   and virtio_console.

Changes from v3:
 - Minor cleanups
 - Rebased onto Linus' tree
 - All the thread_info stuff is new

Changes from v2:
 - Delete kerne_unmap_pages_in_pgd rather than hardening it (Borislav)
 - Fix sub-page stack accounting better (Josh)

Changes from v1:
 - Fix rewind_stack_and_do_exit (Josh)
 - Fix deadlock under load
 - Clean up generic stack vmalloc code
 - Many other minor fixes

Andy Lutomirski (25):
  bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  x86/mm: Remove kernel_unmap_pages_in_pgd() and
    efi_cleanup_page_tables()
  mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  mm: Fix memcg stack accounting for sub-page stacks
  dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  fork: Add generic vmalloced stack support
  x86/die: Don't try to recover from an OOPS on a non-default stack
  x86/dumpstack: When OOPSing, rewind the stack before do_exit
  x86/dumpstack: When dumping stack bytes due to OOPS, start with
    regs->sp
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
    OOPS
  x86/mm/64: Enable vmapped stacks
  x86/mm: Improve stack-overflow #PF handling
  x86: Move uaccess_err and sig_on_uaccess_err to thread_struct
  x86: Move addr_limit to thread_struct
  signal: Consolidate {TS,TLF}_RESTORE_SIGMASK code
  x86/smp: Remove stack_smp_processor_id()
  x86/smp: Remove unnecessary initialization of thread_info::cpu
  x86/asm: Move 'status' from struct thread_info to struct thread_struct
  kdb: Use task_cpu() instead of task_thread_info()->cpu
  sched: Allow putting thread_info into task_struct
  x86: Move thread_info into task_struct
  sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set

Herbert Xu (1):
  rxrpc: Avoid using stack memory in SG lists in rxkad

Ingo Molnar (1):
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

Linus Torvalds (2):
  x86/entry: Get rid of pt_regs_to_thread_info()
  um: Stop conflating task_struct::stack with thread_info

 arch/Kconfig                              |  29 ++++++
 arch/alpha/include/asm/thread_info.h      |  27 -----
 arch/ia64/include/asm/thread_info.h       |  30 +-----
 arch/microblaze/include/asm/thread_info.h |  27 -----
 arch/powerpc/include/asm/thread_info.h    |  25 -----
 arch/sh/include/asm/thread_info.h         |  26 -----
 arch/sparc/include/asm/thread_info_64.h   |  24 -----
 arch/tile/include/asm/thread_info.h       |  27 -----
 arch/x86/Kconfig                          |   2 +
 arch/x86/entry/common.c                   |  25 ++---
 arch/x86/entry/entry_32.S                 |  11 +++
 arch/x86/entry/entry_64.S                 |  20 +++-
 arch/x86/entry/vsyscall/vsyscall_64.c     |   6 +-
 arch/x86/include/asm/checksum_32.h        |   3 +-
 arch/x86/include/asm/cpu.h                |   1 -
 arch/x86/include/asm/efi.h                |   1 -
 arch/x86/include/asm/pgtable_types.h      |   2 -
 arch/x86/include/asm/processor.h          |  32 ++++--
 arch/x86/include/asm/smp.h                |   6 --
 arch/x86/include/asm/switch_to.h          |  34 ++++++-
 arch/x86/include/asm/syscall.h            |  23 +----
 arch/x86/include/asm/thread_info.h        | 102 +------------------
 arch/x86/include/asm/traps.h              |   6 ++
 arch/x86/include/asm/uaccess.h            |  10 +-
 arch/x86/kernel/asm-offsets.c             |   5 +-
 arch/x86/kernel/cpu/common.c              |   2 +-
 arch/x86/kernel/dumpstack.c               |  20 +++-
 arch/x86/kernel/dumpstack_32.c            |   4 +-
 arch/x86/kernel/dumpstack_64.c            |  16 ++-
 arch/x86/kernel/fpu/init.c                |   1 -
 arch/x86/kernel/irq_64.c                  |   3 +-
 arch/x86/kernel/process.c                 |   6 +-
 arch/x86/kernel/process_64.c              |   4 +-
 arch/x86/kernel/ptrace.c                  |   2 +-
 arch/x86/kernel/smpboot.c                 |   1 -
 arch/x86/kernel/traps.c                   |  32 ++++++
 arch/x86/lib/copy_user_64.S               |   8 +-
 arch/x86/lib/csum-wrappers_64.c           |   1 +
 arch/x86/lib/getuser.S                    |  20 ++--
 arch/x86/lib/putuser.S                    |  10 +-
 arch/x86/lib/usercopy_64.c                |   2 +-
 arch/x86/mm/extable.c                     |   2 +-
 arch/x86/mm/fault.c                       |  41 +++++++-
 arch/x86/mm/init_64.c                     |  27 -----
 arch/x86/mm/pageattr.c                    |  32 +-----
 arch/x86/mm/tlb.c                         |  15 +++
 arch/x86/platform/efi/efi.c               |   2 -
 arch/x86/platform/efi/efi_32.c            |   3 -
 arch/x86/platform/efi/efi_64.c            |   5 -
 arch/x86/um/ptrace_32.c                   |   8 +-
 drivers/base/node.c                       |   3 +-
 drivers/pnp/isapnp/proc.c                 |   2 +-
 fs/proc/meminfo.c                         |   2 +-
 include/linux/init_task.h                 |   9 ++
 include/linux/kdb.h                       |   2 +-
 include/linux/memcontrol.h                |   2 +-
 include/linux/mmzone.h                    |   2 +-
 include/linux/sched.h                     | 115 +++++++++++++++++++++-
 include/linux/thread_info.h               |  56 +++--------
 init/Kconfig                              |   3 +
 init/init_task.c                          |   7 +-
 kernel/fork.c                             | 158 +++++++++++++++++++++++++-----
 kernel/sched/core.c                       |   9 ++
 kernel/sched/sched.h                      |   4 +
 lib/bitmap.c                              |   2 +-
 lib/dma-debug.c                           |  39 ++++++--
 mm/memcontrol.c                           |   2 +-
 mm/page_alloc.c                           |   3 +-
 net/bluetooth/smp.c                       |  67 ++++++-------
 net/rxrpc/ar-internal.h                   |   1 +
 net/rxrpc/rxkad.c                         | 103 ++++++++-----------
 71 files changed, 714 insertions(+), 648 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-27  5:58   ` Marcel Holtmann
  2016-06-26 21:55 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad Andy Lutomirski
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Marcel Holtmann,
	Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth,
	netdev

SMP does ECB crypto on stack buffers.  This is complicated and
fragile, and it will not work if the stack is virtually allocated.

Switch to the crypto_cipher interface, which is simpler and safer.

Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-bluetooth@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
 1 file changed, 28 insertions(+), 39 deletions(-)

diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 50976a6481f3..4c1a16a96ae5 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -22,9 +22,9 @@
 
 #include <linux/debugfs.h>
 #include <linux/scatterlist.h>
+#include <linux/crypto.h>
 #include <crypto/b128ops.h>
 #include <crypto/hash.h>
-#include <crypto/skcipher.h>
 
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci_core.h>
@@ -88,7 +88,7 @@ struct smp_dev {
 	u8			min_key_size;
 	u8			max_key_size;
 
-	struct crypto_skcipher	*tfm_aes;
+	struct crypto_cipher	*tfm_aes;
 	struct crypto_shash	*tfm_cmac;
 };
 
@@ -127,7 +127,7 @@ struct smp_chan {
 	u8			dhkey[32];
 	u8			mackey[16];
 
-	struct crypto_skcipher	*tfm_aes;
+	struct crypto_cipher	*tfm_aes;
 	struct crypto_shash	*tfm_cmac;
 };
 
@@ -361,10 +361,8 @@ static int smp_h6(struct crypto_shash *tfm_cmac, const u8 w[16],
  * s1 and ah.
  */
 
-static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
+static int smp_e(struct crypto_cipher *tfm, const u8 *k, u8 *r)
 {
-	SKCIPHER_REQUEST_ON_STACK(req, tfm);
-	struct scatterlist sg;
 	uint8_t tmp[16], data[16];
 	int err;
 
@@ -378,7 +376,7 @@ static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
 	/* The most significant octet of key corresponds to k[0] */
 	swap_buf(k, tmp, 16);
 
-	err = crypto_skcipher_setkey(tfm, tmp, 16);
+	err = crypto_cipher_setkey(tfm, tmp, 16);
 	if (err) {
 		BT_ERR("cipher setkey failed: %d", err);
 		return err;
@@ -387,16 +385,7 @@ static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
 	/* Most significant octet of plaintextData corresponds to data[0] */
 	swap_buf(r, data, 16);
 
-	sg_init_one(&sg, data, 16);
-
-	skcipher_request_set_tfm(req, tfm);
-	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg, &sg, 16, NULL);
-
-	err = crypto_skcipher_encrypt(req);
-	skcipher_request_zero(req);
-	if (err)
-		BT_ERR("Encrypt data error %d", err);
+	crypto_cipher_encrypt_one(tfm, data, data);
 
 	/* Most significant octet of encryptedData corresponds to data[0] */
 	swap_buf(data, r, 16);
@@ -406,7 +395,7 @@ static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
 	return err;
 }
 
-static int smp_c1(struct crypto_skcipher *tfm_aes, const u8 k[16],
+static int smp_c1(struct crypto_cipher *tfm_aes, const u8 k[16],
 		  const u8 r[16], const u8 preq[7], const u8 pres[7], u8 _iat,
 		  const bdaddr_t *ia, u8 _rat, const bdaddr_t *ra, u8 res[16])
 {
@@ -455,7 +444,7 @@ static int smp_c1(struct crypto_skcipher *tfm_aes, const u8 k[16],
 	return err;
 }
 
-static int smp_s1(struct crypto_skcipher *tfm_aes, const u8 k[16],
+static int smp_s1(struct crypto_cipher *tfm_aes, const u8 k[16],
 		  const u8 r1[16], const u8 r2[16], u8 _r[16])
 {
 	int err;
@@ -471,7 +460,7 @@ static int smp_s1(struct crypto_skcipher *tfm_aes, const u8 k[16],
 	return err;
 }
 
-static int smp_ah(struct crypto_skcipher *tfm, const u8 irk[16],
+static int smp_ah(struct crypto_cipher *tfm, const u8 irk[16],
 		  const u8 r[3], u8 res[3])
 {
 	u8 _res[16];
@@ -759,7 +748,7 @@ static void smp_chan_destroy(struct l2cap_conn *conn)
 	kzfree(smp->slave_csrk);
 	kzfree(smp->link_key);
 
-	crypto_free_skcipher(smp->tfm_aes);
+	crypto_free_cipher(smp->tfm_aes);
 	crypto_free_shash(smp->tfm_cmac);
 
 	/* Ensure that we don't leave any debug key around if debug key
@@ -1359,9 +1348,9 @@ static struct smp_chan *smp_chan_create(struct l2cap_conn *conn)
 	if (!smp)
 		return NULL;
 
-	smp->tfm_aes = crypto_alloc_skcipher("ecb(aes)", 0, CRYPTO_ALG_ASYNC);
+	smp->tfm_aes = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(smp->tfm_aes)) {
-		BT_ERR("Unable to create ECB crypto context");
+		BT_ERR("Unable to create AES crypto context");
 		kzfree(smp);
 		return NULL;
 	}
@@ -1369,7 +1358,7 @@ static struct smp_chan *smp_chan_create(struct l2cap_conn *conn)
 	smp->tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, 0);
 	if (IS_ERR(smp->tfm_cmac)) {
 		BT_ERR("Unable to create CMAC crypto context");
-		crypto_free_skcipher(smp->tfm_aes);
+		crypto_free_cipher(smp->tfm_aes);
 		kzfree(smp);
 		return NULL;
 	}
@@ -3120,7 +3109,7 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 {
 	struct l2cap_chan *chan;
 	struct smp_dev *smp;
-	struct crypto_skcipher *tfm_aes;
+	struct crypto_cipher *tfm_aes;
 	struct crypto_shash *tfm_cmac;
 
 	if (cid == L2CAP_CID_SMP_BREDR) {
@@ -3132,9 +3121,9 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 	if (!smp)
 		return ERR_PTR(-ENOMEM);
 
-	tfm_aes = crypto_alloc_skcipher("ecb(aes)", 0, CRYPTO_ALG_ASYNC);
+	tfm_aes = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(tfm_aes)) {
-		BT_ERR("Unable to create ECB crypto context");
+		BT_ERR("Unable to create AES crypto context");
 		kzfree(smp);
 		return ERR_CAST(tfm_aes);
 	}
@@ -3142,7 +3131,7 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 	tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, 0);
 	if (IS_ERR(tfm_cmac)) {
 		BT_ERR("Unable to create CMAC crypto context");
-		crypto_free_skcipher(tfm_aes);
+		crypto_free_cipher(tfm_aes);
 		kzfree(smp);
 		return ERR_CAST(tfm_cmac);
 	}
@@ -3156,7 +3145,7 @@ create_chan:
 	chan = l2cap_chan_create();
 	if (!chan) {
 		if (smp) {
-			crypto_free_skcipher(smp->tfm_aes);
+			crypto_free_cipher(smp->tfm_aes);
 			crypto_free_shash(smp->tfm_cmac);
 			kzfree(smp);
 		}
@@ -3203,7 +3192,7 @@ static void smp_del_chan(struct l2cap_chan *chan)
 	smp = chan->data;
 	if (smp) {
 		chan->data = NULL;
-		crypto_free_skcipher(smp->tfm_aes);
+		crypto_free_cipher(smp->tfm_aes);
 		crypto_free_shash(smp->tfm_cmac);
 		kzfree(smp);
 	}
@@ -3440,7 +3429,7 @@ void smp_unregister(struct hci_dev *hdev)
 
 #if IS_ENABLED(CONFIG_BT_SELFTEST_SMP)
 
-static int __init test_ah(struct crypto_skcipher *tfm_aes)
+static int __init test_ah(struct crypto_cipher *tfm_aes)
 {
 	const u8 irk[16] = {
 			0x9b, 0x7d, 0x39, 0x0a, 0xa6, 0x10, 0x10, 0x34,
@@ -3460,7 +3449,7 @@ static int __init test_ah(struct crypto_skcipher *tfm_aes)
 	return 0;
 }
 
-static int __init test_c1(struct crypto_skcipher *tfm_aes)
+static int __init test_c1(struct crypto_cipher *tfm_aes)
 {
 	const u8 k[16] = {
 			0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@@ -3490,7 +3479,7 @@ static int __init test_c1(struct crypto_skcipher *tfm_aes)
 	return 0;
 }
 
-static int __init test_s1(struct crypto_skcipher *tfm_aes)
+static int __init test_s1(struct crypto_cipher *tfm_aes)
 {
 	const u8 k[16] = {
 			0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@@ -3686,7 +3675,7 @@ static const struct file_operations test_smp_fops = {
 	.llseek		= default_llseek,
 };
 
-static int __init run_selftests(struct crypto_skcipher *tfm_aes,
+static int __init run_selftests(struct crypto_cipher *tfm_aes,
 				struct crypto_shash *tfm_cmac)
 {
 	ktime_t calltime, delta, rettime;
@@ -3764,27 +3753,27 @@ done:
 
 int __init bt_selftest_smp(void)
 {
-	struct crypto_skcipher *tfm_aes;
+	struct crypto_cipher *tfm_aes;
 	struct crypto_shash *tfm_cmac;
 	int err;
 
-	tfm_aes = crypto_alloc_skcipher("ecb(aes)", 0, CRYPTO_ALG_ASYNC);
+	tfm_aes = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(tfm_aes)) {
-		BT_ERR("Unable to create ECB crypto context");
+		BT_ERR("Unable to create AES crypto context");
 		return PTR_ERR(tfm_aes);
 	}
 
 	tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(tfm_cmac)) {
 		BT_ERR("Unable to create CMAC crypto context");
-		crypto_free_skcipher(tfm_aes);
+		crypto_free_cipher(tfm_aes);
 		return PTR_ERR(tfm_cmac);
 	}
 
 	err = run_selftests(tfm_aes, tfm_cmac);
 
 	crypto_free_shash(tfm_cmac);
-	crypto_free_skcipher(tfm_aes);
+	crypto_free_cipher(tfm_aes);
 
 	return err;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 03/29] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Herbert Xu, Andy Lutomirski

From: Herbert Xu <herbert@gondor.apana.org.au>

rxkad uses stack memory in SG lists which would not work if stacks
were allocated from vmalloc memory.  In fact, in most cases this
isn't even necessary as the stack memory ends up getting copied
over to kmalloc memory.

This patch eliminates all the unnecessary stack memory uses by
supplying the final destination directly to the crypto API.  In
two instances where a temporary buffer is actually needed we also
switch use the skb->cb area instead of the stack.

Finally there is no need to split a split-page buffer into two SG
entries so code dealing with that has been removed.

Message-Id: <20160623064137.GA8958@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 net/rxrpc/ar-internal.h |   1 +
 net/rxrpc/rxkad.c       | 103 ++++++++++++++++++++----------------------------
 2 files changed, 44 insertions(+), 60 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index f0b807a163fa..8ee5933982f3 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -277,6 +277,7 @@ struct rxrpc_connection {
 	struct key		*key;		/* security for this connection (client) */
 	struct key		*server_key;	/* security for this service */
 	struct crypto_skcipher	*cipher;	/* encryption handle */
+	struct rxrpc_crypt	csum_iv_head;	/* leading block for csum_iv */
 	struct rxrpc_crypt	csum_iv;	/* packet checksum base */
 	unsigned long		events;
 #define RXRPC_CONN_CHALLENGE	0		/* send challenge packet */
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index bab56ed649ba..a28a3c6fdf1d 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -105,11 +105,9 @@ static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 {
 	struct rxrpc_key_token *token;
 	SKCIPHER_REQUEST_ON_STACK(req, conn->cipher);
-	struct scatterlist sg[2];
+	struct rxrpc_crypt *csum_iv;
+	struct scatterlist sg;
 	struct rxrpc_crypt iv;
-	struct {
-		__be32 x[4];
-	} tmpbuf __attribute__((aligned(16))); /* must all be in same page */
 
 	_enter("");
 
@@ -119,24 +117,21 @@ static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 	token = conn->key->payload.data[0];
 	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
-	tmpbuf.x[0] = htonl(conn->epoch);
-	tmpbuf.x[1] = htonl(conn->cid);
-	tmpbuf.x[2] = 0;
-	tmpbuf.x[3] = htonl(conn->security_ix);
+	csum_iv = &conn->csum_iv_head;
+	csum_iv[0].x[0] = htonl(conn->epoch);
+	csum_iv[0].x[1] = htonl(conn->cid);
+	csum_iv[1].x[0] = 0;
+	csum_iv[1].x[1] = htonl(conn->security_ix);
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	sg_init_one(&sg, csum_iv, 16);
 
 	skcipher_request_set_tfm(req, conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, 16, iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	memcpy(&conn->csum_iv, &tmpbuf.x[2], sizeof(conn->csum_iv));
-	ASSERTCMP((u32 __force)conn->csum_iv.n[0], ==, (u32 __force)tmpbuf.x[2]);
-
 	_leave("");
 }
 
@@ -150,12 +145,9 @@ static int rxkad_secure_packet_auth(const struct rxrpc_call *call,
 {
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
+	struct rxkad_level1_hdr hdr;
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
-		struct rxkad_level1_hdr hdr;
-		__be32	first;	/* first four bytes of data and padding */
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+	struct scatterlist sg;
 	u16 check;
 
 	sp = rxrpc_skb(skb);
@@ -165,24 +157,21 @@ static int rxkad_secure_packet_auth(const struct rxrpc_call *call,
 	check = sp->hdr.seq ^ sp->hdr.callNumber;
 	data_size |= (u32)check << 16;
 
-	tmpbuf.hdr.data_size = htonl(data_size);
-	memcpy(&tmpbuf.first, sechdr + 4, sizeof(tmpbuf.first));
+	hdr.data_size = htonl(data_size);
+	memcpy(sechdr, &hdr, sizeof(hdr));
 
 	/* start the encryption afresh */
 	memset(&iv, 0, sizeof(iv));
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	sg_init_one(&sg, sechdr, 8);
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, 8, iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	memcpy(sechdr, &tmpbuf, sizeof(tmpbuf));
-
 	_leave(" = 0");
 	return 0;
 }
@@ -196,8 +185,7 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 				       void *sechdr)
 {
 	const struct rxrpc_key_token *token;
-	struct rxkad_level2_hdr rxkhdr
-		__attribute__((aligned(8))); /* must be all on one page */
+	struct rxkad_level2_hdr rxkhdr;
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_crypt iv;
@@ -216,17 +204,17 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 
 	rxkhdr.data_size = htonl(data_size | (u32)check << 16);
 	rxkhdr.checksum = 0;
+	memcpy(sechdr, &rxkhdr, sizeof(rxkhdr));
 
 	/* encrypt from the session key */
 	token = call->conn->key->payload.data[0];
 	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
 	sg_init_one(&sg[0], sechdr, sizeof(rxkhdr));
-	sg_init_one(&sg[1], &rxkhdr, sizeof(rxkhdr));
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(rxkhdr), iv.x);
+	skcipher_request_set_crypt(req, &sg[0], &sg[0], sizeof(rxkhdr), iv.x);
 
 	crypto_skcipher_encrypt(req);
 
@@ -265,10 +253,11 @@ static int rxkad_secure_packet(const struct rxrpc_call *call,
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
+	struct scatterlist sg;
+	union {
 		__be32 x[2];
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+		__be64 xl;
+	} tmpbuf;
 	u32 x, y;
 	int ret;
 
@@ -294,16 +283,19 @@ static int rxkad_secure_packet(const struct rxrpc_call *call,
 	tmpbuf.x[0] = htonl(sp->hdr.callNumber);
 	tmpbuf.x[1] = htonl(x);
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
+	sg_init_one(&sg, sp, sizeof(tmpbuf));
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, sizeof(tmpbuf), iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
 	y = ntohl(tmpbuf.x[1]);
 	y = (y >> 16) & 0xffff;
 	if (y == 0)
@@ -503,10 +495,11 @@ static int rxkad_verify_packet(const struct rxrpc_call *call,
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_skb_priv *sp;
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
+	struct scatterlist sg;
+	union {
 		__be32 x[2];
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+		__be64 xl;
+	} tmpbuf;
 	u16 cksum;
 	u32 x, y;
 	int ret;
@@ -534,16 +527,19 @@ static int rxkad_verify_packet(const struct rxrpc_call *call,
 	tmpbuf.x[0] = htonl(call->call_id);
 	tmpbuf.x[1] = htonl(x);
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
+	sg_init_one(&sg, sp, sizeof(tmpbuf));
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, sizeof(tmpbuf), iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
 	y = ntohl(tmpbuf.x[1]);
 	cksum = (y >> 16) & 0xffff;
 	if (cksum == 0)
@@ -708,26 +704,13 @@ static void rxkad_calc_response_checksum(struct rxkad_response *response)
 }
 
 /*
- * load a scatterlist with a potentially split-page buffer
+ * load a scatterlist
  */
-static void rxkad_sg_set_buf2(struct scatterlist sg[2],
+static void rxkad_sg_set_buf2(struct scatterlist sg[1],
 			      void *buf, size_t buflen)
 {
-	int nsg = 1;
-
-	sg_init_table(sg, 2);
-
+	sg_init_table(sg, 1);
 	sg_set_buf(&sg[0], buf, buflen);
-	if (sg[0].offset + buflen > PAGE_SIZE) {
-		/* the buffer was split over two pages */
-		sg[0].length = PAGE_SIZE - sg[0].offset;
-		sg_set_buf(&sg[1], buf + sg[0].length, buflen - sg[0].length);
-		nsg++;
-	}
-
-	sg_mark_end(&sg[nsg - 1]);
-
-	ASSERTCMP(sg[0].length + sg[1].length, ==, buflen);
 }
 
 /*
@@ -739,7 +722,7 @@ static void rxkad_encrypt_response(struct rxrpc_connection *conn,
 {
 	SKCIPHER_REQUEST_ON_STACK(req, conn->cipher);
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
+	struct scatterlist sg[1];
 
 	/* continue encrypting from where we left off */
 	memcpy(&iv, s2->session_key, sizeof(iv));
@@ -999,7 +982,7 @@ static void rxkad_decrypt_response(struct rxrpc_connection *conn,
 				   const struct rxrpc_crypt *session_key)
 {
 	SKCIPHER_REQUEST_ON_STACK(req, rxkad_ci);
-	struct scatterlist sg[2];
+	struct scatterlist sg[1];
 	struct rxrpc_crypt iv;
 
 	_enter(",,%08x%08x",
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 03/29] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 04/29] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Ingo Molnar, Andrew Morton,
	Andy Lutomirski, Denys Vlasenko, H . Peter Anvin, Oleg Nesterov,
	Peter Zijlstra, Rik van Riel, Thomas Gleixner, Waiman Long,
	linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 04/29] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 03/29] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-28 18:48   ` Borislav Petkov
  2016-06-26 21:55 ` [PATCH v4 05/29] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables() Andy Lutomirski
                   ` (28 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

This avoids pointless races in which another CPU or task might see a
partially populated global pgd entry.  These races should normally
be harmless, but, if another CPU propagates the entry via
vmalloc_fault and then populate_pgd fails (due to memory allocation
failure, for example), this prevents a use-after-free of the pgd
entry.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bbf4105..6a8026918bf6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 
 	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
 	if (ret < 0) {
-		unmap_pgd_range(cpa->pgd, addr,
+		if (pud)
+			free_page((unsigned long)pud);
+		unmap_pud_range(pgd_entry, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 05/29] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 04/29] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-27  7:19   ` Borislav Petkov
  2016-06-26 21:55 ` [PATCH v4 06/29] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
                   ` (27 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Matt Fleming,
	linux-efi

kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
init_mm.pgd were to be cleared, callers would need to ensure that
the pgd entry hadn't been propagated to any other pgd.

Its only caller was efi_cleanup_page_tables(), and that, in turn,
was unused, so just delete both functions.  This leaves a couple of
other helpers unused, so delete them, too.

Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: linux-efi@vger.kernel.org
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/pageattr.c               | 28 ----------------------------
 arch/x86/platform/efi/efi.c          |  2 --
 arch/x86/platform/efi/efi_32.c       |  3 ---
 arch/x86/platform/efi/efi_64.c       |  5 -----
 6 files changed, 41 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 78d1e7467eae..45ea38df86d4 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -125,7 +125,6 @@ extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
 extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages);
-extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
 extern void __init runtime_code_page_mkexec(void);
 extern void __init efi_runtime_update_mappings(void);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 7b5efe264eff..0b9f58ad10c8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -475,8 +475,6 @@ extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags);
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages);
 #endif	/* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_DEFS_H */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a8026918bf6..762162af3662 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -746,18 +746,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -871,16 +859,6 @@ static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
 	 */
 }
 
-static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd_entry = root + pgd_index(addr);
-
-	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
-}
-
 static int alloc_pte_page(pmd_t *pmd)
 {
 	pte_t *pte = (pte_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
@@ -1993,12 +1971,6 @@ out:
 	return retval;
 }
 
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages)
-{
-	unmap_pgd_range(root, address, address + (numpages << PAGE_SHIFT));
-}
-
 /*
  * The testcases use internal knowledge of the implementation that shouldn't
  * be exposed to the rest of the kernel. Include these directly here.
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index f93545e7dc54..62986e5fbdba 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -978,8 +978,6 @@ static void __init __efi_enter_virtual_mode(void)
 	 * EFI mixed mode we need all of memory to be accessible when
 	 * we pass parameters to the EFI runtime services in the
 	 * thunking code.
-	 *
-	 * efi_cleanup_page_tables(__pa(new_memmap), 1 << pg_shift);
 	 */
 	free_pages((unsigned long)new_memmap, pg_shift);
 
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 338402b91d2e..cef39b097649 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,9 +49,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	return 0;
 }
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-}
 
 void __init efi_map_region(efi_memory_desc_t *md)
 {
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index b226b3f497f1..d288dcea1ffe 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -285,11 +285,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	return 0;
 }
 
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-	kernel_unmap_pages_in_pgd(efi_pgd, pa_memmap, num_pages);
-}
-
 static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
 	unsigned long flags = _PAGE_RW;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 06/29] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 05/29] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables() Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 07/29] mm: Fix memcg stack accounting for sub-page stacks Andy Lutomirski
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
all architectures.  Keep it simple and use KiB.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 kernel/fork.c          | 3 ++-
 mm/page_alloc.c        | 3 +--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..27dc68a0ed2d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..239b5a06cee0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		global_page_state(NR_KERNEL_STACK_KB),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..63f05a7efb54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,7 +127,7 @@ enum zone_stat_item {
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
-	NR_KERNEL_STACK,
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	/* Second 128 byte cacheline */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
diff --git a/kernel/fork.c b/kernel/fork.c
index 4a7ec0c6c88c..466ba8febe3b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(unsigned long *stack, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(stack));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+			    THREAD_SIZE / 1024 * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..a277dea926c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			zone_page_state(zone, NR_KERNEL_STACK_KB),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 07/29] mm: Fix memcg stack accounting for sub-page stacks
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (5 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 06/29] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 08/29] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size, and we need
to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
units to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/memcontrol.h |  2 +-
 kernel/fork.c              | 19 ++++++++-----------
 mm/memcontrol.c            |  2 +-
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..3b653b86bb8f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 	/* default hierarchy stats */
-	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
+	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
 	MEMCG_SLAB_RECLAIMABLE,
 	MEMCG_SLAB_UNRECLAIMABLE,
 	MEMCG_SOCK,
diff --git a/kernel/fork.c b/kernel/fork.c
index 466ba8febe3b..146c9840c079 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_stack(unsigned long *stack)
 {
-	struct page *page = virt_to_page(stack);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_stack_cache;
@@ -223,10 +215,15 @@ static struct kmem_cache *mm_cachep;
 
 static void account_kernel_stack(unsigned long *stack, int account)
 {
-	struct zone *zone = page_zone(virt_to_page(stack));
+	/* All stack pages are in the same zone and belong to the same memcg. */
+	struct page *first_page = virt_to_page(stack);
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+	mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
 			    THREAD_SIZE / 1024 * account);
+
+	memcg_kmem_update_page_stat(
+		first_page, MEMCG_KERNEL_STACK_KB,
+		account * (THREAD_SIZE / 1024));
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac8664db3823..ee44afc1f2d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "file %llu\n",
 		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
 	seq_printf(m, "kernel_stack %llu\n",
-		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
+		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
 	seq_printf(m, "slab %llu\n",
 		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
 			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 08/29] dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (6 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 07/29] mm: Fix memcg stack accounting for sub-page stacks Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-30 19:37   ` Borislav Petkov
  2016-06-26 21:55 ` [PATCH v4 09/29] fork: Add generic vmalloced stack support Andy Lutomirski
                   ` (24 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Andrew Morton,
	Arnd Bergmann

If we're using CONFIG_VMAP_STACK and we manage to point an sg entry
at the stack, then either the sg page will be in highmem or sg_virt
will return the direct-map alias.  In neither case will the existing
check_for_stack() implementation realize that it's a stack page.

Fix it by explicitly checking for stack pages.

This has no effect by itself.  It's broken out for ease of review.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 lib/dma-debug.c | 39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 51a76af25c66..5b2e63cba90e 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -22,6 +22,7 @@
 #include <linux/stacktrace.h>
 #include <linux/dma-debug.h>
 #include <linux/spinlock.h>
+#include <linux/vmalloc.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
 #include <linux/export.h>
@@ -1162,11 +1163,35 @@ static void check_unmap(struct dma_debug_entry *ref)
 	put_hash_bucket(bucket, &flags);
 }
 
-static void check_for_stack(struct device *dev, void *addr)
+static void check_for_stack(struct device *dev,
+			    struct page *page, size_t offset)
 {
-	if (object_is_on_stack(addr))
-		err_printk(dev, NULL, "DMA-API: device driver maps memory from "
-				"stack [addr=%p]\n", addr);
+	void *addr;
+	struct vm_struct *stack_vm_area = task_stack_vm_area(current);
+
+	if (!stack_vm_area) {
+		/* Stack is direct-mapped. */
+		if (PageHighMem(page))
+			return;
+		addr = page_address(page) + offset;
+		if (object_is_on_stack(addr))
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [addr=%p]\n",
+				   addr);
+	} else {
+		/* Stack is vmalloced. */
+		int i;
+
+		for (i = 0; i < stack_vm_area->nr_pages; i++) {
+			if (page != stack_vm_area->pages[i])
+				continue;
+
+			addr = (u8 *)current->stack + i * PAGE_SIZE +
+				offset;
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [probable addr=%p]\n",
+				   addr);
+			break;
+		}
+	}
 }
 
 static inline bool overlap(void *addr, unsigned long len, void *start, void *end)
@@ -1289,10 +1314,11 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
 	if (map_single)
 		entry->type = dma_debug_single;
 
+	check_for_stack(dev, page, offset);
+
 	if (!PageHighMem(page)) {
 		void *addr = page_address(page) + offset;
 
-		check_for_stack(dev, addr);
 		check_for_illegal_area(dev, addr, size);
 	}
 
@@ -1384,8 +1410,9 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		entry->sg_call_ents   = nents;
 		entry->sg_mapped_ents = mapped_ents;
 
+		check_for_stack(dev, sg_page(s), s->offset);
+
 		if (!PageHighMem(sg_page(s))) {
-			check_for_stack(dev, sg_virt(s));
 			check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s));
 		}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 09/29] fork: Add generic vmalloced stack support
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (7 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 08/29] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-07-01 14:59   ` Borislav Petkov
  2016-06-26 21:55 ` [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
                   ` (23 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Oleg Nesterov

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

grsecurity has had a similar feature (called
GRKERNSEC_KSTACKOVERFLOW) for a long time.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig                        | 29 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 +++++++
 kernel/fork.c                       | 87 +++++++++++++++++++++++++++++--------
 4 files changed, 113 insertions(+), 20 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 15996290fed4..18a2c3a7b460 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -661,4 +661,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to and switch_mm,
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index d1212b84fb83..f0a72e98e5a4 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_stack_node(tsk, node)	((unsigned long *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_stack(ti)	/* nothing */
+#define free_thread_stack(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f29ade..26869dba21f1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1934,6 +1937,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 146c9840c079..06761de69360 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,37 @@ void __weak arch_release_thread_stack(unsigned long *stack)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
-static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
-						  int node)
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
+static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	void *stack = __vmalloc_node_range(
+		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
+		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
+		0, node, __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_info can be called in interrupt context, so cache
+	 * the vm_struct.
+	 */
+	if (stack)
+		tsk->stack_vm_area = find_vm_area(stack);
+	return stack;
+#else
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_stack(unsigned long *stack)
+static inline void free_thread_stack(struct task_struct *tsk)
 {
-	free_kmem_pages((unsigned long)stack, THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_stack_cache;
@@ -181,9 +199,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_stack(unsigned long *stack)
+static void free_thread_stack(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_stack_cache, stack);
+	kmem_cache_free(thread_stack_cache, tsk->stack);
 }
 
 void thread_stack_cache_init(void)
@@ -213,24 +231,49 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(unsigned long *stack, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	/* All stack pages are in the same zone and belong to the same memcg. */
-	struct page *first_page = virt_to_page(stack);
+	void *stack = task_stack_page(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
 
-	mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
-	memcg_kmem_update_page_stat(
-		first_page, MEMCG_KERNEL_STACK_KB,
-		account * (THREAD_SIZE / 1024));
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    NR_KERNEL_STACK_KB,
+					    PAGE_SIZE / 1024 * account);
+		}
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			vm->pages[0], MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	} else {
+		/*
+		 * All stack pages are in the same zone and belong to the
+		 * same memcg.
+		 */
+		struct page *first_page = virt_to_page(stack);
+
+		mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
+
+		memcg_kmem_update_page_stat(
+			first_page, MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_stack(tsk->stack);
-	free_thread_stack(tsk->stack);
+	free_thread_stack(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +385,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	unsigned long *stack;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +398,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!stack)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
 	if (err)
 		goto free_stack;
 
 	tsk->stack = stack;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +439,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(stack, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_stack:
-	free_thread_stack(stack);
+	free_thread_stack(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (8 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 09/29] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-07-02 17:24   ` Borislav Petkov
  2016-06-26 21:55 ` [PATCH v4 11/29] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
                   ` (22 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

It's not going to work, because the scheduler will explode if we try
to schedule when running on an IST stack or similar.

This will matter when we let kernel stack overflows (which are #DF)
call die().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ef8017ca5ba9..352f022cfd5b 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -245,6 +245,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
+	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
+	     & ~(THREAD_SIZE - 1)) != 0)
+		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	do_exit(signr);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 11/29] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (9 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-07-04 18:45   ` Borislav Petkov
  2016-06-26 21:55 ` [PATCH v4 12/29] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
                   ` (21 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

If we call do_exit with a clean stack, we greatly reduce the risk of
recursive oopses due to stack overflow in do_exit, and we allow
do_exit to work even if we OOPS from an IST stack.  The latter gives
us a much better chance of surviving long enough after we detect a
stack overflow to write out our logs.

I intentionally separated this from the preceding patch that
disables do_exit-on-OOPS on IST stacks.  This way, if we need to
revert this patch, we still end up in an acceptable state wrt stack
overflow handling.

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S   | 11 +++++++++++
 arch/x86/entry/entry_64.S   | 11 +++++++++++
 arch/x86/kernel/dumpstack.c | 13 +++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3a0d27..0b56666e6039 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
 	jmp	error_code
 END(async_page_fault)
 #endif
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..b846875aeea6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 352f022cfd5b..0d05f113805e 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -226,6 +226,8 @@ unsigned long oops_begin(void)
 EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
+extern void __noreturn rewind_stack_do_exit(int signr);
+
 void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 {
 	if (regs && kexec_should_crash(current))
@@ -245,12 +247,15 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
-	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
-	     & ~(THREAD_SIZE - 1)) != 0)
-		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
-	do_exit(signr);
+
+	/*
+	 * We're not going to return, but we might be on an IST stack or
+	 * have very little stack space left.  Rewind the stack and kill
+	 * the task.
+	 */
+	rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 12/29] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (10 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 11/29] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 13/29] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

The comment suggests that show_stack(NULL, NULL) should backtrace
the current context, but the code doesn't match the comment.  If
regs are given, start the "Stack:" hexdump at regs->sp.

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index fef917e79b9d..948d77da3881 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -96,7 +96,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index d558a8a49016..a81e1ef73bf2 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -264,7 +264,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 13/29] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (11 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 12/29] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 14/29] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

If we overflow the stack, print_context_stack will abort.  Detect
this case and rewind back into the valid part of the stack so that
we can trace it.

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 0d05f113805e..8a2113dfc154 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -87,7 +87,7 @@ static inline int valid_stack_ptr(struct task_struct *task,
 		else
 			return 0;
 	}
-	return p > t && p < t + THREAD_SIZE - size;
+	return p >= t && p < t + THREAD_SIZE - size;
 }
 
 unsigned long
@@ -98,6 +98,14 @@ print_context_stack(struct task_struct *task,
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
+	/*
+	 * If we overflowed the stack into a guard page, jump back to the
+	 * bottom of the usable stack.
+	 */
+	if ((unsigned long)task_stack_page(task) - (unsigned long)stack <
+	    PAGE_SIZE)
+		stack = (unsigned long *)task_stack_page(task);
+
 	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 14/29] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (12 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 13/29] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks Andy Lutomirski
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

If we overflow the stack into a guard page, we'll recursively fault
when trying to dump the contents of the guard page.  Use
probe_kernel_address so we can recover if this happens.

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_64.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index a81e1ef73bf2..6dede08dd98b 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -274,6 +274,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
+		unsigned long word;
+
 		if (stack >= irq_stack && stack <= irq_stack_end) {
 			if (stack == irq_stack_end) {
 				stack = (unsigned long *) (irq_stack_end[-1]);
@@ -283,12 +285,18 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (kstack_end(stack))
 			break;
 		}
+
+		if (probe_kernel_address(stack, word))
+			break;
+
 		if ((i % STACKSLOTS_PER_LINE) == 0) {
 			if (i != 0)
 				pr_cont("\n");
-			printk("%s %016lx", log_lvl, *stack++);
+			printk("%s %016lx", log_lvl, word);
 		} else
-			pr_cont(" %016lx", *stack++);
+			pr_cont(" %016lx", word);
+
+		stack++;
 		touch_nmi_watchdog();
 	}
 	preempt_enable();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (13 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 14/29] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-27 15:01   ` Brian Gerst
  2016-06-26 21:55 ` [PATCH v4 16/29] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and an oops without triggering a double-fault, and
no_context doesn't know about stack overflows.  The next patch will
improve that case.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 +++++++++++++++
 4 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9a94da0c29f..afdcf96ef109 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..9cb7ea781176 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  CR2 will contain
+	 * the linear address of the second fault, which will be in the
+	 * guard page below the bottom of the stack.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)",
+			regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..fbf036ae72ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 16/29] x86/mm: Improve stack-overflow #PF handling
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (14 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 17/29] x86: Move uaccess_err and sig_on_uaccess_err to thread_struct Andy Lutomirski
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h |  6 ++++++
 arch/x86/kernel/traps.c      |  6 +++---
 arch/x86/mm/fault.c          | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619740a..01fd0a7f48cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -117,6 +117,12 @@ extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
+#ifdef CONFIG_VMAP_STACK
+void __noreturn handle_stack_overflow(const char *message,
+				      struct pt_regs *regs,
+				      unsigned long fault_address);
+#endif
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9cb7ea781176..b389c0539eb9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -293,9 +293,9 @@ DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
 #ifdef CONFIG_VMAP_STACK
-static void __noreturn handle_stack_overflow(const char *message,
-					     struct pt_regs *regs,
-					     unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(const char *message,
+						struct pt_regs *regs,
+						unsigned long fault_address)
 {
 	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
 		 (void *)fault_address, current->stack,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..c68b81f5659f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -753,6 +753,45 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * Stack overflow?  During boot, we can fault near the initial
+	 * stack in the direct map, but that's not an overflow -- check
+	 * that we're in vmalloc space to avoid this.
+	 *
+	 * Check this after trying fixup_exception, since there are handful
+	 * of kernel code paths that wander off the top of the stack but
+	 * handle any faults that occur.  Once those are fixed, we can
+	 * move this above fixup_exception.
+	 */
+	if (is_vmalloc_addr((void *)address) &&
+	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+		register void *__sp asm("rsp");
+		unsigned long stack =
+			this_cpu_read(orig_ist.ist[DOUBLEFAULT_STACK]) -
+			sizeof(void *);
+		/*
+		 * We're likely to be running with very little stack space
+		 * left.  It's plausible that we'd hit this condition but
+		 * double-fault even before we get this far, in which case
+		 * we're fine: the double-fault handler will deal with it.
+		 *
+		 * We don't want to make it all the way into the oops code
+		 * and then double-fault, though, because we're likely to
+		 * break the console driver and lose most of the stack dump.
+		 */
+		asm volatile ("movq %[stack], %%rsp\n\t"
+			      "call handle_stack_overflow\n\t"
+			      "1: jmp 1b"
+			      : "+r" (__sp)
+			      : "D" ("kernel stack overflow (page fault)"),
+				"S" (regs), "d" (address),
+				[stack] "rm" (stack));
+		unreachable();
+	}
+#endif
+
 	/*
 	 * 32-bit:
 	 *
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 17/29] x86: Move uaccess_err and sig_on_uaccess_err to thread_struct
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (15 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 16/29] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 18/29] x86: Move addr_limit " Andy Lutomirski
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

thread_info is a legacy mess.  To prepare for its partial removal,
move the uaccess control fields out -- they're straightforward.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/vsyscall/vsyscall_64.c | 6 +++---
 arch/x86/include/asm/processor.h      | 3 +++
 arch/x86/include/asm/thread_info.h    | 2 --
 arch/x86/include/asm/uaccess.h        | 4 ++--
 arch/x86/mm/extable.c                 | 2 +-
 arch/x86/mm/fault.c                   | 2 +-
 6 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 174c2549939d..3aba2b043050 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -221,8 +221,8 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 	 * With a real vsyscall, page faults cause SIGSEGV.  We want to
 	 * preserve that behavior to make writing exploits harder.
 	 */
-	prev_sig_on_uaccess_error = current_thread_info()->sig_on_uaccess_error;
-	current_thread_info()->sig_on_uaccess_error = 1;
+	prev_sig_on_uaccess_error = current->thread.sig_on_uaccess_error;
+	current->thread.sig_on_uaccess_error = 1;
 
 	ret = -EFAULT;
 	switch (vsyscall_nr) {
@@ -243,7 +243,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 		break;
 	}
 
-	current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error;
+	current->thread.sig_on_uaccess_error = prev_sig_on_uaccess_error;
 
 check_fault:
 	if (ret == -EFAULT) {
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 62c6cc3cc5d3..f53ae57bd985 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -419,6 +419,9 @@ struct thread_struct {
 	/* Max allowed port in the bitmap, in bytes: */
 	unsigned		io_bitmap_max;
 
+	unsigned int		sig_on_uaccess_error:1;
+	unsigned int		uaccess_err:1;	/* uaccess failed */
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..7c47bb659ecd 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -58,8 +58,6 @@ struct thread_info {
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
 	mm_segment_t		addr_limit;
-	unsigned int		sig_on_uaccess_error:1;
-	unsigned int		uaccess_err:1;	/* uaccess failed */
 };
 
 #define INIT_THREAD_INFO(tsk)			\
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 2982387ba817..4d2a726e8e6d 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -468,13 +468,13 @@ struct __large_struct { unsigned long buf[100]; };
  * uaccess_try and catch
  */
 #define uaccess_try	do {						\
-	current_thread_info()->uaccess_err = 0;				\
+	current->thread.uaccess_err = 0;				\
 	__uaccess_begin();						\
 	barrier();
 
 #define uaccess_catch(err)						\
 	__uaccess_end();						\
-	(err) |= (current_thread_info()->uaccess_err ? -EFAULT : 0);	\
+	(err) |= (current->thread.uaccess_err ? -EFAULT : 0);		\
 } while (0)
 
 /**
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 4bb53b89f3c5..0f90cc218d04 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -37,7 +37,7 @@ bool ex_handler_ext(const struct exception_table_entry *fixup,
 		   struct pt_regs *regs, int trapnr)
 {
 	/* Special hack for uaccess_err */
-	current_thread_info()->uaccess_err = 1;
+	current->thread.uaccess_err = 1;
 	regs->ip = ex_fixup_addr(fixup);
 	return true;
 }
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index c68b81f5659f..d34e21e81b07 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -737,7 +737,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		 * In this case we need to make sure we're not recursively
 		 * faulting through the emulate_vsyscall() logic.
 		 */
-		if (current_thread_info()->sig_on_uaccess_error && signal) {
+		if (current->thread.sig_on_uaccess_error && signal) {
 			tsk->thread.trap_nr = X86_TRAP_PF;
 			tsk->thread.error_code = error_code | PF_USER;
 			tsk->thread.cr2 = address;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 18/29] x86: Move addr_limit to thread_struct
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (16 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 17/29] x86: Move uaccess_err and sig_on_uaccess_err to thread_struct Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 19/29] signal: Consolidate {TS,TLF}_RESTORE_SIGMASK code Andy Lutomirski
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

thread_info is a legacy mess.  To prepare for its partial removal,
move addr_limit out.

As an added benefit, this way is simpler.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/checksum_32.h |  3 +--
 arch/x86/include/asm/processor.h   | 17 ++++++++++-------
 arch/x86/include/asm/thread_info.h |  7 -------
 arch/x86/include/asm/uaccess.h     |  6 +++---
 arch/x86/kernel/asm-offsets.c      |  4 +++-
 arch/x86/lib/copy_user_64.S        |  8 ++++----
 arch/x86/lib/csum-wrappers_64.c    |  1 +
 arch/x86/lib/getuser.S             | 20 ++++++++++----------
 arch/x86/lib/putuser.S             | 10 +++++-----
 arch/x86/lib/usercopy_64.c         |  2 +-
 drivers/pnp/isapnp/proc.c          |  2 +-
 lib/bitmap.c                       |  2 +-
 12 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/checksum_32.h b/arch/x86/include/asm/checksum_32.h
index 532f85e6651f..7b53743ed267 100644
--- a/arch/x86/include/asm/checksum_32.h
+++ b/arch/x86/include/asm/checksum_32.h
@@ -2,8 +2,7 @@
 #define _ASM_X86_CHECKSUM_32_H
 
 #include <linux/in6.h>
-
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 
 /*
  * computes the checksum of a memory block at buff, length len,
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f53ae57bd985..a2e20d6d01fe 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -371,6 +371,10 @@ extern unsigned int xstate_size;
 
 struct perf_event;
 
+typedef struct {
+	unsigned long		seg;
+} mm_segment_t;
+
 struct thread_struct {
 	/* Cached TLS descriptors: */
 	struct desc_struct	tls_array[GDT_ENTRY_TLS_ENTRIES];
@@ -419,6 +423,8 @@ struct thread_struct {
 	/* Max allowed port in the bitmap, in bytes: */
 	unsigned		io_bitmap_max;
 
+	mm_segment_t		addr_limit;
+
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
 
@@ -493,11 +499,6 @@ static inline void load_sp0(struct tss_struct *tss,
 #define set_iopl_mask native_set_iopl_mask
 #endif /* CONFIG_PARAVIRT */
 
-typedef struct {
-	unsigned long		seg;
-} mm_segment_t;
-
-
 /* Free all resources held by a thread. */
 extern void release_thread(struct task_struct *);
 
@@ -719,6 +720,7 @@ static inline void spin_lock_prefetch(const void *x)
 	.sp0			= TOP_OF_INIT_STACK,			  \
 	.sysenter_cs		= __KERNEL_CS,				  \
 	.io_bitmap_ptr		= NULL,					  \
+	.addr_limit		= KERNEL_DS,				  \
 }
 
 extern unsigned long thread_saved_pc(struct task_struct *tsk);
@@ -768,8 +770,9 @@ extern unsigned long thread_saved_pc(struct task_struct *tsk);
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
-#define INIT_THREAD  { \
-	.sp0 = TOP_OF_INIT_STACK \
+#define INIT_THREAD  {						\
+	.sp0			= TOP_OF_INIT_STACK,		\
+	.addr_limit		= KERNEL_DS,			\
 }
 
 /*
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 7c47bb659ecd..89bff044a6f5 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -57,7 +57,6 @@ struct thread_info {
 	__u32			flags;		/* low level flags */
 	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
-	mm_segment_t		addr_limit;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
@@ -65,7 +64,6 @@ struct thread_info {
 	.task		= &tsk,			\
 	.flags		= 0,			\
 	.cpu		= 0,			\
-	.addr_limit	= KERNEL_DS,		\
 }
 
 #define init_thread_info	(init_thread_union.thread_info)
@@ -184,11 +182,6 @@ static inline unsigned long current_stack_pointer(void)
 # define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
 #endif
 
-/* Load thread_info address into "reg" */
-#define GET_THREAD_INFO(reg) \
-	_ASM_MOV PER_CPU_VAR(cpu_current_top_of_stack),reg ; \
-	_ASM_SUB $(THREAD_SIZE),reg ;
-
 /*
  * ASM operand which evaluates to a 'thread_info' address of
  * the current task, if it is known that "reg" is exactly "off"
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 4d2a726e8e6d..0eb18b4ac492 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -29,12 +29,12 @@
 #define USER_DS 	MAKE_MM_SEG(TASK_SIZE_MAX)
 
 #define get_ds()	(KERNEL_DS)
-#define get_fs()	(current_thread_info()->addr_limit)
-#define set_fs(x)	(current_thread_info()->addr_limit = (x))
+#define get_fs()	(current->thread.addr_limit)
+#define set_fs(x)	(current->thread.addr_limit = (x))
 
 #define segment_eq(a, b)	((a).seg == (b).seg)
 
-#define user_addr_max() (current_thread_info()->addr_limit.seg)
+#define user_addr_max() (current->thread.addr_limit.seg)
 #define __addr_ok(addr) 	\
 	((unsigned long __force)(addr) < user_addr_max())
 
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 674134e9f5e5..2bd5c6ff7ee7 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -31,7 +31,9 @@ void common(void) {
 	BLANK();
 	OFFSET(TI_flags, thread_info, flags);
 	OFFSET(TI_status, thread_info, status);
-	OFFSET(TI_addr_limit, thread_info, addr_limit);
+
+	BLANK();
+	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
 
 	BLANK();
 	OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx);
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 2b0ef26da0bd..bf603ebbfd8e 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -17,11 +17,11 @@
 
 /* Standard copy_to_user with segment limit checking */
 ENTRY(_copy_to_user)
-	GET_THREAD_INFO(%rax)
+	mov PER_CPU_VAR(current_task), %rax
 	movq %rdi,%rcx
 	addq %rdx,%rcx
 	jc bad_to_user
-	cmpq TI_addr_limit(%rax),%rcx
+	cmpq TASK_addr_limit(%rax),%rcx
 	ja bad_to_user
 	ALTERNATIVE_2 "jmp copy_user_generic_unrolled",		\
 		      "jmp copy_user_generic_string",		\
@@ -32,11 +32,11 @@ ENDPROC(_copy_to_user)
 
 /* Standard copy_from_user with segment limit checking */
 ENTRY(_copy_from_user)
-	GET_THREAD_INFO(%rax)
+	mov PER_CPU_VAR(current_task), %rax
 	movq %rsi,%rcx
 	addq %rdx,%rcx
 	jc bad_from_user
-	cmpq TI_addr_limit(%rax),%rcx
+	cmpq TASK_addr_limit(%rax),%rcx
 	ja bad_from_user
 	ALTERNATIVE_2 "jmp copy_user_generic_unrolled",		\
 		      "jmp copy_user_generic_string",		\
diff --git a/arch/x86/lib/csum-wrappers_64.c b/arch/x86/lib/csum-wrappers_64.c
index 28a6654f0d08..b6fcb9a9ddbc 100644
--- a/arch/x86/lib/csum-wrappers_64.c
+++ b/arch/x86/lib/csum-wrappers_64.c
@@ -6,6 +6,7 @@
  */
 #include <asm/checksum.h>
 #include <linux/module.h>
+#include <linux/uaccess.h>
 #include <asm/smap.h>
 
 /**
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 46668cda4ffd..0ef5128c2de8 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -35,8 +35,8 @@
 
 	.text
 ENTRY(__get_user_1)
-	GET_THREAD_INFO(%_ASM_DX)
-	cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
+	mov PER_CPU_VAR(current_task), %_ASM_DX
+	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
 1:	movzbl (%_ASM_AX),%edx
@@ -48,8 +48,8 @@ ENDPROC(__get_user_1)
 ENTRY(__get_user_2)
 	add $1,%_ASM_AX
 	jc bad_get_user
-	GET_THREAD_INFO(%_ASM_DX)
-	cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
+	mov PER_CPU_VAR(current_task), %_ASM_DX
+	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
 2:	movzwl -1(%_ASM_AX),%edx
@@ -61,8 +61,8 @@ ENDPROC(__get_user_2)
 ENTRY(__get_user_4)
 	add $3,%_ASM_AX
 	jc bad_get_user
-	GET_THREAD_INFO(%_ASM_DX)
-	cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
+	mov PER_CPU_VAR(current_task), %_ASM_DX
+	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
 3:	movl -3(%_ASM_AX),%edx
@@ -75,8 +75,8 @@ ENTRY(__get_user_8)
 #ifdef CONFIG_X86_64
 	add $7,%_ASM_AX
 	jc bad_get_user
-	GET_THREAD_INFO(%_ASM_DX)
-	cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
+	mov PER_CPU_VAR(current_task), %_ASM_DX
+	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
 4:	movq -7(%_ASM_AX),%rdx
@@ -86,8 +86,8 @@ ENTRY(__get_user_8)
 #else
 	add $7,%_ASM_AX
 	jc bad_get_user_8
-	GET_THREAD_INFO(%_ASM_DX)
-	cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
+	mov PER_CPU_VAR(current_task), %_ASM_DX
+	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user_8
 	ASM_STAC
 4:	movl -7(%_ASM_AX),%edx
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index e0817a12d323..c891ece81e5b 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -29,14 +29,14 @@
  * as they get called from within inline assembly.
  */
 
-#define ENTER	GET_THREAD_INFO(%_ASM_BX)
+#define ENTER	mov PER_CPU_VAR(current_task), %_ASM_BX
 #define EXIT	ASM_CLAC ;	\
 		ret
 
 .text
 ENTRY(__put_user_1)
 	ENTER
-	cmp TI_addr_limit(%_ASM_BX),%_ASM_CX
+	cmp TASK_addr_limit(%_ASM_BX),%_ASM_CX
 	jae bad_put_user
 	ASM_STAC
 1:	movb %al,(%_ASM_CX)
@@ -46,7 +46,7 @@ ENDPROC(__put_user_1)
 
 ENTRY(__put_user_2)
 	ENTER
-	mov TI_addr_limit(%_ASM_BX),%_ASM_BX
+	mov TASK_addr_limit(%_ASM_BX),%_ASM_BX
 	sub $1,%_ASM_BX
 	cmp %_ASM_BX,%_ASM_CX
 	jae bad_put_user
@@ -58,7 +58,7 @@ ENDPROC(__put_user_2)
 
 ENTRY(__put_user_4)
 	ENTER
-	mov TI_addr_limit(%_ASM_BX),%_ASM_BX
+	mov TASK_addr_limit(%_ASM_BX),%_ASM_BX
 	sub $3,%_ASM_BX
 	cmp %_ASM_BX,%_ASM_CX
 	jae bad_put_user
@@ -70,7 +70,7 @@ ENDPROC(__put_user_4)
 
 ENTRY(__put_user_8)
 	ENTER
-	mov TI_addr_limit(%_ASM_BX),%_ASM_BX
+	mov TASK_addr_limit(%_ASM_BX),%_ASM_BX
 	sub $7,%_ASM_BX
 	cmp %_ASM_BX,%_ASM_CX
 	jae bad_put_user
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0a42327a59d7..9f760cdcaf40 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -6,7 +6,7 @@
  * Copyright 2002 Andi Kleen <ak@suse.de>
  */
 #include <linux/module.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 
 /*
  * Zero Userspace
diff --git a/drivers/pnp/isapnp/proc.c b/drivers/pnp/isapnp/proc.c
index 5edee645d890..262285e48a09 100644
--- a/drivers/pnp/isapnp/proc.c
+++ b/drivers/pnp/isapnp/proc.c
@@ -21,7 +21,7 @@
 #include <linux/isapnp.h>
 #include <linux/proc_fs.h>
 #include <linux/init.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 
 extern struct pnp_protocol isapnp_protocol;
 
diff --git a/lib/bitmap.c b/lib/bitmap.c
index c66da508cbf7..eca88087fa8a 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -14,9 +14,9 @@
 #include <linux/bug.h>
 #include <linux/kernel.h>
 #include <linux/string.h>
+#include <linux/uaccess.h>
 
 #include <asm/page.h>
-#include <asm/uaccess.h>
 
 /*
  * bitmaps provide an array of bits, implemented using an an
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 19/29] signal: Consolidate {TS,TLF}_RESTORE_SIGMASK code
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (17 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 18/29] x86: Move addr_limit " Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 20/29] x86/smp: Remove stack_smp_processor_id() Andy Lutomirski
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Tony Luck, Fenghua Yu,
	Michal Simek, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Peter Zijlstra, Borislav Petkov, Dmitry Safonov,
	Andrew Morton, linux-alpha, linux-ia64, linuxppc-dev, linux-sh,
	sparclinux

In general, there's no need for the "restore sigmask" flag to live in
ti->flags.  alpha, ia64, microblaze, powerpc, sh, sparc (64-bit only),
tile, and x86 use essentially identical alternative implementations,
placing the flag in ti->status.

Replace those optimized implementations with an equally good common
implementation that stores it in a bitfield in struct task_struct
and drop the custom implementations.

Additional architectures can opt in by removing their
TIF_RESTORE_SIGMASK defines.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-alpha@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/alpha/include/asm/thread_info.h      | 27 -------------
 arch/ia64/include/asm/thread_info.h       | 28 --------------
 arch/microblaze/include/asm/thread_info.h | 27 -------------
 arch/powerpc/include/asm/thread_info.h    | 25 ------------
 arch/sh/include/asm/thread_info.h         | 26 -------------
 arch/sparc/include/asm/thread_info_64.h   | 24 ------------
 arch/tile/include/asm/thread_info.h       | 27 -------------
 arch/x86/include/asm/thread_info.h        | 24 ------------
 include/linux/sched.h                     | 63 +++++++++++++++++++++++++++++++
 include/linux/thread_info.h               | 41 --------------------
 10 files changed, 63 insertions(+), 249 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 32e920a83ae5..e9e90bfa2b50 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -86,33 +86,6 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TS_UAC_NOPRINT		0x0001	/* ! Preserve the following three */
 #define TS_UAC_NOFIX		0x0002	/* ! flags as they match          */
 #define TS_UAC_SIGBUS		0x0004	/* ! userspace part of 'osf_sysinfo' */
-#define TS_RESTORE_SIGMASK	0x0008	/* restore signal mask in do_signal() */
-
-#ifndef __ASSEMBLY__
-#define HAVE_SET_RESTORE_SIGMASK	1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->status |= TS_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags));
-}
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->status &= ~TS_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->status & TS_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->status & TS_RESTORE_SIGMASK))
-		return false;
-	ti->status &= ~TS_RESTORE_SIGMASK;
-	return true;
-}
-#endif
 
 #define SET_UNALIGN_CTL(task,value)	({				\
 	__u32 status = task_thread_info(task)->status & ~UAC_BITMASK;	\
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index f0a72e98e5a4..c7026429816b 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -121,32 +121,4 @@ struct thread_info {
 /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT */
 #define TIF_WORK_MASK		(TIF_ALLWORK_MASK&~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT))
 
-#define TS_RESTORE_SIGMASK	2	/* restore signal mask in do_signal() */
-
-#ifndef __ASSEMBLY__
-#define HAVE_SET_RESTORE_SIGMASK	1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->status |= TS_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, &ti->flags));
-}
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->status &= ~TS_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->status & TS_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->status & TS_RESTORE_SIGMASK))
-		return false;
-	ti->status &= ~TS_RESTORE_SIGMASK;
-	return true;
-}
-#endif	/* !__ASSEMBLY__ */
-
 #endif /* _ASM_IA64_THREAD_INFO_H */
diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h
index 383f387b4eee..e7e8954e9815 100644
--- a/arch/microblaze/include/asm/thread_info.h
+++ b/arch/microblaze/include/asm/thread_info.h
@@ -148,33 +148,6 @@ static inline struct thread_info *current_thread_info(void)
  */
 /* FPU was used by this task this quantum (SMP) */
 #define TS_USEDFPU		0x0001
-#define TS_RESTORE_SIGMASK	0x0002
-
-#ifndef __ASSEMBLY__
-#define HAVE_SET_RESTORE_SIGMASK 1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->status |= TS_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags));
-}
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->status &= ~TS_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->status & TS_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->status & TS_RESTORE_SIGMASK))
-		return false;
-	ti->status &= ~TS_RESTORE_SIGMASK;
-	return true;
-}
-#endif
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_MICROBLAZE_THREAD_INFO_H */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 8febc3f66d53..cfc35195f95e 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -134,40 +134,15 @@ static inline struct thread_info *current_thread_info(void)
 /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
-#define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
 #define TLF_LAZY_MMU		3	/* tlb_batch is active */
 #define TLF_RUNLATCH		4	/* Is the runlatch enabled? */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
-#define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
 #define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 #define _TLF_RUNLATCH		(1 << TLF_RUNLATCH)
 
 #ifndef __ASSEMBLY__
-#define HAVE_SET_RESTORE_SIGMASK	1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->local_flags |= _TLF_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, &ti->flags));
-}
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->local_flags &= ~_TLF_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->local_flags & _TLF_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->local_flags & _TLF_RESTORE_SIGMASK))
-		return false;
-	ti->local_flags &= ~_TLF_RESTORE_SIGMASK;
-	return true;
-}
 
 static inline bool test_thread_local_flags(unsigned int flags)
 {
diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h
index 2afa321157be..6c65dcd470ab 100644
--- a/arch/sh/include/asm/thread_info.h
+++ b/arch/sh/include/asm/thread_info.h
@@ -151,19 +151,10 @@ extern void init_thread_xstate(void);
  * ever touches our thread-synchronous status, so we don't
  * have to worry about atomic accesses.
  */
-#define TS_RESTORE_SIGMASK	0x0001	/* restore signal mask in do_signal() */
 #define TS_USEDFPU		0x0002	/* FPU used by this task this quantum */
 
 #ifndef __ASSEMBLY__
 
-#define HAVE_SET_RESTORE_SIGMASK	1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->status |= TS_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags));
-}
-
 #define TI_FLAG_FAULT_CODE_SHIFT	24
 
 /*
@@ -182,23 +173,6 @@ static inline unsigned int get_thread_fault_code(void)
 	return ti->flags >> TI_FLAG_FAULT_CODE_SHIFT;
 }
 
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->status &= ~TS_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->status & TS_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->status & TS_RESTORE_SIGMASK))
-		return false;
-	ti->status &= ~TS_RESTORE_SIGMASK;
-	return true;
-}
-
 #endif	/* !__ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
index bde59825d06c..3d7b925f6516 100644
--- a/arch/sparc/include/asm/thread_info_64.h
+++ b/arch/sparc/include/asm/thread_info_64.h
@@ -222,32 +222,8 @@ register struct thread_info *current_thread_info_reg asm("g6");
  *
  * Note that there are only 8 bits available.
  */
-#define TS_RESTORE_SIGMASK	0x0001	/* restore signal mask in do_signal() */
 
 #ifndef __ASSEMBLY__
-#define HAVE_SET_RESTORE_SIGMASK	1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->status |= TS_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, &ti->flags));
-}
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->status &= ~TS_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->status & TS_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->status & TS_RESTORE_SIGMASK))
-		return false;
-	ti->status &= ~TS_RESTORE_SIGMASK;
-	return true;
-}
 
 #define thread32_stack_is_64bit(__SP) (((__SP) & 0x1) != 0)
 #define test_thread_64bit_stack(__SP) \
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index c1467ac59ce6..b7659b8f1117 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -166,32 +166,5 @@ extern void _cpu_idle(void);
 #ifdef __tilegx__
 #define TS_COMPAT		0x0001	/* 32-bit compatibility mode */
 #endif
-#define TS_RESTORE_SIGMASK	0x0008	/* restore signal mask in do_signal */
-
-#ifndef __ASSEMBLY__
-#define HAVE_SET_RESTORE_SIGMASK	1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->status |= TS_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, &ti->flags));
-}
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->status &= ~TS_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->status & TS_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->status & TS_RESTORE_SIGMASK))
-		return false;
-	ti->status &= ~TS_RESTORE_SIGMASK;
-	return true;
-}
-#endif	/* !__ASSEMBLY__ */
 
 #endif /* _ASM_TILE_THREAD_INFO_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 89bff044a6f5..b45ffdda3549 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -219,32 +219,8 @@ static inline unsigned long current_stack_pointer(void)
  * have to worry about atomic accesses.
  */
 #define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
-#define TS_RESTORE_SIGMASK	0x0008	/* restore signal mask in do_signal() */
 
 #ifndef __ASSEMBLY__
-#define HAVE_SET_RESTORE_SIGMASK	1
-static inline void set_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	ti->status |= TS_RESTORE_SIGMASK;
-	WARN_ON(!test_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags));
-}
-static inline void clear_restore_sigmask(void)
-{
-	current_thread_info()->status &= ~TS_RESTORE_SIGMASK;
-}
-static inline bool test_restore_sigmask(void)
-{
-	return current_thread_info()->status & TS_RESTORE_SIGMASK;
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	struct thread_info *ti = current_thread_info();
-	if (!(ti->status & TS_RESTORE_SIGMASK))
-		return false;
-	ti->status &= ~TS_RESTORE_SIGMASK;
-	return true;
-}
 
 static inline bool in_ia32_syscall(void)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26869dba21f1..569df670407a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1545,6 +1545,9 @@ struct task_struct {
 	/* unserialized, strictly 'current' */
 	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
 	unsigned in_iowait:1;
+#if !defined(TIF_RESTORE_SIGMASK)
+	unsigned restore_sigmask:1;
+#endif
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #ifndef CONFIG_SLOB
@@ -2664,6 +2667,66 @@ extern void sigqueue_free(struct sigqueue *);
 extern int send_sigqueue(struct sigqueue *,  struct task_struct *, int group);
 extern int do_sigaction(int, struct k_sigaction *, struct k_sigaction *);
 
+#ifdef TIF_RESTORE_SIGMASK
+/*
+ * Legacy restore_sigmask accessors.  These are inefficient on
+ * SMP architectures because they require atomic operations.
+ */
+
+/**
+ * set_restore_sigmask() - make sure saved_sigmask processing gets done
+ *
+ * This sets TIF_RESTORE_SIGMASK and ensures that the arch signal code
+ * will run before returning to user mode, to process the flag.  For
+ * all callers, TIF_SIGPENDING is already set or it's no harm to set
+ * it.  TIF_RESTORE_SIGMASK need not be in the set of bits that the
+ * arch code will notice on return to user mode, in case those bits
+ * are scarce.  We set TIF_SIGPENDING here to ensure that the arch
+ * signal code always gets run when TIF_RESTORE_SIGMASK is set.
+ */
+static inline void set_restore_sigmask(void)
+{
+	set_thread_flag(TIF_RESTORE_SIGMASK);
+	WARN_ON(!test_thread_flag(TIF_SIGPENDING));
+}
+static inline void clear_restore_sigmask(void)
+{
+	clear_thread_flag(TIF_RESTORE_SIGMASK);
+}
+static inline bool test_restore_sigmask(void)
+{
+	return test_thread_flag(TIF_RESTORE_SIGMASK);
+}
+static inline bool test_and_clear_restore_sigmask(void)
+{
+	return test_and_clear_thread_flag(TIF_RESTORE_SIGMASK);
+}
+
+#else	/* TIF_RESTORE_SIGMASK */
+
+/* Higher-quality implementation, used if TIF_RESTORE_SIGMASK doesn't exist. */
+static inline void set_restore_sigmask(void)
+{
+	current->restore_sigmask = true;
+	WARN_ON(!test_thread_flag(TIF_SIGPENDING));
+}
+static inline void clear_restore_sigmask(void)
+{
+	current->restore_sigmask = false;
+}
+static inline bool test_restore_sigmask(void)
+{
+	return current->restore_sigmask;
+}
+static inline bool test_and_clear_restore_sigmask(void)
+{
+	if (!current->restore_sigmask)
+		return false;
+	current->restore_sigmask = false;
+	return true;
+}
+#endif
+
 static inline void restore_saved_sigmask(void)
 {
 	if (test_and_clear_restore_sigmask())
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index b4c2a485b28a..352b1542f5cc 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -105,47 +105,6 @@ static inline int test_ti_thread_flag(struct thread_info *ti, int flag)
 
 #define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)
 
-#if defined TIF_RESTORE_SIGMASK && !defined HAVE_SET_RESTORE_SIGMASK
-/*
- * An arch can define its own version of set_restore_sigmask() to get the
- * job done however works, with or without TIF_RESTORE_SIGMASK.
- */
-#define HAVE_SET_RESTORE_SIGMASK	1
-
-/**
- * set_restore_sigmask() - make sure saved_sigmask processing gets done
- *
- * This sets TIF_RESTORE_SIGMASK and ensures that the arch signal code
- * will run before returning to user mode, to process the flag.  For
- * all callers, TIF_SIGPENDING is already set or it's no harm to set
- * it.  TIF_RESTORE_SIGMASK need not be in the set of bits that the
- * arch code will notice on return to user mode, in case those bits
- * are scarce.  We set TIF_SIGPENDING here to ensure that the arch
- * signal code always gets run when TIF_RESTORE_SIGMASK is set.
- */
-static inline void set_restore_sigmask(void)
-{
-	set_thread_flag(TIF_RESTORE_SIGMASK);
-	WARN_ON(!test_thread_flag(TIF_SIGPENDING));
-}
-static inline void clear_restore_sigmask(void)
-{
-	clear_thread_flag(TIF_RESTORE_SIGMASK);
-}
-static inline bool test_restore_sigmask(void)
-{
-	return test_thread_flag(TIF_RESTORE_SIGMASK);
-}
-static inline bool test_and_clear_restore_sigmask(void)
-{
-	return test_and_clear_thread_flag(TIF_RESTORE_SIGMASK);
-}
-#endif	/* TIF_RESTORE_SIGMASK && !HAVE_SET_RESTORE_SIGMASK */
-
-#ifndef HAVE_SET_RESTORE_SIGMASK
-#error "no set_restore_sigmask() provided and default one won't work"
-#endif
-
 #endif	/* __KERNEL__ */
 
 #endif /* _LINUX_THREAD_INFO_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 20/29] x86/smp: Remove stack_smp_processor_id()
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (18 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 19/29] signal: Consolidate {TS,TLF}_RESTORE_SIGMASK code Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 21/29] x86/smp: Remove unnecessary initialization of thread_info::cpu Andy Lutomirski
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

It serves no purpose -- raw_smp_processor_id() works fine.  This
change will be needed to move thread_info off the stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/cpu.h   | 1 -
 arch/x86/include/asm/smp.h   | 6 ------
 arch/x86/kernel/cpu/common.c | 2 +-
 3 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index 678637ad7476..59d34c521d96 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -17,7 +17,6 @@ static inline void prefill_possible_map(void) {}
 
 #define cpu_physical_id(cpu)			boot_cpu_physical_apicid
 #define safe_smp_processor_id()			0
-#define stack_smp_processor_id()		0
 
 #endif /* CONFIG_SMP */
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 66b057306f40..0576b6157f3a 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -172,12 +172,6 @@ extern int safe_smp_processor_id(void);
 #elif defined(CONFIG_X86_64_SMP)
 #define raw_smp_processor_id() (this_cpu_read(cpu_number))
 
-#define stack_smp_processor_id()					\
-({								\
-	struct thread_info *ti;						\
-	__asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK));	\
-	ti->cpu;							\
-})
 #define safe_smp_processor_id()		smp_processor_id()
 
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0fe6953f421c..d22a7b9c4f0e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1452,7 +1452,7 @@ void cpu_init(void)
 	struct task_struct *me;
 	struct tss_struct *t;
 	unsigned long v;
-	int cpu = stack_smp_processor_id();
+	int cpu = raw_smp_processor_id();
 	int i;
 
 	wait_for_master_cpu(cpu);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 21/29] x86/smp: Remove unnecessary initialization of thread_info::cpu
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (19 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 20/29] x86/smp: Remove stack_smp_processor_id() Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

It's statically initialized to zero -- no need to dynamically
initialize it to zero as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/smpboot.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fafe8b923cac..0e91dbeca2fd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1285,7 +1285,6 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 	cpumask_copy(cpu_callin_mask, cpumask_of(0));
 	mb();
 
-	current_thread_info()->cpu = 0;  /* needed? */
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var(&per_cpu(cpu_sibling_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (20 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 21/29] x86/smp: Remove unnecessary initialization of thread_info::cpu Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 23:55   ` Brian Gerst
  2016-06-26 21:55 ` [PATCH v4 23/29] kdb: Use task_cpu() instead of task_thread_info()->cpu Andy Lutomirski
                   ` (10 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

Becuase sched.h and thread_info.h are a tangled mess, I turned
in_compat_syscall into a macro.  If we had current_thread_struct()
or similar and we could use it from thread_info.h, then this would
be a bit cleaner.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c            |  4 ++--
 arch/x86/include/asm/processor.h   | 12 ++++++++++++
 arch/x86/include/asm/syscall.h     | 23 +++++------------------
 arch/x86/include/asm/thread_info.h | 23 ++++-------------------
 arch/x86/kernel/asm-offsets.c      |  1 -
 arch/x86/kernel/fpu/init.c         |  1 -
 arch/x86/kernel/process_64.c       |  4 ++--
 arch/x86/kernel/ptrace.c           |  2 +-
 8 files changed, 26 insertions(+), 44 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e538c44..c4150bec7982 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -271,7 +271,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	 * syscalls.  The fixup is exercised by the ptrace_syscall_32
 	 * selftest.
 	 */
-	ti->status &= ~TS_COMPAT;
+	current->thread.status &= ~TS_COMPAT;
 #endif
 
 	user_enter();
@@ -369,7 +369,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
-	ti->status |= TS_COMPAT;
+	current->thread.status |= TS_COMPAT;
 #endif
 
 	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a2e20d6d01fe..a75e720f6402 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -388,6 +388,9 @@ struct thread_struct {
 	unsigned short		fsindex;
 	unsigned short		gsindex;
 #endif
+
+	u32			status;		/* thread synchronous flags */
+
 #ifdef CONFIG_X86_32
 	unsigned long		ip;
 #endif
@@ -437,6 +440,15 @@ struct thread_struct {
 };
 
 /*
+ * Thread-synchronous status.
+ *
+ * This is different from the flags in that nobody else
+ * ever touches our thread-synchronous status, so we don't
+ * have to worry about atomic accesses.
+ */
+#define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
+
+/*
  * Set IOPL bits in EFLAGS from given mask
  */
 static inline void native_set_iopl_mask(unsigned mask)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 999b7cd2e78c..17229e7e2a1c 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
 	 * TS_COMPAT is set for 32-bit syscall entries and then
 	 * remains set until we return to user mode.
 	 */
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task->thread.status & TS_COMPAT)
 		/*
 		 * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
 		 * and will match correctly in comparisons.
@@ -116,7 +116,7 @@ static inline void syscall_get_arguments(struct task_struct *task,
 					 unsigned long *args)
 {
 # ifdef CONFIG_IA32_EMULATION
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task->thread.status & TS_COMPAT)
 		switch (i) {
 		case 0:
 			if (!n--) break;
@@ -177,7 +177,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
 					 const unsigned long *args)
 {
 # ifdef CONFIG_IA32_EMULATION
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task->thread.status & TS_COMPAT)
 		switch (i) {
 		case 0:
 			if (!n--) break;
@@ -234,21 +234,8 @@ static inline void syscall_set_arguments(struct task_struct *task,
 
 static inline int syscall_get_arch(void)
 {
-#ifdef CONFIG_IA32_EMULATION
-	/*
-	 * TS_COMPAT is set for 32-bit syscall entry and then
-	 * remains set until we return to user mode.
-	 *
-	 * TIF_IA32 tasks should always have TS_COMPAT set at
-	 * system call time.
-	 *
-	 * x32 tasks should be considered AUDIT_ARCH_X86_64.
-	 */
-	if (task_thread_info(current)->status & TS_COMPAT)
-		return AUDIT_ARCH_I386;
-#endif
-	/* Both x32 and x86_64 are considered "64-bit". */
-	return AUDIT_ARCH_X86_64;
+	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
+	return in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 #endif	/* CONFIG_X86_32 */
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index b45ffdda3549..7b42c1e462ac 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -55,7 +55,6 @@ struct task_struct;
 struct thread_info {
 	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
-	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
 };
 
@@ -211,28 +210,14 @@ static inline unsigned long current_stack_pointer(void)
 
 #endif
 
-/*
- * Thread-synchronous status.
- *
- * This is different from the flags in that nobody else
- * ever touches our thread-synchronous status, so we don't
- * have to worry about atomic accesses.
- */
-#define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
-
 #ifndef __ASSEMBLY__
 
-static inline bool in_ia32_syscall(void)
-{
 #ifdef CONFIG_X86_32
-	return true;
-#endif
-#ifdef CONFIG_IA32_EMULATION
-	if (current_thread_info()->status & TS_COMPAT)
-		return true;
+#define in_ia32_syscall() true
+#else
+#define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
+			   current->thread.status & TS_COMPAT)
 #endif
-	return false;
-}
 
 /*
  * Force syscall return via IRET by making it look as if there was
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 2bd5c6ff7ee7..a91a6ead24a2 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -30,7 +30,6 @@
 void common(void) {
 	BLANK();
 	OFFSET(TI_flags, thread_info, flags);
-	OFFSET(TI_status, thread_info, status);
 
 	BLANK();
 	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index aacfd7a82cec..4579c1544ed1 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -327,7 +327,6 @@ static void __init fpu__init_system_ctx_switch(void)
 	on_boot_cpu = 0;
 
 	WARN_ON_FPU(current->thread.fpu.fpstate_active);
-	current_thread_info()->status = 0;
 
 	if (boot_cpu_has(X86_FEATURE_XSAVEOPT) && eagerfpu != DISABLE)
 		eagerfpu = ENABLE;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6e789ca1f841..e6d53b641ef0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -511,7 +511,7 @@ void set_personality_ia32(bool x32)
 		current->personality &= ~READ_IMPLIES_EXEC;
 		/* in_compat_syscall() uses the presence of the x32
 		   syscall bit flag to determine compat status */
-		current_thread_info()->status &= ~TS_COMPAT;
+		current->thread.status &= ~TS_COMPAT;
 	} else {
 		set_thread_flag(TIF_IA32);
 		clear_thread_flag(TIF_X32);
@@ -519,7 +519,7 @@ void set_personality_ia32(bool x32)
 			current->mm->context.ia32_compat = TIF_IA32;
 		current->personality |= force_personality32;
 		/* Prepare the first "return" to user space */
-		current_thread_info()->status |= TS_COMPAT;
+		current->thread.status |= TS_COMPAT;
 	}
 }
 EXPORT_SYMBOL_GPL(set_personality_ia32);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 600edd225e81..4f03301ecfbe 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -931,7 +931,7 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
 		 */
 		regs->orig_ax = value;
 		if (syscall_get_nr(child, regs) >= 0)
-			task_thread_info(child)->status |= TS_COMPAT;
+			child->thread.status |= TS_COMPAT;
 		break;
 
 	case offsetof(struct user32, regs.eflags):
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 23/29] kdb: Use task_cpu() instead of task_thread_info()->cpu
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (21 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 24/29] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Jason Wessel

We'll need this cleanup to make the cpu field in thread_info be
optional.

Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/kdb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/kdb.h b/include/linux/kdb.h
index a19bcf9e762e..410decacff8f 100644
--- a/include/linux/kdb.h
+++ b/include/linux/kdb.h
@@ -177,7 +177,7 @@ extern int kdb_get_kbd_char(void);
 static inline
 int kdb_process_cpu(const struct task_struct *p)
 {
-	unsigned int cpu = task_thread_info(p)->cpu;
+	unsigned int cpu = task_cpu(p);
 	if (cpu > num_possible_cpus())
 		cpu = 0;
 	return cpu;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 24/29] x86/entry: Get rid of pt_regs_to_thread_info()
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (22 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 23/29] kdb: Use task_cpu() instead of task_thread_info()->cpu Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 25/29] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

From: Linus Torvalds <torvalds@linux-foundation.org>

It was a nice optimization while it lasted, but thread_info is moving
and this optimization will no longer work.

Quoting Linus:

    Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want
    to do unspeakable acts on a poor innocent wax figure that looked
    _exactly_ like you.

[changelog written by Andy]
Message-Id: <CA+55aFxvZhBu9U1cqpVm4frv0p5mqu=0TxsSqE-=95ft8HvCVA@mail.gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index c4150bec7982..804487f126cb 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,13 +31,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
-{
-	unsigned long top_of_stack =
-		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
-	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
-}
-
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
 __visible void enter_from_user_mode(void)
@@ -78,7 +71,7 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
  */
 unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long ret = 0;
 	u32 work;
 
@@ -156,7 +149,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
 				unsigned long phase1_result)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	long ret = 0;
 	u32 work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
 
@@ -239,7 +232,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
-		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
@@ -250,7 +243,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
 
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN_ON(!irqs_disabled()))
@@ -309,7 +302,7 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
  */
 __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -332,7 +325,7 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long nr = regs->orig_ax;
 
 	enter_from_user_mode();
@@ -365,7 +358,7 @@ __visible void do_syscall_64(struct pt_regs *regs)
  */
 static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 25/29] um: Stop conflating task_struct::stack with thread_info
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (23 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 24/29] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 23:40   ` Brian Gerst
  2016-06-26 21:55 ` [PATCH v4 26/29] sched: Allow putting thread_info into task_struct Andy Lutomirski
                   ` (7 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

From: Linus Torvalds <torvalds@linux-foundation.org>

thread_info may move in the future, so use the accessors.

[changelog written by Andy]
Message-Id: <CA+55aFxvZhBu9U1cqpVm4frv0p5mqu=0TxsSqE-=95ft8HvCVA@mail.gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/um/ptrace_32.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
index ebd4dd6ef73b..14e8f6a628c2 100644
--- a/arch/x86/um/ptrace_32.c
+++ b/arch/x86/um/ptrace_32.c
@@ -191,7 +191,7 @@ int peek_user(struct task_struct *child, long addr, long data)
 
 static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	err = save_i387_registers(userspace_pid[cpu],
@@ -208,7 +208,7 @@ static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_i387_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
@@ -221,7 +221,7 @@ static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	err = save_fpx_registers(userspace_pid[cpu], (unsigned long *) &fpregs);
@@ -237,7 +237,7 @@ static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *
 
 static int set_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_thread_info(child)->cpu;
 	struct user_fxsr_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 26/29] sched: Allow putting thread_info into task_struct
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (24 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 25/29] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-07-11 10:08   ` [kernel-hardening] " Mark Rutland
  2016-06-26 21:55 ` [PATCH v4 27/29] x86: Move " Andy Lutomirski
                   ` (6 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
then thread_info is defined as a single 'u32 flags' and is the first
entry of task_struct.  thread_info::task is removed (it serves no
purpose if thread_info is embedded in task_struct), and
thread_info::cpu gets its own slot in task_struct.

This is heavily based on a patch written by Linus.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/init_task.h   |  9 +++++++++
 include/linux/sched.h       | 36 ++++++++++++++++++++++++++++++++++--
 include/linux/thread_info.h | 15 +++++++++++++++
 init/Kconfig                |  3 +++
 init/init_task.c            |  7 +++++--
 kernel/sched/sched.h        |  4 ++++
 6 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f8834f820ec2..9c04d44eeb3c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,8 @@
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
+#include <asm/thread_info.h>
+
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
 	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO),
@@ -183,12 +185,19 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
+#else
+# define INIT_TASK_TI(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
  */
 #define INIT_TASK(tsk)	\
 {									\
+	INIT_TASK_TI(tsk)						\
 	.state		= 0,						\
 	.stack		= init_stack,					\
 	.usage		= ATOMIC_INIT(2),				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 569df670407a..4108b4880b86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1456,6 +1456,13 @@ struct tlbflush_unmap_batch {
 };
 
 struct task_struct {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	/*
+	 * For reasons of header soup (see current_thread_info()), this
+	 * must be the first element of task_struct.
+	 */
+	struct thread_info thread_info;
+#endif
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
 	atomic_t usage;
@@ -1465,6 +1472,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	unsigned int cpu;	/* current CPU */
+#endif
 	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
 	struct task_struct *last_wakee;
@@ -2557,7 +2567,9 @@ extern void set_curr_task(int cpu, struct task_struct *p);
 void yield(void);
 
 union thread_union {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
 	struct thread_info thread_info;
+#endif
 	unsigned long stack[THREAD_SIZE/sizeof(long)];
 };
 
@@ -3045,10 +3057,26 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 	cgroup_threadgroup_change_end(tsk);
 }
 
-#ifndef __HAVE_THREAD_FUNCTIONS
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+
+static inline struct thread_info *task_thread_info(struct task_struct *task)
+{
+	return &task->thread_info;
+}
+static inline void *task_stack_page(const struct task_struct *task)
+{
+	return task->stack;
+}
+#define setup_thread_stack(new,old)	do { } while(0)
+static inline unsigned long *end_of_stack(const struct task_struct *task)
+{
+	return task->stack;
+}
+
+#elif !defined(__HAVE_THREAD_FUNCTIONS)
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
-#define task_stack_page(task)	((task)->stack)
+#define task_stack_page(task)	((void *)(task)->stack)
 
 static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org)
 {
@@ -3348,7 +3376,11 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 
 static inline unsigned int task_cpu(const struct task_struct *p)
 {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	return p->cpu;
+#else
 	return task_thread_info(p)->cpu;
+#endif
 }
 
 static inline int task_node(const struct task_struct *p)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 352b1542f5cc..b2b32d63bc8e 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -13,6 +13,21 @@
 struct timespec;
 struct compat_timespec;
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+struct thread_info {
+	u32			flags;		/* low level flags */
+};
+
+#define INIT_THREAD_INFO(tsk)			\
+{						\
+	.flags		= 0,			\
+}
+#endif
+
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+#define current_thread_info() ((struct thread_info *)current)
+#endif
+
 /*
  * System call restart block.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f755a602d4a1..0c83af6d3753 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -26,6 +26,9 @@ config IRQ_WORK
 config BUILDTIME_EXTABLE_SORT
 	bool
 
+config THREAD_INFO_IN_TASK
+	bool
+
 menu "General setup"
 
 config BROKEN
diff --git a/init/init_task.c b/init/init_task.c
index ba0a7f362d9e..11f83be1fa79 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,5 +22,8 @@ EXPORT_SYMBOL(init_task);
  * Initial thread structure. Alignment of this is handled by a special
  * linker map entry.
  */
-union thread_union init_thread_union __init_task_data =
-	{ INIT_THREAD_INFO(init_task) };
+union thread_union init_thread_union __init_task_data = {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	INIT_THREAD_INFO(init_task)
+#endif
+};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7cbeb92a1cb9..a1cabcea4c54 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -999,7 +999,11 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 	 * per-task data have been completed by this moment.
 	 */
 	smp_wmb();
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	p->cpu = cpu;
+#else
 	task_thread_info(p)->cpu = cpu;
+#endif
 	p->wake_cpu = cpu;
 #endif
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 27/29] x86: Move thread_info into task_struct
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (25 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 26/29] sched: Allow putting thread_info into task_struct Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 28/29] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK Andy Lutomirski
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

Now that most of the thread_info users have been cleaned up,
this is straightforward.

Most of this code was written by Linus.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/entry_64.S          |  9 +++++---
 arch/x86/include/asm/switch_to.h   |  6 ++---
 arch/x86/include/asm/thread_info.h | 46 --------------------------------------
 arch/x86/kernel/asm-offsets.c      |  4 +---
 arch/x86/kernel/irq_64.c           |  3 +--
 arch/x86/kernel/process.c          |  6 ++---
 7 files changed, 13 insertions(+), 62 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index afdcf96ef109..b3002c8efde2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -155,6 +155,7 @@ config X86
 	select SPARSE_IRQ
 	select SRCU
 	select SYSCTL_EXCEPTION_TRACE
+	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index b846875aeea6..efeb1c9f64f4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -179,7 +179,8 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	movq	%gs:current_task, %r11
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -217,7 +218,8 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	movq	%gs:current_task, %r11
+	testl	$_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
@@ -368,9 +370,10 @@ END(ptregs_\func)
  * A newly forked process directly context switches into this address.
  *
  * rdi: prev task we switched from
+ * rsi: task we're switching to
  */
 ENTRY(ret_from_fork)
-	LOCK ; btr $TIF_FORK, TI_flags(%r8)
+	LOCK ; btr $TIF_FORK, TASK_TI_flags(%rsi)
 
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 14e4b20f0aaf..5194f4a680ab 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -136,18 +136,16 @@ do {									\
 	     "call __switch_to\n\t"					  \
 	     "movq "__percpu_arg([current_task])",%%rsi\n\t"		  \
 	     __switch_canary						  \
-	     "movq %P[thread_info](%%rsi),%%r8\n\t"			  \
 	     "movq %%rax,%%rdi\n\t" 					  \
-	     "testl  %[_tif_fork],%P[ti_flags](%%r8)\n\t"		  \
+	     "testl  %[_tif_fork],%P[ti_flags](%%rsi)\n\t"		  \
 	     "jnz   ret_from_fork\n\t"					  \
 	     RESTORE_CONTEXT						  \
 	     : "=a" (last)					  	  \
 	       __switch_canary_oparam					  \
 	     : [next] "S" (next), [prev] "D" (prev),			  \
 	       [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \
-	       [ti_flags] "i" (offsetof(struct thread_info, flags)),	  \
+	       [ti_flags] "i" (offsetof(struct task_struct, thread_info.flags)),	  \
 	       [_tif_fork] "i" (_TIF_FORK),			  	  \
-	       [thread_info] "i" (offsetof(struct task_struct, stack)),   \
 	       [current_task] "m" (current_task)			  \
 	       __switch_canary_iparam					  \
 	     : "memory", "cc" __EXTRA_CLOBBER)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 7b42c1e462ac..0afc37654ad1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -52,20 +52,6 @@ struct task_struct;
 #include <asm/cpufeature.h>
 #include <linux/atomic.h>
 
-struct thread_info {
-	struct task_struct	*task;		/* main task structure */
-	__u32			flags;		/* low level flags */
-	__u32			cpu;		/* current CPU */
-};
-
-#define INIT_THREAD_INFO(tsk)			\
-{						\
-	.task		= &tsk,			\
-	.flags		= 0,			\
-	.cpu		= 0,			\
-}
-
-#define init_thread_info	(init_thread_union.thread_info)
 #define init_stack		(init_thread_union.stack)
 
 #else /* !__ASSEMBLY__ */
@@ -159,11 +145,6 @@ struct thread_info {
  */
 #ifndef __ASSEMBLY__
 
-static inline struct thread_info *current_thread_info(void)
-{
-	return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
-}
-
 static inline unsigned long current_stack_pointer(void)
 {
 	unsigned long sp;
@@ -181,33 +162,6 @@ static inline unsigned long current_stack_pointer(void)
 # define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
 #endif
 
-/*
- * ASM operand which evaluates to a 'thread_info' address of
- * the current task, if it is known that "reg" is exactly "off"
- * bytes below the top of the stack currently.
- *
- * ( The kernel stack's size is known at build time, it is usually
- *   2 or 4 pages, and the bottom  of the kernel stack contains
- *   the thread_info structure. So to access the thread_info very
- *   quickly from assembly code we can calculate down from the
- *   top of the kernel stack to the bottom, using constant,
- *   build-time calculations only. )
- *
- * For example, to fetch the current thread_info->flags value into %eax
- * on x86-64 defconfig kernels, in syscall entry code where RSP is
- * currently at exactly SIZEOF_PTREGS bytes away from the top of the
- * stack:
- *
- *      mov ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS), %eax
- *
- * will translate to:
- *
- *      8b 84 24 b8 c0 ff ff      mov    -0x3f48(%rsp), %eax
- *
- * which is below the current RSP by almost 16K.
- */
-#define ASM_THREAD_INFO(field, reg, off) ((field)+(off)-THREAD_SIZE)(reg)
-
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index a91a6ead24a2..e900f5e13f22 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -29,9 +29,7 @@
 
 void common(void) {
 	BLANK();
-	OFFSET(TI_flags, thread_info, flags);
-
-	BLANK();
+	OFFSET(TASK_TI_flags, task_struct, thread_info.flags);
 	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
 
 	BLANK();
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 206d0b90a3ab..38f9f5678dc8 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -41,8 +41,7 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (user_mode(regs))
 		return;
 
-	if (regs->sp >= curbase + sizeof(struct thread_info) +
-				  sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
+	if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
 	    regs->sp <= curbase + THREAD_SIZE)
 		return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 96becbbb52e0..8f60f810a9e7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -536,9 +536,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 * PADDING
 	 * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
 	 * stack
-	 * ----------- bottom = start + sizeof(thread_info)
-	 * thread_info
-	 * ----------- start
+	 * ----------- bottom = start
 	 *
 	 * The tasks stack pointer points at the location where the
 	 * framepointer is stored. The data on the stack is:
@@ -549,7 +547,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 */
 	top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
 	top -= 2 * sizeof(unsigned long);
-	bottom = start + sizeof(struct thread_info);
+	bottom = start;
 
 	sp = READ_ONCE(p->thread.sp);
 	if (sp < bottom || sp > top)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 28/29] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (26 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 27/29] x86: Move " Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-27  2:35   ` Andy Lutomirski
  2016-06-26 21:55 ` [PATCH v4 29/29] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set Andy Lutomirski
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski, Oleg Nesterov,
	Peter Zijlstra

We currently keep every task's stack around until the task_struct
itself is freed.  This means that we keep the stack allocation alive
for longer than necessary and that, under load, we free stacks in
big batches whenever RCU drops the last task reference.  Neither of
these is good for reuse of cache-hot memory, and freeing in batches
prevents us from usefully caching small numbers of vmalloced stacks.

On architectures that have thread_info on the stack, we can't easily
change this, but on architectures that set THREAD_INFO_IN_TASK, we
can free it as soon as the task is dead.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched.h |  1 +
 kernel/fork.c         | 23 ++++++++++++++++++++++-
 kernel/sched/core.c   |  9 +++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4108b4880b86..0b9486826d62 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2659,6 +2659,7 @@ static inline void kernel_signal_stop(void)
 }
 
 extern void release_task(struct task_struct * p);
+extern void release_task_stack(struct task_struct *tsk);
 extern int send_sig_info(int, struct siginfo *, struct task_struct *);
 extern int force_sigsegv(int, struct task_struct *);
 extern int force_sig_info(int, struct siginfo *, struct task_struct *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 06761de69360..8dd1329e1bf8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -269,11 +269,32 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 	}
 }
 
-void free_task(struct task_struct *tsk)
+void release_task_stack(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk, -1);
 	arch_release_thread_stack(tsk->stack);
 	free_thread_stack(tsk);
+	tsk->stack = NULL;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = NULL;
+#endif
+}
+
+void free_task(struct task_struct *tsk)
+{
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	/*
+	 * The task is finally done with both the stack and thread_info,
+	 * so free both.
+	 */
+	release_task_stack(tsk);
+#else
+	/*
+	 * If the task had a separate stack allocation, it should be gone
+	 * by now.
+	 */
+	WARN_ON_ONCE(tsk->stack);
+#endif
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105f529a..00c9ba5cf605 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2742,6 +2742,15 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		 * task and put them back on the free list.
 		 */
 		kprobe_flush_task(prev);
+
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+		/*
+		 * If thread_info is in task_struct, then the dead task no
+		 * longer needs its stack.  Free it right away.
+		 */
+		release_task_stack(prev);
+#endif
+
 		put_task_struct(prev);
 	}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 29/29] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (27 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 28/29] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK Andy Lutomirski
@ 2016-06-26 21:55 ` Andy Lutomirski
  2016-06-28  7:32 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad David Howells
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 21:55 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

vmalloc is a bit slow, and pounding vmalloc/vfree will eventually
force a global TLB flush.

To reduce pressure on them, if CONFIG_VMAP_STACK, cache two thread
stacks per cpu.  This will let us quickly allocate a hopefully
cache-hot, TLB-hot stack under heavy forking workloads (shell script
style).

On my silly pthread_create benchmark, it saves about 2 µs per
pthread_create+join with CONFIG_VMAP_STACK=y.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/fork.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 50 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8dd1329e1bf8..4b8ea904e47b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -159,10 +159,37 @@ void __weak arch_release_thread_stack(unsigned long *stack)
  * kmemcache based allocator.
  */
 # if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
+
+#ifdef CONFIG_VMAP_STACK
+/*
+ * vmalloc is a bit slow, and calling vfree enough times will force a TLB
+ * flush.  Try to minimize the number of calls by caching stacks.
+ */
+#define NR_CACHED_STACKS 2
+static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
+#endif
+
 static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 #ifdef CONFIG_VMAP_STACK
-	void *stack = __vmalloc_node_range(
+	void *stack;
+	int i;
+
+	local_irq_disable();
+	for (i = 0; i < NR_CACHED_STACKS; i++) {
+		struct vm_struct *s = this_cpu_read(cached_stacks[i]);
+
+		if (!s)
+			continue;
+		this_cpu_write(cached_stacks[i], NULL);
+
+		tsk->stack_vm_area = s;
+		local_irq_enable();
+		return s->addr;
+	}
+	local_irq_enable();
+
+	stack = __vmalloc_node_range(
 		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
 		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
 		0, node, __builtin_return_address(0));
@@ -185,10 +212,29 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 static inline void free_thread_stack(struct task_struct *tsk)
 {
-	if (task_stack_vm_area(tsk))
+#ifdef CONFIG_VMAP_STACK
+	if (task_stack_vm_area(tsk)) {
+		unsigned long flags;
+		int i;
+
+		local_irq_save(flags);
+		for (i = 0; i < NR_CACHED_STACKS; i++) {
+			if (this_cpu_read(cached_stacks[i]))
+				continue;
+
+			this_cpu_write(cached_stacks[i], tsk->stack_vm_area);
+			goto done;
+		}
+
 		vfree(tsk->stack);
-	else
-		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
+
+done:
+		local_irq_restore(flags);
+		return;
+	}
+#endif
+
+	free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_stack_cache;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 25/29] um: Stop conflating task_struct::stack with thread_info
  2016-06-26 21:55 ` [PATCH v4 25/29] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
@ 2016-06-26 23:40   ` Brian Gerst
  2016-06-26 23:49     ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Brian Gerst @ 2016-06-26 23:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
>
> thread_info may move in the future, so use the accessors.
>
> [changelog written by Andy]
> Message-Id: <CA+55aFxvZhBu9U1cqpVm4frv0p5mqu=0TxsSqE-=95ft8HvCVA@mail.gmail.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/um/ptrace_32.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
> index ebd4dd6ef73b..14e8f6a628c2 100644
> --- a/arch/x86/um/ptrace_32.c
> +++ b/arch/x86/um/ptrace_32.c
> @@ -191,7 +191,7 @@ int peek_user(struct task_struct *child, long addr, long data)
>
>  static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
>  {
> -       int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
> +       int err, n, cpu = task_thread_info(child)->cpu;

Shouldn't this use task_cpu() like in patch 23?

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 25/29] um: Stop conflating task_struct::stack with thread_info
  2016-06-26 23:40   ` Brian Gerst
@ 2016-06-26 23:49     ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-26 23:49 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 4:40 PM, Brian Gerst <brgerst@gmail.com> wrote:
> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> From: Linus Torvalds <torvalds@linux-foundation.org>
>>
>> thread_info may move in the future, so use the accessors.
>>
>> [changelog written by Andy]
>> Message-Id: <CA+55aFxvZhBu9U1cqpVm4frv0p5mqu=0TxsSqE-=95ft8HvCVA@mail.gmail.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/um/ptrace_32.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
>> index ebd4dd6ef73b..14e8f6a628c2 100644
>> --- a/arch/x86/um/ptrace_32.c
>> +++ b/arch/x86/um/ptrace_32.c
>> @@ -191,7 +191,7 @@ int peek_user(struct task_struct *child, long addr, long data)
>>
>>  static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
>>  {
>> -       int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
>> +       int err, n, cpu = task_thread_info(child)->cpu;
>
> Shouldn't this use task_cpu() like in patch 23?
>

Indeed.  But blame Linus -- he wrote it :)

If I send a v5, I'll fix this in this patch.  Otherwise I'll send a
followup patch when this lands.  It doesn't matter immediately because
I'm not opting um into THREAD_INFO_IN_TASK.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct
  2016-06-26 21:55 ` [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
@ 2016-06-26 23:55   ` Brian Gerst
  2016-06-27  0:23     ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Brian Gerst @ 2016-06-26 23:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
> Becuase sched.h and thread_info.h are a tangled mess, I turned
> in_compat_syscall into a macro.  If we had current_thread_struct()
> or similar and we could use it from thread_info.h, then this would
> be a bit cleaner.
>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/common.c            |  4 ++--
>  arch/x86/include/asm/processor.h   | 12 ++++++++++++
>  arch/x86/include/asm/syscall.h     | 23 +++++------------------
>  arch/x86/include/asm/thread_info.h | 23 ++++-------------------
>  arch/x86/kernel/asm-offsets.c      |  1 -
>  arch/x86/kernel/fpu/init.c         |  1 -
>  arch/x86/kernel/process_64.c       |  4 ++--
>  arch/x86/kernel/ptrace.c           |  2 +-
>  8 files changed, 26 insertions(+), 44 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index ec138e538c44..c4150bec7982 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -271,7 +271,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
>          * syscalls.  The fixup is exercised by the ptrace_syscall_32
>          * selftest.
>          */
> -       ti->status &= ~TS_COMPAT;
> +       current->thread.status &= ~TS_COMPAT;
>  #endif
>
>         user_enter();
> @@ -369,7 +369,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
>         unsigned int nr = (unsigned int)regs->orig_ax;
>
>  #ifdef CONFIG_IA32_EMULATION
> -       ti->status |= TS_COMPAT;
> +       current->thread.status |= TS_COMPAT;
>  #endif
>
>         if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index a2e20d6d01fe..a75e720f6402 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -388,6 +388,9 @@ struct thread_struct {
>         unsigned short          fsindex;
>         unsigned short          gsindex;
>  #endif
> +
> +       u32                     status;         /* thread synchronous flags */
> +
>  #ifdef CONFIG_X86_32
>         unsigned long           ip;
>  #endif
> @@ -437,6 +440,15 @@ struct thread_struct {
>  };
>
>  /*
> + * Thread-synchronous status.
> + *
> + * This is different from the flags in that nobody else
> + * ever touches our thread-synchronous status, so we don't
> + * have to worry about atomic accesses.
> + */
> +#define TS_COMPAT              0x0002  /* 32bit syscall active (64BIT)*/
> +
> +/*
>   * Set IOPL bits in EFLAGS from given mask
>   */
>  static inline void native_set_iopl_mask(unsigned mask)
> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index 999b7cd2e78c..17229e7e2a1c 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h
> @@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
>          * TS_COMPAT is set for 32-bit syscall entries and then
>          * remains set until we return to user mode.
>          */
> -       if (task_thread_info(task)->status & TS_COMPAT)
> +       if (task->thread.status & TS_COMPAT)
>                 /*
>                  * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
>                  * and will match correctly in comparisons.
> @@ -116,7 +116,7 @@ static inline void syscall_get_arguments(struct task_struct *task,
>                                          unsigned long *args)
>  {
>  # ifdef CONFIG_IA32_EMULATION
> -       if (task_thread_info(task)->status & TS_COMPAT)
> +       if (task->thread.status & TS_COMPAT)
>                 switch (i) {
>                 case 0:
>                         if (!n--) break;
> @@ -177,7 +177,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
>                                          const unsigned long *args)
>  {
>  # ifdef CONFIG_IA32_EMULATION
> -       if (task_thread_info(task)->status & TS_COMPAT)
> +       if (task->thread.status & TS_COMPAT)
>                 switch (i) {
>                 case 0:
>                         if (!n--) break;
> @@ -234,21 +234,8 @@ static inline void syscall_set_arguments(struct task_struct *task,
>
>  static inline int syscall_get_arch(void)
>  {
> -#ifdef CONFIG_IA32_EMULATION
> -       /*
> -        * TS_COMPAT is set for 32-bit syscall entry and then
> -        * remains set until we return to user mode.
> -        *
> -        * TIF_IA32 tasks should always have TS_COMPAT set at
> -        * system call time.
> -        *
> -        * x32 tasks should be considered AUDIT_ARCH_X86_64.
> -        */
> -       if (task_thread_info(current)->status & TS_COMPAT)
> -               return AUDIT_ARCH_I386;
> -#endif
> -       /* Both x32 and x86_64 are considered "64-bit". */
> -       return AUDIT_ARCH_X86_64;
> +       /* x32 tasks should be considered AUDIT_ARCH_X86_64. */
> +       return in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
>  }
>  #endif /* CONFIG_X86_32 */
>
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index b45ffdda3549..7b42c1e462ac 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -55,7 +55,6 @@ struct task_struct;
>  struct thread_info {
>         struct task_struct      *task;          /* main task structure */
>         __u32                   flags;          /* low level flags */
> -       __u32                   status;         /* thread synchronous flags */
>         __u32                   cpu;            /* current CPU */
>  };
>
> @@ -211,28 +210,14 @@ static inline unsigned long current_stack_pointer(void)
>
>  #endif
>
> -/*
> - * Thread-synchronous status.
> - *
> - * This is different from the flags in that nobody else
> - * ever touches our thread-synchronous status, so we don't
> - * have to worry about atomic accesses.
> - */
> -#define TS_COMPAT              0x0002  /* 32bit syscall active (64BIT)*/
> -
>  #ifndef __ASSEMBLY__
>
> -static inline bool in_ia32_syscall(void)
> -{
>  #ifdef CONFIG_X86_32
> -       return true;
> -#endif
> -#ifdef CONFIG_IA32_EMULATION
> -       if (current_thread_info()->status & TS_COMPAT)
> -               return true;
> +#define in_ia32_syscall() true
> +#else
> +#define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
> +                          current->thread.status & TS_COMPAT)
>  #endif
> -       return false;
> -}
>
>  /*
>   * Force syscall return via IRET by making it look as if there was
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 2bd5c6ff7ee7..a91a6ead24a2 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -30,7 +30,6 @@
>  void common(void) {
>         BLANK();
>         OFFSET(TI_flags, thread_info, flags);
> -       OFFSET(TI_status, thread_info, status);

TI_status can be deleted.  It's last users were removed in commit ee08c6bd.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct
  2016-06-26 23:55   ` Brian Gerst
@ 2016-06-27  0:23     ` Andy Lutomirski
  2016-06-27  0:36       ` Brian Gerst
  0 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27  0:23 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 4:55 PM, Brian Gerst <brgerst@gmail.com> wrote:
> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> Becuase sched.h and thread_info.h are a tangled mess, I turned
>> in_compat_syscall into a macro.  If we had current_thread_struct()
>> or similar and we could use it from thread_info.h, then this would
>> be a bit cleaner.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/common.c            |  4 ++--
>>  arch/x86/include/asm/processor.h   | 12 ++++++++++++
>>  arch/x86/include/asm/syscall.h     | 23 +++++------------------
>>  arch/x86/include/asm/thread_info.h | 23 ++++-------------------
>>  arch/x86/kernel/asm-offsets.c      |  1 -
>>  arch/x86/kernel/fpu/init.c         |  1 -
>>  arch/x86/kernel/process_64.c       |  4 ++--
>>  arch/x86/kernel/ptrace.c           |  2 +-
>>  8 files changed, 26 insertions(+), 44 deletions(-)
>>
>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>> index ec138e538c44..c4150bec7982 100644
>> --- a/arch/x86/entry/common.c
>> +++ b/arch/x86/entry/common.c
>> @@ -271,7 +271,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
>>          * syscalls.  The fixup is exercised by the ptrace_syscall_32
>>          * selftest.
>>          */
>> -       ti->status &= ~TS_COMPAT;
>> +       current->thread.status &= ~TS_COMPAT;
>>  #endif
>>
>>         user_enter();
>> @@ -369,7 +369,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
>>         unsigned int nr = (unsigned int)regs->orig_ax;
>>
>>  #ifdef CONFIG_IA32_EMULATION
>> -       ti->status |= TS_COMPAT;
>> +       current->thread.status |= TS_COMPAT;
>>  #endif
>>
>>         if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index a2e20d6d01fe..a75e720f6402 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -388,6 +388,9 @@ struct thread_struct {
>>         unsigned short          fsindex;
>>         unsigned short          gsindex;
>>  #endif
>> +
>> +       u32                     status;         /* thread synchronous flags */
>> +
>>  #ifdef CONFIG_X86_32
>>         unsigned long           ip;
>>  #endif
>> @@ -437,6 +440,15 @@ struct thread_struct {
>>  };
>>
>>  /*
>> + * Thread-synchronous status.
>> + *
>> + * This is different from the flags in that nobody else
>> + * ever touches our thread-synchronous status, so we don't
>> + * have to worry about atomic accesses.
>> + */
>> +#define TS_COMPAT              0x0002  /* 32bit syscall active (64BIT)*/
>> +
>> +/*
>>   * Set IOPL bits in EFLAGS from given mask
>>   */
>>  static inline void native_set_iopl_mask(unsigned mask)
>> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
>> index 999b7cd2e78c..17229e7e2a1c 100644
>> --- a/arch/x86/include/asm/syscall.h
>> +++ b/arch/x86/include/asm/syscall.h
>> @@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
>>          * TS_COMPAT is set for 32-bit syscall entries and then
>>          * remains set until we return to user mode.
>>          */
>> -       if (task_thread_info(task)->status & TS_COMPAT)
>> +       if (task->thread.status & TS_COMPAT)
>>                 /*
>>                  * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
>>                  * and will match correctly in comparisons.
>> @@ -116,7 +116,7 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>                                          unsigned long *args)
>>  {
>>  # ifdef CONFIG_IA32_EMULATION
>> -       if (task_thread_info(task)->status & TS_COMPAT)
>> +       if (task->thread.status & TS_COMPAT)
>>                 switch (i) {
>>                 case 0:
>>                         if (!n--) break;
>> @@ -177,7 +177,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
>>                                          const unsigned long *args)
>>  {
>>  # ifdef CONFIG_IA32_EMULATION
>> -       if (task_thread_info(task)->status & TS_COMPAT)
>> +       if (task->thread.status & TS_COMPAT)
>>                 switch (i) {
>>                 case 0:
>>                         if (!n--) break;
>> @@ -234,21 +234,8 @@ static inline void syscall_set_arguments(struct task_struct *task,
>>
>>  static inline int syscall_get_arch(void)
>>  {
>> -#ifdef CONFIG_IA32_EMULATION
>> -       /*
>> -        * TS_COMPAT is set for 32-bit syscall entry and then
>> -        * remains set until we return to user mode.
>> -        *
>> -        * TIF_IA32 tasks should always have TS_COMPAT set at
>> -        * system call time.
>> -        *
>> -        * x32 tasks should be considered AUDIT_ARCH_X86_64.
>> -        */
>> -       if (task_thread_info(current)->status & TS_COMPAT)
>> -               return AUDIT_ARCH_I386;
>> -#endif
>> -       /* Both x32 and x86_64 are considered "64-bit". */
>> -       return AUDIT_ARCH_X86_64;
>> +       /* x32 tasks should be considered AUDIT_ARCH_X86_64. */
>> +       return in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
>>  }
>>  #endif /* CONFIG_X86_32 */
>>
>> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
>> index b45ffdda3549..7b42c1e462ac 100644
>> --- a/arch/x86/include/asm/thread_info.h
>> +++ b/arch/x86/include/asm/thread_info.h
>> @@ -55,7 +55,6 @@ struct task_struct;
>>  struct thread_info {
>>         struct task_struct      *task;          /* main task structure */
>>         __u32                   flags;          /* low level flags */
>> -       __u32                   status;         /* thread synchronous flags */
>>         __u32                   cpu;            /* current CPU */
>>  };
>>
>> @@ -211,28 +210,14 @@ static inline unsigned long current_stack_pointer(void)
>>
>>  #endif
>>
>> -/*
>> - * Thread-synchronous status.
>> - *
>> - * This is different from the flags in that nobody else
>> - * ever touches our thread-synchronous status, so we don't
>> - * have to worry about atomic accesses.
>> - */
>> -#define TS_COMPAT              0x0002  /* 32bit syscall active (64BIT)*/
>> -
>>  #ifndef __ASSEMBLY__
>>
>> -static inline bool in_ia32_syscall(void)
>> -{
>>  #ifdef CONFIG_X86_32
>> -       return true;
>> -#endif
>> -#ifdef CONFIG_IA32_EMULATION
>> -       if (current_thread_info()->status & TS_COMPAT)
>> -               return true;
>> +#define in_ia32_syscall() true
>> +#else
>> +#define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
>> +                          current->thread.status & TS_COMPAT)
>>  #endif
>> -       return false;
>> -}
>>
>>  /*
>>   * Force syscall return via IRET by making it look as if there was
>> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
>> index 2bd5c6ff7ee7..a91a6ead24a2 100644
>> --- a/arch/x86/kernel/asm-offsets.c
>> +++ b/arch/x86/kernel/asm-offsets.c
>> @@ -30,7 +30,6 @@
>>  void common(void) {
>>         BLANK();
>>         OFFSET(TI_flags, thread_info, flags);
>> -       OFFSET(TI_status, thread_info, status);
>
> TI_status can be deleted.  It's last users were removed in commit ee08c6bd.

Indeed.

Just to double-check: are you saying that this patch is okay?

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct
  2016-06-27  0:23     ` Andy Lutomirski
@ 2016-06-27  0:36       ` Brian Gerst
  2016-06-27  0:40         ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Brian Gerst @ 2016-06-27  0:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 8:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Sun, Jun 26, 2016 at 4:55 PM, Brian Gerst <brgerst@gmail.com> wrote:
>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> Becuase sched.h and thread_info.h are a tangled mess, I turned
>>> in_compat_syscall into a macro.  If we had current_thread_struct()
>>> or similar and we could use it from thread_info.h, then this would
>>> be a bit cleaner.
>>>
>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>> ---
>>>  arch/x86/entry/common.c            |  4 ++--
>>>  arch/x86/include/asm/processor.h   | 12 ++++++++++++
>>>  arch/x86/include/asm/syscall.h     | 23 +++++------------------
>>>  arch/x86/include/asm/thread_info.h | 23 ++++-------------------
>>>  arch/x86/kernel/asm-offsets.c      |  1 -
>>>  arch/x86/kernel/fpu/init.c         |  1 -
>>>  arch/x86/kernel/process_64.c       |  4 ++--
>>>  arch/x86/kernel/ptrace.c           |  2 +-
>>>  8 files changed, 26 insertions(+), 44 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>>> index ec138e538c44..c4150bec7982 100644
>>> --- a/arch/x86/entry/common.c
>>> +++ b/arch/x86/entry/common.c
>>> @@ -271,7 +271,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
>>>          * syscalls.  The fixup is exercised by the ptrace_syscall_32
>>>          * selftest.
>>>          */
>>> -       ti->status &= ~TS_COMPAT;
>>> +       current->thread.status &= ~TS_COMPAT;
>>>  #endif
>>>
>>>         user_enter();
>>> @@ -369,7 +369,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
>>>         unsigned int nr = (unsigned int)regs->orig_ax;
>>>
>>>  #ifdef CONFIG_IA32_EMULATION
>>> -       ti->status |= TS_COMPAT;
>>> +       current->thread.status |= TS_COMPAT;
>>>  #endif
>>>
>>>         if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
>>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>>> index a2e20d6d01fe..a75e720f6402 100644
>>> --- a/arch/x86/include/asm/processor.h
>>> +++ b/arch/x86/include/asm/processor.h
>>> @@ -388,6 +388,9 @@ struct thread_struct {
>>>         unsigned short          fsindex;
>>>         unsigned short          gsindex;
>>>  #endif
>>> +
>>> +       u32                     status;         /* thread synchronous flags */
>>> +
>>>  #ifdef CONFIG_X86_32
>>>         unsigned long           ip;
>>>  #endif
>>> @@ -437,6 +440,15 @@ struct thread_struct {
>>>  };
>>>
>>>  /*
>>> + * Thread-synchronous status.
>>> + *
>>> + * This is different from the flags in that nobody else
>>> + * ever touches our thread-synchronous status, so we don't
>>> + * have to worry about atomic accesses.
>>> + */
>>> +#define TS_COMPAT              0x0002  /* 32bit syscall active (64BIT)*/
>>> +
>>> +/*
>>>   * Set IOPL bits in EFLAGS from given mask
>>>   */
>>>  static inline void native_set_iopl_mask(unsigned mask)
>>> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
>>> index 999b7cd2e78c..17229e7e2a1c 100644
>>> --- a/arch/x86/include/asm/syscall.h
>>> +++ b/arch/x86/include/asm/syscall.h
>>> @@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
>>>          * TS_COMPAT is set for 32-bit syscall entries and then
>>>          * remains set until we return to user mode.
>>>          */
>>> -       if (task_thread_info(task)->status & TS_COMPAT)
>>> +       if (task->thread.status & TS_COMPAT)
>>>                 /*
>>>                  * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
>>>                  * and will match correctly in comparisons.
>>> @@ -116,7 +116,7 @@ static inline void syscall_get_arguments(struct task_struct *task,
>>>                                          unsigned long *args)
>>>  {
>>>  # ifdef CONFIG_IA32_EMULATION
>>> -       if (task_thread_info(task)->status & TS_COMPAT)
>>> +       if (task->thread.status & TS_COMPAT)
>>>                 switch (i) {
>>>                 case 0:
>>>                         if (!n--) break;
>>> @@ -177,7 +177,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
>>>                                          const unsigned long *args)
>>>  {
>>>  # ifdef CONFIG_IA32_EMULATION
>>> -       if (task_thread_info(task)->status & TS_COMPAT)
>>> +       if (task->thread.status & TS_COMPAT)
>>>                 switch (i) {
>>>                 case 0:
>>>                         if (!n--) break;
>>> @@ -234,21 +234,8 @@ static inline void syscall_set_arguments(struct task_struct *task,
>>>
>>>  static inline int syscall_get_arch(void)
>>>  {
>>> -#ifdef CONFIG_IA32_EMULATION
>>> -       /*
>>> -        * TS_COMPAT is set for 32-bit syscall entry and then
>>> -        * remains set until we return to user mode.
>>> -        *
>>> -        * TIF_IA32 tasks should always have TS_COMPAT set at
>>> -        * system call time.
>>> -        *
>>> -        * x32 tasks should be considered AUDIT_ARCH_X86_64.
>>> -        */
>>> -       if (task_thread_info(current)->status & TS_COMPAT)
>>> -               return AUDIT_ARCH_I386;
>>> -#endif
>>> -       /* Both x32 and x86_64 are considered "64-bit". */
>>> -       return AUDIT_ARCH_X86_64;
>>> +       /* x32 tasks should be considered AUDIT_ARCH_X86_64. */
>>> +       return in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
>>>  }
>>>  #endif /* CONFIG_X86_32 */
>>>
>>> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
>>> index b45ffdda3549..7b42c1e462ac 100644
>>> --- a/arch/x86/include/asm/thread_info.h
>>> +++ b/arch/x86/include/asm/thread_info.h
>>> @@ -55,7 +55,6 @@ struct task_struct;
>>>  struct thread_info {
>>>         struct task_struct      *task;          /* main task structure */
>>>         __u32                   flags;          /* low level flags */
>>> -       __u32                   status;         /* thread synchronous flags */
>>>         __u32                   cpu;            /* current CPU */
>>>  };
>>>
>>> @@ -211,28 +210,14 @@ static inline unsigned long current_stack_pointer(void)
>>>
>>>  #endif
>>>
>>> -/*
>>> - * Thread-synchronous status.
>>> - *
>>> - * This is different from the flags in that nobody else
>>> - * ever touches our thread-synchronous status, so we don't
>>> - * have to worry about atomic accesses.
>>> - */
>>> -#define TS_COMPAT              0x0002  /* 32bit syscall active (64BIT)*/
>>> -
>>>  #ifndef __ASSEMBLY__
>>>
>>> -static inline bool in_ia32_syscall(void)
>>> -{
>>>  #ifdef CONFIG_X86_32
>>> -       return true;
>>> -#endif
>>> -#ifdef CONFIG_IA32_EMULATION
>>> -       if (current_thread_info()->status & TS_COMPAT)
>>> -               return true;
>>> +#define in_ia32_syscall() true
>>> +#else
>>> +#define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
>>> +                          current->thread.status & TS_COMPAT)
>>>  #endif
>>> -       return false;
>>> -}
>>>
>>>  /*
>>>   * Force syscall return via IRET by making it look as if there was
>>> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
>>> index 2bd5c6ff7ee7..a91a6ead24a2 100644
>>> --- a/arch/x86/kernel/asm-offsets.c
>>> +++ b/arch/x86/kernel/asm-offsets.c
>>> @@ -30,7 +30,6 @@
>>>  void common(void) {
>>>         BLANK();
>>>         OFFSET(TI_flags, thread_info, flags);
>>> -       OFFSET(TI_status, thread_info, status);
>>
>> TI_status can be deleted.  It's last users were removed in commit ee08c6bd.
>
> Indeed.
>
> Just to double-check: are you saying that this patch is okay?

It looks OK to me, but I haven't tested it.  Another suggestion is to
change the compat flag to a bitfield, since there is only one TS_*
flag now and it's not referenced from asm.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct
  2016-06-27  0:36       ` Brian Gerst
@ 2016-06-27  0:40         ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27  0:40 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 5:36 PM, Brian Gerst <brgerst@gmail.com> wrote:
> On Sun, Jun 26, 2016 at 8:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Sun, Jun 26, 2016 at 4:55 PM, Brian Gerst <brgerst@gmail.com> wrote:
>>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> --- a/arch/x86/kernel/asm-offsets.c
>>>> +++ b/arch/x86/kernel/asm-offsets.c
>>>> @@ -30,7 +30,6 @@
>>>>  void common(void) {
>>>>         BLANK();
>>>>         OFFSET(TI_flags, thread_info, flags);
>>>> -       OFFSET(TI_status, thread_info, status);
>>>
>>> TI_status can be deleted.  It's last users were removed in commit ee08c6bd.
>>
>> Indeed.
>>
>> Just to double-check: are you saying that this patch is okay?
>
> It looks OK to me, but I haven't tested it.  Another suggestion is to
> change the compat flag to a bitfield, since there is only one TS_*
> flag now and it's not referenced from asm.

That could also work.

As a silly alternative thought: we just might be able to get away with
shoving the "is ia32" flag into one of the high bits of
pt_regs->orig_ax.  It wouldn't break any 32-bit ptrace users because
they can't see the high bits.  It wouldn't break most 64-bit ptrace
users because they use the silly PTRACE_GETREGSET API that doesn't
show the high bits if the tracee is "32-bit".  It would change
behavior when a 64-bit tracer traces a 64-bit process that does int
$0x80, but at least strace already gets that case completely wrong.

Of course, this proposal has all kinds of problems.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 28/29] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  2016-06-26 21:55 ` [PATCH v4 28/29] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK Andy Lutomirski
@ 2016-06-27  2:35   ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27  2:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Oleg Nesterov,
	Peter Zijlstra

On Sun, Jun 26, 2016 at 2:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
> We currently keep every task's stack around until the task_struct
> itself is freed.  This means that we keep the stack allocation alive
> for longer than necessary and that, under load, we free stacks in
> big batches whenever RCU drops the last task reference.  Neither of
> these is good for reuse of cache-hot memory, and freeing in batches
> prevents us from usefully caching small numbers of vmalloced stacks.
>
> On architectures that have thread_info on the stack, we can't easily
> change this, but on architectures that set THREAD_INFO_IN_TASK, we
> can free it as soon as the task is dead.

This is broken:

> -void free_task(struct task_struct *tsk)
> +void release_task_stack(struct task_struct *tsk)
>  {
>         account_kernel_stack(tsk, -1);
>         arch_release_thread_stack(tsk->stack);
>         free_thread_stack(tsk);
> +       tsk->stack = NULL;
> +#ifdef CONFIG_VMAP_STACK
> +       tsk->stack_vm_area = NULL;
> +#endif
> +}
> +
> +void free_task(struct task_struct *tsk)
> +{
> +#ifndef CONFIG_THREAD_INFO_IN_TASK
> +       /*
> +        * The task is finally done with both the stack and thread_info,
> +        * so free both.
> +        */
> +       release_task_stack(tsk);
> +#else
> +       /*
> +        * If the task had a separate stack allocation, it should be gone
> +        * by now.
> +        */
> +       WARN_ON_ONCE(tsk->stack);
> +#endif

We can get to free_task without first going through TASK_DEAD if we
fail to clone().  I'm inclined to make release_task_stack be safe to
call more than once and to call it unconditionally in free_task, since
doing it without branches (calling release_task_stack in the
copy_process failure path) will require more ifdeffery and sounds like
more trouble than it's worth.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-26 21:55 ` [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
@ 2016-06-27  5:58   ` Marcel Holtmann
  2016-06-27  8:54     ` Ingo Molnar
  0 siblings, 1 reply; 84+ messages in thread
From: Marcel Holtmann @ 2016-06-27  5:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Gustavo F. Padovan, Johan Hedberg,
	David S. Miller, linux-bluetooth, netdev

Hi Andy,

> SMP does ECB crypto on stack buffers.  This is complicated and
> fragile, and it will not work if the stack is virtually allocated.
> 
> Switch to the crypto_cipher interface, which is simpler and safer.
> 
> Cc: Marcel Holtmann <marcel@holtmann.org>
> Cc: Gustavo Padovan <gustavo@padovan.org>
> Cc: Johan Hedberg <johan.hedberg@gmail.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: linux-bluetooth@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
> Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
> 1 file changed, 28 insertions(+), 39 deletions(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 05/29] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  2016-06-26 21:55 ` [PATCH v4 05/29] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables() Andy Lutomirski
@ 2016-06-27  7:19   ` Borislav Petkov
  0 siblings, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2016-06-27  7:19 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, linux-arch, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Matt Fleming, linux-efi

Andy Lutomirski <luto@kernel.org> wrote:

>kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
>init_mm.pgd were to be cleared, callers would need to ensure that
>the pgd entry hadn't been propagated to any other pgd.
>
>Its only caller was efi_cleanup_page_tables(), and that, in turn,
>was unused, so just delete both functions.  This leaves a couple of
>other helpers unused, so delete them, too.
>
>Cc: Matt Fleming <matt@codeblueprint.co.uk>
>Cc: linux-efi@vger.kernel.org
>Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
>Signed-off-by: Andy Lutomirski <luto@kernel.org>
>---
> arch/x86/include/asm/efi.h           |  1 -
> arch/x86/include/asm/pgtable_types.h |  2 --
> arch/x86/mm/pageattr.c               | 28 ----------------------------
> arch/x86/platform/efi/efi.c          |  2 --
> arch/x86/platform/efi/efi_32.c       |  3 ---
> arch/x86/platform/efi/efi_64.c       |  5 -----
> 6 files changed, 41 deletions(-)

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Sent from a small device: formatting sucks and brevity is inevitable. 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-27  5:58   ` Marcel Holtmann
@ 2016-06-27  8:54     ` Ingo Molnar
  2016-06-27 22:30       ` Marcel Holtmann
  0 siblings, 1 reply; 84+ messages in thread
From: Ingo Molnar @ 2016-06-27  8:54 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens,
	Gustavo F. Padovan, Johan Hedberg, David S. Miller,
	linux-bluetooth, netdev


* Marcel Holtmann <marcel@holtmann.org> wrote:

> Hi Andy,
> 
> > SMP does ECB crypto on stack buffers.  This is complicated and
> > fragile, and it will not work if the stack is virtually allocated.
> > 
> > Switch to the crypto_cipher interface, which is simpler and safer.
> > 
> > Cc: Marcel Holtmann <marcel@holtmann.org>
> > Cc: Gustavo Padovan <gustavo@padovan.org>
> > Cc: Johan Hedberg <johan.hedberg@gmail.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: linux-bluetooth@vger.kernel.org
> > Cc: netdev@vger.kernel.org
> > Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
> > Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> > net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
> > 1 file changed, 28 insertions(+), 39 deletions(-)
> 
> patch has been applied to bluetooth-next tree.

Sadly carrying this separately will delay the virtual kernel stacks feature by a 
kernel cycle, because it's a must-have prerequisite.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-26 21:55 ` [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks Andy Lutomirski
@ 2016-06-27 15:01   ` Brian Gerst
  2016-06-27 15:12     ` Brian Gerst
  0 siblings, 1 reply; 84+ messages in thread
From: Brian Gerst @ 2016-06-27 15:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
> This allows x86_64 kernels to enable vmapped stacks.  There are a
> couple of interesting bits.
>
> First, x86 lazily faults in top-level paging entries for the vmalloc
> area.  This won't work if we get a page fault while trying to access
> the stack: the CPU will promote it to a double-fault and we'll die.
> To avoid this problem, probe the new stack when switching stacks and
> forcibly populate the pgd entry for the stack when switching mms.
>
> Second, once we have guard pages around the stack, we'll want to
> detect and handle stack overflow.
>
> I didn't enable it on x86_32.  We'd need to rework the double-fault
> code a bit and I'm concerned about running out of vmalloc virtual
> addresses under some workloads.
>
> This patch, by itself, will behave somewhat erratically when the
> stack overflows while RSP is still more than a few tens of bytes
> above the bottom of the stack.  Specifically, we'll get #PF and make
> it to no_context and an oops without triggering a double-fault, and
> no_context doesn't know about stack overflows.  The next patch will
> improve that case.
>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/Kconfig                 |  1 +
>  arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
>  arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
>  arch/x86/mm/tlb.c                | 15 +++++++++++++++
>  4 files changed, 75 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d9a94da0c29f..afdcf96ef109 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -92,6 +92,7 @@ config X86
>         select HAVE_ARCH_TRACEHOOK
>         select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>         select HAVE_EBPF_JIT                    if X86_64
> +       select HAVE_ARCH_VMAP_STACK             if X86_64
>         select HAVE_CC_STACKPROTECTOR
>         select HAVE_CMPXCHG_DOUBLE
>         select HAVE_CMPXCHG_LOCAL
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index 8f321a1b03a1..14e4b20f0aaf 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -8,6 +8,28 @@ struct tss_struct;
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                       struct tss_struct *tss);
>
> +/* This runs runs on the previous thread's stack. */
> +static inline void prepare_switch_to(struct task_struct *prev,
> +                                    struct task_struct *next)
> +{
> +#ifdef CONFIG_VMAP_STACK
> +       /*
> +        * If we switch to a stack that has a top-level paging entry
> +        * that is not present in the current mm, the resulting #PF will
> +        * will be promoted to a double-fault and we'll panic.  Probe
> +        * the new stack now so that vmalloc_fault can fix up the page
> +        * tables if needed.  This can only happen if we use a stack
> +        * in vmap space.
> +        *
> +        * We assume that the stack is aligned so that it never spans
> +        * more than one top-level paging entry.
> +        *
> +        * To minimize cache pollution, just follow the stack pointer.
> +        */
> +       READ_ONCE(*(unsigned char *)next->thread.sp);
> +#endif
> +}
> +
>  #ifdef CONFIG_X86_32
>
>  #ifdef CONFIG_CC_STACKPROTECTOR
> @@ -39,6 +61,8 @@ do {                                                                  \
>          */                                                             \
>         unsigned long ebx, ecx, edx, esi, edi;                          \
>                                                                         \
> +       prepare_switch_to(prev, next);                                  \
> +                                                                       \
>         asm volatile("pushl %%ebp\n\t"          /* save    EBP   */     \
>                      "movl %%esp,%[prev_sp]\n\t"        /* save    ESP   */ \
>                      "movl %[next_sp],%%esp\n\t"        /* restore ESP   */ \
> @@ -103,7 +127,9 @@ do {                                                                        \
>   * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
>   * has no effect.
>   */
> -#define switch_to(prev, next, last) \
> +#define switch_to(prev, next, last)                                      \
> +       prepare_switch_to(prev, next);                                    \
> +                                                                         \
>         asm volatile(SAVE_CONTEXT                                         \
>              "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */       \
>              "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */    \
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 00f03d82e69a..9cb7ea781176 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present", segment_not_present)
>  DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",            stack_segment)
>  DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",          alignment_check)
>
> +#ifdef CONFIG_VMAP_STACK
> +static void __noreturn handle_stack_overflow(const char *message,
> +                                            struct pt_regs *regs,
> +                                            unsigned long fault_address)
> +{
> +       printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
> +                (void *)fault_address, current->stack,
> +                (char *)current->stack + THREAD_SIZE - 1);
> +       die(message, regs, 0);
> +
> +       /* Be absolutely certain we don't return. */
> +       panic(message);
> +}
> +#endif
> +
>  #ifdef CONFIG_X86_64
>  /* Runs on IST stack */
>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>  {
>         static const char str[] = "double fault";
>         struct task_struct *tsk = current;
> +#ifdef CONFIG_VMAP_STACK
> +       unsigned long cr2;
> +#endif
>
>  #ifdef CONFIG_X86_ESPFIX64
>         extern unsigned char native_irq_return_iret[];
> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>         tsk->thread.error_code = error_code;
>         tsk->thread.trap_nr = X86_TRAP_DF;
>
> +#ifdef CONFIG_VMAP_STACK
> +       /*
> +        * If we overflow the stack into a guard page, the CPU will fail
> +        * to deliver #PF and will send #DF instead.  CR2 will contain
> +        * the linear address of the second fault, which will be in the
> +        * guard page below the bottom of the stack.
> +        */
> +       cr2 = read_cr2();
> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
> +               handle_stack_overflow(
> +                       "kernel stack overflow (double-fault)",
> +                       regs, cr2);
> +#endif

Is there any other way to tell if this was from a page fault?  If it
wasn't a page fault then CR2 is undefined.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 15:01   ` Brian Gerst
@ 2016-06-27 15:12     ` Brian Gerst
  2016-06-27 15:22       ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Brian Gerst @ 2016-06-27 15:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> This allows x86_64 kernels to enable vmapped stacks.  There are a
>> couple of interesting bits.
>>
>> First, x86 lazily faults in top-level paging entries for the vmalloc
>> area.  This won't work if we get a page fault while trying to access
>> the stack: the CPU will promote it to a double-fault and we'll die.
>> To avoid this problem, probe the new stack when switching stacks and
>> forcibly populate the pgd entry for the stack when switching mms.
>>
>> Second, once we have guard pages around the stack, we'll want to
>> detect and handle stack overflow.
>>
>> I didn't enable it on x86_32.  We'd need to rework the double-fault
>> code a bit and I'm concerned about running out of vmalloc virtual
>> addresses under some workloads.
>>
>> This patch, by itself, will behave somewhat erratically when the
>> stack overflows while RSP is still more than a few tens of bytes
>> above the bottom of the stack.  Specifically, we'll get #PF and make
>> it to no_context and an oops without triggering a double-fault, and
>> no_context doesn't know about stack overflows.  The next patch will
>> improve that case.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/Kconfig                 |  1 +
>>  arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
>>  arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
>>  arch/x86/mm/tlb.c                | 15 +++++++++++++++
>>  4 files changed, 75 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index d9a94da0c29f..afdcf96ef109 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -92,6 +92,7 @@ config X86
>>         select HAVE_ARCH_TRACEHOOK
>>         select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>>         select HAVE_EBPF_JIT                    if X86_64
>> +       select HAVE_ARCH_VMAP_STACK             if X86_64
>>         select HAVE_CC_STACKPROTECTOR
>>         select HAVE_CMPXCHG_DOUBLE
>>         select HAVE_CMPXCHG_LOCAL
>> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
>> index 8f321a1b03a1..14e4b20f0aaf 100644
>> --- a/arch/x86/include/asm/switch_to.h
>> +++ b/arch/x86/include/asm/switch_to.h
>> @@ -8,6 +8,28 @@ struct tss_struct;
>>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>>                       struct tss_struct *tss);
>>
>> +/* This runs runs on the previous thread's stack. */
>> +static inline void prepare_switch_to(struct task_struct *prev,
>> +                                    struct task_struct *next)
>> +{
>> +#ifdef CONFIG_VMAP_STACK
>> +       /*
>> +        * If we switch to a stack that has a top-level paging entry
>> +        * that is not present in the current mm, the resulting #PF will
>> +        * will be promoted to a double-fault and we'll panic.  Probe
>> +        * the new stack now so that vmalloc_fault can fix up the page
>> +        * tables if needed.  This can only happen if we use a stack
>> +        * in vmap space.
>> +        *
>> +        * We assume that the stack is aligned so that it never spans
>> +        * more than one top-level paging entry.
>> +        *
>> +        * To minimize cache pollution, just follow the stack pointer.
>> +        */
>> +       READ_ONCE(*(unsigned char *)next->thread.sp);
>> +#endif
>> +}
>> +
>>  #ifdef CONFIG_X86_32
>>
>>  #ifdef CONFIG_CC_STACKPROTECTOR
>> @@ -39,6 +61,8 @@ do {                                                                  \
>>          */                                                             \
>>         unsigned long ebx, ecx, edx, esi, edi;                          \
>>                                                                         \
>> +       prepare_switch_to(prev, next);                                  \
>> +                                                                       \
>>         asm volatile("pushl %%ebp\n\t"          /* save    EBP   */     \
>>                      "movl %%esp,%[prev_sp]\n\t"        /* save    ESP   */ \
>>                      "movl %[next_sp],%%esp\n\t"        /* restore ESP   */ \
>> @@ -103,7 +127,9 @@ do {                                                                        \
>>   * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
>>   * has no effect.
>>   */
>> -#define switch_to(prev, next, last) \
>> +#define switch_to(prev, next, last)                                      \
>> +       prepare_switch_to(prev, next);                                    \
>> +                                                                         \
>>         asm volatile(SAVE_CONTEXT                                         \
>>              "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */       \
>>              "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */    \
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 00f03d82e69a..9cb7ea781176 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present", segment_not_present)
>>  DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",            stack_segment)
>>  DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",          alignment_check)
>>
>> +#ifdef CONFIG_VMAP_STACK
>> +static void __noreturn handle_stack_overflow(const char *message,
>> +                                            struct pt_regs *regs,
>> +                                            unsigned long fault_address)
>> +{
>> +       printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
>> +                (void *)fault_address, current->stack,
>> +                (char *)current->stack + THREAD_SIZE - 1);
>> +       die(message, regs, 0);
>> +
>> +       /* Be absolutely certain we don't return. */
>> +       panic(message);
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_X86_64
>>  /* Runs on IST stack */
>>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>  {
>>         static const char str[] = "double fault";
>>         struct task_struct *tsk = current;
>> +#ifdef CONFIG_VMAP_STACK
>> +       unsigned long cr2;
>> +#endif
>>
>>  #ifdef CONFIG_X86_ESPFIX64
>>         extern unsigned char native_irq_return_iret[];
>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>         tsk->thread.error_code = error_code;
>>         tsk->thread.trap_nr = X86_TRAP_DF;
>>
>> +#ifdef CONFIG_VMAP_STACK
>> +       /*
>> +        * If we overflow the stack into a guard page, the CPU will fail
>> +        * to deliver #PF and will send #DF instead.  CR2 will contain
>> +        * the linear address of the second fault, which will be in the
>> +        * guard page below the bottom of the stack.
>> +        */
>> +       cr2 = read_cr2();
>> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>> +               handle_stack_overflow(
>> +                       "kernel stack overflow (double-fault)",
>> +                       regs, cr2);
>> +#endif
>
> Is there any other way to tell if this was from a page fault?  If it
> wasn't a page fault then CR2 is undefined.

I guess it doesn't really matter, since the fault is fatal either way.
The error message might be incorrect though.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 15:12     ` Brian Gerst
@ 2016-06-27 15:22       ` Andy Lutomirski
  2016-06-27 15:54         ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27 15:22 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 8:12 AM, Brian Gerst <brgerst@gmail.com> wrote:
> On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> This allows x86_64 kernels to enable vmapped stacks.  There are a
>>> couple of interesting bits.
>>>
>>> First, x86 lazily faults in top-level paging entries for the vmalloc
>>> area.  This won't work if we get a page fault while trying to access
>>> the stack: the CPU will promote it to a double-fault and we'll die.
>>> To avoid this problem, probe the new stack when switching stacks and
>>> forcibly populate the pgd entry for the stack when switching mms.
>>>
>>> Second, once we have guard pages around the stack, we'll want to
>>> detect and handle stack overflow.
>>>
>>> I didn't enable it on x86_32.  We'd need to rework the double-fault
>>> code a bit and I'm concerned about running out of vmalloc virtual
>>> addresses under some workloads.
>>>
>>> This patch, by itself, will behave somewhat erratically when the
>>> stack overflows while RSP is still more than a few tens of bytes
>>> above the bottom of the stack.  Specifically, we'll get #PF and make
>>> it to no_context and an oops without triggering a double-fault, and
>>> no_context doesn't know about stack overflows.  The next patch will
>>> improve that case.
>>>
>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>> ---
>>>  arch/x86/Kconfig                 |  1 +
>>>  arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
>>>  arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
>>>  arch/x86/mm/tlb.c                | 15 +++++++++++++++
>>>  4 files changed, 75 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index d9a94da0c29f..afdcf96ef109 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -92,6 +92,7 @@ config X86
>>>         select HAVE_ARCH_TRACEHOOK
>>>         select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>>>         select HAVE_EBPF_JIT                    if X86_64
>>> +       select HAVE_ARCH_VMAP_STACK             if X86_64
>>>         select HAVE_CC_STACKPROTECTOR
>>>         select HAVE_CMPXCHG_DOUBLE
>>>         select HAVE_CMPXCHG_LOCAL
>>> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
>>> index 8f321a1b03a1..14e4b20f0aaf 100644
>>> --- a/arch/x86/include/asm/switch_to.h
>>> +++ b/arch/x86/include/asm/switch_to.h
>>> @@ -8,6 +8,28 @@ struct tss_struct;
>>>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>>>                       struct tss_struct *tss);
>>>
>>> +/* This runs runs on the previous thread's stack. */
>>> +static inline void prepare_switch_to(struct task_struct *prev,
>>> +                                    struct task_struct *next)
>>> +{
>>> +#ifdef CONFIG_VMAP_STACK
>>> +       /*
>>> +        * If we switch to a stack that has a top-level paging entry
>>> +        * that is not present in the current mm, the resulting #PF will
>>> +        * will be promoted to a double-fault and we'll panic.  Probe
>>> +        * the new stack now so that vmalloc_fault can fix up the page
>>> +        * tables if needed.  This can only happen if we use a stack
>>> +        * in vmap space.
>>> +        *
>>> +        * We assume that the stack is aligned so that it never spans
>>> +        * more than one top-level paging entry.
>>> +        *
>>> +        * To minimize cache pollution, just follow the stack pointer.
>>> +        */
>>> +       READ_ONCE(*(unsigned char *)next->thread.sp);
>>> +#endif
>>> +}
>>> +
>>>  #ifdef CONFIG_X86_32
>>>
>>>  #ifdef CONFIG_CC_STACKPROTECTOR
>>> @@ -39,6 +61,8 @@ do {                                                                  \
>>>          */                                                             \
>>>         unsigned long ebx, ecx, edx, esi, edi;                          \
>>>                                                                         \
>>> +       prepare_switch_to(prev, next);                                  \
>>> +                                                                       \
>>>         asm volatile("pushl %%ebp\n\t"          /* save    EBP   */     \
>>>                      "movl %%esp,%[prev_sp]\n\t"        /* save    ESP   */ \
>>>                      "movl %[next_sp],%%esp\n\t"        /* restore ESP   */ \
>>> @@ -103,7 +127,9 @@ do {                                                                        \
>>>   * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
>>>   * has no effect.
>>>   */
>>> -#define switch_to(prev, next, last) \
>>> +#define switch_to(prev, next, last)                                      \
>>> +       prepare_switch_to(prev, next);                                    \
>>> +                                                                         \
>>>         asm volatile(SAVE_CONTEXT                                         \
>>>              "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */       \
>>>              "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */    \
>>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>>> index 00f03d82e69a..9cb7ea781176 100644
>>> --- a/arch/x86/kernel/traps.c
>>> +++ b/arch/x86/kernel/traps.c
>>> @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present", segment_not_present)
>>>  DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",            stack_segment)
>>>  DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",          alignment_check)
>>>
>>> +#ifdef CONFIG_VMAP_STACK
>>> +static void __noreturn handle_stack_overflow(const char *message,
>>> +                                            struct pt_regs *regs,
>>> +                                            unsigned long fault_address)
>>> +{
>>> +       printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
>>> +                (void *)fault_address, current->stack,
>>> +                (char *)current->stack + THREAD_SIZE - 1);
>>> +       die(message, regs, 0);
>>> +
>>> +       /* Be absolutely certain we don't return. */
>>> +       panic(message);
>>> +}
>>> +#endif
>>> +
>>>  #ifdef CONFIG_X86_64
>>>  /* Runs on IST stack */
>>>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>  {
>>>         static const char str[] = "double fault";
>>>         struct task_struct *tsk = current;
>>> +#ifdef CONFIG_VMAP_STACK
>>> +       unsigned long cr2;
>>> +#endif
>>>
>>>  #ifdef CONFIG_X86_ESPFIX64
>>>         extern unsigned char native_irq_return_iret[];
>>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>         tsk->thread.error_code = error_code;
>>>         tsk->thread.trap_nr = X86_TRAP_DF;
>>>
>>> +#ifdef CONFIG_VMAP_STACK
>>> +       /*
>>> +        * If we overflow the stack into a guard page, the CPU will fail
>>> +        * to deliver #PF and will send #DF instead.  CR2 will contain
>>> +        * the linear address of the second fault, which will be in the
>>> +        * guard page below the bottom of the stack.
>>> +        */
>>> +       cr2 = read_cr2();
>>> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>>> +               handle_stack_overflow(
>>> +                       "kernel stack overflow (double-fault)",
>>> +                       regs, cr2);
>>> +#endif
>>
>> Is there any other way to tell if this was from a page fault?  If it
>> wasn't a page fault then CR2 is undefined.
>
> I guess it doesn't really matter, since the fault is fatal either way.
> The error message might be incorrect though.
>

It's at least worth a comment, though.  Maybe I should check if
regs->rsp is within 40 bytes of the bottom of the stack, too, such
that delivery of an inner fault would have double-faulted assuming the
inner fault didn't use an IST vector.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 15:22       ` Andy Lutomirski
@ 2016-06-27 15:54         ` Andy Lutomirski
  2016-06-27 16:17           ` Brian Gerst
  2016-06-27 17:28           ` Linus Torvalds
  0 siblings, 2 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27 15:54 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 8:22 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 27, 2016 at 8:12 AM, Brian Gerst <brgerst@gmail.com> wrote:
>> On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>  #ifdef CONFIG_X86_64
>>>>  /* Runs on IST stack */
>>>>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>  {
>>>>         static const char str[] = "double fault";
>>>>         struct task_struct *tsk = current;
>>>> +#ifdef CONFIG_VMAP_STACK
>>>> +       unsigned long cr2;
>>>> +#endif
>>>>
>>>>  #ifdef CONFIG_X86_ESPFIX64
>>>>         extern unsigned char native_irq_return_iret[];
>>>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>         tsk->thread.error_code = error_code;
>>>>         tsk->thread.trap_nr = X86_TRAP_DF;
>>>>
>>>> +#ifdef CONFIG_VMAP_STACK
>>>> +       /*
>>>> +        * If we overflow the stack into a guard page, the CPU will fail
>>>> +        * to deliver #PF and will send #DF instead.  CR2 will contain
>>>> +        * the linear address of the second fault, which will be in the
>>>> +        * guard page below the bottom of the stack.
>>>> +        */
>>>> +       cr2 = read_cr2();
>>>> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>>>> +               handle_stack_overflow(
>>>> +                       "kernel stack overflow (double-fault)",
>>>> +                       regs, cr2);
>>>> +#endif
>>>
>>> Is there any other way to tell if this was from a page fault?  If it
>>> wasn't a page fault then CR2 is undefined.
>>
>> I guess it doesn't really matter, since the fault is fatal either way.
>> The error message might be incorrect though.
>>
>
> It's at least worth a comment, though.  Maybe I should check if
> regs->rsp is within 40 bytes of the bottom of the stack, too, such
> that delivery of an inner fault would have double-faulted assuming the
> inner fault didn't use an IST vector.
>

How about:

    /*
     * If we overflow the stack into a guard page, the CPU will fail
     * to deliver #PF and will send #DF instead.  CR2 will contain
     * the linear address of the second fault, which will be in the
     * guard page below the bottom of the stack.
     *
     * We're limited to using heuristics here, since the CPU does
     * not tell us what type of fault failed and, if the first fault
     * wasn't a page fault, CR2 may contain stale garbage.  To mostly
     * rule out garbage, we check if the saved RSP is close enough to
     * the bottom of the stack to cause exception delivery to fail.
     * The worst case is 7 stack slots: one for alignment, five for
     * SS..RIP, and one for the error code.
     */
    tsk_stack = (unsigned long)task_stack_page(tsk);
    if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {
        /* A double-fault due to #PF delivery failure is plausible. */
        cr2 = read_cr2();
        if (tsk_stack - 1 - cr2 < PAGE_SIZE)
            handle_stack_overflow(
                "kernel stack overflow (double-fault)",
                regs, cr2);
    }

--Andy
-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 15:54         ` Andy Lutomirski
@ 2016-06-27 16:17           ` Brian Gerst
  2016-06-27 16:35             ` Andy Lutomirski
  2016-06-27 17:28           ` Linus Torvalds
  1 sibling, 1 reply; 84+ messages in thread
From: Brian Gerst @ 2016-06-27 16:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 11:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 27, 2016 at 8:22 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jun 27, 2016 at 8:12 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>> On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>>  #ifdef CONFIG_X86_64
>>>>>  /* Runs on IST stack */
>>>>>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>  {
>>>>>         static const char str[] = "double fault";
>>>>>         struct task_struct *tsk = current;
>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>> +       unsigned long cr2;
>>>>> +#endif
>>>>>
>>>>>  #ifdef CONFIG_X86_ESPFIX64
>>>>>         extern unsigned char native_irq_return_iret[];
>>>>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>         tsk->thread.error_code = error_code;
>>>>>         tsk->thread.trap_nr = X86_TRAP_DF;
>>>>>
>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>> +       /*
>>>>> +        * If we overflow the stack into a guard page, the CPU will fail
>>>>> +        * to deliver #PF and will send #DF instead.  CR2 will contain
>>>>> +        * the linear address of the second fault, which will be in the
>>>>> +        * guard page below the bottom of the stack.
>>>>> +        */
>>>>> +       cr2 = read_cr2();
>>>>> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>>>>> +               handle_stack_overflow(
>>>>> +                       "kernel stack overflow (double-fault)",
>>>>> +                       regs, cr2);
>>>>> +#endif
>>>>
>>>> Is there any other way to tell if this was from a page fault?  If it
>>>> wasn't a page fault then CR2 is undefined.
>>>
>>> I guess it doesn't really matter, since the fault is fatal either way.
>>> The error message might be incorrect though.
>>>
>>
>> It's at least worth a comment, though.  Maybe I should check if
>> regs->rsp is within 40 bytes of the bottom of the stack, too, such
>> that delivery of an inner fault would have double-faulted assuming the
>> inner fault didn't use an IST vector.
>>
>
> How about:
>
>     /*
>      * If we overflow the stack into a guard page, the CPU will fail
>      * to deliver #PF and will send #DF instead.  CR2 will contain
>      * the linear address of the second fault, which will be in the
>      * guard page below the bottom of the stack.
>      *
>      * We're limited to using heuristics here, since the CPU does
>      * not tell us what type of fault failed and, if the first fault
>      * wasn't a page fault, CR2 may contain stale garbage.  To mostly
>      * rule out garbage, we check if the saved RSP is close enough to
>      * the bottom of the stack to cause exception delivery to fail.
>      * The worst case is 7 stack slots: one for alignment, five for
>      * SS..RIP, and one for the error code.
>      */
>     tsk_stack = (unsigned long)task_stack_page(tsk);
>     if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {
>         /* A double-fault due to #PF delivery failure is plausible. */
>         cr2 = read_cr2();
>         if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>             handle_stack_overflow(
>                 "kernel stack overflow (double-fault)",
>                 regs, cr2);
>     }

I think RSP anywhere in the guard page would be best, since it could
have been decremented by a function prologue into the guard page
before an access that triggers the page fault.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 16:17           ` Brian Gerst
@ 2016-06-27 16:35             ` Andy Lutomirski
  2016-06-27 17:09               ` Brian Gerst
  0 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27 16:35 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 9:17 AM, Brian Gerst <brgerst@gmail.com> wrote:
> On Mon, Jun 27, 2016 at 11:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jun 27, 2016 at 8:22 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Jun 27, 2016 at 8:12 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>> On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>  #ifdef CONFIG_X86_64
>>>>>>  /* Runs on IST stack */
>>>>>>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>  {
>>>>>>         static const char str[] = "double fault";
>>>>>>         struct task_struct *tsk = current;
>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>> +       unsigned long cr2;
>>>>>> +#endif
>>>>>>
>>>>>>  #ifdef CONFIG_X86_ESPFIX64
>>>>>>         extern unsigned char native_irq_return_iret[];
>>>>>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>         tsk->thread.error_code = error_code;
>>>>>>         tsk->thread.trap_nr = X86_TRAP_DF;
>>>>>>
>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>> +       /*
>>>>>> +        * If we overflow the stack into a guard page, the CPU will fail
>>>>>> +        * to deliver #PF and will send #DF instead.  CR2 will contain
>>>>>> +        * the linear address of the second fault, which will be in the
>>>>>> +        * guard page below the bottom of the stack.
>>>>>> +        */
>>>>>> +       cr2 = read_cr2();
>>>>>> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>>>>>> +               handle_stack_overflow(
>>>>>> +                       "kernel stack overflow (double-fault)",
>>>>>> +                       regs, cr2);
>>>>>> +#endif
>>>>>
>>>>> Is there any other way to tell if this was from a page fault?  If it
>>>>> wasn't a page fault then CR2 is undefined.
>>>>
>>>> I guess it doesn't really matter, since the fault is fatal either way.
>>>> The error message might be incorrect though.
>>>>
>>>
>>> It's at least worth a comment, though.  Maybe I should check if
>>> regs->rsp is within 40 bytes of the bottom of the stack, too, such
>>> that delivery of an inner fault would have double-faulted assuming the
>>> inner fault didn't use an IST vector.
>>>
>>
>> How about:
>>
>>     /*
>>      * If we overflow the stack into a guard page, the CPU will fail
>>      * to deliver #PF and will send #DF instead.  CR2 will contain
>>      * the linear address of the second fault, which will be in the
>>      * guard page below the bottom of the stack.
>>      *
>>      * We're limited to using heuristics here, since the CPU does
>>      * not tell us what type of fault failed and, if the first fault
>>      * wasn't a page fault, CR2 may contain stale garbage.  To mostly
>>      * rule out garbage, we check if the saved RSP is close enough to
>>      * the bottom of the stack to cause exception delivery to fail.
>>      * The worst case is 7 stack slots: one for alignment, five for
>>      * SS..RIP, and one for the error code.
>>      */
>>     tsk_stack = (unsigned long)task_stack_page(tsk);
>>     if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {
>>         /* A double-fault due to #PF delivery failure is plausible. */
>>         cr2 = read_cr2();
>>         if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>>             handle_stack_overflow(
>>                 "kernel stack overflow (double-fault)",
>>                 regs, cr2);
>>     }
>
> I think RSP anywhere in the guard page would be best, since it could
> have been decremented by a function prologue into the guard page
> before an access that triggers the page fault.
>

I think that can miss some stack overflows.  Suppose that RSP points
very close to the bottom of the stack and we take an unrelated fault.
The CPU can fail to deliver that fault and we get a double fault
instead.  But I counted wrong, too.  Do you like this version and its
explanation?

    /*
     * If we overflow the stack into a guard page, the CPU will fail
     * to deliver #PF and will send #DF instead.  Similarly, if we
     * take any non-IST exception while too close to the bottom of
     * the stack, the processor will get a page fault while
     * delivering the exception and will generate a double fault.
     *
     * According to the SDM (footnote in 6.15 under "Interrupt 14 -
     * Page-Fault Exception (#PF):
     *
     *   Processors update CR2 whenever a page fault is detected. If a
     *   second page fault occurs while an earlier page fault is being
     *   deliv- ered, the faulting linear address of the second fault will
     *   overwrite the contents of CR2 (replacing the previous
     *   address). These updates to CR2 occur even if the page fault
     *   results in a double fault or occurs during the delivery of a
     *   double fault.
     *
     * However, if we got here due to a non-page-fault exception while
     * delivering a non-page-fault exception, CR2 may contain a
     * stale value.
     *
     * As a heuristic: we consider this double fault to be a stack
     * overflow if CR2 points to the guard page and RSP is either
     * in the guard page or close enough to the bottom of the stack
     *
     * We're limited to using heuristics here, since the CPU does
     * not tell us what type of fault failed and, if the first fault
     * wasn't a page fault, CR2 may contain stale garbage.  To
     * mostly rule out garbage, we check if the saved RSP is close
     * enough to the bottom of the stack to cause exception delivery
     * to fail.  If RSP == tsk_stack + 48 and we take an exception,
     * the stack is already aligned and there will be enough room
     * SS, RSP, RFLAGS, CS, RIP, and a possible error code.  With
     * any less space left, exception delivery could fail.
     */
    tsk_stack = (unsigned long)task_stack_page(tsk);
    if (regs->rsp < tsk_stack + 48 && regs->rsp > tsk_stack - PAGE_SIZE) {
        /* A double-fault due to #PF delivery failure is plausible. */
        cr2 = read_cr2();
        if (tsk_stack - 1 - cr2 < PAGE_SIZE)
            handle_stack_overflow(
                "kernel stack overflow (double-fault)",
                regs, cr2);
    }

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 16:35             ` Andy Lutomirski
@ 2016-06-27 17:09               ` Brian Gerst
  2016-06-27 17:23                 ` Brian Gerst
  0 siblings, 1 reply; 84+ messages in thread
From: Brian Gerst @ 2016-06-27 17:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 12:35 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jun 27, 2016 at 9:17 AM, Brian Gerst <brgerst@gmail.com> wrote:
>> On Mon, Jun 27, 2016 at 11:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Jun 27, 2016 at 8:22 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mon, Jun 27, 2016 at 8:12 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>>> On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>>>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>>  #ifdef CONFIG_X86_64
>>>>>>>  /* Runs on IST stack */
>>>>>>>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>>  {
>>>>>>>         static const char str[] = "double fault";
>>>>>>>         struct task_struct *tsk = current;
>>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>>> +       unsigned long cr2;
>>>>>>> +#endif
>>>>>>>
>>>>>>>  #ifdef CONFIG_X86_ESPFIX64
>>>>>>>         extern unsigned char native_irq_return_iret[];
>>>>>>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>>         tsk->thread.error_code = error_code;
>>>>>>>         tsk->thread.trap_nr = X86_TRAP_DF;
>>>>>>>
>>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>>> +       /*
>>>>>>> +        * If we overflow the stack into a guard page, the CPU will fail
>>>>>>> +        * to deliver #PF and will send #DF instead.  CR2 will contain
>>>>>>> +        * the linear address of the second fault, which will be in the
>>>>>>> +        * guard page below the bottom of the stack.
>>>>>>> +        */
>>>>>>> +       cr2 = read_cr2();
>>>>>>> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>>>>>>> +               handle_stack_overflow(
>>>>>>> +                       "kernel stack overflow (double-fault)",
>>>>>>> +                       regs, cr2);
>>>>>>> +#endif
>>>>>>
>>>>>> Is there any other way to tell if this was from a page fault?  If it
>>>>>> wasn't a page fault then CR2 is undefined.
>>>>>
>>>>> I guess it doesn't really matter, since the fault is fatal either way.
>>>>> The error message might be incorrect though.
>>>>>
>>>>
>>>> It's at least worth a comment, though.  Maybe I should check if
>>>> regs->rsp is within 40 bytes of the bottom of the stack, too, such
>>>> that delivery of an inner fault would have double-faulted assuming the
>>>> inner fault didn't use an IST vector.
>>>>
>>>
>>> How about:
>>>
>>>     /*
>>>      * If we overflow the stack into a guard page, the CPU will fail
>>>      * to deliver #PF and will send #DF instead.  CR2 will contain
>>>      * the linear address of the second fault, which will be in the
>>>      * guard page below the bottom of the stack.
>>>      *
>>>      * We're limited to using heuristics here, since the CPU does
>>>      * not tell us what type of fault failed and, if the first fault
>>>      * wasn't a page fault, CR2 may contain stale garbage.  To mostly
>>>      * rule out garbage, we check if the saved RSP is close enough to
>>>      * the bottom of the stack to cause exception delivery to fail.
>>>      * The worst case is 7 stack slots: one for alignment, five for
>>>      * SS..RIP, and one for the error code.
>>>      */
>>>     tsk_stack = (unsigned long)task_stack_page(tsk);
>>>     if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {
>>>         /* A double-fault due to #PF delivery failure is plausible. */
>>>         cr2 = read_cr2();
>>>         if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>>>             handle_stack_overflow(
>>>                 "kernel stack overflow (double-fault)",
>>>                 regs, cr2);
>>>     }
>>
>> I think RSP anywhere in the guard page would be best, since it could
>> have been decremented by a function prologue into the guard page
>> before an access that triggers the page fault.
>>
>
> I think that can miss some stack overflows.  Suppose that RSP points
> very close to the bottom of the stack and we take an unrelated fault.
> The CPU can fail to deliver that fault and we get a double fault
> instead.  But I counted wrong, too.  Do you like this version and its
> explanation?
>
>     /*
>      * If we overflow the stack into a guard page, the CPU will fail
>      * to deliver #PF and will send #DF instead.  Similarly, if we
>      * take any non-IST exception while too close to the bottom of
>      * the stack, the processor will get a page fault while
>      * delivering the exception and will generate a double fault.
>      *
>      * According to the SDM (footnote in 6.15 under "Interrupt 14 -
>      * Page-Fault Exception (#PF):
>      *
>      *   Processors update CR2 whenever a page fault is detected. If a
>      *   second page fault occurs while an earlier page fault is being
>      *   deliv- ered, the faulting linear address of the second fault will
>      *   overwrite the contents of CR2 (replacing the previous
>      *   address). These updates to CR2 occur even if the page fault
>      *   results in a double fault or occurs during the delivery of a
>      *   double fault.
>      *
>      * However, if we got here due to a non-page-fault exception while
>      * delivering a non-page-fault exception, CR2 may contain a
>      * stale value.
>      *
>      * As a heuristic: we consider this double fault to be a stack
>      * overflow if CR2 points to the guard page and RSP is either
>      * in the guard page or close enough to the bottom of the stack
>      *
>      * We're limited to using heuristics here, since the CPU does
>      * not tell us what type of fault failed and, if the first fault
>      * wasn't a page fault, CR2 may contain stale garbage.  To
>      * mostly rule out garbage, we check if the saved RSP is close
>      * enough to the bottom of the stack to cause exception delivery
>      * to fail.  If RSP == tsk_stack + 48 and we take an exception,
>      * the stack is already aligned and there will be enough room
>      * SS, RSP, RFLAGS, CS, RIP, and a possible error code.  With
>      * any less space left, exception delivery could fail.
>      */
>     tsk_stack = (unsigned long)task_stack_page(tsk);
>     if (regs->rsp < tsk_stack + 48 && regs->rsp > tsk_stack - PAGE_SIZE) {
>         /* A double-fault due to #PF delivery failure is plausible. */
>         cr2 = read_cr2();
>         if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>             handle_stack_overflow(
>                 "kernel stack overflow (double-fault)",
>                 regs, cr2);
>     }

Actually I think your quote from the SDM contradicts this.  The second
#PF (when trying to invoke the page fault handler) would update CR2
with an address in the guard page.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 17:09               ` Brian Gerst
@ 2016-06-27 17:23                 ` Brian Gerst
  0 siblings, 0 replies; 84+ messages in thread
From: Brian Gerst @ 2016-06-27 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 1:09 PM, Brian Gerst <brgerst@gmail.com> wrote:
> On Mon, Jun 27, 2016 at 12:35 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Jun 27, 2016 at 9:17 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>> On Mon, Jun 27, 2016 at 11:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mon, Jun 27, 2016 at 8:22 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Mon, Jun 27, 2016 at 8:12 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>>>> On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>>>>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>>>  #ifdef CONFIG_X86_64
>>>>>>>>  /* Runs on IST stack */
>>>>>>>>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>>>  {
>>>>>>>>         static const char str[] = "double fault";
>>>>>>>>         struct task_struct *tsk = current;
>>>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>>>> +       unsigned long cr2;
>>>>>>>> +#endif
>>>>>>>>
>>>>>>>>  #ifdef CONFIG_X86_ESPFIX64
>>>>>>>>         extern unsigned char native_irq_return_iret[];
>>>>>>>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>>>         tsk->thread.error_code = error_code;
>>>>>>>>         tsk->thread.trap_nr = X86_TRAP_DF;
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>>>> +       /*
>>>>>>>> +        * If we overflow the stack into a guard page, the CPU will fail
>>>>>>>> +        * to deliver #PF and will send #DF instead.  CR2 will contain
>>>>>>>> +        * the linear address of the second fault, which will be in the
>>>>>>>> +        * guard page below the bottom of the stack.
>>>>>>>> +        */
>>>>>>>> +       cr2 = read_cr2();
>>>>>>>> +       if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>>>>>>>> +               handle_stack_overflow(
>>>>>>>> +                       "kernel stack overflow (double-fault)",
>>>>>>>> +                       regs, cr2);
>>>>>>>> +#endif
>>>>>>>
>>>>>>> Is there any other way to tell if this was from a page fault?  If it
>>>>>>> wasn't a page fault then CR2 is undefined.
>>>>>>
>>>>>> I guess it doesn't really matter, since the fault is fatal either way.
>>>>>> The error message might be incorrect though.
>>>>>>
>>>>>
>>>>> It's at least worth a comment, though.  Maybe I should check if
>>>>> regs->rsp is within 40 bytes of the bottom of the stack, too, such
>>>>> that delivery of an inner fault would have double-faulted assuming the
>>>>> inner fault didn't use an IST vector.
>>>>>
>>>>
>>>> How about:
>>>>
>>>>     /*
>>>>      * If we overflow the stack into a guard page, the CPU will fail
>>>>      * to deliver #PF and will send #DF instead.  CR2 will contain
>>>>      * the linear address of the second fault, which will be in the
>>>>      * guard page below the bottom of the stack.
>>>>      *
>>>>      * We're limited to using heuristics here, since the CPU does
>>>>      * not tell us what type of fault failed and, if the first fault
>>>>      * wasn't a page fault, CR2 may contain stale garbage.  To mostly
>>>>      * rule out garbage, we check if the saved RSP is close enough to
>>>>      * the bottom of the stack to cause exception delivery to fail.
>>>>      * The worst case is 7 stack slots: one for alignment, five for
>>>>      * SS..RIP, and one for the error code.
>>>>      */
>>>>     tsk_stack = (unsigned long)task_stack_page(tsk);
>>>>     if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {
>>>>         /* A double-fault due to #PF delivery failure is plausible. */
>>>>         cr2 = read_cr2();
>>>>         if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>>>>             handle_stack_overflow(
>>>>                 "kernel stack overflow (double-fault)",
>>>>                 regs, cr2);
>>>>     }
>>>
>>> I think RSP anywhere in the guard page would be best, since it could
>>> have been decremented by a function prologue into the guard page
>>> before an access that triggers the page fault.
>>>
>>
>> I think that can miss some stack overflows.  Suppose that RSP points
>> very close to the bottom of the stack and we take an unrelated fault.
>> The CPU can fail to deliver that fault and we get a double fault
>> instead.  But I counted wrong, too.  Do you like this version and its
>> explanation?
>>
>>     /*
>>      * If we overflow the stack into a guard page, the CPU will fail
>>      * to deliver #PF and will send #DF instead.  Similarly, if we
>>      * take any non-IST exception while too close to the bottom of
>>      * the stack, the processor will get a page fault while
>>      * delivering the exception and will generate a double fault.
>>      *
>>      * According to the SDM (footnote in 6.15 under "Interrupt 14 -
>>      * Page-Fault Exception (#PF):
>>      *
>>      *   Processors update CR2 whenever a page fault is detected. If a
>>      *   second page fault occurs while an earlier page fault is being
>>      *   deliv- ered, the faulting linear address of the second fault will
>>      *   overwrite the contents of CR2 (replacing the previous
>>      *   address). These updates to CR2 occur even if the page fault
>>      *   results in a double fault or occurs during the delivery of a
>>      *   double fault.
>>      *
>>      * However, if we got here due to a non-page-fault exception while
>>      * delivering a non-page-fault exception, CR2 may contain a
>>      * stale value.
>>      *
>>      * As a heuristic: we consider this double fault to be a stack
>>      * overflow if CR2 points to the guard page and RSP is either
>>      * in the guard page or close enough to the bottom of the stack
>>      *
>>      * We're limited to using heuristics here, since the CPU does
>>      * not tell us what type of fault failed and, if the first fault
>>      * wasn't a page fault, CR2 may contain stale garbage.  To
>>      * mostly rule out garbage, we check if the saved RSP is close
>>      * enough to the bottom of the stack to cause exception delivery
>>      * to fail.  If RSP == tsk_stack + 48 and we take an exception,
>>      * the stack is already aligned and there will be enough room
>>      * SS, RSP, RFLAGS, CS, RIP, and a possible error code.  With
>>      * any less space left, exception delivery could fail.
>>      */
>>     tsk_stack = (unsigned long)task_stack_page(tsk);
>>     if (regs->rsp < tsk_stack + 48 && regs->rsp > tsk_stack - PAGE_SIZE) {
>>         /* A double-fault due to #PF delivery failure is plausible. */
>>         cr2 = read_cr2();
>>         if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>>             handle_stack_overflow(
>>                 "kernel stack overflow (double-fault)",
>>                 regs, cr2);
>>     }
>
> Actually I think your quote from the SDM contradicts this.  The second
> #PF (when trying to invoke the page fault handler) would update CR2
> with an address in the guard page.

Nevermind, I'm totally misunderstanding this. I guess the real
question is when is RSP updated?  During each push or once before or
after the frame is pushed.  Checking for the bottom of the valid stack
covers the case when RSP isn't updated until after the whole frame is
pushed successfully.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 15:54         ` Andy Lutomirski
  2016-06-27 16:17           ` Brian Gerst
@ 2016-06-27 17:28           ` Linus Torvalds
  2016-06-27 17:30             ` Andy Lutomirski
  1 sibling, 1 reply; 84+ messages in thread
From: Linus Torvalds @ 2016-06-27 17:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 8:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> How about:
>
>     tsk_stack = (unsigned long)task_stack_page(tsk);
>     if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {

I'm not at all convinced that regs->rsp will be all that reliable
under a double-fault scenario either. I'd be more inclined to trusr
cr2 than the register state.

It's true that double faults can happen for *other* reasons entirely,
and as such it's not clear that %cr2 is reliable either, but since
this is all just about a printout, I'd rather go that way anyway.

                 Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks
  2016-06-27 17:28           ` Linus Torvalds
@ 2016-06-27 17:30             ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Gerst, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Mon, Jun 27, 2016 at 10:28 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 27, 2016 at 8:54 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> How about:
>>
>>     tsk_stack = (unsigned long)task_stack_page(tsk);
>>     if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {
>
> I'm not at all convinced that regs->rsp will be all that reliable
> under a double-fault scenario either. I'd be more inclined to trusr
> cr2 than the register state.
>
> It's true that double faults can happen for *other* reasons entirely,
> and as such it's not clear that %cr2 is reliable either, but since
> this is all just about a printout, I'd rather go that way anyway.

Fair enough.  The chance that we get #GP-in-#GP or similar while CR2
coincidentally points to the guard page is quite low.  I'll add all
the details to the comment but I'll leave the code alone.

FWIW, the manual only says that CS and RIP are untrustworthy, not that
RSP is untrustworthy, but it doesn't specify *what* RSP would contain
anywhere I can find.  I don't think this is important enough to start
harassing the Intel and AMD folks over.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-27  8:54     ` Ingo Molnar
@ 2016-06-27 22:30       ` Marcel Holtmann
  2016-06-27 22:33         ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Marcel Holtmann @ 2016-06-27 22:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, LKML, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens,
	Gustavo F. Padovan, Johan Hedberg, David S. Miller,
	linux-bluetooth, netdev

Hi Ingo,

>>> SMP does ECB crypto on stack buffers.  This is complicated and
>>> fragile, and it will not work if the stack is virtually allocated.
>>> 
>>> Switch to the crypto_cipher interface, which is simpler and safer.
>>> 
>>> Cc: Marcel Holtmann <marcel@holtmann.org>
>>> Cc: Gustavo Padovan <gustavo@padovan.org>
>>> Cc: Johan Hedberg <johan.hedberg@gmail.com>
>>> Cc: "David S. Miller" <davem@davemloft.net>
>>> Cc: linux-bluetooth@vger.kernel.org
>>> Cc: netdev@vger.kernel.org
>>> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
>>> Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>
>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>> ---
>>> net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
>>> 1 file changed, 28 insertions(+), 39 deletions(-)
>> 
>> patch has been applied to bluetooth-next tree.
> 
> Sadly carrying this separately will delay the virtual kernel stacks feature by a 
> kernel cycle, because it's a must-have prerequisite.

I can take it back out, but then I have the fear the the ECDH change to use KPP for SMP might be the one that has to wait a kernel cycle. Either way is fine with me, but I want to avoid nasty merge conflicts in the Bluetooth SMP code.

Regards

Marcel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-27 22:30       ` Marcel Holtmann
@ 2016-06-27 22:33         ` Andy Lutomirski
  2016-07-04 17:56           ` Marcel Holtmann
  0 siblings, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-27 22:33 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Gustavo F. Padovan, Johan Hedberg,
	David S. Miller, linux-bluetooth, Network Development

On Mon, Jun 27, 2016 at 3:30 PM, Marcel Holtmann <marcel@holtmann.org> wrote:
> Hi Ingo,
>
>>>> SMP does ECB crypto on stack buffers.  This is complicated and
>>>> fragile, and it will not work if the stack is virtually allocated.
>>>>
>>>> Switch to the crypto_cipher interface, which is simpler and safer.
>>>>
>>>> Cc: Marcel Holtmann <marcel@holtmann.org>
>>>> Cc: Gustavo Padovan <gustavo@padovan.org>
>>>> Cc: Johan Hedberg <johan.hedberg@gmail.com>
>>>> Cc: "David S. Miller" <davem@davemloft.net>
>>>> Cc: linux-bluetooth@vger.kernel.org
>>>> Cc: netdev@vger.kernel.org
>>>> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
>>>> Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>
>>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>>> ---
>>>> net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
>>>> 1 file changed, 28 insertions(+), 39 deletions(-)
>>>
>>> patch has been applied to bluetooth-next tree.
>>
>> Sadly carrying this separately will delay the virtual kernel stacks feature by a
>> kernel cycle, because it's a must-have prerequisite.
>
> I can take it back out, but then I have the fear the the ECDH change to use KPP for SMP might be the one that has to wait a kernel cycle. Either way is fine with me, but I want to avoid nasty merge conflicts in the Bluetooth SMP code.

Nothing goes wrong if an identical patch is queued in both places,
right?  Or, if you prefer not to duplicate it, could one of you commit
it and the other one pull it?  Ingo, given that this is patch 1 in the
series and unlikely to change, if you want to make this whole thing
have a separate branch in -tip, this could live there for starters.
(But, if you do so, please make sure you base off a very new copy of
Linus' tree -- the series is heavily dependent on the thread_info
change he applied a few days ago.)

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (28 preceding siblings ...)
  2016-06-26 21:55 ` [PATCH v4 29/29] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set Andy Lutomirski
@ 2016-06-28  7:32 ` David Howells
  2016-06-28  7:37   ` Herbert Xu
  2016-06-28  9:07   ` David Howells
  2016-06-28  7:41 ` David Howells
                   ` (2 subsequent siblings)
  32 siblings, 2 replies; 84+ messages in thread
From: David Howells @ 2016-06-28  7:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens,
	Herbert Xu

Andy Lutomirski <luto@kernel.org> wrote:

> @@ -277,6 +277,7 @@ struct rxrpc_connection {
>  	struct key		*key;		/* security for this connection (client) */
>  	struct key		*server_key;	/* security for this service */
>  	struct crypto_skcipher	*cipher;	/* encryption handle */
> +	struct rxrpc_crypt	csum_iv_head;	/* leading block for csum_iv */
>  	struct rxrpc_crypt	csum_iv;	/* packet checksum base */
>  	unsigned long		events;
>  #define RXRPC_CONN_CHALLENGE	0		/* send challenge packet */

NAK.  This won't work.  csum_iv_head is per packet being processed, but you've
put it in rxrpc_connection which is shared amongst several creators/digestors
of packets.  Putting it in rxrpc_call won't work either since it's also needed
for connection level packets.

David

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  7:32 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad David Howells
@ 2016-06-28  7:37   ` Herbert Xu
  2016-06-28  9:07   ` David Howells
  1 sibling, 0 replies; 84+ messages in thread
From: Herbert Xu @ 2016-06-28  7:37 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 28, 2016 at 08:32:46AM +0100, David Howells wrote:
> Andy Lutomirski <luto@kernel.org> wrote:
> 
> > @@ -277,6 +277,7 @@ struct rxrpc_connection {
> >  	struct key		*key;		/* security for this connection (client) */
> >  	struct key		*server_key;	/* security for this service */
> >  	struct crypto_skcipher	*cipher;	/* encryption handle */
> > +	struct rxrpc_crypt	csum_iv_head;	/* leading block for csum_iv */
> >  	struct rxrpc_crypt	csum_iv;	/* packet checksum base */
> >  	unsigned long		events;
> >  #define RXRPC_CONN_CHALLENGE	0		/* send challenge packet */
> 
> NAK.  This won't work.  csum_iv_head is per packet being processed, but you've
> put it in rxrpc_connection which is shared amongst several creators/digestors
> of packets.  Putting it in rxrpc_call won't work either since it's also needed
> for connection level packets.

Huh? If you can't write to csum_iv_head without clobbering others
then by the same reasoning you can't write to csum_iv either.  So
unless you're saying the existing code is already broken then there
is nothing wrong with the patch.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (29 preceding siblings ...)
  2016-06-28  7:32 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad David Howells
@ 2016-06-28  7:41 ` David Howells
  2016-06-28  7:52 ` David Howells
  2016-06-29  7:06 ` [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Mika Penttilä
  32 siblings, 0 replies; 84+ messages in thread
From: David Howells @ 2016-06-28  7:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens,
	Herbert Xu

You should also note there's a pile of rxrpc patches in net-next that might
cause your patch problems.

David

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (30 preceding siblings ...)
  2016-06-28  7:41 ` David Howells
@ 2016-06-28  7:52 ` David Howells
  2016-06-28  7:55   ` Herbert Xu
  2016-06-28  8:54   ` David Howells
  2016-06-29  7:06 ` [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Mika Penttilä
  32 siblings, 2 replies; 84+ messages in thread
From: David Howells @ 2016-06-28  7:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens,
	Herbert Xu

Andy Lutomirski <luto@kernel.org> wrote:

> -	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
> +	skcipher_request_set_crypt(req, &sg, &sg, sizeof(tmpbuf), iv.x);

Don't the sg's have to be different?  Aren't they both altered by the process
of reading/writing from them?

> 	struct rxrpc_skb_priv *sp;
> ...
> +	swap(tmpbuf.xl, *(__be64 *)sp);
> +
> +	sg_init_one(&sg, sp, sizeof(tmpbuf));

????  I assume you're assuming that the rxrpc_skb_priv struct contents can
arbitrarily replaced temporarily...

And using an XCHG-equivalent instruction?  This won't work on a 32-bit arch
(apart from one that sports CMPXCHG8 or similar).

>  /*
> - * load a scatterlist with a potentially split-page buffer
> + * load a scatterlist
>   */
> -static void rxkad_sg_set_buf2(struct scatterlist sg[2],
> +static void rxkad_sg_set_buf2(struct scatterlist sg[1],
>  			      void *buf, size_t buflen)
>  {
> -	int nsg = 1;
> -
> -	sg_init_table(sg, 2);
> -
> +	sg_init_table(sg, 1);
>  	sg_set_buf(&sg[0], buf, buflen);
> -	if (sg[0].offset + buflen > PAGE_SIZE) {
> -		/* the buffer was split over two pages */
> -		sg[0].length = PAGE_SIZE - sg[0].offset;
> -		sg_set_buf(&sg[1], buf + sg[0].length, buflen - sg[0].length);
> -		nsg++;
> -	}
> -
> -	sg_mark_end(&sg[nsg - 1]);
> -
> -	ASSERTCMP(sg[0].length + sg[1].length, ==, buflen);
>  }

This should be a separate patch.

David

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  7:52 ` David Howells
@ 2016-06-28  7:55   ` Herbert Xu
  2016-06-28  8:54   ` David Howells
  1 sibling, 0 replies; 84+ messages in thread
From: Herbert Xu @ 2016-06-28  7:55 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 28, 2016 at 08:52:20AM +0100, David Howells wrote:
> Andy Lutomirski <luto@kernel.org> wrote:
> 
> > -	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
> > +	skcipher_request_set_crypt(req, &sg, &sg, sizeof(tmpbuf), iv.x);
> 
> Don't the sg's have to be different?  Aren't they both altered by the process
> of reading/writing from them?

No they don't have to be different.

> > 	struct rxrpc_skb_priv *sp;
> > ...
> > +	swap(tmpbuf.xl, *(__be64 *)sp);
> > +
> > +	sg_init_one(&sg, sp, sizeof(tmpbuf));
> 
> ????  I assume you're assuming that the rxrpc_skb_priv struct contents can
> arbitrarily replaced temporarily...

Of course you can, it's per-skb state.

> And using an XCHG-equivalent instruction?  This won't work on a 32-bit arch
> (apart from one that sports CMPXCHG8 or similar).

No this is not using an atomic xchg, whatever gave you that idea?
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  7:52 ` David Howells
  2016-06-28  7:55   ` Herbert Xu
@ 2016-06-28  8:54   ` David Howells
  2016-06-28  9:43     ` Herbert Xu
                       ` (2 more replies)
  1 sibling, 3 replies; 84+ messages in thread
From: David Howells @ 2016-06-28  8:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: dhowells, Andy Lutomirski, x86, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

Herbert Xu <herbert@gondor.apana.org.au> wrote:

> > ????  I assume you're assuming that the rxrpc_skb_priv struct contents can
> > arbitrarily replaced temporarily...
> 
> Of course you can, it's per-skb state.

I'm using the per-skb state for my own purposes and might be looking at it
elsewhere at the same time.

David

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  7:32 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad David Howells
  2016-06-28  7:37   ` Herbert Xu
@ 2016-06-28  9:07   ` David Howells
  2016-06-28  9:45     ` Herbert Xu
  1 sibling, 1 reply; 84+ messages in thread
From: David Howells @ 2016-06-28  9:07 UTC (permalink / raw)
  To: Herbert Xu
  Cc: dhowells, Andy Lutomirski, x86, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Huh? If you can't write to csum_iv_head without clobbering others
> then by the same reasoning you can't write to csum_iv either.  So
> unless you're saying the existing code is already broken then there
> is nothing wrong with the patch.

Ah, for some reason I read it as being in the normal packet processing.  Need
tea before I read security patches;-)

Since it's (more or less) a one off piece of memory, why not kmalloc it
temporarily rather than expanding the connection struct?  Also, the bit where
you put a second rxrpc_crypt in just so that it happens to give you a 16-byte
slot by adjacency is pretty icky.  It would be much better to use a union
instead:

	union {
		struct rxrpc_crypt	csum_iv; /* packet checksum base */
		__be32 tmpbuf[4];
	};

Note also that the above doesn't guarantee that the struct will be inside of a
single page.  It would need an alignment of 16 for that - but you only have
one sg.  Could that be a problem?

David

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  8:54   ` David Howells
@ 2016-06-28  9:43     ` Herbert Xu
  2016-06-28 10:00     ` David Howells
  2016-06-28 13:23     ` David Howells
  2 siblings, 0 replies; 84+ messages in thread
From: Herbert Xu @ 2016-06-28  9:43 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 28, 2016 at 09:54:23AM +0100, David Howells wrote:
> 
> I'm using the per-skb state for my own purposes and might be looking at it
> elsewhere at the same time.

AFAICS this cannot happen for secure_packet/verify_packet.  In both
cases we have exclusive ownership of the skb.

But it's your code so feel free to send your own patch.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  9:07   ` David Howells
@ 2016-06-28  9:45     ` Herbert Xu
  0 siblings, 0 replies; 84+ messages in thread
From: Herbert Xu @ 2016-06-28  9:45 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 28, 2016 at 10:07:44AM +0100, David Howells wrote:
>
> Since it's (more or less) a one off piece of memory, why not kmalloc it
> temporarily rather than expanding the connection struct?  Also, the bit where
> you put a second rxrpc_crypt in just so that it happens to give you a 16-byte
> slot by adjacency is pretty icky.  It would be much better to use a union
> instead:
> 
> 	union {
> 		struct rxrpc_crypt	csum_iv; /* packet checksum base */
> 		__be32 tmpbuf[4];
> 	};

Feel free to send your own patch to do this.

> Note also that the above doesn't guarantee that the struct will be inside of a
> single page.  It would need an alignment of 16 for that - but you only have
> one sg.  Could that be a problem?

No it's not a problem.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  8:54   ` David Howells
  2016-06-28  9:43     ` Herbert Xu
@ 2016-06-28 10:00     ` David Howells
  2016-06-28 13:23     ` David Howells
  2 siblings, 0 replies; 84+ messages in thread
From: David Howells @ 2016-06-28 10:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: dhowells, Andy Lutomirski, x86, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens

Herbert Xu <herbert@gondor.apana.org.au> wrote:

> > I'm using the per-skb state for my own purposes and might be looking at it
> > elsewhere at the same time.
> 
> AFAICS this cannot happen for secure_packet/verify_packet.  In both
> cases we have exclusive ownership of the skb.

In code I'm busy working on the patch I'm decrypting may be on the receive
queue several times.  rxrpc has a jumbo packet concept whereby a packet may be
constructed in such a way that it's actually several packets stitched together
- the idea being that a router can split it up (not that any actually do that
I know of) - but each segment of the jumbo packet may be enqueued as a
separate entity.

> But it's your code so feel free to send your own patch.

I will apply something very similar to my tree.  Andy's patch does not apply
as-is due to conflicts.

David

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-28  8:54   ` David Howells
  2016-06-28  9:43     ` Herbert Xu
  2016-06-28 10:00     ` David Howells
@ 2016-06-28 13:23     ` David Howells
  2 siblings, 0 replies; 84+ messages in thread
From: David Howells @ 2016-06-28 13:23 UTC (permalink / raw)
  To: Herbert Xu, Andy Lutomirski
  Cc: dhowells, x86, linux-kernel, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Josh Poimboeuf, Jann Horn, Heiko Carstens

I'm going to commit this patch to my tree.  Hopefully, this should appear in
net-next shortly.

David
---
commit 4da137ed8a467d01f87ac84ceb2a7af8719e0136
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Sun Jun 26 14:55:24 2016 -0700

    rxrpc: Avoid using stack memory in SG lists in rxkad
    
    rxkad uses stack memory in SG lists which would not work if stacks were
    allocated from vmalloc memory.  In fact, in most cases this isn't even
    necessary as the stack memory ends up getting copied over to kmalloc
    memory.
    
    This patch eliminates all the unnecessary stack memory uses by supplying
    the final destination directly to the crypto API.  In two instances where a
    temporary buffer is actually needed we also switch use a scratch area in
    the rxrpc_call struct (only one DATA packet will be being secured or
    verified at a time).
    
    Finally there is no need to split a split-page buffer into two SG entries
    so code dealing with that has been removed.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: Andy Lutomirski <luto@kernel.org>
    Signed-off-by: David Howells <dhowells@redhat.com>

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 702db72196fb..796368d1fb25 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -141,17 +141,16 @@ struct rxrpc_security {
 	int (*init_connection_security)(struct rxrpc_connection *);
 
 	/* prime a connection's packet security */
-	void (*prime_packet_security)(struct rxrpc_connection *);
+	int (*prime_packet_security)(struct rxrpc_connection *);
 
 	/* impose security on a packet */
-	int (*secure_packet)(const struct rxrpc_call *,
+	int (*secure_packet)(struct rxrpc_call *,
 			     struct sk_buff *,
 			     size_t,
 			     void *);
 
 	/* verify the security on a received packet */
-	int (*verify_packet)(const struct rxrpc_call *, struct sk_buff *,
-			     u32 *);
+	int (*verify_packet)(struct rxrpc_call *, struct sk_buff *, u32 *);
 
 	/* issue a challenge */
 	int (*issue_challenge)(struct rxrpc_connection *);
@@ -399,6 +398,7 @@ struct rxrpc_call {
 	struct sk_buff_head	rx_oos_queue;	/* packets received out of sequence */
 	struct sk_buff		*tx_pending;	/* Tx socket buffer being filled */
 	wait_queue_head_t	tx_waitq;	/* wait for Tx window space to become available */
+	__be32			crypto_buf[2];	/* Temporary packet crypto buffer */
 	unsigned long		user_call_ID;	/* user-defined call ID */
 	unsigned long		creation_jif;	/* time of call creation */
 	unsigned long		flags;
diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index bf6971555eac..6a3c96707831 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -188,7 +188,10 @@ static int rxrpc_process_event(struct rxrpc_connection *conn,
 		if (ret < 0)
 			return ret;
 
-		conn->security->prime_packet_security(conn);
+		ret = conn->security->prime_packet_security(conn);
+		if (ret < 0)
+			return ret;
+
 		read_lock_bh(&conn->lock);
 		spin_lock(&conn->state_lock);
 
diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c
index 4bfad7cf96cb..35b36beb4684 100644
--- a/net/rxrpc/conn_object.c
+++ b/net/rxrpc/conn_object.c
@@ -138,7 +138,9 @@ rxrpc_alloc_client_connection(struct rxrpc_conn_parameters *cp, gfp_t gfp)
 	if (ret < 0)
 		goto error_1;
 
-	conn->security->prime_packet_security(conn);
+	ret = conn->security->prime_packet_security(conn);
+	if (ret < 0)
+		goto error_2;
 
 	write_lock(&rxrpc_connection_lock);
 	list_add_tail(&conn->link, &rxrpc_connections);
@@ -152,6 +154,8 @@ rxrpc_alloc_client_connection(struct rxrpc_conn_parameters *cp, gfp_t gfp)
 	_leave(" = %p", conn);
 	return conn;
 
+error_2:
+	conn->security->clear(conn);
 error_1:
 	rxrpc_put_client_connection_id(conn);
 error_0:
diff --git a/net/rxrpc/insecure.c b/net/rxrpc/insecure.c
index e571403613c1..c21ad213b337 100644
--- a/net/rxrpc/insecure.c
+++ b/net/rxrpc/insecure.c
@@ -17,11 +17,12 @@ static int none_init_connection_security(struct rxrpc_connection *conn)
 	return 0;
 }
 
-static void none_prime_packet_security(struct rxrpc_connection *conn)
+static int none_prime_packet_security(struct rxrpc_connection *conn)
 {
+	return 0;
 }
 
-static int none_secure_packet(const struct rxrpc_call *call,
+static int none_secure_packet(struct rxrpc_call *call,
 			       struct sk_buff *skb,
 			       size_t data_size,
 			       void *sechdr)
@@ -29,7 +30,7 @@ static int none_secure_packet(const struct rxrpc_call *call,
 	return 0;
 }
 
-static int none_verify_packet(const struct rxrpc_call *call,
+static int none_verify_packet(struct rxrpc_call *call,
 			       struct sk_buff *skb,
 			       u32 *_abort_code)
 {
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index 23c05ec6fa28..3acc7c1241d4 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -103,43 +103,43 @@ error:
  * prime the encryption state with the invariant parts of a connection's
  * description
  */
-static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
+static int rxkad_prime_packet_security(struct rxrpc_connection *conn)
 {
 	struct rxrpc_key_token *token;
 	SKCIPHER_REQUEST_ON_STACK(req, conn->cipher);
-	struct scatterlist sg[2];
+	struct scatterlist sg;
 	struct rxrpc_crypt iv;
-	struct {
-		__be32 x[4];
-	} tmpbuf __attribute__((aligned(16))); /* must all be in same page */
+	__be32 *tmpbuf;
+	size_t tmpsize = 4 * sizeof(__be32);
 
 	_enter("");
 
 	if (!conn->params.key)
-		return;
+		return 0;
+
+	tmpbuf = kmalloc(tmpsize, GFP_KERNEL);
+	if (!tmpbuf)
+		return -ENOMEM;
 
 	token = conn->params.key->payload.data[0];
 	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
-	tmpbuf.x[0] = htonl(conn->proto.epoch);
-	tmpbuf.x[1] = htonl(conn->proto.cid);
-	tmpbuf.x[2] = 0;
-	tmpbuf.x[3] = htonl(conn->security_ix);
-
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	tmpbuf[0] = htonl(conn->proto.epoch);
+	tmpbuf[1] = htonl(conn->proto.cid);
+	tmpbuf[2] = 0;
+	tmpbuf[3] = htonl(conn->security_ix);
 
+	sg_init_one(&sg, tmpbuf, tmpsize);
 	skcipher_request_set_tfm(req, conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
-
+	skcipher_request_set_crypt(req, &sg, &sg, tmpsize, iv.x);
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	memcpy(&conn->csum_iv, &tmpbuf.x[2], sizeof(conn->csum_iv));
-	ASSERTCMP((u32 __force)conn->csum_iv.n[0], ==, (u32 __force)tmpbuf.x[2]);
-
-	_leave("");
+	memcpy(&conn->csum_iv, tmpbuf + 2, sizeof(conn->csum_iv));
+	kfree(tmpbuf);
+	_leave(" = 0");
+	return 0;
 }
 
 /*
@@ -152,12 +152,9 @@ static int rxkad_secure_packet_auth(const struct rxrpc_call *call,
 {
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
+	struct rxkad_level1_hdr hdr;
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
-		struct rxkad_level1_hdr hdr;
-		__be32	first;	/* first four bytes of data and padding */
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+	struct scatterlist sg;
 	u16 check;
 
 	sp = rxrpc_skb(skb);
@@ -167,24 +164,19 @@ static int rxkad_secure_packet_auth(const struct rxrpc_call *call,
 	check = sp->hdr.seq ^ sp->hdr.callNumber;
 	data_size |= (u32)check << 16;
 
-	tmpbuf.hdr.data_size = htonl(data_size);
-	memcpy(&tmpbuf.first, sechdr + 4, sizeof(tmpbuf.first));
+	hdr.data_size = htonl(data_size);
+	memcpy(sechdr, &hdr, sizeof(hdr));
 
 	/* start the encryption afresh */
 	memset(&iv, 0, sizeof(iv));
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
-
+	sg_init_one(&sg, sechdr, 8);
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
-
+	skcipher_request_set_crypt(req, &sg, &sg, 8, iv.x);
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	memcpy(sechdr, &tmpbuf, sizeof(tmpbuf));
-
 	_leave(" = 0");
 	return 0;
 }
@@ -198,8 +190,7 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 				       void *sechdr)
 {
 	const struct rxrpc_key_token *token;
-	struct rxkad_level2_hdr rxkhdr
-		__attribute__((aligned(8))); /* must be all on one page */
+	struct rxkad_level2_hdr rxkhdr;
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_crypt iv;
@@ -218,18 +209,16 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 
 	rxkhdr.data_size = htonl(data_size | (u32)check << 16);
 	rxkhdr.checksum = 0;
+	memcpy(sechdr, &rxkhdr, sizeof(rxkhdr));
 
 	/* encrypt from the session key */
 	token = call->conn->params.key->payload.data[0];
 	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
 	sg_init_one(&sg[0], sechdr, sizeof(rxkhdr));
-	sg_init_one(&sg[1], &rxkhdr, sizeof(rxkhdr));
-
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(rxkhdr), iv.x);
-
+	skcipher_request_set_crypt(req, &sg[0], &sg[0], sizeof(rxkhdr), iv.x);
 	crypto_skcipher_encrypt(req);
 
 	/* we want to encrypt the skbuff in-place */
@@ -243,9 +232,7 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 
 	sg_init_table(sg, nsg);
 	skb_to_sgvec(skb, sg, 0, len);
-
 	skcipher_request_set_crypt(req, sg, sg, len, iv.x);
-
 	crypto_skcipher_encrypt(req);
 
 	_leave(" = 0");
@@ -259,7 +246,7 @@ out:
 /*
  * checksum an RxRPC packet header
  */
-static int rxkad_secure_packet(const struct rxrpc_call *call,
+static int rxkad_secure_packet(struct rxrpc_call *call,
 			       struct sk_buff *skb,
 			       size_t data_size,
 			       void *sechdr)
@@ -267,10 +254,7 @@ static int rxkad_secure_packet(const struct rxrpc_call *call,
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
-		__be32 x[2];
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+	struct scatterlist sg;
 	u32 x, y;
 	int ret;
 
@@ -293,20 +277,17 @@ static int rxkad_secure_packet(const struct rxrpc_call *call,
 	/* calculate the security checksum */
 	x = call->channel << (32 - RXRPC_CIDSHIFT);
 	x |= sp->hdr.seq & 0x3fffffff;
-	tmpbuf.x[0] = htonl(sp->hdr.callNumber);
-	tmpbuf.x[1] = htonl(x);
-
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	call->crypto_buf[0] = htonl(sp->hdr.callNumber);
+	call->crypto_buf[1] = htonl(x);
 
+	sg_init_one(&sg, call->crypto_buf, 8);
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
-
+	skcipher_request_set_crypt(req, &sg, &sg, 8, iv.x);
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	y = ntohl(tmpbuf.x[1]);
+	y = ntohl(call->crypto_buf[1]);
 	y = (y >> 16) & 0xffff;
 	if (y == 0)
 		y = 1; /* zero checksums are not permitted */
@@ -367,7 +348,6 @@ static int rxkad_verify_packet_auth(const struct rxrpc_call *call,
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
 	skcipher_request_set_crypt(req, sg, sg, 8, iv.x);
-
 	crypto_skcipher_decrypt(req);
 	skcipher_request_zero(req);
 
@@ -452,7 +432,6 @@ static int rxkad_verify_packet_encrypt(const struct rxrpc_call *call,
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
 	skcipher_request_set_crypt(req, sg, sg, skb->len, iv.x);
-
 	crypto_skcipher_decrypt(req);
 	skcipher_request_zero(req);
 	if (sg != _sg)
@@ -498,17 +477,14 @@ nomem:
 /*
  * verify the security on a received packet
  */
-static int rxkad_verify_packet(const struct rxrpc_call *call,
+static int rxkad_verify_packet(struct rxrpc_call *call,
 			       struct sk_buff *skb,
 			       u32 *_abort_code)
 {
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_skb_priv *sp;
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
-		__be32 x[2];
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+	struct scatterlist sg;
 	u16 cksum;
 	u32 x, y;
 	int ret;
@@ -533,20 +509,17 @@ static int rxkad_verify_packet(const struct rxrpc_call *call,
 	/* validate the security checksum */
 	x = call->channel << (32 - RXRPC_CIDSHIFT);
 	x |= sp->hdr.seq & 0x3fffffff;
-	tmpbuf.x[0] = htonl(call->call_id);
-	tmpbuf.x[1] = htonl(x);
-
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	call->crypto_buf[0] = htonl(call->call_id);
+	call->crypto_buf[1] = htonl(x);
 
+	sg_init_one(&sg, call->crypto_buf, 8);
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
-
+	skcipher_request_set_crypt(req, &sg, &sg, 8, iv.x);
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	y = ntohl(tmpbuf.x[1]);
+	y = ntohl(call->crypto_buf[1]);
 	cksum = (y >> 16) & 0xffff;
 	if (cksum == 0)
 		cksum = 1; /* zero checksums are not permitted */
@@ -710,29 +683,6 @@ static void rxkad_calc_response_checksum(struct rxkad_response *response)
 }
 
 /*
- * load a scatterlist with a potentially split-page buffer
- */
-static void rxkad_sg_set_buf2(struct scatterlist sg[2],
-			      void *buf, size_t buflen)
-{
-	int nsg = 1;
-
-	sg_init_table(sg, 2);
-
-	sg_set_buf(&sg[0], buf, buflen);
-	if (sg[0].offset + buflen > PAGE_SIZE) {
-		/* the buffer was split over two pages */
-		sg[0].length = PAGE_SIZE - sg[0].offset;
-		sg_set_buf(&sg[1], buf + sg[0].length, buflen - sg[0].length);
-		nsg++;
-	}
-
-	sg_mark_end(&sg[nsg - 1]);
-
-	ASSERTCMP(sg[0].length + sg[1].length, ==, buflen);
-}
-
-/*
  * encrypt the response packet
  */
 static void rxkad_encrypt_response(struct rxrpc_connection *conn,
@@ -741,17 +691,16 @@ static void rxkad_encrypt_response(struct rxrpc_connection *conn,
 {
 	SKCIPHER_REQUEST_ON_STACK(req, conn->cipher);
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
+	struct scatterlist sg[1];
 
 	/* continue encrypting from where we left off */
 	memcpy(&iv, s2->session_key, sizeof(iv));
 
-	rxkad_sg_set_buf2(sg, &resp->encrypted, sizeof(resp->encrypted));
-
+	sg_init_table(sg, 1);
+	sg_set_buf(sg, &resp->encrypted, sizeof(resp->encrypted));
 	skcipher_request_set_tfm(req, conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
 	skcipher_request_set_crypt(req, sg, sg, sizeof(resp->encrypted), iv.x);
-
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 }
@@ -887,10 +836,8 @@ static int rxkad_decrypt_ticket(struct rxrpc_connection *conn,
 	}
 
 	sg_init_one(&sg[0], ticket, ticket_len);
-
 	skcipher_request_set_callback(req, 0, NULL, NULL);
 	skcipher_request_set_crypt(req, sg, sg, ticket_len, iv.x);
-
 	crypto_skcipher_decrypt(req);
 	skcipher_request_free(req);
 
@@ -1001,7 +948,7 @@ static void rxkad_decrypt_response(struct rxrpc_connection *conn,
 				   const struct rxrpc_crypt *session_key)
 {
 	SKCIPHER_REQUEST_ON_STACK(req, rxkad_ci);
-	struct scatterlist sg[2];
+	struct scatterlist sg[1];
 	struct rxrpc_crypt iv;
 
 	_enter(",,%08x%08x",
@@ -1016,12 +963,11 @@ static void rxkad_decrypt_response(struct rxrpc_connection *conn,
 
 	memcpy(&iv, session_key, sizeof(iv));
 
-	rxkad_sg_set_buf2(sg, &resp->encrypted, sizeof(resp->encrypted));
-
+	sg_init_table(sg, 1);
+	sg_set_buf(sg, &resp->encrypted, sizeof(resp->encrypted));
 	skcipher_request_set_tfm(req, rxkad_ci);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
 	skcipher_request_set_crypt(req, sg, sg, sizeof(resp->encrypted), iv.x);
-
 	crypto_skcipher_decrypt(req);
 	skcipher_request_zero(req);
 

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 04/29] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  2016-06-26 21:55 ` [PATCH v4 04/29] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
@ 2016-06-28 18:48   ` Borislav Petkov
  2016-06-28 19:07     ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2016-06-28 18:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 02:55:26PM -0700, Andy Lutomirski wrote:
> This avoids pointless races in which another CPU or task might see a
> partially populated global pgd entry.  These races should normally
> be harmless, but, if another CPU propagates the entry via
> vmalloc_fault and then populate_pgd fails (due to memory allocation
> failure, for example), this prevents a use-after-free of the pgd
> entry.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/mm/pageattr.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index 7a1f7bbf4105..6a8026918bf6 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
>  
>  	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
>  	if (ret < 0) {
> -		unmap_pgd_range(cpa->pgd, addr,
> +		if (pud)
> +			free_page((unsigned long)pud);
> +		unmap_pud_range(pgd_entry, addr,
>  				addr + (cpa->numpages << PAGE_SHIFT));
>  		return ret;
>  	}
> -- 

So something's amiss here. Subject says:

"x86/cpa: In populate_pgd, don't set the pgd entry until it's populated"

but you haven't moved

	set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));

after populate_pud() succeeds... Which is a good catch but your patch
should do it too. :-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 04/29] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  2016-06-28 18:48   ` Borislav Petkov
@ 2016-06-28 19:07     ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-06-28 19:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens

On Tue, Jun 28, 2016 at 11:48 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Sun, Jun 26, 2016 at 02:55:26PM -0700, Andy Lutomirski wrote:
>> This avoids pointless races in which another CPU or task might see a
>> partially populated global pgd entry.  These races should normally
>> be harmless, but, if another CPU propagates the entry via
>> vmalloc_fault and then populate_pgd fails (due to memory allocation
>> failure, for example), this prevents a use-after-free of the pgd
>> entry.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/mm/pageattr.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>> index 7a1f7bbf4105..6a8026918bf6 100644
>> --- a/arch/x86/mm/pageattr.c
>> +++ b/arch/x86/mm/pageattr.c
>> @@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
>>
>>       ret = populate_pud(cpa, addr, pgd_entry, pgprot);
>>       if (ret < 0) {
>> -             unmap_pgd_range(cpa->pgd, addr,
>> +             if (pud)
>> +                     free_page((unsigned long)pud);
>> +             unmap_pud_range(pgd_entry, addr,
>>                               addr + (cpa->numpages << PAGE_SHIFT));
>>               return ret;
>>       }
>> --
>
> So something's amiss here. Subject says:
>
> "x86/cpa: In populate_pgd, don't set the pgd entry until it's populated"
>
> but you haven't moved
>
>         set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
>
> after populate_pud() succeeds... Which is a good catch but your patch
> should do it too. :-)

Good catch.  I'll fix this in the next version.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup
  2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
                   ` (31 preceding siblings ...)
  2016-06-28  7:52 ` David Howells
@ 2016-06-29  7:06 ` Mika Penttilä
  2016-06-29 17:24   ` Mika Penttilä
  32 siblings, 1 reply; 84+ messages in thread
From: Mika Penttilä @ 2016-06-29  7:06 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
> Hi all-
> 

> 
> Known issues:
>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>    has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>    and virtio_console.
> 
 How about PTRACE_SETREGS, it's using the child stack's vmapped address to put regs?

--Mika

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup
  2016-06-29  7:06 ` [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Mika Penttilä
@ 2016-06-29 17:24   ` Mika Penttilä
  0 siblings, 0 replies; 84+ messages in thread
From: Mika Penttilä @ 2016-06-29 17:24 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, linux-arch, Borislav Petkov, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens



On 29.06.2016 10:06, Mika Penttilä wrote:
> On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
>> Hi all-
>>
>> Known issues:
>>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>>    has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>>    and virtio_console.
>>
>  How about PTRACE_SETREGS, it's using the child stack's vmapped address to put regs?
>
> --Mika
>
PTRACE_SETREGS is ok of course.

--Mika

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 08/29] dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  2016-06-26 21:55 ` [PATCH v4 08/29] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
@ 2016-06-30 19:37   ` Borislav Petkov
  2016-07-06 13:20     ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2016-06-30 19:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andrew Morton, Arnd Bergmann

On Sun, Jun 26, 2016 at 02:55:30PM -0700, Andy Lutomirski wrote:
> If we're using CONFIG_VMAP_STACK and we manage to point an sg entry
> at the stack, then either the sg page will be in highmem or sg_virt
> will return the direct-map alias.  In neither case will the existing
> check_for_stack() implementation realize that it's a stack page.
> 
> Fix it by explicitly checking for stack pages.
> 
> This has no effect by itself.  It's broken out for ease of review.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  lib/dma-debug.c | 39 +++++++++++++++++++++++++++++++++------
>  1 file changed, 33 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/dma-debug.c b/lib/dma-debug.c
> index 51a76af25c66..5b2e63cba90e 100644
> --- a/lib/dma-debug.c
> +++ b/lib/dma-debug.c
> @@ -22,6 +22,7 @@
>  #include <linux/stacktrace.h>
>  #include <linux/dma-debug.h>
>  #include <linux/spinlock.h>
> +#include <linux/vmalloc.h>
>  #include <linux/debugfs.h>
>  #include <linux/uaccess.h>
>  #include <linux/export.h>
> @@ -1162,11 +1163,35 @@ static void check_unmap(struct dma_debug_entry *ref)
>  	put_hash_bucket(bucket, &flags);
>  }
>  
> -static void check_for_stack(struct device *dev, void *addr)
> +static void check_for_stack(struct device *dev,
> +			    struct page *page, size_t offset)
>  {
> -	if (object_is_on_stack(addr))
> -		err_printk(dev, NULL, "DMA-API: device driver maps memory from "
> -				"stack [addr=%p]\n", addr);
> +	void *addr;
> +	struct vm_struct *stack_vm_area = task_stack_vm_area(current);

lib/dma-debug.c: In function ‘check_for_stack’:
lib/dma-debug.c:1170:36: error: implicit declaration of function ‘task_stack_vm_area’ [-Werror=implicit-function-declaration]
  struct vm_struct *stack_vm_area = task_stack_vm_area(current);
                                    ^
lib/dma-debug.c:1170:36: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
cc1: some warnings being treated as errors
make[1]: *** [lib/dma-debug.o] Error 1
make: *** [lib] Error 2
make: *** Waiting for unfinished jobs....

Probably reorder pieces from patch 9 to earlier ones...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 09/29] fork: Add generic vmalloced stack support
  2016-06-26 21:55 ` [PATCH v4 09/29] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-07-01 14:59   ` Borislav Petkov
  2016-07-01 16:30     ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2016-07-01 14:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Oleg Nesterov

On Sun, Jun 26, 2016 at 02:55:31PM -0700, Andy Lutomirski wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
> 
> grsecurity has had a similar feature (called
> GRKERNSEC_KSTACKOVERFLOW) for a long time.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/Kconfig                        | 29 +++++++++++++
>  arch/ia64/include/asm/thread_info.h |  2 +-
>  include/linux/sched.h               | 15 +++++++
>  kernel/fork.c                       | 87 +++++++++++++++++++++++++++++--------
>  4 files changed, 113 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 15996290fed4..18a2c3a7b460 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -661,4 +661,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
>  config CPU_NO_EFFICIENT_FFS
>  	def_bool n
>  
> +config HAVE_ARCH_VMAP_STACK
> +	def_bool n
> +	help
> +	  An arch should select this symbol if it can support kernel stacks
> +	  in vmalloc space.  This means:
> +
> +	  - vmalloc space must be large enough to hold many kernel stacks.
> +	    This may rule out many 32-bit architectures.
> +
> +	  - Stacks in vmalloc space need to work reliably.  For example, if
> +	    vmap page tables are created on demand, either this mechanism
> +	    needs to work while the stack points to a virtual address with
> +	    unpopulated page tables or arch code (switch_to and switch_mm,
> +	    most likely) needs to ensure that the stack's page table entries
> +	    are populated before running on a possibly unpopulated stack.
> +
> +	  - If the stack overflows into a guard page, something reasonable
> +	    should happen.  The definition of "reasonable" is flexible, but
> +	    instantly rebooting without logging anything would be unfriendly.

Nice, I wish more people would actually *explain* their Kconfig options
properly.

...

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 146c9840c079..06761de69360 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -158,19 +158,37 @@ void __weak arch_release_thread_stack(unsigned long *stack)
>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
>   * kmemcache based allocator.
>   */
> -# if THREAD_SIZE >= PAGE_SIZE
> -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
> -						  int node)
> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
> +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> +	void *stack = __vmalloc_node_range(
> +		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
> +		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
> +		0, node, __builtin_return_address(0));

Reformat:

        void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
                                           VMALLOC_START, VMALLOC_END,
                                           THREADINFO_GFP | __GFP_HIGHMEM,
                                           PAGE_KERNEL,
                                           0, node, __builtin_return_address(0));


> +
> +	/*
> +	 * We can't call find_vm_area() in interrupt context, and
> +	 * free_thread_info can be called in interrupt context, so cache

free_thread_stack() ?

> +	 * the vm_struct.
> +	 */
> +	if (stack)
> +		tsk->stack_vm_area = find_vm_area(stack);
> +	return stack;
> +#else
>  	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
>  						  THREAD_SIZE_ORDER);
>  
>  	return page ? page_address(page) : NULL;
> +#endif
>  }
>  
> -static inline void free_thread_stack(unsigned long *stack)
> +static inline void free_thread_stack(struct task_struct *tsk)
>  {
> -	free_kmem_pages((unsigned long)stack, THREAD_SIZE_ORDER);
> +	if (task_stack_vm_area(tsk))
> +		vfree(tsk->stack);
> +	else
> +		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_stack_cache;
> @@ -181,9 +199,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
>  	return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
>  }
>  
> -static void free_thread_stack(unsigned long *stack)
> +static void free_thread_stack(struct task_struct *tsk)
>  {
> -	kmem_cache_free(thread_stack_cache, stack);
> +	kmem_cache_free(thread_stack_cache, tsk->stack);
>  }
>  
>  void thread_stack_cache_init(void)
> @@ -213,24 +231,49 @@ struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -static void account_kernel_stack(unsigned long *stack, int account)
> +static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
> -	/* All stack pages are in the same zone and belong to the same memcg. */
> -	struct page *first_page = virt_to_page(stack);
> +	void *stack = task_stack_page(tsk);
> +	struct vm_struct *vm = task_stack_vm_area(tsk);
> +
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
> +
> +	if (vm) {
> +		int i;
>  
> -	mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
> -			    THREAD_SIZE / 1024 * account);
> +		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
>  
> -	memcg_kmem_update_page_stat(
> -		first_page, MEMCG_KERNEL_STACK_KB,
> -		account * (THREAD_SIZE / 1024));
> +		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
> +			mod_zone_page_state(page_zone(vm->pages[i]),
> +					    NR_KERNEL_STACK_KB,
> +					    PAGE_SIZE / 1024 * account);
> +		}
> +
> +		/* All stack pages belong to the same memcg. */
> +		memcg_kmem_update_page_stat(
> +			vm->pages[0], MEMCG_KERNEL_STACK_KB,
> +			account * (THREAD_SIZE / 1024));

Formatting:

		function_name(arg0, arg1,
			      arg2, arg3, ...);

> +	} else {
> +		/*
> +		 * All stack pages are in the same zone and belong to the
> +		 * same memcg.
> +		 */
> +		struct page *first_page = virt_to_page(stack);
> +
> +		mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
> +				    THREAD_SIZE / 1024 * account);
> +
> +		memcg_kmem_update_page_stat(
> +			first_page, MEMCG_KERNEL_STACK_KB,
> +			account * (THREAD_SIZE / 1024));

Ditto.

> +	}
>  }
>  
>  void free_task(struct task_struct *tsk)
>  {
> -	account_kernel_stack(tsk->stack, -1);
> +	account_kernel_stack(tsk, -1);
>  	arch_release_thread_stack(tsk->stack);
> -	free_thread_stack(tsk->stack);
> +	free_thread_stack(tsk);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
>  	put_seccomp_filter(tsk);
-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 09/29] fork: Add generic vmalloced stack support
  2016-07-01 14:59   ` Borislav Petkov
@ 2016-07-01 16:30     ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-07-01 16:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Oleg Nesterov

On Fri, Jul 1, 2016 at 7:59 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Sun, Jun 26, 2016 at 02:55:31PM -0700, Andy Lutomirski wrote:
>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> vmalloc_node.
>>

All done.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-06-26 21:55 ` [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
@ 2016-07-02 17:24   ` Borislav Petkov
  2016-07-02 18:34     ` Josh Poimboeuf
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2016-07-02 17:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 02:55:32PM -0700, Andy Lutomirski wrote:
> It's not going to work, because the scheduler will explode if we try
> to schedule when running on an IST stack or similar.
> 
> This will matter when we let kernel stack overflows (which are #DF)
> call die().
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/kernel/dumpstack.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index ef8017ca5ba9..352f022cfd5b 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -245,6 +245,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
>  		return;
>  	if (in_interrupt())
>  		panic("Fatal exception in interrupt");
> +	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
> +	     & ~(THREAD_SIZE - 1)) != 0)

Ugh, that's hard to parse. You could remove the "!= 0" at least to
shorten it a bit and have one less braces level.

Or maybe even do something like that to make it a bit more readable:

        if ((current_stack_pointer() ^ (current_top_of_stack() - 1))
                        &
             ~(THREAD_SIZE - 1))
                panic("Fatal exception on non-default stack");

Meh.

> +		panic("Fatal exception on special stack");

			"Fatal exception on non-default stack"

maybe?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-07-02 17:24   ` Borislav Petkov
@ 2016-07-02 18:34     ` Josh Poimboeuf
  2016-07-03  9:40       ` Borislav Petkov
  2016-07-03 14:25       ` Andy Lutomirski
  0 siblings, 2 replies; 84+ messages in thread
From: Josh Poimboeuf @ 2016-07-02 18:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, x86, linux-kernel, linux-arch, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Sat, Jul 02, 2016 at 07:24:41PM +0200, Borislav Petkov wrote:
> On Sun, Jun 26, 2016 at 02:55:32PM -0700, Andy Lutomirski wrote:
> > It's not going to work, because the scheduler will explode if we try
> > to schedule when running on an IST stack or similar.
> > 
> > This will matter when we let kernel stack overflows (which are #DF)
> > call die().
> > 
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >  arch/x86/kernel/dumpstack.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> > index ef8017ca5ba9..352f022cfd5b 100644
> > --- a/arch/x86/kernel/dumpstack.c
> > +++ b/arch/x86/kernel/dumpstack.c
> > @@ -245,6 +245,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
> >  		return;
> >  	if (in_interrupt())
> >  		panic("Fatal exception in interrupt");
> > +	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
> > +	     & ~(THREAD_SIZE - 1)) != 0)
> 
> Ugh, that's hard to parse. You could remove the "!= 0" at least to
> shorten it a bit and have one less braces level.
> 
> Or maybe even do something like that to make it a bit more readable:
> 
>         if ((current_stack_pointer() ^ (current_top_of_stack() - 1))
>                         &
>              ~(THREAD_SIZE - 1))
>                 panic("Fatal exception on non-default stack");
> 
> Meh.

A helper function would be even better.

The existing 'object_is_on_stack()' can probably be used:

	if (!object_is_on_stack(current_top_of_stack()))
		panic("...");

Though that function isn't quite accurately named.  It should really
have 'task_stack' in its name, like 'object_is_on_task_stack()'.  Or
even better, something more concise like 'on_task_stack()'.

-- 
Josh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-07-02 18:34     ` Josh Poimboeuf
@ 2016-07-03  9:40       ` Borislav Petkov
  2016-07-03 14:25       ` Andy Lutomirski
  1 sibling, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2016-07-03  9:40 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, x86, linux-kernel, linux-arch, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Sat, Jul 02, 2016 at 01:34:51PM -0500, Josh Poimboeuf wrote:
> The existing 'object_is_on_stack()' can probably be used:
> 
> 	if (!object_is_on_stack(current_top_of_stack()))
> 		panic("...");
> 
> Though that function isn't quite accurately named.  It should really
> have 'task_stack' in its name, like 'object_is_on_task_stack()'.  Or
> even better, something more concise like 'on_task_stack()'.

So I'm obviously missing something here:

object_is_on_stack() uses task_stack_page(current) -> task_struct.stack
while current_stack_pointer() reads %rsp directly.

I'm guessing %rsp and task_struct.stack are in sync?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-07-02 18:34     ` Josh Poimboeuf
  2016-07-03  9:40       ` Borislav Petkov
@ 2016-07-03 14:25       ` Andy Lutomirski
  2016-07-03 18:42         ` Borislav Petkov
  1 sibling, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-07-03 14:25 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, linux-kernel,
	linux-arch, Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Jann Horn, Heiko Carstens

On Sat, Jul 2, 2016 at 11:34 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Sat, Jul 02, 2016 at 07:24:41PM +0200, Borislav Petkov wrote:
>> On Sun, Jun 26, 2016 at 02:55:32PM -0700, Andy Lutomirski wrote:
>> > It's not going to work, because the scheduler will explode if we try
>> > to schedule when running on an IST stack or similar.
>> >
>> > This will matter when we let kernel stack overflows (which are #DF)
>> > call die().
>> >
>> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> > ---
>> >  arch/x86/kernel/dumpstack.c | 3 +++
>> >  1 file changed, 3 insertions(+)
>> >
>> > diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
>> > index ef8017ca5ba9..352f022cfd5b 100644
>> > --- a/arch/x86/kernel/dumpstack.c
>> > +++ b/arch/x86/kernel/dumpstack.c
>> > @@ -245,6 +245,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
>> >             return;
>> >     if (in_interrupt())
>> >             panic("Fatal exception in interrupt");
>> > +   if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
>> > +        & ~(THREAD_SIZE - 1)) != 0)
>>
>> Ugh, that's hard to parse. You could remove the "!= 0" at least to
>> shorten it a bit and have one less braces level.
>>
>> Or maybe even do something like that to make it a bit more readable:
>>
>>         if ((current_stack_pointer() ^ (current_top_of_stack() - 1))
>>                         &
>>              ~(THREAD_SIZE - 1))
>>                 panic("Fatal exception on non-default stack");
>>
>> Meh.
>
> A helper function would be even better.
>
> The existing 'object_is_on_stack()' can probably be used:
>
>         if (!object_is_on_stack(current_top_of_stack()))
>                 panic("...");
>
> Though that function isn't quite accurately named.  It should really
> have 'task_stack' in its name, like 'object_is_on_task_stack()'.  Or
> even better, something more concise like 'on_task_stack()'.
>

Given that the very next patch deletes this code, I vote for leaving
it alone.  Or I could fold the patches together.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-07-03 14:25       ` Andy Lutomirski
@ 2016-07-03 18:42         ` Borislav Petkov
  0 siblings, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2016-07-03 18:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Andy Lutomirski, X86 ML, linux-kernel,
	linux-arch, Nadav Amit, Kees Cook, Brian Gerst, kernel-hardening,
	Linus Torvalds, Jann Horn, Heiko Carstens

On Sun, Jul 03, 2016 at 07:25:05AM -0700, Andy Lutomirski wrote:
> Given that the very next patch deletes this code, I vote for leaving
> it alone.  Or I could fold the patches together.

Ah, true. Yes, please fold them together.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-27 22:33         ` Andy Lutomirski
@ 2016-07-04 17:56           ` Marcel Holtmann
  2016-07-06 13:17             ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Marcel Holtmann @ 2016-07-04 17:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Gustavo F. Padovan, Johan Hedberg,
	David S. Miller, linux-bluetooth, Network Development

Hi Andy,

>>>>> SMP does ECB crypto on stack buffers.  This is complicated and
>>>>> fragile, and it will not work if the stack is virtually allocated.
>>>>> 
>>>>> Switch to the crypto_cipher interface, which is simpler and safer.
>>>>> 
>>>>> Cc: Marcel Holtmann <marcel@holtmann.org>
>>>>> Cc: Gustavo Padovan <gustavo@padovan.org>
>>>>> Cc: Johan Hedberg <johan.hedberg@gmail.com>
>>>>> Cc: "David S. Miller" <davem@davemloft.net>
>>>>> Cc: linux-bluetooth@vger.kernel.org
>>>>> Cc: netdev@vger.kernel.org
>>>>> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
>>>>> Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>
>>>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>>>> ---
>>>>> net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
>>>>> 1 file changed, 28 insertions(+), 39 deletions(-)
>>>> 
>>>> patch has been applied to bluetooth-next tree.
>>> 
>>> Sadly carrying this separately will delay the virtual kernel stacks feature by a
>>> kernel cycle, because it's a must-have prerequisite.
>> 
>> I can take it back out, but then I have the fear the the ECDH change to use KPP for SMP might be the one that has to wait a kernel cycle. Either way is fine with me, but I want to avoid nasty merge conflicts in the Bluetooth SMP code.
> 
> Nothing goes wrong if an identical patch is queued in both places,
> right?  Or, if you prefer not to duplicate it, could one of you commit
> it and the other one pull it?  Ingo, given that this is patch 1 in the
> series and unlikely to change, if you want to make this whole thing
> have a separate branch in -tip, this could live there for starters.
> (But, if you do so, please make sure you base off a very new copy of
> Linus' tree -- the series is heavily dependent on the thread_info
> change he applied a few days ago.)

so what are doing now? I take this back out or we keep it in and let git deal with it when merging the trees?

Regards

Marcel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/29] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-26 21:55 ` [PATCH v4 11/29] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
@ 2016-07-04 18:45   ` Borislav Petkov
  0 siblings, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2016-07-04 18:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Nadav Amit, Kees Cook,
	Brian Gerst, kernel-hardening, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens

On Sun, Jun 26, 2016 at 02:55:33PM -0700, Andy Lutomirski wrote:
> If we call do_exit with a clean stack, we greatly reduce the risk of

Nits:	    do_exit()

> recursive oopses due to stack overflow in do_exit, and we allow

s/ in do_exit//

> do_exit to work even if we OOPS from an IST stack.  The latter gives

Append "()" to the function names.

> us a much better chance of surviving long enough after we detect a
> stack overflow to write out our logs.
> 
> I intentionally separated this from the preceding patch that
> disables do_exit-on-OOPS on IST stacks.  This way, if we need to
> revert this patch, we still end up in an acceptable state wrt stack
> overflow handling.
> 
> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---

...

> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index 352f022cfd5b..0d05f113805e 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -226,6 +226,8 @@ unsigned long oops_begin(void)
>  EXPORT_SYMBOL_GPL(oops_begin);
>  NOKPROBE_SYMBOL(oops_begin);
>  
> +extern void __noreturn rewind_stack_do_exit(int signr);

You don't need the "extern" here.

> +
>  void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
>  {
>  	if (regs && kexec_should_crash(current))
-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-07-04 17:56           ` Marcel Holtmann
@ 2016-07-06 13:17             ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-07-06 13:17 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML, LKML, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Gustavo F. Padovan, Johan Hedberg,
	David S. Miller, linux-bluetooth, Network Development

On Mon, Jul 4, 2016 at 10:56 AM, Marcel Holtmann <marcel@holtmann.org> wrote:
> Hi Andy,
>
>>>>>> SMP does ECB crypto on stack buffers.  This is complicated and
>>>>>> fragile, and it will not work if the stack is virtually allocated.
>>>>>>
>>>>>> Switch to the crypto_cipher interface, which is simpler and safer.
>>>>>>
>>>>>> Cc: Marcel Holtmann <marcel@holtmann.org>
>>>>>> Cc: Gustavo Padovan <gustavo@padovan.org>
>>>>>> Cc: Johan Hedberg <johan.hedberg@gmail.com>
>>>>>> Cc: "David S. Miller" <davem@davemloft.net>
>>>>>> Cc: linux-bluetooth@vger.kernel.org
>>>>>> Cc: netdev@vger.kernel.org
>>>>>> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
>>>>>> Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>
>>>>>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>>>>>> ---
>>>>>> net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
>>>>>> 1 file changed, 28 insertions(+), 39 deletions(-)
>>>>>
>>>>> patch has been applied to bluetooth-next tree.
>>>>
>>>> Sadly carrying this separately will delay the virtual kernel stacks feature by a
>>>> kernel cycle, because it's a must-have prerequisite.
>>>
>>> I can take it back out, but then I have the fear the the ECDH change to use KPP for SMP might be the one that has to wait a kernel cycle. Either way is fine with me, but I want to avoid nasty merge conflicts in the Bluetooth SMP code.
>>
>> Nothing goes wrong if an identical patch is queued in both places,
>> right?  Or, if you prefer not to duplicate it, could one of you commit
>> it and the other one pull it?  Ingo, given that this is patch 1 in the
>> series and unlikely to change, if you want to make this whole thing
>> have a separate branch in -tip, this could live there for starters.
>> (But, if you do so, please make sure you base off a very new copy of
>> Linus' tree -- the series is heavily dependent on the thread_info
>> change he applied a few days ago.)
>
> so what are doing now? I take this back out or we keep it in and let git deal with it when merging the trees?
>

Unless Ingo says otherwise, let's let git deal with it.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 08/29] dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  2016-06-30 19:37   ` Borislav Petkov
@ 2016-07-06 13:20     ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-07-06 13:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Andrew Morton,
	Arnd Bergmann

On Thu, Jun 30, 2016 at 12:37 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Sun, Jun 26, 2016 at 02:55:30PM -0700, Andy Lutomirski wrote:
>> If we're using CONFIG_VMAP_STACK and we manage to point an sg entry
>> at the stack, then either the sg page will be in highmem or sg_virt
>> will return the direct-map alias.  In neither case will the existing
>> check_for_stack() implementation realize that it's a stack page.
>>
>> Fix it by explicitly checking for stack pages.
>>
>> This has no effect by itself.  It's broken out for ease of review.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  lib/dma-debug.c | 39 +++++++++++++++++++++++++++++++++------
>>  1 file changed, 33 insertions(+), 6 deletions(-)
>>
>> diff --git a/lib/dma-debug.c b/lib/dma-debug.c
>> index 51a76af25c66..5b2e63cba90e 100644
>> --- a/lib/dma-debug.c
>> +++ b/lib/dma-debug.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/stacktrace.h>
>>  #include <linux/dma-debug.h>
>>  #include <linux/spinlock.h>
>> +#include <linux/vmalloc.h>
>>  #include <linux/debugfs.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/export.h>
>> @@ -1162,11 +1163,35 @@ static void check_unmap(struct dma_debug_entry *ref)
>>       put_hash_bucket(bucket, &flags);
>>  }
>>
>> -static void check_for_stack(struct device *dev, void *addr)
>> +static void check_for_stack(struct device *dev,
>> +                         struct page *page, size_t offset)
>>  {
>> -     if (object_is_on_stack(addr))
>> -             err_printk(dev, NULL, "DMA-API: device driver maps memory from "
>> -                             "stack [addr=%p]\n", addr);
>> +     void *addr;
>> +     struct vm_struct *stack_vm_area = task_stack_vm_area(current);
>
> lib/dma-debug.c: In function ‘check_for_stack’:
> lib/dma-debug.c:1170:36: error: implicit declaration of function ‘task_stack_vm_area’ [-Werror=implicit-function-declaration]
>   struct vm_struct *stack_vm_area = task_stack_vm_area(current);
>                                     ^
> lib/dma-debug.c:1170:36: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
> cc1: some warnings being treated as errors
> make[1]: *** [lib/dma-debug.o] Error 1
> make: *** [lib] Error 2
> make: *** Waiting for unfinished jobs....
>
> Probably reorder pieces from patch 9 to earlier ones...

I'll address this by reordering it later in the series.  The temporary
loss of functionality will be unobservable.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [kernel-hardening] [PATCH v4 26/29] sched: Allow putting thread_info into task_struct
  2016-06-26 21:55 ` [PATCH v4 26/29] sched: Allow putting thread_info into task_struct Andy Lutomirski
@ 2016-07-11 10:08   ` Mark Rutland
  2016-07-11 14:55     ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Mark Rutland @ 2016-07-11 10:08 UTC (permalink / raw)
  To: kernel-hardening
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, Linus Torvalds, Josh Poimboeuf,
	Jann Horn, Heiko Carstens, Andy Lutomirski

Hi,

On Sun, Jun 26, 2016 at 02:55:48PM -0700, Andy Lutomirski wrote:
> If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
> then thread_info is defined as a single 'u32 flags' and is the first
> entry of task_struct.  thread_info::task is removed (it serves no
> purpose if thread_info is embedded in task_struct), and
> thread_info::cpu gets its own slot in task_struct.
> 
> This is heavily based on a patch written by Linus.

I've been considering how we'd implement this for arm64, and I suspect
that we'll also need to fold our preempt_count into task_struct
(following from the style of asm-generic/preempt.h).

As far as I can see, we can't make our preempt-count a percpu variable
as with x86, as our percpu ops themselves are based on disabling
preemption.

To that end, would it be possible to keep the thread_info definition per
arch, even with CONFIG_THREAD_INFO_IN_TASK?

Thanks,
Mark.

> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  include/linux/init_task.h   |  9 +++++++++
>  include/linux/sched.h       | 36 ++++++++++++++++++++++++++++++++++--
>  include/linux/thread_info.h | 15 +++++++++++++++
>  init/Kconfig                |  3 +++
>  init/init_task.c            |  7 +++++--
>  kernel/sched/sched.h        |  4 ++++
>  6 files changed, 70 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index f8834f820ec2..9c04d44eeb3c 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -15,6 +15,8 @@
>  #include <net/net_namespace.h>
>  #include <linux/sched/rt.h>
>  
> +#include <asm/thread_info.h>
> +
>  #ifdef CONFIG_SMP
>  # define INIT_PUSHABLE_TASKS(tsk)					\
>  	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO),
> @@ -183,12 +185,19 @@ extern struct task_group root_task_group;
>  # define INIT_KASAN(tsk)
>  #endif
>  
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
> +#else
> +# define INIT_TASK_TI(tsk)
> +#endif
> +
>  /*
>   *  INIT_TASK is used to set up the first task table, touch at
>   * your own risk!. Base=0, limit=0x1fffff (=2MB)
>   */
>  #define INIT_TASK(tsk)	\
>  {									\
> +	INIT_TASK_TI(tsk)						\
>  	.state		= 0,						\
>  	.stack		= init_stack,					\
>  	.usage		= ATOMIC_INIT(2),				\
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 569df670407a..4108b4880b86 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1456,6 +1456,13 @@ struct tlbflush_unmap_batch {
>  };
>  
>  struct task_struct {
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +	/*
> +	 * For reasons of header soup (see current_thread_info()), this
> +	 * must be the first element of task_struct.
> +	 */
> +	struct thread_info thread_info;
> +#endif
>  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
>  	void *stack;
>  	atomic_t usage;
> @@ -1465,6 +1472,9 @@ struct task_struct {
>  #ifdef CONFIG_SMP
>  	struct llist_node wake_entry;
>  	int on_cpu;
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +	unsigned int cpu;	/* current CPU */
> +#endif
>  	unsigned int wakee_flips;
>  	unsigned long wakee_flip_decay_ts;
>  	struct task_struct *last_wakee;
> @@ -2557,7 +2567,9 @@ extern void set_curr_task(int cpu, struct task_struct *p);
>  void yield(void);
>  
>  union thread_union {
> +#ifndef CONFIG_THREAD_INFO_IN_TASK
>  	struct thread_info thread_info;
> +#endif
>  	unsigned long stack[THREAD_SIZE/sizeof(long)];
>  };
>  
> @@ -3045,10 +3057,26 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
>  	cgroup_threadgroup_change_end(tsk);
>  }
>  
> -#ifndef __HAVE_THREAD_FUNCTIONS
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +
> +static inline struct thread_info *task_thread_info(struct task_struct *task)
> +{
> +	return &task->thread_info;
> +}
> +static inline void *task_stack_page(const struct task_struct *task)
> +{
> +	return task->stack;
> +}
> +#define setup_thread_stack(new,old)	do { } while(0)
> +static inline unsigned long *end_of_stack(const struct task_struct *task)
> +{
> +	return task->stack;
> +}
> +
> +#elif !defined(__HAVE_THREAD_FUNCTIONS)
>  
>  #define task_thread_info(task)	((struct thread_info *)(task)->stack)
> -#define task_stack_page(task)	((task)->stack)
> +#define task_stack_page(task)	((void *)(task)->stack)
>  
>  static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org)
>  {
> @@ -3348,7 +3376,11 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
>  
>  static inline unsigned int task_cpu(const struct task_struct *p)
>  {
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +	return p->cpu;
> +#else
>  	return task_thread_info(p)->cpu;
> +#endif
>  }
>  
>  static inline int task_node(const struct task_struct *p)
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 352b1542f5cc..b2b32d63bc8e 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -13,6 +13,21 @@
>  struct timespec;
>  struct compat_timespec;
>  
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +struct thread_info {
> +	u32			flags;		/* low level flags */
> +};
> +
> +#define INIT_THREAD_INFO(tsk)			\
> +{						\
> +	.flags		= 0,			\
> +}
> +#endif
> +
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +#define current_thread_info() ((struct thread_info *)current)
> +#endif
> +
>  /*
>   * System call restart block.
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index f755a602d4a1..0c83af6d3753 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -26,6 +26,9 @@ config IRQ_WORK
>  config BUILDTIME_EXTABLE_SORT
>  	bool
>  
> +config THREAD_INFO_IN_TASK
> +	bool
> +
>  menu "General setup"
>  
>  config BROKEN
> diff --git a/init/init_task.c b/init/init_task.c
> index ba0a7f362d9e..11f83be1fa79 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -22,5 +22,8 @@ EXPORT_SYMBOL(init_task);
>   * Initial thread structure. Alignment of this is handled by a special
>   * linker map entry.
>   */
> -union thread_union init_thread_union __init_task_data =
> -	{ INIT_THREAD_INFO(init_task) };
> +union thread_union init_thread_union __init_task_data = {
> +#ifndef CONFIG_THREAD_INFO_IN_TASK
> +	INIT_THREAD_INFO(init_task)
> +#endif
> +};
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7cbeb92a1cb9..a1cabcea4c54 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -999,7 +999,11 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
>  	 * per-task data have been completed by this moment.
>  	 */
>  	smp_wmb();
> +#ifdef CONFIG_THREAD_INFO_IN_TASK
> +	p->cpu = cpu;
> +#else
>  	task_thread_info(p)->cpu = cpu;
> +#endif
>  	p->wake_cpu = cpu;
>  #endif
>  }
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [kernel-hardening] [PATCH v4 26/29] sched: Allow putting thread_info into task_struct
  2016-07-11 10:08   ` [kernel-hardening] " Mark Rutland
@ 2016-07-11 14:55     ` Andy Lutomirski
  2016-07-11 15:08       ` Mark Rutland
       [not found]       ` <CA+55aFy2Sno+bS0A2k0cMWpEJy-bpXufSAw3+ufrfQYbp9rcMQ@mail.gmail.com>
  0 siblings, 2 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-07-11 14:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Nadav Amit, linux-arch, Kees Cook, Josh Poimboeuf,
	Borislav Petkov, linux-kernel, Jann Horn, Heiko Carstens,
	kernel-hardening, Brian Gerst, X86 ML, Linus Torvalds

On Jul 11, 2016 3:08 AM, "Mark Rutland" <mark.rutland@arm.com> wrote:
>
> Hi,
>
> On Sun, Jun 26, 2016 at 02:55:48PM -0700, Andy Lutomirski wrote:
> > If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
> > then thread_info is defined as a single 'u32 flags' and is the first
> > entry of task_struct.  thread_info::task is removed (it serves no
> > purpose if thread_info is embedded in task_struct), and
> > thread_info::cpu gets its own slot in task_struct.
> >
> > This is heavily based on a patch written by Linus.
>
> I've been considering how we'd implement this for arm64, and I suspect
> that we'll also need to fold our preempt_count into task_struct
> (following from the style of asm-generic/preempt.h).
>
> As far as I can see, we can't make our preempt-count a percpu variable
> as with x86, as our percpu ops themselves are based on disabling
> preemption.

How do you intend to find 'current' to get to the preempt count
without first disabling preemption?

>
> To that end, would it be possible to keep the thread_info definition per
> arch, even with CONFIG_THREAD_INFO_IN_TASK?

In principal, yes, but could you alternatively put it in
thread_struct?  My goal here is to encourage people to clean up their
use of thread_info vs thread_struct at the same time.  For x86, that
cleanup was trivial -- most of the work was addressing relative to
current instead of the stack pointer, and that had to happen
regardless.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [kernel-hardening] [PATCH v4 26/29] sched: Allow putting thread_info into task_struct
  2016-07-11 14:55     ` Andy Lutomirski
@ 2016-07-11 15:08       ` Mark Rutland
       [not found]       ` <CA+55aFy2Sno+bS0A2k0cMWpEJy-bpXufSAw3+ufrfQYbp9rcMQ@mail.gmail.com>
  1 sibling, 0 replies; 84+ messages in thread
From: Mark Rutland @ 2016-07-11 15:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, linux-arch, Kees Cook, Josh Poimboeuf,
	Borislav Petkov, linux-kernel, Jann Horn, Heiko Carstens,
	kernel-hardening, Brian Gerst, X86 ML, Linus Torvalds

On Mon, Jul 11, 2016 at 07:55:17AM -0700, Andy Lutomirski wrote:
> On Jul 11, 2016 3:08 AM, "Mark Rutland" <mark.rutland@arm.com> wrote:
> >
> > Hi,
> >
> > On Sun, Jun 26, 2016 at 02:55:48PM -0700, Andy Lutomirski wrote:
> > > If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
> > > then thread_info is defined as a single 'u32 flags' and is the first
> > > entry of task_struct.  thread_info::task is removed (it serves no
> > > purpose if thread_info is embedded in task_struct), and
> > > thread_info::cpu gets its own slot in task_struct.
> > >
> > > This is heavily based on a patch written by Linus.
> >
> > I've been considering how we'd implement this for arm64, and I suspect
> > that we'll also need to fold our preempt_count into task_struct
> > (following from the style of asm-generic/preempt.h).
> >
> > As far as I can see, we can't make our preempt-count a percpu variable
> > as with x86, as our percpu ops themselves are based on disabling
> > preemption.
> 
> How do you intend to find 'current' to get to the preempt count
> without first disabling preemption?

Good point.

For some reason I had convinced myself that it only mattered for RMW
sequences, so evidently I hadn't considered things thoroughly enough. :(

> > To that end, would it be possible to keep the thread_info definition per
> > arch, even with CONFIG_THREAD_INFO_IN_TASK?
> 
> In principal, yes, but could you alternatively put it in
> thread_struct?  My goal here is to encourage people to clean up their
> use of thread_info vs thread_struct at the same time.  For x86, that
> cleanup was trivial -- most of the work was addressing relative to
> current instead of the stack pointer, and that had to happen
> regardless.

I'm more than happy to do that, modulo the above permitting.

Sorry for the noise!

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [kernel-hardening] [PATCH v4 26/29] sched: Allow putting thread_info into task_struct
       [not found]       ` <CA+55aFy2Sno+bS0A2k0cMWpEJy-bpXufSAw3+ufrfQYbp9rcMQ@mail.gmail.com>
@ 2016-07-11 16:31         ` Mark Rutland
  2016-07-11 16:42           ` Linus Torvalds
  0 siblings, 1 reply; 84+ messages in thread
From: Mark Rutland @ 2016-07-11 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, linux-arch, Nadav Amit, Kees Cook,
	Borislav Petkov, Josh Poimboeuf, X86 ML, Jann Horn,
	Heiko Carstens, Brian Gerst, kernel-hardening, linux-kernel

On Mon, Jul 11, 2016 at 09:06:58AM -0700, Linus Torvalds wrote:
> On Jul 11, 2016 7:55 AM, "Andy Lutomirski" <[1]luto@amacapital.net> wrote:
> >
> > How do you intend to find 'current' to get to the preempt count
> > without first disabling preemption?
>
> Actually, that is the classic case of "not a problem".
>
> The thing is, it doesn't matter if you schedule away while looking up
> current or the preempt count - because both values are idempotent wet
> scheduling.
>
> So until you do the wire that actually disables preemption you can
> schedule away as much as you want, and after that write you no longer
> will.

I was assuming a percpu pointer to current (or preempt count).

The percpu offset might be stale at the point you try to dereference
that, even though current itself hasn't changed, and you may access the
wrong CPU's value.

> This is different wrt a per-cpu area - which is clearly not idempotent wrt
> scheduling.
>
> The reason per-cpu works on x86 is that we have an atomic rmw operation
> that is *also* atomic wrt the CPU lookup (thanks to the segment base)

Sure, understood.

Mark.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [kernel-hardening] [PATCH v4 26/29] sched: Allow putting thread_info into task_struct
  2016-07-11 16:31         ` Mark Rutland
@ 2016-07-11 16:42           ` Linus Torvalds
  0 siblings, 0 replies; 84+ messages in thread
From: Linus Torvalds @ 2016-07-11 16:42 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, linux-arch, Nadav Amit, Kees Cook,
	Borislav Petkov, Josh Poimboeuf, X86 ML, Jann Horn,
	Heiko Carstens, Brian Gerst, kernel-hardening, linux-kernel

On Mon, Jul 11, 2016 at 9:31 AM, Mark Rutland <mark.rutland@arm.com> wrote:
>>
>> So until you do the wire that actually disables preemption you can
>> schedule away as much as you want, and after that write you no longer
>> will.
>
> I was assuming a percpu pointer to current (or preempt count).

So for the same reason that is ok *iff* you have

 - some kind of dedicated percpu register (or other base pointer - x86
has the segment thing) that gets updated when you schedule.

 - an instruction that can load 'current' directly off that register atomically.

But yes, percpu data in general is obviously not safe to access
without preemption.

         Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2016-07-11 16:42 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-26 21:55 [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 01/29] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
2016-06-27  5:58   ` Marcel Holtmann
2016-06-27  8:54     ` Ingo Molnar
2016-06-27 22:30       ` Marcel Holtmann
2016-06-27 22:33         ` Andy Lutomirski
2016-07-04 17:56           ` Marcel Holtmann
2016-07-06 13:17             ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 03/29] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 04/29] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
2016-06-28 18:48   ` Borislav Petkov
2016-06-28 19:07     ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 05/29] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables() Andy Lutomirski
2016-06-27  7:19   ` Borislav Petkov
2016-06-26 21:55 ` [PATCH v4 06/29] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 07/29] mm: Fix memcg stack accounting for sub-page stacks Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 08/29] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
2016-06-30 19:37   ` Borislav Petkov
2016-07-06 13:20     ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 09/29] fork: Add generic vmalloced stack support Andy Lutomirski
2016-07-01 14:59   ` Borislav Petkov
2016-07-01 16:30     ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 10/29] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
2016-07-02 17:24   ` Borislav Petkov
2016-07-02 18:34     ` Josh Poimboeuf
2016-07-03  9:40       ` Borislav Petkov
2016-07-03 14:25       ` Andy Lutomirski
2016-07-03 18:42         ` Borislav Petkov
2016-06-26 21:55 ` [PATCH v4 11/29] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
2016-07-04 18:45   ` Borislav Petkov
2016-06-26 21:55 ` [PATCH v4 12/29] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 13/29] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 14/29] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks Andy Lutomirski
2016-06-27 15:01   ` Brian Gerst
2016-06-27 15:12     ` Brian Gerst
2016-06-27 15:22       ` Andy Lutomirski
2016-06-27 15:54         ` Andy Lutomirski
2016-06-27 16:17           ` Brian Gerst
2016-06-27 16:35             ` Andy Lutomirski
2016-06-27 17:09               ` Brian Gerst
2016-06-27 17:23                 ` Brian Gerst
2016-06-27 17:28           ` Linus Torvalds
2016-06-27 17:30             ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 16/29] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 17/29] x86: Move uaccess_err and sig_on_uaccess_err to thread_struct Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 18/29] x86: Move addr_limit " Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 19/29] signal: Consolidate {TS,TLF}_RESTORE_SIGMASK code Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 20/29] x86/smp: Remove stack_smp_processor_id() Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 21/29] x86/smp: Remove unnecessary initialization of thread_info::cpu Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 22/29] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
2016-06-26 23:55   ` Brian Gerst
2016-06-27  0:23     ` Andy Lutomirski
2016-06-27  0:36       ` Brian Gerst
2016-06-27  0:40         ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 23/29] kdb: Use task_cpu() instead of task_thread_info()->cpu Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 24/29] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 25/29] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
2016-06-26 23:40   ` Brian Gerst
2016-06-26 23:49     ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 26/29] sched: Allow putting thread_info into task_struct Andy Lutomirski
2016-07-11 10:08   ` [kernel-hardening] " Mark Rutland
2016-07-11 14:55     ` Andy Lutomirski
2016-07-11 15:08       ` Mark Rutland
     [not found]       ` <CA+55aFy2Sno+bS0A2k0cMWpEJy-bpXufSAw3+ufrfQYbp9rcMQ@mail.gmail.com>
2016-07-11 16:31         ` Mark Rutland
2016-07-11 16:42           ` Linus Torvalds
2016-06-26 21:55 ` [PATCH v4 27/29] x86: Move " Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 28/29] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK Andy Lutomirski
2016-06-27  2:35   ` Andy Lutomirski
2016-06-26 21:55 ` [PATCH v4 29/29] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set Andy Lutomirski
2016-06-28  7:32 ` [PATCH v4 02/29] rxrpc: Avoid using stack memory in SG lists in rxkad David Howells
2016-06-28  7:37   ` Herbert Xu
2016-06-28  9:07   ` David Howells
2016-06-28  9:45     ` Herbert Xu
2016-06-28  7:41 ` David Howells
2016-06-28  7:52 ` David Howells
2016-06-28  7:55   ` Herbert Xu
2016-06-28  8:54   ` David Howells
2016-06-28  9:43     ` Herbert Xu
2016-06-28 10:00     ` David Howells
2016-06-28 13:23     ` David Howells
2016-06-29  7:06 ` [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup Mika Penttilä
2016-06-29 17:24   ` Mika Penttilä

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).