linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core)
@ 2016-06-24  4:22 Andy Lutomirski
  2016-06-24  4:22 ` [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
                   ` (15 more replies)
  0 siblings, 16 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:22 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

On my laptop, this adds about 1.5µs of overhead to task creation,
which seems to be mainly caused by vmalloc inefficiently allocating
individual pages even when a higher-order page is available on the
freelist.

This does not address interrupt stacks.  It also does not address
the possibility of privilege escalation by a controlled stack
overflow that overwrites thread_info without hitting the guard page.
I'll send patches to address the latter issue once this series
lands.

It's worth noting that s390 has an arch-specific gcc feature that
detects stack overflows by adjusting function prologues.  Arches
with features like that may wish to avoid using vmapped stacks to
minimize the performance hit.

Ingo, would it make sense to throw it into a seaparate branch in
-tip?  I wouldn't mind seeing some -next testing to give people a
chance to shake out problems.  I'm particularly interested in
whether there are any drivers that expect virt_to_phys to work on
stack addresses.  (I know that virtio-net used to, but I fixed that
a while back.)

Once this lands in -tip, I'm planning on attacking thread_info.
Once thread_info is under control, we can start caching a couple of
stacks per cpu, and that should get us most of the performance back.

Changes from v3:
 - Fix rxrpc and bluetooth, which used scatterlists pointed at the stack
 - Add some acks and cc's

Changes from v2:
 - Delete kernel_unmap_pages_in_pgd rather than hardening it (Borislav)
 - Fix sub-page stack accounting better (Josh)

Changes from v1:
 - Fix rewind_stack_and_do_exit (Josh)
 - Fix deadlock under load
 - Clean up generic stack vmalloc code
 - Many other minor fixes
 
Andy Lutomirski (14):
  bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  x86/mm: Remove kernel_unmap_pages_in_pgd() and
    efi_cleanup_page_tables()
  mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  mm: Fix memcg stack accounting for sub-page stacks
  dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  fork: Add generic vmalloced stack support
  x86/die: Don't try to recover from an OOPS on a non-default stack
  x86/dumpstack: When OOPSing, rewind the stack before do_exit
  x86/dumpstack: When dumping stack bytes due to OOPS, start with
    regs->sp
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/dumpstack/64: Handle faults when printing the "Stack:" part of an
    OOPS
  x86/mm/64: Enable vmapped stacks
  x86/mm: Improve stack-overflow #PF handling

Herbert Xu (1):
  rxrpc: Avoid using stack memory in SG lists in rxkad

Ingo Molnar (1):
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

 arch/Kconfig                         |  29 ++++++++++
 arch/ia64/include/asm/thread_info.h  |   2 +-
 arch/x86/Kconfig                     |   1 +
 arch/x86/entry/entry_32.S            |  11 ++++
 arch/x86/entry/entry_64.S            |  11 ++++
 arch/x86/include/asm/efi.h           |   1 -
 arch/x86/include/asm/pgtable_types.h |   2 -
 arch/x86/include/asm/switch_to.h     |  28 +++++++++-
 arch/x86/include/asm/traps.h         |   6 ++
 arch/x86/kernel/dumpstack.c          |  19 ++++++-
 arch/x86/kernel/dumpstack_32.c       |   4 +-
 arch/x86/kernel/dumpstack_64.c       |  16 +++++-
 arch/x86/kernel/traps.c              |  32 +++++++++++
 arch/x86/mm/fault.c                  |  39 +++++++++++++
 arch/x86/mm/init_64.c                |  27 ---------
 arch/x86/mm/pageattr.c               |  32 +----------
 arch/x86/mm/tlb.c                    |  15 +++++
 arch/x86/platform/efi/efi.c          |   2 -
 arch/x86/platform/efi/efi_32.c       |   3 -
 arch/x86/platform/efi/efi_64.c       |   5 --
 drivers/base/node.c                  |   3 +-
 fs/proc/meminfo.c                    |   2 +-
 include/linux/memcontrol.h           |   2 +-
 include/linux/mmzone.h               |   2 +-
 include/linux/sched.h                |  15 +++++
 kernel/fork.c                        |  86 ++++++++++++++++++++++-------
 lib/dma-debug.c                      |  39 +++++++++++--
 mm/memcontrol.c                      |   2 +-
 mm/page_alloc.c                      |   3 +-
 net/bluetooth/smp.c                  |  67 ++++++++++-------------
 net/rxrpc/ar-internal.h              |   1 +
 net/rxrpc/rxkad.c                    | 103 +++++++++++++++--------------------
 32 files changed, 400 insertions(+), 210 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
@ 2016-06-24  4:22 ` Andy Lutomirski
  2016-06-24  6:10   ` Herbert Xu
  2016-06-24  7:19   ` Johan Hedberg
  2016-06-24  4:22 ` [PATCH v4 02/16] rxrpc: Avoid using stack memory in SG lists in rxkad Andy Lutomirski
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:22 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Marcel Holtmann,
	Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth,
	Herbert Xu, netdev

SMP does ECB crypto on stack buffers.  This is complicated and
fragile, and it will not work if the stack is virtually allocated.

Switch to the crypto_cipher interface, which is simpler and safer.

Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-bluetooth@vger.kernel.org
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
 1 file changed, 28 insertions(+), 39 deletions(-)

diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 50976a6481f3..4c1a16a96ae5 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -22,9 +22,9 @@
 
 #include <linux/debugfs.h>
 #include <linux/scatterlist.h>
+#include <linux/crypto.h>
 #include <crypto/b128ops.h>
 #include <crypto/hash.h>
-#include <crypto/skcipher.h>
 
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci_core.h>
@@ -88,7 +88,7 @@ struct smp_dev {
 	u8			min_key_size;
 	u8			max_key_size;
 
-	struct crypto_skcipher	*tfm_aes;
+	struct crypto_cipher	*tfm_aes;
 	struct crypto_shash	*tfm_cmac;
 };
 
@@ -127,7 +127,7 @@ struct smp_chan {
 	u8			dhkey[32];
 	u8			mackey[16];
 
-	struct crypto_skcipher	*tfm_aes;
+	struct crypto_cipher	*tfm_aes;
 	struct crypto_shash	*tfm_cmac;
 };
 
@@ -361,10 +361,8 @@ static int smp_h6(struct crypto_shash *tfm_cmac, const u8 w[16],
  * s1 and ah.
  */
 
-static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
+static int smp_e(struct crypto_cipher *tfm, const u8 *k, u8 *r)
 {
-	SKCIPHER_REQUEST_ON_STACK(req, tfm);
-	struct scatterlist sg;
 	uint8_t tmp[16], data[16];
 	int err;
 
@@ -378,7 +376,7 @@ static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
 	/* The most significant octet of key corresponds to k[0] */
 	swap_buf(k, tmp, 16);
 
-	err = crypto_skcipher_setkey(tfm, tmp, 16);
+	err = crypto_cipher_setkey(tfm, tmp, 16);
 	if (err) {
 		BT_ERR("cipher setkey failed: %d", err);
 		return err;
@@ -387,16 +385,7 @@ static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
 	/* Most significant octet of plaintextData corresponds to data[0] */
 	swap_buf(r, data, 16);
 
-	sg_init_one(&sg, data, 16);
-
-	skcipher_request_set_tfm(req, tfm);
-	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg, &sg, 16, NULL);
-
-	err = crypto_skcipher_encrypt(req);
-	skcipher_request_zero(req);
-	if (err)
-		BT_ERR("Encrypt data error %d", err);
+	crypto_cipher_encrypt_one(tfm, data, data);
 
 	/* Most significant octet of encryptedData corresponds to data[0] */
 	swap_buf(data, r, 16);
@@ -406,7 +395,7 @@ static int smp_e(struct crypto_skcipher *tfm, const u8 *k, u8 *r)
 	return err;
 }
 
-static int smp_c1(struct crypto_skcipher *tfm_aes, const u8 k[16],
+static int smp_c1(struct crypto_cipher *tfm_aes, const u8 k[16],
 		  const u8 r[16], const u8 preq[7], const u8 pres[7], u8 _iat,
 		  const bdaddr_t *ia, u8 _rat, const bdaddr_t *ra, u8 res[16])
 {
@@ -455,7 +444,7 @@ static int smp_c1(struct crypto_skcipher *tfm_aes, const u8 k[16],
 	return err;
 }
 
-static int smp_s1(struct crypto_skcipher *tfm_aes, const u8 k[16],
+static int smp_s1(struct crypto_cipher *tfm_aes, const u8 k[16],
 		  const u8 r1[16], const u8 r2[16], u8 _r[16])
 {
 	int err;
@@ -471,7 +460,7 @@ static int smp_s1(struct crypto_skcipher *tfm_aes, const u8 k[16],
 	return err;
 }
 
-static int smp_ah(struct crypto_skcipher *tfm, const u8 irk[16],
+static int smp_ah(struct crypto_cipher *tfm, const u8 irk[16],
 		  const u8 r[3], u8 res[3])
 {
 	u8 _res[16];
@@ -759,7 +748,7 @@ static void smp_chan_destroy(struct l2cap_conn *conn)
 	kzfree(smp->slave_csrk);
 	kzfree(smp->link_key);
 
-	crypto_free_skcipher(smp->tfm_aes);
+	crypto_free_cipher(smp->tfm_aes);
 	crypto_free_shash(smp->tfm_cmac);
 
 	/* Ensure that we don't leave any debug key around if debug key
@@ -1359,9 +1348,9 @@ static struct smp_chan *smp_chan_create(struct l2cap_conn *conn)
 	if (!smp)
 		return NULL;
 
-	smp->tfm_aes = crypto_alloc_skcipher("ecb(aes)", 0, CRYPTO_ALG_ASYNC);
+	smp->tfm_aes = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(smp->tfm_aes)) {
-		BT_ERR("Unable to create ECB crypto context");
+		BT_ERR("Unable to create AES crypto context");
 		kzfree(smp);
 		return NULL;
 	}
@@ -1369,7 +1358,7 @@ static struct smp_chan *smp_chan_create(struct l2cap_conn *conn)
 	smp->tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, 0);
 	if (IS_ERR(smp->tfm_cmac)) {
 		BT_ERR("Unable to create CMAC crypto context");
-		crypto_free_skcipher(smp->tfm_aes);
+		crypto_free_cipher(smp->tfm_aes);
 		kzfree(smp);
 		return NULL;
 	}
@@ -3120,7 +3109,7 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 {
 	struct l2cap_chan *chan;
 	struct smp_dev *smp;
-	struct crypto_skcipher *tfm_aes;
+	struct crypto_cipher *tfm_aes;
 	struct crypto_shash *tfm_cmac;
 
 	if (cid == L2CAP_CID_SMP_BREDR) {
@@ -3132,9 +3121,9 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 	if (!smp)
 		return ERR_PTR(-ENOMEM);
 
-	tfm_aes = crypto_alloc_skcipher("ecb(aes)", 0, CRYPTO_ALG_ASYNC);
+	tfm_aes = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(tfm_aes)) {
-		BT_ERR("Unable to create ECB crypto context");
+		BT_ERR("Unable to create AES crypto context");
 		kzfree(smp);
 		return ERR_CAST(tfm_aes);
 	}
@@ -3142,7 +3131,7 @@ static struct l2cap_chan *smp_add_cid(struct hci_dev *hdev, u16 cid)
 	tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, 0);
 	if (IS_ERR(tfm_cmac)) {
 		BT_ERR("Unable to create CMAC crypto context");
-		crypto_free_skcipher(tfm_aes);
+		crypto_free_cipher(tfm_aes);
 		kzfree(smp);
 		return ERR_CAST(tfm_cmac);
 	}
@@ -3156,7 +3145,7 @@ create_chan:
 	chan = l2cap_chan_create();
 	if (!chan) {
 		if (smp) {
-			crypto_free_skcipher(smp->tfm_aes);
+			crypto_free_cipher(smp->tfm_aes);
 			crypto_free_shash(smp->tfm_cmac);
 			kzfree(smp);
 		}
@@ -3203,7 +3192,7 @@ static void smp_del_chan(struct l2cap_chan *chan)
 	smp = chan->data;
 	if (smp) {
 		chan->data = NULL;
-		crypto_free_skcipher(smp->tfm_aes);
+		crypto_free_cipher(smp->tfm_aes);
 		crypto_free_shash(smp->tfm_cmac);
 		kzfree(smp);
 	}
@@ -3440,7 +3429,7 @@ void smp_unregister(struct hci_dev *hdev)
 
 #if IS_ENABLED(CONFIG_BT_SELFTEST_SMP)
 
-static int __init test_ah(struct crypto_skcipher *tfm_aes)
+static int __init test_ah(struct crypto_cipher *tfm_aes)
 {
 	const u8 irk[16] = {
 			0x9b, 0x7d, 0x39, 0x0a, 0xa6, 0x10, 0x10, 0x34,
@@ -3460,7 +3449,7 @@ static int __init test_ah(struct crypto_skcipher *tfm_aes)
 	return 0;
 }
 
-static int __init test_c1(struct crypto_skcipher *tfm_aes)
+static int __init test_c1(struct crypto_cipher *tfm_aes)
 {
 	const u8 k[16] = {
 			0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@@ -3490,7 +3479,7 @@ static int __init test_c1(struct crypto_skcipher *tfm_aes)
 	return 0;
 }
 
-static int __init test_s1(struct crypto_skcipher *tfm_aes)
+static int __init test_s1(struct crypto_cipher *tfm_aes)
 {
 	const u8 k[16] = {
 			0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@@ -3686,7 +3675,7 @@ static const struct file_operations test_smp_fops = {
 	.llseek		= default_llseek,
 };
 
-static int __init run_selftests(struct crypto_skcipher *tfm_aes,
+static int __init run_selftests(struct crypto_cipher *tfm_aes,
 				struct crypto_shash *tfm_cmac)
 {
 	ktime_t calltime, delta, rettime;
@@ -3764,27 +3753,27 @@ done:
 
 int __init bt_selftest_smp(void)
 {
-	struct crypto_skcipher *tfm_aes;
+	struct crypto_cipher *tfm_aes;
 	struct crypto_shash *tfm_cmac;
 	int err;
 
-	tfm_aes = crypto_alloc_skcipher("ecb(aes)", 0, CRYPTO_ALG_ASYNC);
+	tfm_aes = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(tfm_aes)) {
-		BT_ERR("Unable to create ECB crypto context");
+		BT_ERR("Unable to create AES crypto context");
 		return PTR_ERR(tfm_aes);
 	}
 
 	tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(tfm_cmac)) {
 		BT_ERR("Unable to create CMAC crypto context");
-		crypto_free_skcipher(tfm_aes);
+		crypto_free_cipher(tfm_aes);
 		return PTR_ERR(tfm_cmac);
 	}
 
 	err = run_selftests(tfm_aes, tfm_cmac);
 
 	crypto_free_shash(tfm_cmac);
-	crypto_free_skcipher(tfm_aes);
+	crypto_free_cipher(tfm_aes);
 
 	return err;
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 02/16] rxrpc: Avoid using stack memory in SG lists in rxkad
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
  2016-06-24  4:22 ` [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
@ 2016-06-24  4:22 ` Andy Lutomirski
  2016-06-24  4:22 ` [PATCH v4 03/16] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:22 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Herbert Xu, Andy Lutomirski

From: Herbert Xu <herbert@gondor.apana.org.au>

rxkad uses stack memory in SG lists which would not work if stacks
were allocated from vmalloc memory.  In fact, in most cases this
isn't even necessary as the stack memory ends up getting copied
over to kmalloc memory.

This patch eliminates all the unnecessary stack memory uses by
supplying the final destination directly to the crypto API.  In
two instances where a temporary buffer is actually needed we also
switch use the skb->cb area instead of the stack.

Finally there is no need to split a split-page buffer into two SG
entries so code dealing with that has been removed.

Message-Id: <20160623064137.GA8958@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 net/rxrpc/ar-internal.h |   1 +
 net/rxrpc/rxkad.c       | 103 ++++++++++++++++++++----------------------------
 2 files changed, 44 insertions(+), 60 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index f0b807a163fa..8ee5933982f3 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -277,6 +277,7 @@ struct rxrpc_connection {
 	struct key		*key;		/* security for this connection (client) */
 	struct key		*server_key;	/* security for this service */
 	struct crypto_skcipher	*cipher;	/* encryption handle */
+	struct rxrpc_crypt	csum_iv_head;	/* leading block for csum_iv */
 	struct rxrpc_crypt	csum_iv;	/* packet checksum base */
 	unsigned long		events;
 #define RXRPC_CONN_CHALLENGE	0		/* send challenge packet */
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index bab56ed649ba..a28a3c6fdf1d 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -105,11 +105,9 @@ static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 {
 	struct rxrpc_key_token *token;
 	SKCIPHER_REQUEST_ON_STACK(req, conn->cipher);
-	struct scatterlist sg[2];
+	struct rxrpc_crypt *csum_iv;
+	struct scatterlist sg;
 	struct rxrpc_crypt iv;
-	struct {
-		__be32 x[4];
-	} tmpbuf __attribute__((aligned(16))); /* must all be in same page */
 
 	_enter("");
 
@@ -119,24 +117,21 @@ static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 	token = conn->key->payload.data[0];
 	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
-	tmpbuf.x[0] = htonl(conn->epoch);
-	tmpbuf.x[1] = htonl(conn->cid);
-	tmpbuf.x[2] = 0;
-	tmpbuf.x[3] = htonl(conn->security_ix);
+	csum_iv = &conn->csum_iv_head;
+	csum_iv[0].x[0] = htonl(conn->epoch);
+	csum_iv[0].x[1] = htonl(conn->cid);
+	csum_iv[1].x[0] = 0;
+	csum_iv[1].x[1] = htonl(conn->security_ix);
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	sg_init_one(&sg, csum_iv, 16);
 
 	skcipher_request_set_tfm(req, conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, 16, iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	memcpy(&conn->csum_iv, &tmpbuf.x[2], sizeof(conn->csum_iv));
-	ASSERTCMP((u32 __force)conn->csum_iv.n[0], ==, (u32 __force)tmpbuf.x[2]);
-
 	_leave("");
 }
 
@@ -150,12 +145,9 @@ static int rxkad_secure_packet_auth(const struct rxrpc_call *call,
 {
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
+	struct rxkad_level1_hdr hdr;
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
-		struct rxkad_level1_hdr hdr;
-		__be32	first;	/* first four bytes of data and padding */
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+	struct scatterlist sg;
 	u16 check;
 
 	sp = rxrpc_skb(skb);
@@ -165,24 +157,21 @@ static int rxkad_secure_packet_auth(const struct rxrpc_call *call,
 	check = sp->hdr.seq ^ sp->hdr.callNumber;
 	data_size |= (u32)check << 16;
 
-	tmpbuf.hdr.data_size = htonl(data_size);
-	memcpy(&tmpbuf.first, sechdr + 4, sizeof(tmpbuf.first));
+	hdr.data_size = htonl(data_size);
+	memcpy(sechdr, &hdr, sizeof(hdr));
 
 	/* start the encryption afresh */
 	memset(&iv, 0, sizeof(iv));
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	sg_init_one(&sg, sechdr, 8);
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, 8, iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
-	memcpy(sechdr, &tmpbuf, sizeof(tmpbuf));
-
 	_leave(" = 0");
 	return 0;
 }
@@ -196,8 +185,7 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 				       void *sechdr)
 {
 	const struct rxrpc_key_token *token;
-	struct rxkad_level2_hdr rxkhdr
-		__attribute__((aligned(8))); /* must be all on one page */
+	struct rxkad_level2_hdr rxkhdr;
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_crypt iv;
@@ -216,17 +204,17 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 
 	rxkhdr.data_size = htonl(data_size | (u32)check << 16);
 	rxkhdr.checksum = 0;
+	memcpy(sechdr, &rxkhdr, sizeof(rxkhdr));
 
 	/* encrypt from the session key */
 	token = call->conn->key->payload.data[0];
 	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
 	sg_init_one(&sg[0], sechdr, sizeof(rxkhdr));
-	sg_init_one(&sg[1], &rxkhdr, sizeof(rxkhdr));
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(rxkhdr), iv.x);
+	skcipher_request_set_crypt(req, &sg[0], &sg[0], sizeof(rxkhdr), iv.x);
 
 	crypto_skcipher_encrypt(req);
 
@@ -265,10 +253,11 @@ static int rxkad_secure_packet(const struct rxrpc_call *call,
 	struct rxrpc_skb_priv *sp;
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
+	struct scatterlist sg;
+	union {
 		__be32 x[2];
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+		__be64 xl;
+	} tmpbuf;
 	u32 x, y;
 	int ret;
 
@@ -294,16 +283,19 @@ static int rxkad_secure_packet(const struct rxrpc_call *call,
 	tmpbuf.x[0] = htonl(sp->hdr.callNumber);
 	tmpbuf.x[1] = htonl(x);
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
+	sg_init_one(&sg, sp, sizeof(tmpbuf));
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, sizeof(tmpbuf), iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
 	y = ntohl(tmpbuf.x[1]);
 	y = (y >> 16) & 0xffff;
 	if (y == 0)
@@ -503,10 +495,11 @@ static int rxkad_verify_packet(const struct rxrpc_call *call,
 	SKCIPHER_REQUEST_ON_STACK(req, call->conn->cipher);
 	struct rxrpc_skb_priv *sp;
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
-	struct {
+	struct scatterlist sg;
+	union {
 		__be32 x[2];
-	} tmpbuf __attribute__((aligned(8))); /* must all be in same page */
+		__be64 xl;
+	} tmpbuf;
 	u16 cksum;
 	u32 x, y;
 	int ret;
@@ -534,16 +527,19 @@ static int rxkad_verify_packet(const struct rxrpc_call *call,
 	tmpbuf.x[0] = htonl(call->call_id);
 	tmpbuf.x[1] = htonl(x);
 
-	sg_init_one(&sg[0], &tmpbuf, sizeof(tmpbuf));
-	sg_init_one(&sg[1], &tmpbuf, sizeof(tmpbuf));
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
+	sg_init_one(&sg, sp, sizeof(tmpbuf));
 
 	skcipher_request_set_tfm(req, call->conn->cipher);
 	skcipher_request_set_callback(req, 0, NULL, NULL);
-	skcipher_request_set_crypt(req, &sg[1], &sg[0], sizeof(tmpbuf), iv.x);
+	skcipher_request_set_crypt(req, &sg, &sg, sizeof(tmpbuf), iv.x);
 
 	crypto_skcipher_encrypt(req);
 	skcipher_request_zero(req);
 
+	swap(tmpbuf.xl, *(__be64 *)sp);
+
 	y = ntohl(tmpbuf.x[1]);
 	cksum = (y >> 16) & 0xffff;
 	if (cksum == 0)
@@ -708,26 +704,13 @@ static void rxkad_calc_response_checksum(struct rxkad_response *response)
 }
 
 /*
- * load a scatterlist with a potentially split-page buffer
+ * load a scatterlist
  */
-static void rxkad_sg_set_buf2(struct scatterlist sg[2],
+static void rxkad_sg_set_buf2(struct scatterlist sg[1],
 			      void *buf, size_t buflen)
 {
-	int nsg = 1;
-
-	sg_init_table(sg, 2);
-
+	sg_init_table(sg, 1);
 	sg_set_buf(&sg[0], buf, buflen);
-	if (sg[0].offset + buflen > PAGE_SIZE) {
-		/* the buffer was split over two pages */
-		sg[0].length = PAGE_SIZE - sg[0].offset;
-		sg_set_buf(&sg[1], buf + sg[0].length, buflen - sg[0].length);
-		nsg++;
-	}
-
-	sg_mark_end(&sg[nsg - 1]);
-
-	ASSERTCMP(sg[0].length + sg[1].length, ==, buflen);
 }
 
 /*
@@ -739,7 +722,7 @@ static void rxkad_encrypt_response(struct rxrpc_connection *conn,
 {
 	SKCIPHER_REQUEST_ON_STACK(req, conn->cipher);
 	struct rxrpc_crypt iv;
-	struct scatterlist sg[2];
+	struct scatterlist sg[1];
 
 	/* continue encrypting from where we left off */
 	memcpy(&iv, s2->session_key, sizeof(iv));
@@ -999,7 +982,7 @@ static void rxkad_decrypt_response(struct rxrpc_connection *conn,
 				   const struct rxrpc_crypt *session_key)
 {
 	SKCIPHER_REQUEST_ON_STACK(req, rxkad_ci);
-	struct scatterlist sg[2];
+	struct scatterlist sg[1];
 	struct rxrpc_crypt iv;
 
 	_enter(",,%08x%08x",
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 03/16] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
  2016-06-24  4:22 ` [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
  2016-06-24  4:22 ` [PATCH v4 02/16] rxrpc: Avoid using stack memory in SG lists in rxkad Andy Lutomirski
@ 2016-06-24  4:22 ` Andy Lutomirski
  2016-06-24  4:22 ` [PATCH v4 04/16] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:22 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Ingo Molnar, Andrew Morton, Andy Lutomirski,
	Denys Vlasenko, H. Peter Anvin, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Waiman Long, linux-mm

From: Ingo Molnar <mingo@kernel.org>

So when memory hotplug removes a piece of physical memory from pagetable
mappings, it also frees the underlying PGD entry.

This complicates PGD management, so don't do this. We can keep the
PGD mapped and the PUD table all clear - it's only a single 4K page
per 512 GB of memory hotplugged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Message-Id: <1442903021-3893-4-git-send-email-mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d9edd4..c7465453d64e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -702,27 +702,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-/* Return true if pgd is changed, otherwise return false. */
-static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
-{
-	pud_t *pud;
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		pud = pud_start + i;
-		if (pud_val(*pud))
-			return false;
-	}
-
-	/* free a pud table */
-	free_pagetable(pgd_page(*pgd), 0);
-	spin_lock(&init_mm.page_table_lock);
-	pgd_clear(pgd);
-	spin_unlock(&init_mm.page_table_lock);
-
-	return true;
-}
-
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -913,7 +892,6 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long addr;
 	pgd_t *pgd;
 	pud_t *pud;
-	bool pgd_changed = false;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,13 +902,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
 		remove_pud_table(pud, addr, next, direct);
-		if (free_pud_table(pud, pgd))
-			pgd_changed = true;
 	}
 
-	if (pgd_changed)
-		sync_global_pgds(start, end - 1, 1);
-
 	flush_tlb_all();
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 04/16] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-06-24  4:22 ` [PATCH v4 03/16] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
@ 2016-06-24  4:22 ` Andy Lutomirski
  2016-06-24  4:23 ` [PATCH v4 05/16] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables() Andy Lutomirski
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:22 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This avoids pointless races in which another CPU or task might see a
partially populated global pgd entry.  These races should normally
be harmless, but, if another CPU propagates the entry via
vmalloc_fault and then populate_pgd fails (due to memory allocation
failure, for example), this prevents a use-after-free of the pgd
entry.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/pageattr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bbf4105..6a8026918bf6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1113,7 +1113,9 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 
 	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
 	if (ret < 0) {
-		unmap_pgd_range(cpa->pgd, addr,
+		if (pud)
+			free_page((unsigned long)pud);
+		unmap_pud_range(pgd_entry, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 05/16] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-06-24  4:22 ` [PATCH v4 04/16] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24  4:23 ` [PATCH v4 06/16] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Matt Fleming, linux-efi

kernel_unmap_pages_in_pgd() is dangerous: if a pgd entry in
init_mm.pgd were to be cleared, callers would need to ensure that
the pgd entry hadn't been propagated to any other pgd.

Its only caller was efi_cleanup_page_tables(), and that, in turn,
was unused, so just delete both functions.  This leaves a couple of
other helpers unused, so delete them, too.

Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: linux-efi@vger.kernel.org
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/efi.h           |  1 -
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/pageattr.c               | 28 ----------------------------
 arch/x86/platform/efi/efi.c          |  2 --
 arch/x86/platform/efi/efi_32.c       |  3 ---
 arch/x86/platform/efi/efi_64.c       |  5 -----
 6 files changed, 41 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 78d1e7467eae..45ea38df86d4 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -125,7 +125,6 @@ extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
 extern void efi_sync_low_kernel_mappings(void);
 extern int __init efi_alloc_page_tables(void);
 extern int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages);
-extern void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages);
 extern void __init old_map_region(efi_memory_desc_t *md);
 extern void __init runtime_code_page_mkexec(void);
 extern void __init efi_runtime_update_mappings(void);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 7b5efe264eff..0b9f58ad10c8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -475,8 +475,6 @@ extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 				   unsigned numpages, unsigned long page_flags);
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages);
 #endif	/* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_DEFS_H */
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a8026918bf6..762162af3662 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -746,18 +746,6 @@ static bool try_to_free_pmd_page(pmd_t *pmd)
 	return true;
 }
 
-static bool try_to_free_pud_page(pud_t *pud)
-{
-	int i;
-
-	for (i = 0; i < PTRS_PER_PUD; i++)
-		if (!pud_none(pud[i]))
-			return false;
-
-	free_page((unsigned long)pud);
-	return true;
-}
-
 static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, start);
@@ -871,16 +859,6 @@ static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
 	 */
 }
 
-static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd_entry = root + pgd_index(addr);
-
-	unmap_pud_range(pgd_entry, addr, end);
-
-	if (try_to_free_pud_page((pud_t *)pgd_page_vaddr(*pgd_entry)))
-		pgd_clear(pgd_entry);
-}
-
 static int alloc_pte_page(pmd_t *pmd)
 {
 	pte_t *pte = (pte_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
@@ -1993,12 +1971,6 @@ out:
 	return retval;
 }
 
-void kernel_unmap_pages_in_pgd(pgd_t *root, unsigned long address,
-			       unsigned numpages)
-{
-	unmap_pgd_range(root, address, address + (numpages << PAGE_SHIFT));
-}
-
 /*
  * The testcases use internal knowledge of the implementation that shouldn't
  * be exposed to the rest of the kernel. Include these directly here.
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index f93545e7dc54..62986e5fbdba 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -978,8 +978,6 @@ static void __init __efi_enter_virtual_mode(void)
 	 * EFI mixed mode we need all of memory to be accessible when
 	 * we pass parameters to the EFI runtime services in the
 	 * thunking code.
-	 *
-	 * efi_cleanup_page_tables(__pa(new_memmap), 1 << pg_shift);
 	 */
 	free_pages((unsigned long)new_memmap, pg_shift);
 
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 338402b91d2e..cef39b097649 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,9 +49,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	return 0;
 }
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-}
 
 void __init efi_map_region(efi_memory_desc_t *md)
 {
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 6e7242be1c87..5ab219c2ba43 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -285,11 +285,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	return 0;
 }
 
-void __init efi_cleanup_page_tables(unsigned long pa_memmap, unsigned num_pages)
-{
-	kernel_unmap_pages_in_pgd(efi_pgd, pa_memmap, num_pages);
-}
-
 static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
 	unsigned long flags = _PAGE_RW;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 06/16] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 05/16] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables() Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24 15:21   ` Josh Poimboeuf
  2016-06-24  4:23 ` [PATCH v4 07/16] mm: Fix memcg stack accounting for sub-page stacks Andy Lutomirski
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
zone.  This only makes sense if each kernel stack exists entirely in
one zone, and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
all architectures.  Keep it simple and use KiB.

Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 kernel/fork.c          | 3 ++-
 mm/page_alloc.c        | 3 +--
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..27dc68a0ed2d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+		       nid, node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..239b5a06cee0 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -145,7 +145,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
 		K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
-		global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024,
+		global_page_state(NR_KERNEL_STACK_KB),
 		K(global_page_state(NR_PAGETABLE)),
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..63f05a7efb54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,7 +127,7 @@ enum zone_stat_item {
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
-	NR_KERNEL_STACK,
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	/* Second 128 byte cacheline */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
diff --git a/kernel/fork.c b/kernel/fork.c
index 5c2c355aa97f..be7f006af727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -225,7 +225,8 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 {
 	struct zone *zone = page_zone(virt_to_page(ti));
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK, account);
+	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+			    THREAD_SIZE / 1024 * account);
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b695ebae..a277dea926c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,8 +4457,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
-			zone_page_state(zone, NR_KERNEL_STACK) *
-				THREAD_SIZE / 1024,
+			zone_page_state(zone, NR_KERNEL_STACK_KB),
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 07/16] mm: Fix memcg stack accounting for sub-page stacks
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (5 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 06/16] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24 15:22   ` Josh Poimboeuf
  2016-06-24  4:23 ` [PATCH v4 08/16] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, linux-mm

We should account for stacks regardless of stack size, and we need
to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
units to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/memcontrol.h |  2 +-
 kernel/fork.c              | 15 ++++++---------
 mm/memcontrol.c            |  2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..3b653b86bb8f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 	/* default hierarchy stats */
-	MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
+	MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
 	MEMCG_SLAB_RECLAIMABLE,
 	MEMCG_SLAB_UNRECLAIMABLE,
 	MEMCG_SOCK,
diff --git a/kernel/fork.c b/kernel/fork.c
index be7f006af727..ff3c41c2ba96 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,20 +165,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
-	if (page)
-		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-					    1 << THREAD_SIZE_ORDER);
-
 	return page ? page_address(page) : NULL;
 }
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	struct page *page = virt_to_page(ti);
-
-	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
-				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 
 	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
 			    THREAD_SIZE / 1024 * account);
+
+	/* All stack pages belong to the same memcg. */
+	memcg_kmem_update_page_stat(
+		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+		account * (THREAD_SIZE / 1024));
 }
 
 void free_task(struct task_struct *tsk)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e74408cc8f..8e13a2419dad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "file %llu\n",
 		   (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
 	seq_printf(m, "kernel_stack %llu\n",
-		   (u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
+		   (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
 	seq_printf(m, "slab %llu\n",
 		   (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
 			 stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 08/16] dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (6 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 07/16] mm: Fix memcg stack accounting for sub-page stacks Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24  4:23 ` [PATCH v4 09/16] fork: Add generic vmalloced stack support Andy Lutomirski
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski, Andrew Morton, Arnd Bergmann

If we're using CONFIG_VMAP_STACK and we manage to point an sg entry
at the stack, then either the sg page will be in highmem or sg_virt
will return the direct-map alias.  In neither case will the existing
check_for_stack() implementation realize that it's a stack page.

Fix it by explicitly checking for stack pages.

This has no effect by itself.  It's broken out for ease of review.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 lib/dma-debug.c | 39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 51a76af25c66..5b2e63cba90e 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -22,6 +22,7 @@
 #include <linux/stacktrace.h>
 #include <linux/dma-debug.h>
 #include <linux/spinlock.h>
+#include <linux/vmalloc.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
 #include <linux/export.h>
@@ -1162,11 +1163,35 @@ static void check_unmap(struct dma_debug_entry *ref)
 	put_hash_bucket(bucket, &flags);
 }
 
-static void check_for_stack(struct device *dev, void *addr)
+static void check_for_stack(struct device *dev,
+			    struct page *page, size_t offset)
 {
-	if (object_is_on_stack(addr))
-		err_printk(dev, NULL, "DMA-API: device driver maps memory from "
-				"stack [addr=%p]\n", addr);
+	void *addr;
+	struct vm_struct *stack_vm_area = task_stack_vm_area(current);
+
+	if (!stack_vm_area) {
+		/* Stack is direct-mapped. */
+		if (PageHighMem(page))
+			return;
+		addr = page_address(page) + offset;
+		if (object_is_on_stack(addr))
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [addr=%p]\n",
+				   addr);
+	} else {
+		/* Stack is vmalloced. */
+		int i;
+
+		for (i = 0; i < stack_vm_area->nr_pages; i++) {
+			if (page != stack_vm_area->pages[i])
+				continue;
+
+			addr = (u8 *)current->stack + i * PAGE_SIZE +
+				offset;
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [probable addr=%p]\n",
+				   addr);
+			break;
+		}
+	}
 }
 
 static inline bool overlap(void *addr, unsigned long len, void *start, void *end)
@@ -1289,10 +1314,11 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
 	if (map_single)
 		entry->type = dma_debug_single;
 
+	check_for_stack(dev, page, offset);
+
 	if (!PageHighMem(page)) {
 		void *addr = page_address(page) + offset;
 
-		check_for_stack(dev, addr);
 		check_for_illegal_area(dev, addr, size);
 	}
 
@@ -1384,8 +1410,9 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		entry->sg_call_ents   = nents;
 		entry->sg_mapped_ents = mapped_ents;
 
+		check_for_stack(dev, sg_page(s), s->offset);
+
 		if (!PageHighMem(sg_page(s))) {
-			check_for_stack(dev, sg_virt(s));
 			check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s));
 		}
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 09/16] fork: Add generic vmalloced stack support
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (7 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 08/16] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24  4:23 ` [PATCH v4 10/16] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig                        | 29 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 +++++++
 kernel/fork.c                       | 82 +++++++++++++++++++++++++++++--------
 4 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index e9734796531f..835eeef0f14d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -661,4 +661,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to and switch_mm,
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index aa995b67c3f5..d13edda6e09c 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_info_node(tsk, node)	((struct thread_info *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_info(ti)	/* nothing */
+#define free_thread_info(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..a37c3b790309 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1934,6 +1937,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index ff3c41c2ba96..fe1c785e5f8c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,38 @@ void __weak arch_release_thread_info(struct thread_info *ti)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	struct thread_info *ti = __vmalloc_node_range(
+		THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
+		THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
+		0, node, __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_info can be called in interrupt context, so cache
+	 * the vm_struct.
+	 */
+	if (ti)
+		tsk->stack_vm_area = find_vm_area(ti);
+	return ti;
+#else
 	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
 						  THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_info(struct thread_info *ti)
+static inline void free_thread_info(struct task_struct *tsk)
 {
-	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
@@ -181,9 +200,9 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_info(struct thread_info *ti)
+static void free_thread_info(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_info_cache, ti);
+	kmem_cache_free(thread_info_cache, tsk->stack);
 }
 
 void thread_info_cache_init(void)
@@ -213,24 +232,47 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(struct thread_info *ti, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	struct zone *zone = page_zone(virt_to_page(ti));
+	struct zone *zone;
+	struct thread_info *ti = task_thread_info(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
 
-	mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
-	/* All stack pages belong to the same memcg. */
-	memcg_kmem_update_page_stat(
-		virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
-		account * (THREAD_SIZE / 1024));
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    NR_KERNEL_STACK_KB,
+					    PAGE_SIZE / 1024 * account);
+		}
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			vm->pages[0], MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	} else {
+		zone = page_zone(virt_to_page(ti));
+
+		mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
+
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(
+			virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
+			account * (THREAD_SIZE / 1024));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_info(tsk->stack);
-	free_thread_info(tsk->stack);
+	free_thread_info(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +384,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +397,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!ti)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
 	if (err)
 		goto free_ti;
 
 	tsk->stack = ti;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +438,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(ti, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_ti:
-	free_thread_info(ti);
+	free_thread_info(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 10/16] x86/die: Don't try to recover from an OOPS on a non-default stack
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (8 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 09/16] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24  4:23 ` [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

It's not going to work, because the scheduler will explode if we try
to schedule when running on an IST stack or similar.

This will matter when we let kernel stack overflows (which are #DF)
call die().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d6209f3a69cb..70d5aae8b8f7 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -245,6 +245,9 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
+	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
+	     & ~(THREAD_SIZE - 1)) != 0)
+		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
 	do_exit(signr);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (9 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 10/16] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24 15:30   ` Josh Poimboeuf
  2016-06-24  4:23 ` [PATCH v4 12/16] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we call do_exit with a clean stack, we greatly reduce the risk of
recursive oopses due to stack overflow in do_exit, and we allow
do_exit to work even if we OOPS from an IST stack.  The latter gives
us a much better chance of surviving long enough after we detect a
stack overflow to write out our logs.

I intentionally separated this from the preceding patch that
disables do_exit-on-OOPS on IST stacks.  This way, if we need to
revert this patch, we still end up in an acceptable state wrt stack
overflow handling.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S   | 11 +++++++++++
 arch/x86/entry/entry_64.S   | 11 +++++++++++
 arch/x86/kernel/dumpstack.c | 13 +++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3a0d27..0b56666e6039 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
 	jmp	error_code
 END(async_page_fault)
 #endif
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
+	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1807ed..b846875aeea6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+ENTRY(rewind_stack_do_exit)
+	/* Prevent any naive code from trying to unwind to our caller. */
+	xorl	%ebp, %ebp
+
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
+	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+
+	call	do_exit
+1:	jmp 1b
+END(rewind_stack_do_exit)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 70d5aae8b8f7..4592bc4ed3e1 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -226,6 +226,8 @@ unsigned long oops_begin(void)
 EXPORT_SYMBOL_GPL(oops_begin);
 NOKPROBE_SYMBOL(oops_begin);
 
+extern void __noreturn rewind_stack_do_exit(int signr);
+
 void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 {
 	if (regs && kexec_should_crash(current))
@@ -245,12 +247,15 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
 		return;
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
-	if (((current_stack_pointer() ^ (current_top_of_stack() - 1))
-	     & ~(THREAD_SIZE - 1)) != 0)
-		panic("Fatal exception on special stack");
 	if (panic_on_oops)
 		panic("Fatal exception");
-	do_exit(signr);
+
+	/*
+	 * We're not going to return, but we might be on an IST stack or
+	 * have very little stack space left.  Rewind the stack and kill
+	 * the task.
+	 */
+	rewind_stack_do_exit(signr);
 }
 NOKPROBE_SYMBOL(oops_end);
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 12/16] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (10 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24 15:31   ` Josh Poimboeuf
  2016-06-24  4:23 ` [PATCH v4 13/16] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

The comment suggests that show_stack(NULL, NULL) should backtrace
the current context, but the code doesn't match the comment.  If
regs are given, start the "Stack:" hexdump at regs->sp.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 +++-
 arch/x86/kernel/dumpstack_64.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index fef917e79b9d..948d77da3881 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -96,7 +96,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	int i;
 
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index d558a8a49016..a81e1ef73bf2 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -264,7 +264,9 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * back trace for this cpu:
 	 */
 	if (sp == NULL) {
-		if (task)
+		if (regs)
+			sp = (unsigned long *)regs->sp;
+		else if (task)
 			sp = (unsigned long *)task->thread.sp;
 		else
 			sp = (unsigned long *)&sp;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 13/16] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (11 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 12/16] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24 15:35   ` Josh Poimboeuf
  2016-06-24  4:23 ` [PATCH v4 14/16] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack, print_context_stack will abort.  Detect
this case and rewind back into the valid part of the stack so that
we can trace it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 4592bc4ed3e1..4538f7ca9072 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -87,7 +87,7 @@ static inline int valid_stack_ptr(struct task_struct *task,
 		else
 			return 0;
 	}
-	return p > t && p < t + THREAD_SIZE - size;
+	return p >= t && p < t + THREAD_SIZE - size;
 }
 
 unsigned long
@@ -98,6 +98,13 @@ print_context_stack(struct task_struct *task,
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
+	/*
+	 * If we overflowed the stack into a guard page, jump back to the
+	 * bottom of the usable stack.
+	 */
+	if ((unsigned long)task->stack - (unsigned long)stack < PAGE_SIZE)
+		stack = (unsigned long *)task->stack;
+
 	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
 		unsigned long addr;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 14/16] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (12 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 13/16] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24 15:36   ` Josh Poimboeuf
  2016-06-24  4:23 ` [PATCH v4 15/16] x86/mm/64: Enable vmapped stacks Andy Lutomirski
  2016-06-24  4:23 ` [PATCH v4 16/16] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
  15 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we overflow the stack into a guard page, we'll recursively fault
when trying to dump the contents of the guard page.  Use
probe_kernel_address so we can recover if this happens.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack_64.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index a81e1ef73bf2..6dede08dd98b 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -274,6 +274,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
+		unsigned long word;
+
 		if (stack >= irq_stack && stack <= irq_stack_end) {
 			if (stack == irq_stack_end) {
 				stack = (unsigned long *) (irq_stack_end[-1]);
@@ -283,12 +285,18 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (kstack_end(stack))
 			break;
 		}
+
+		if (probe_kernel_address(stack, word))
+			break;
+
 		if ((i % STACKSLOTS_PER_LINE) == 0) {
 			if (i != 0)
 				pr_cont("\n");
-			printk("%s %016lx", log_lvl, *stack++);
+			printk("%s %016lx", log_lvl, word);
 		} else
-			pr_cont(" %016lx", *stack++);
+			pr_cont(" %016lx", word);
+
+		stack++;
 		touch_nmi_watchdog();
 	}
 	preempt_enable();
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 15/16] x86/mm/64: Enable vmapped stacks
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (13 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 14/16] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  2016-06-24  4:23 ` [PATCH v4 16/16] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and an oops without triggering a double-fault, and
no_context doesn't know about stack overflows.  The next patch will
improve that case.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c          | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 +++++++++++++++
 4 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9a94da0c29f..afdcf96ef109 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..9cb7ea781176 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  CR2 will contain
+	 * the linear address of the second fault, which will be in the
+	 * guard page below the bottom of the stack.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)",
+			regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..fbf036ae72ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 16/16] x86/mm: Improve stack-overflow #PF handling
  2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
                   ` (14 preceding siblings ...)
  2016-06-24  4:23 ` [PATCH v4 15/16] x86/mm/64: Enable vmapped stacks Andy Lutomirski
@ 2016-06-24  4:23 ` Andy Lutomirski
  15 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:23 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: linux-arch, Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Josh Poimboeuf, Jann Horn,
	Heiko Carstens, Andy Lutomirski

If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/traps.h |  6 ++++++
 arch/x86/kernel/traps.c      |  6 +++---
 arch/x86/mm/fault.c          | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619740a..01fd0a7f48cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -117,6 +117,12 @@ extern void ist_exit(struct pt_regs *regs);
 extern void ist_begin_non_atomic(struct pt_regs *regs);
 extern void ist_end_non_atomic(void);
 
+#ifdef CONFIG_VMAP_STACK
+void __noreturn handle_stack_overflow(const char *message,
+				      struct pt_regs *regs,
+				      unsigned long fault_address);
+#endif
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9cb7ea781176..b389c0539eb9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -293,9 +293,9 @@ DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
 #ifdef CONFIG_VMAP_STACK
-static void __noreturn handle_stack_overflow(const char *message,
-					     struct pt_regs *regs,
-					     unsigned long fault_address)
+__visible void __noreturn handle_stack_overflow(const char *message,
+						struct pt_regs *regs,
+						unsigned long fault_address)
 {
 	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
 		 (void *)fault_address, current->stack,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..c68b81f5659f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -753,6 +753,45 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * Stack overflow?  During boot, we can fault near the initial
+	 * stack in the direct map, but that's not an overflow -- check
+	 * that we're in vmalloc space to avoid this.
+	 *
+	 * Check this after trying fixup_exception, since there are handful
+	 * of kernel code paths that wander off the top of the stack but
+	 * handle any faults that occur.  Once those are fixed, we can
+	 * move this above fixup_exception.
+	 */
+	if (is_vmalloc_addr((void *)address) &&
+	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+		register void *__sp asm("rsp");
+		unsigned long stack =
+			this_cpu_read(orig_ist.ist[DOUBLEFAULT_STACK]) -
+			sizeof(void *);
+		/*
+		 * We're likely to be running with very little stack space
+		 * left.  It's plausible that we'd hit this condition but
+		 * double-fault even before we get this far, in which case
+		 * we're fine: the double-fault handler will deal with it.
+		 *
+		 * We don't want to make it all the way into the oops code
+		 * and then double-fault, though, because we're likely to
+		 * break the console driver and lose most of the stack dump.
+		 */
+		asm volatile ("movq %[stack], %%rsp\n\t"
+			      "call handle_stack_overflow\n\t"
+			      "1: jmp 1b"
+			      : "+r" (__sp)
+			      : "D" ("kernel stack overflow (page fault)"),
+				"S" (regs), "d" (address),
+				[stack] "rm" (stack));
+		unreachable();
+	}
+#endif
+
 	/*
 	 * 32-bit:
 	 *
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-24  4:22 ` [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
@ 2016-06-24  6:10   ` Herbert Xu
  2016-06-24  7:19   ` Johan Hedberg
  1 sibling, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2016-06-24  6:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Marcel Holtmann,
	Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth,
	netdev

On Thu, Jun 23, 2016 at 09:22:56PM -0700, Andy Lutomirski wrote:
> SMP does ECB crypto on stack buffers.  This is complicated and
> fragile, and it will not work if the stack is virtually allocated.
> 
> Switch to the crypto_cipher interface, which is simpler and safer.
> 
> Cc: Marcel Holtmann <marcel@holtmann.org>
> Cc: Gustavo Padovan <gustavo@padovan.org>
> Cc: Johan Hedberg <johan.hedberg@gmail.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: linux-bluetooth@vger.kernel.org
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one()
  2016-06-24  4:22 ` [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
  2016-06-24  6:10   ` Herbert Xu
@ 2016-06-24  7:19   ` Johan Hedberg
  1 sibling, 0 replies; 28+ messages in thread
From: Johan Hedberg @ 2016-06-24  7:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Josh Poimboeuf, Jann Horn, Heiko Carstens, Marcel Holtmann,
	Gustavo Padovan, David S. Miller, linux-bluetooth, Herbert Xu,
	netdev

On Thu, Jun 23, 2016, Andy Lutomirski wrote:
> SMP does ECB crypto on stack buffers.  This is complicated and
> fragile, and it will not work if the stack is virtually allocated.
> 
> Switch to the crypto_cipher interface, which is simpler and safer.
> 
> Cc: Marcel Holtmann <marcel@holtmann.org>
> Cc: Gustavo Padovan <gustavo@padovan.org>
> Cc: Johan Hedberg <johan.hedberg@gmail.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: linux-bluetooth@vger.kernel.org
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  net/bluetooth/smp.c | 67 ++++++++++++++++++++++-------------------------------
>  1 file changed, 28 insertions(+), 39 deletions(-)

Acked-and-tested-by: Johan Hedberg <johan.hedberg@intel.com>

Johan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 06/16] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
  2016-06-24  4:23 ` [PATCH v4 06/16] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
@ 2016-06-24 15:21   ` Josh Poimboeuf
  0 siblings, 0 replies; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-24 15:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens, Vladimir Davydov, Johannes Weiner,
	Michal Hocko, linux-mm

On Thu, Jun 23, 2016 at 09:23:01PM -0700, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 07/16] mm: Fix memcg stack accounting for sub-page stacks
  2016-06-24  4:23 ` [PATCH v4 07/16] mm: Fix memcg stack accounting for sub-page stacks Andy Lutomirski
@ 2016-06-24 15:22   ` Josh Poimboeuf
  0 siblings, 0 replies; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-24 15:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens, Vladimir Davydov, Johannes Weiner,
	Michal Hocko, linux-mm

On Thu, Jun 23, 2016 at 09:23:02PM -0700, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: linux-mm@kvack.org
> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-24  4:23 ` [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
@ 2016-06-24 15:30   ` Josh Poimboeuf
  2016-06-24 15:35     ` Brian Gerst
  0 siblings, 1 reply; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-24 15:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 09:23:06PM -0700, Andy Lutomirski wrote:
> If we call do_exit with a clean stack, we greatly reduce the risk of
> recursive oopses due to stack overflow in do_exit, and we allow
> do_exit to work even if we OOPS from an IST stack.  The latter gives
> us a much better chance of surviving long enough after we detect a
> stack overflow to write out our logs.
> 
> I intentionally separated this from the preceding patch that
> disables do_exit-on-OOPS on IST stacks.  This way, if we need to
> revert this patch, we still end up in an acceptable state wrt stack
> overflow handling.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_32.S   | 11 +++++++++++
>  arch/x86/entry/entry_64.S   | 11 +++++++++++
>  arch/x86/kernel/dumpstack.c | 13 +++++++++----
>  3 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 983e5d3a0d27..0b56666e6039 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
>  	jmp	error_code
>  END(async_page_fault)
>  #endif
> +
> +ENTRY(rewind_stack_do_exit)
> +	/* Prevent any naive code from trying to unwind to our caller. */
> +	xorl	%ebp, %ebp
> +
> +	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esi
> +	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
> +
> +	call	do_exit
> +1:	jmp 1b
> +END(rewind_stack_do_exit)
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 9ee0da1807ed..b846875aeea6 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
>  	mov	$-ENOSYS, %eax
>  	sysret
>  END(ignore_sysret)
> +
> +ENTRY(rewind_stack_do_exit)
> +	/* Prevent any naive code from trying to unwind to our caller. */
> +	xorl	%ebp, %ebp

s/ebp/rbp/g/ ?

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 12/16] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp
  2016-06-24  4:23 ` [PATCH v4 12/16] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
@ 2016-06-24 15:31   ` Josh Poimboeuf
  0 siblings, 0 replies; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-24 15:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 09:23:07PM -0700, Andy Lutomirski wrote:
> The comment suggests that show_stack(NULL, NULL) should backtrace
> the current context, but the code doesn't match the comment.  If
> regs are given, start the "Stack:" hexdump at regs->sp.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 13/16] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-24  4:23 ` [PATCH v4 13/16] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
@ 2016-06-24 15:35   ` Josh Poimboeuf
  2016-06-26 16:59     ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-24 15:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 09:23:08PM -0700, Andy Lutomirski wrote:
> If we overflow the stack, print_context_stack will abort.  Detect
> this case and rewind back into the valid part of the stack so that
> we can trace it.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-24 15:30   ` Josh Poimboeuf
@ 2016-06-24 15:35     ` Brian Gerst
  2016-06-24 15:48       ` Josh Poimboeuf
  0 siblings, 1 reply; 28+ messages in thread
From: Brian Gerst @ 2016-06-24 15:35 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Fri, Jun 24, 2016 at 11:30 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jun 23, 2016 at 09:23:06PM -0700, Andy Lutomirski wrote:
>> If we call do_exit with a clean stack, we greatly reduce the risk of
>> recursive oopses due to stack overflow in do_exit, and we allow
>> do_exit to work even if we OOPS from an IST stack.  The latter gives
>> us a much better chance of surviving long enough after we detect a
>> stack overflow to write out our logs.
>>
>> I intentionally separated this from the preceding patch that
>> disables do_exit-on-OOPS on IST stacks.  This way, if we need to
>> revert this patch, we still end up in an acceptable state wrt stack
>> overflow handling.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/entry_32.S   | 11 +++++++++++
>>  arch/x86/entry/entry_64.S   | 11 +++++++++++
>>  arch/x86/kernel/dumpstack.c | 13 +++++++++----
>>  3 files changed, 31 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>> index 983e5d3a0d27..0b56666e6039 100644
>> --- a/arch/x86/entry/entry_32.S
>> +++ b/arch/x86/entry/entry_32.S
>> @@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
>>       jmp     error_code
>>  END(async_page_fault)
>>  #endif
>> +
>> +ENTRY(rewind_stack_do_exit)
>> +     /* Prevent any naive code from trying to unwind to our caller. */
>> +     xorl    %ebp, %ebp
>> +
>> +     movl    PER_CPU_VAR(cpu_current_top_of_stack), %esi
>> +     leal    -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
>> +
>> +     call    do_exit
>> +1:   jmp 1b
>> +END(rewind_stack_do_exit)
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 9ee0da1807ed..b846875aeea6 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
>>       mov     $-ENOSYS, %eax
>>       sysret
>>  END(ignore_sysret)
>> +
>> +ENTRY(rewind_stack_do_exit)
>> +     /* Prevent any naive code from trying to unwind to our caller. */
>> +     xorl    %ebp, %ebp
>
> s/ebp/rbp/g/ ?

No, this quirk of the x86-64 instruction set will zero-extend to
64-bits without needing a REX prefix.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 14/16] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS
  2016-06-24  4:23 ` [PATCH v4 14/16] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
@ 2016-06-24 15:36   ` Josh Poimboeuf
  0 siblings, 0 replies; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-24 15:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, linux-arch, Borislav Petkov, Nadav Amit,
	Kees Cook, Brian Gerst, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Thu, Jun 23, 2016 at 09:23:09PM -0700, Andy Lutomirski wrote:
> If we overflow the stack into a guard page, we'll recursively fault
> when trying to dump the contents of the guard page.  Use
> probe_kernel_address so we can recover if this happens.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit
  2016-06-24 15:35     ` Brian Gerst
@ 2016-06-24 15:48       ` Josh Poimboeuf
  0 siblings, 0 replies; 28+ messages in thread
From: Josh Poimboeuf @ 2016-06-24 15:48 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-arch, Borislav Petkov,
	Nadav Amit, Kees Cook, kernel-hardening, Linus Torvalds,
	Jann Horn, Heiko Carstens

On Fri, Jun 24, 2016 at 11:35:13AM -0400, Brian Gerst wrote:
> On Fri, Jun 24, 2016 at 11:30 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Thu, Jun 23, 2016 at 09:23:06PM -0700, Andy Lutomirski wrote:
> >> If we call do_exit with a clean stack, we greatly reduce the risk of
> >> recursive oopses due to stack overflow in do_exit, and we allow
> >> do_exit to work even if we OOPS from an IST stack.  The latter gives
> >> us a much better chance of surviving long enough after we detect a
> >> stack overflow to write out our logs.
> >>
> >> I intentionally separated this from the preceding patch that
> >> disables do_exit-on-OOPS on IST stacks.  This way, if we need to
> >> revert this patch, we still end up in an acceptable state wrt stack
> >> overflow handling.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>  arch/x86/entry/entry_32.S   | 11 +++++++++++
> >>  arch/x86/entry/entry_64.S   | 11 +++++++++++
> >>  arch/x86/kernel/dumpstack.c | 13 +++++++++----
> >>  3 files changed, 31 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> >> index 983e5d3a0d27..0b56666e6039 100644
> >> --- a/arch/x86/entry/entry_32.S
> >> +++ b/arch/x86/entry/entry_32.S
> >> @@ -1153,3 +1153,14 @@ ENTRY(async_page_fault)
> >>       jmp     error_code
> >>  END(async_page_fault)
> >>  #endif
> >> +
> >> +ENTRY(rewind_stack_do_exit)
> >> +     /* Prevent any naive code from trying to unwind to our caller. */
> >> +     xorl    %ebp, %ebp
> >> +
> >> +     movl    PER_CPU_VAR(cpu_current_top_of_stack), %esi
> >> +     leal    -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
> >> +
> >> +     call    do_exit
> >> +1:   jmp 1b
> >> +END(rewind_stack_do_exit)
> >> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> >> index 9ee0da1807ed..b846875aeea6 100644
> >> --- a/arch/x86/entry/entry_64.S
> >> +++ b/arch/x86/entry/entry_64.S
> >> @@ -1423,3 +1423,14 @@ ENTRY(ignore_sysret)
> >>       mov     $-ENOSYS, %eax
> >>       sysret
> >>  END(ignore_sysret)
> >> +
> >> +ENTRY(rewind_stack_do_exit)
> >> +     /* Prevent any naive code from trying to unwind to our caller. */
> >> +     xorl    %ebp, %ebp
> >
> > s/ebp/rbp/g/ ?
> 
> No, this quirk of the x86-64 instruction set will zero-extend to
> 64-bits without needing a REX prefix.

Ah, so it makes the instruction smaller.  And I see that gcc also does
the same.  In that case:

Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

-- 
Josh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 13/16] x86/dumpstack: Try harder to get a call trace on stack overflow
  2016-06-24 15:35   ` Josh Poimboeuf
@ 2016-06-26 16:59     ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-06-26 16:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, linux-kernel, linux-arch,
	Borislav Petkov, Nadav Amit, Kees Cook, Brian Gerst,
	kernel-hardening, Linus Torvalds, Jann Horn, Heiko Carstens

On Fri, Jun 24, 2016 at 8:35 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jun 23, 2016 at 09:23:08PM -0700, Andy Lutomirski wrote:
>> If we overflow the stack, print_context_stack will abort.  Detect
>> this case and rewind back into the valid part of the stack so that
>> we can trace it.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
>

FWIW, I'm making a trivial change here for v4: task->stack ->
task_stack_page(task).  Since it seems inconsequential, I'm keeping
your reviewed-by.

> --
> Josh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-06-26 17:00 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-24  4:22 [PATCH v4 00/16] Virtually mapped stacks with guard pages (x86, core) Andy Lutomirski
2016-06-24  4:22 ` [PATCH v4 01/16] bluetooth: Switch SMP to crypto_cipher_encrypt_one() Andy Lutomirski
2016-06-24  6:10   ` Herbert Xu
2016-06-24  7:19   ` Johan Hedberg
2016-06-24  4:22 ` [PATCH v4 02/16] rxrpc: Avoid using stack memory in SG lists in rxkad Andy Lutomirski
2016-06-24  4:22 ` [PATCH v4 03/16] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Andy Lutomirski
2016-06-24  4:22 ` [PATCH v4 04/16] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated Andy Lutomirski
2016-06-24  4:23 ` [PATCH v4 05/16] x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables() Andy Lutomirski
2016-06-24  4:23 ` [PATCH v4 06/16] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks Andy Lutomirski
2016-06-24 15:21   ` Josh Poimboeuf
2016-06-24  4:23 ` [PATCH v4 07/16] mm: Fix memcg stack accounting for sub-page stacks Andy Lutomirski
2016-06-24 15:22   ` Josh Poimboeuf
2016-06-24  4:23 ` [PATCH v4 08/16] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
2016-06-24  4:23 ` [PATCH v4 09/16] fork: Add generic vmalloced stack support Andy Lutomirski
2016-06-24  4:23 ` [PATCH v4 10/16] x86/die: Don't try to recover from an OOPS on a non-default stack Andy Lutomirski
2016-06-24  4:23 ` [PATCH v4 11/16] x86/dumpstack: When OOPSing, rewind the stack before do_exit Andy Lutomirski
2016-06-24 15:30   ` Josh Poimboeuf
2016-06-24 15:35     ` Brian Gerst
2016-06-24 15:48       ` Josh Poimboeuf
2016-06-24  4:23 ` [PATCH v4 12/16] x86/dumpstack: When dumping stack bytes due to OOPS, start with regs->sp Andy Lutomirski
2016-06-24 15:31   ` Josh Poimboeuf
2016-06-24  4:23 ` [PATCH v4 13/16] x86/dumpstack: Try harder to get a call trace on stack overflow Andy Lutomirski
2016-06-24 15:35   ` Josh Poimboeuf
2016-06-26 16:59     ` Andy Lutomirski
2016-06-24  4:23 ` [PATCH v4 14/16] x86/dumpstack/64: Handle faults when printing the "Stack:" part of an OOPS Andy Lutomirski
2016-06-24 15:36   ` Josh Poimboeuf
2016-06-24  4:23 ` [PATCH v4 15/16] x86/mm/64: Enable vmapped stacks Andy Lutomirski
2016-06-24  4:23 ` [PATCH v4 16/16] x86/mm: Improve stack-overflow #PF handling Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).