All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-02-26 14:21 ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-02-26 14:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: vpk, juerg.haefliger

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userland, unless explicitly requested by the
kernel. Whenever a page destined for userland is allocated, it is
unmapped from physmap. When such a page is reclaimed from userland, it is
mapped back to physmap.

Mapping/unmapping from physmap is accomplished by modifying the PTE
permission bits to allow/disallow access to the page.

Additional fields are added to the page struct for XPFO housekeeping.
Specifically a flags field to distinguish user vs. kernel pages, a
reference counter to track physmap map/unmap operations and a lock to
protect the XPFO fields.

Known issues/limitations:
  - Only supported on x86-64.
  - Only supports 4k pages.
  - Adds additional data to the page struct.
  - There are most likely some additional and legitimate uses cases where
    the kernel needs to access userspace. Those need to be identified and
    made XPFO-aware.
  - There's a performance impact if XPFO is turned on. Per the paper
    referenced below it's in the 1-3% ballpark. More performance testing
    wouldn't hurt. What tests to run though?

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   2 +-
 arch/x86/Kconfig.debug   |  17 +++++
 arch/x86/mm/Makefile     |   2 +
 arch/x86/mm/init.c       |   3 +-
 arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-map.c          |   7 +-
 include/linux/highmem.h  |  23 +++++--
 include/linux/mm_types.h |   4 ++
 include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
 lib/swiotlb.c            |   3 +-
 mm/page_alloc.c          |   7 +-
 11 files changed, 323 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 include/linux/xpfo.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c46662f..9d32b4a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 9b18ed9..1331da5 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
 
 source "lib/Kconfig.debug"
 
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL
+	depends on X86_64
+	select DEBUG_TLBFLUSH
+	---help---
+	  This option offers protection against 'ret2dir' (kernel) attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
+	  are mapped back to physmap. Special care is taken to minimize the
+	  impact on performance by reducing TLB shootdowns and unnecessary page
+	  zero fills.
+
+	  If in doubt, say "N".
+
 config X86_VERBOSE_BOOTUP
 	bool "Enable verbose x86 bootup info messages"
 	default y
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f9d38a4..8bf52b6 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
+
+obj-$(CONFIG_XPFO)		+= xpfo.o
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 493f541..27fc8a6 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -150,7 +150,8 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
+	!defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
 	 * This will simplify cpa(), which otherwise needs to support splitting
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 0000000..6bc24d3
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,176 @@
+/*
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ *
+ * Authors:
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
+
+#define TEST_XPFO_FLAG(flag, page) \
+	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+#define SET_XPFO_FLAG(flag, page)			\
+	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+#define CLEAR_XPFO_FLAG(flag, page)			\
+	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
+	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+inline void xpfo_clear_zap(struct page *page, int order)
+{
+	int i;
+
+	for (i = 0; i < (1 << order); i++)
+		CLEAR_XPFO_FLAG(zap, page + i);
+}
+
+inline int xpfo_test_and_clear_zap(struct page *page)
+{
+	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
+}
+
+inline int xpfo_test_kernel(struct page *page)
+{
+	return TEST_XPFO_FLAG(kernel, page);
+}
+
+inline int xpfo_test_user(struct page *page)
+{
+	return TEST_XPFO_FLAG(user, page);
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, tlb_shoot = 0;
+	unsigned long kaddr;
+
+	for (i = 0; i < (1 << order); i++)  {
+		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
+			TEST_XPFO_FLAG(user, page + i));
+
+		if (gfp & GFP_HIGHUSER) {
+			/* Initialize the xpfo lock and map counter */
+			spin_lock_init(&(page + i)->xpfo.lock);
+			atomic_set(&(page + i)->xpfo.mapcount, 0);
+
+			/* Mark it as a user page */
+			SET_XPFO_FLAG(user_fp, page + i);
+
+			/*
+			 * Shoot the TLB if the page was previously allocated
+			 * to kernel space
+			 */
+			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
+				tlb_shoot = 1;
+		} else {
+			/* Mark it as a kernel page */
+			SET_XPFO_FLAG(kernel, page + i);
+		}
+	}
+
+	if (tlb_shoot) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	unsigned long kaddr;
+
+	for (i = 0; i < (1 << order); i++) {
+
+		/* The page frame was previously allocated to user space */
+		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
+			kaddr = (unsigned long)page_address(page + i);
+
+			/* Clear the page and mark it accordingly */
+			clear_page((void *)kaddr);
+			SET_XPFO_FLAG(zap, page + i);
+
+			/* Map it back to kernel space */
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+
+			/* No TLB update */
+		}
+
+		/* Clear the xpfo fast-path flag */
+		CLEAR_XPFO_FLAG(user_fp, page + i);
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	/* The page is allocated to kernel space, so nothing to do */
+	if (TEST_XPFO_FLAG(kernel, page))
+		return;
+
+	spin_lock_irqsave(&page->xpfo.lock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB update required.
+	 */
+	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
+	    TEST_XPFO_FLAG(user, page))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page->xpfo.lock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	/* The page is allocated to kernel space, so nothing to do */
+	if (TEST_XPFO_FLAG(kernel, page))
+		return;
+
+	spin_lock_irqsave(&page->xpfo.lock, flags);
+
+	/*
+	 * The page frame is to be allocated back to user space. So unmap it
+	 * from the kernel, update the TLB and mark it as a user page.
+	 */
+	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
+	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+		SET_XPFO_FLAG(user, page);
+	}
+
+	spin_unlock_irqrestore(&page->xpfo.lock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
diff --git a/block/blk-map.c b/block/blk-map.c
index f565e11..b7b8302 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
 		prv.iov_len = iov.iov_len;
 	}
 
-	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
+	/*
+	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
+	 * is enabled. Results in an XPFO page fault otherwise.
+	 */
+	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
+	    IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f329..0ca9130 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
+
 	pagefault_enable();
 	preempt_enable();
 }
@@ -133,7 +146,8 @@ do {                                                            \
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
 	void *addr = kmap_atomic(page);
-	clear_user_page(addr, vaddr, page);
+	if (!xpfo_test_and_clear_zap(page))
+		clear_user_page(addr, vaddr, page);
 	kunmap_atomic(addr);
 }
 #endif
@@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page);
-	clear_page(kaddr);
+	if (!xpfo_test_and_clear_zap(page))
+		clear_page(kaddr);
 	kunmap_atomic(kaddr);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 624b78b..71c95aa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
+#include <linux/xpfo.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -215,6 +216,9 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+#ifdef CONFIG_XPFO
+	struct xpfo_info xpfo;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 0000000..c4f0871
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,88 @@
+/*
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ *
+ * Authors:
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+/*
+ * XPFO page flags:
+ *
+ * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
+ * is used in the fast path, where the page is marked accordingly but *not*
+ * unmapped from the kernel. In most cases, the kernel will need access to the
+ * page immediately after its acquisition so an unnecessary mapping operation
+ * is avoided.
+ *
+ * PG_XPFO_user denotes that the page is destined for user space. This flag is
+ * used in the slow path, where the page needs to be mapped/unmapped when the
+ * kernel wants to access it. If a page is deallocated and this flag is set,
+ * the page is cleared and mapped back into the kernel.
+ *
+ * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
+ * for identifying pages that are first assigned to kernel space and then freed
+ * and mapped to user space. In such cases, an expensive TLB shootdown is
+ * necessary. Pages allocated to user space, freed, and subsequently allocated
+ * to user space again, require only local TLB invalidation.
+ *
+ * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
+ * avoid zapping pages multiple times. Whenever a page is freed and was
+ * previously mapped to user space, it needs to be zapped before mapped back
+ * in to the kernel.
+ */
+
+enum xpfo_pageflags {
+	PG_XPFO_user_fp,
+	PG_XPFO_user,
+	PG_XPFO_kernel,
+	PG_XPFO_zap,
+};
+
+struct xpfo_info {
+	unsigned long flags;	/* Flags for tracking the page's XPFO state */
+	atomic_t mapcount;	/* Counter for balancing page map/unmap
+				 * requests. Only the first map request maps
+				 * the page back to kernel space. Likewise,
+				 * only the last unmap request unmaps the page.
+				 */
+	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
+				 * requests.
+				 */
+};
+
+extern void xpfo_clear_zap(struct page *page, int order);
+extern int xpfo_test_and_clear_zap(struct page *page);
+extern int xpfo_test_kernel(struct page *page);
+extern int xpfo_test_user(struct page *page);
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+#else /* ifdef CONFIG_XPFO */
+
+static inline void xpfo_clear_zap(struct page *page, int order) { }
+static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
+static inline int xpfo_test_kernel(struct page *page) { return 0; }
+static inline int xpfo_test_user(struct page *page) { return 0; }
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+#endif /* ifdef CONFIG_XPFO */
+
+#endif /* ifndef _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 76f29ec..cf57ee9 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_test_user(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 838ca8bb..47b42a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 	}
 	arch_free_page(page, order);
 	kernel_map_pages(page, 1 << order, 0);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 
 	if (gfp_flags & __GFP_ZERO)
 		for (i = 0; i < (1 << order); i++)
 			clear_highpage(page + i);
+	else
+		xpfo_clear_zap(page, order);
 
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
@@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	if (!cold && !xpfo_test_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
+
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = READ_ONCE(pcp->batch);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-02-26 14:21 ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-02-26 14:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: vpk, juerg.haefliger

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userland, unless explicitly requested by the
kernel. Whenever a page destined for userland is allocated, it is
unmapped from physmap. When such a page is reclaimed from userland, it is
mapped back to physmap.

Mapping/unmapping from physmap is accomplished by modifying the PTE
permission bits to allow/disallow access to the page.

Additional fields are added to the page struct for XPFO housekeeping.
Specifically a flags field to distinguish user vs. kernel pages, a
reference counter to track physmap map/unmap operations and a lock to
protect the XPFO fields.

Known issues/limitations:
  - Only supported on x86-64.
  - Only supports 4k pages.
  - Adds additional data to the page struct.
  - There are most likely some additional and legitimate uses cases where
    the kernel needs to access userspace. Those need to be identified and
    made XPFO-aware.
  - There's a performance impact if XPFO is turned on. Per the paper
    referenced below it's in the 1-3% ballpark. More performance testing
    wouldn't hurt. What tests to run though?

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   2 +-
 arch/x86/Kconfig.debug   |  17 +++++
 arch/x86/mm/Makefile     |   2 +
 arch/x86/mm/init.c       |   3 +-
 arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-map.c          |   7 +-
 include/linux/highmem.h  |  23 +++++--
 include/linux/mm_types.h |   4 ++
 include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
 lib/swiotlb.c            |   3 +-
 mm/page_alloc.c          |   7 +-
 11 files changed, 323 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 include/linux/xpfo.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c46662f..9d32b4a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 9b18ed9..1331da5 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
 
 source "lib/Kconfig.debug"
 
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL
+	depends on X86_64
+	select DEBUG_TLBFLUSH
+	---help---
+	  This option offers protection against 'ret2dir' (kernel) attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
+	  are mapped back to physmap. Special care is taken to minimize the
+	  impact on performance by reducing TLB shootdowns and unnecessary page
+	  zero fills.
+
+	  If in doubt, say "N".
+
 config X86_VERBOSE_BOOTUP
 	bool "Enable verbose x86 bootup info messages"
 	default y
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f9d38a4..8bf52b6 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
+
+obj-$(CONFIG_XPFO)		+= xpfo.o
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 493f541..27fc8a6 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -150,7 +150,8 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
+	!defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
 	 * This will simplify cpa(), which otherwise needs to support splitting
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 0000000..6bc24d3
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,176 @@
+/*
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ *
+ * Authors:
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
+
+#define TEST_XPFO_FLAG(flag, page) \
+	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+#define SET_XPFO_FLAG(flag, page)			\
+	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+#define CLEAR_XPFO_FLAG(flag, page)			\
+	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
+	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+inline void xpfo_clear_zap(struct page *page, int order)
+{
+	int i;
+
+	for (i = 0; i < (1 << order); i++)
+		CLEAR_XPFO_FLAG(zap, page + i);
+}
+
+inline int xpfo_test_and_clear_zap(struct page *page)
+{
+	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
+}
+
+inline int xpfo_test_kernel(struct page *page)
+{
+	return TEST_XPFO_FLAG(kernel, page);
+}
+
+inline int xpfo_test_user(struct page *page)
+{
+	return TEST_XPFO_FLAG(user, page);
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, tlb_shoot = 0;
+	unsigned long kaddr;
+
+	for (i = 0; i < (1 << order); i++)  {
+		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
+			TEST_XPFO_FLAG(user, page + i));
+
+		if (gfp & GFP_HIGHUSER) {
+			/* Initialize the xpfo lock and map counter */
+			spin_lock_init(&(page + i)->xpfo.lock);
+			atomic_set(&(page + i)->xpfo.mapcount, 0);
+
+			/* Mark it as a user page */
+			SET_XPFO_FLAG(user_fp, page + i);
+
+			/*
+			 * Shoot the TLB if the page was previously allocated
+			 * to kernel space
+			 */
+			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
+				tlb_shoot = 1;
+		} else {
+			/* Mark it as a kernel page */
+			SET_XPFO_FLAG(kernel, page + i);
+		}
+	}
+
+	if (tlb_shoot) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	unsigned long kaddr;
+
+	for (i = 0; i < (1 << order); i++) {
+
+		/* The page frame was previously allocated to user space */
+		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
+			kaddr = (unsigned long)page_address(page + i);
+
+			/* Clear the page and mark it accordingly */
+			clear_page((void *)kaddr);
+			SET_XPFO_FLAG(zap, page + i);
+
+			/* Map it back to kernel space */
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+
+			/* No TLB update */
+		}
+
+		/* Clear the xpfo fast-path flag */
+		CLEAR_XPFO_FLAG(user_fp, page + i);
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	/* The page is allocated to kernel space, so nothing to do */
+	if (TEST_XPFO_FLAG(kernel, page))
+		return;
+
+	spin_lock_irqsave(&page->xpfo.lock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB update required.
+	 */
+	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
+	    TEST_XPFO_FLAG(user, page))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page->xpfo.lock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	/* The page is allocated to kernel space, so nothing to do */
+	if (TEST_XPFO_FLAG(kernel, page))
+		return;
+
+	spin_lock_irqsave(&page->xpfo.lock, flags);
+
+	/*
+	 * The page frame is to be allocated back to user space. So unmap it
+	 * from the kernel, update the TLB and mark it as a user page.
+	 */
+	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
+	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+		SET_XPFO_FLAG(user, page);
+	}
+
+	spin_unlock_irqrestore(&page->xpfo.lock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
diff --git a/block/blk-map.c b/block/blk-map.c
index f565e11..b7b8302 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
 		prv.iov_len = iov.iov_len;
 	}
 
-	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
+	/*
+	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
+	 * is enabled. Results in an XPFO page fault otherwise.
+	 */
+	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
+	    IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f329..0ca9130 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
+
 	pagefault_enable();
 	preempt_enable();
 }
@@ -133,7 +146,8 @@ do {                                                            \
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
 	void *addr = kmap_atomic(page);
-	clear_user_page(addr, vaddr, page);
+	if (!xpfo_test_and_clear_zap(page))
+		clear_user_page(addr, vaddr, page);
 	kunmap_atomic(addr);
 }
 #endif
@@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page);
-	clear_page(kaddr);
+	if (!xpfo_test_and_clear_zap(page))
+		clear_page(kaddr);
 	kunmap_atomic(kaddr);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 624b78b..71c95aa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
+#include <linux/xpfo.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -215,6 +216,9 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+#ifdef CONFIG_XPFO
+	struct xpfo_info xpfo;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 0000000..c4f0871
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,88 @@
+/*
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ *
+ * Authors:
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+/*
+ * XPFO page flags:
+ *
+ * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
+ * is used in the fast path, where the page is marked accordingly but *not*
+ * unmapped from the kernel. In most cases, the kernel will need access to the
+ * page immediately after its acquisition so an unnecessary mapping operation
+ * is avoided.
+ *
+ * PG_XPFO_user denotes that the page is destined for user space. This flag is
+ * used in the slow path, where the page needs to be mapped/unmapped when the
+ * kernel wants to access it. If a page is deallocated and this flag is set,
+ * the page is cleared and mapped back into the kernel.
+ *
+ * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
+ * for identifying pages that are first assigned to kernel space and then freed
+ * and mapped to user space. In such cases, an expensive TLB shootdown is
+ * necessary. Pages allocated to user space, freed, and subsequently allocated
+ * to user space again, require only local TLB invalidation.
+ *
+ * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
+ * avoid zapping pages multiple times. Whenever a page is freed and was
+ * previously mapped to user space, it needs to be zapped before mapped back
+ * in to the kernel.
+ */
+
+enum xpfo_pageflags {
+	PG_XPFO_user_fp,
+	PG_XPFO_user,
+	PG_XPFO_kernel,
+	PG_XPFO_zap,
+};
+
+struct xpfo_info {
+	unsigned long flags;	/* Flags for tracking the page's XPFO state */
+	atomic_t mapcount;	/* Counter for balancing page map/unmap
+				 * requests. Only the first map request maps
+				 * the page back to kernel space. Likewise,
+				 * only the last unmap request unmaps the page.
+				 */
+	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
+				 * requests.
+				 */
+};
+
+extern void xpfo_clear_zap(struct page *page, int order);
+extern int xpfo_test_and_clear_zap(struct page *page);
+extern int xpfo_test_kernel(struct page *page);
+extern int xpfo_test_user(struct page *page);
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+#else /* ifdef CONFIG_XPFO */
+
+static inline void xpfo_clear_zap(struct page *page, int order) { }
+static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
+static inline int xpfo_test_kernel(struct page *page) { return 0; }
+static inline int xpfo_test_user(struct page *page) { return 0; }
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+#endif /* ifdef CONFIG_XPFO */
+
+#endif /* ifndef _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 76f29ec..cf57ee9 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_test_user(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 838ca8bb..47b42a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 	}
 	arch_free_page(page, order);
 	kernel_map_pages(page, 1 << order, 0);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 
 	if (gfp_flags & __GFP_ZERO)
 		for (i = 0; i < (1 << order); i++)
 			clear_highpage(page + i);
+	else
+		xpfo_clear_zap(page, order);
 
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
@@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	if (!cold && !xpfo_test_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
+
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = READ_ONCE(pcp->batch);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-02-26 14:21 ` Juerg Haefliger
@ 2016-03-01  1:31   ` Laura Abbott
  -1 siblings, 0 replies; 93+ messages in thread
From: Laura Abbott @ 2016-03-01  1:31 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm; +Cc: vpk, Kees Cook

On 02/26/2016 06:21 AM, Juerg Haefliger wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userland, unless explicitly requested by the
> kernel. Whenever a page destined for userland is allocated, it is
> unmapped from physmap. When such a page is reclaimed from userland, it is
> mapped back to physmap.
>
> Mapping/unmapping from physmap is accomplished by modifying the PTE
> permission bits to allow/disallow access to the page.
>
> Additional fields are added to the page struct for XPFO housekeeping.
> Specifically a flags field to distinguish user vs. kernel pages, a
> reference counter to track physmap map/unmap operations and a lock to
> protect the XPFO fields.
>
> Known issues/limitations:
>    - Only supported on x86-64.
>    - Only supports 4k pages.
>    - Adds additional data to the page struct.
>    - There are most likely some additional and legitimate uses cases where
>      the kernel needs to access userspace. Those need to be identified and
>      made XPFO-aware.
>    - There's a performance impact if XPFO is turned on. Per the paper
>      referenced below it's in the 1-3% ballpark. More performance testing
>      wouldn't hurt. What tests to run though?
>
> Reference paper by the original patch authors:
>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>

General note: Make sure to cc the x86 maintainers on the next version of
the patch. I'd also recommend ccing the kernel hardening list (see the wiki
page http://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project for
details)

If you can find a way to break this up into x86 specific vs. generic patches
that would be better. Perhaps move the Kconfig for XPFO to the generic
Kconfig layer and make it depend on ARCH_HAS_XPFO? x86 can then select
ARCH_HAS_XPFO as the last option.

There also isn't much that's actually x86 specific here except for
some of the page table manipulation functions and even those can probably
be abstracted away. It would be good to get more of this out of x86 to
let other arches take advantage of it. The arm64 implementation would
look pretty similar if you save the old kernel mapping and restore
it on free.

  
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>   arch/x86/Kconfig         |   2 +-
>   arch/x86/Kconfig.debug   |  17 +++++
>   arch/x86/mm/Makefile     |   2 +
>   arch/x86/mm/init.c       |   3 +-
>   arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>   block/blk-map.c          |   7 +-
>   include/linux/highmem.h  |  23 +++++--
>   include/linux/mm_types.h |   4 ++
>   include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>   lib/swiotlb.c            |   3 +-
>   mm/page_alloc.c          |   7 +-
>   11 files changed, 323 insertions(+), 9 deletions(-)
>   create mode 100644 arch/x86/mm/xpfo.c
>   create mode 100644 include/linux/xpfo.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c46662f..9d32b4a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>   config X86_DIRECT_GBPAGES
>   	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>   	---help---
>   	  Certain kernel features effectively disable kernel
>   	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 9b18ed9..1331da5 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>
>   source "lib/Kconfig.debug"
>
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on DEBUG_KERNEL
> +	depends on X86_64
> +	select DEBUG_TLBFLUSH
> +	---help---
> +	  This option offers protection against 'ret2dir' (kernel) attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
> +	  are mapped back to physmap. Special care is taken to minimize the
> +	  impact on performance by reducing TLB shootdowns and unnecessary page
> +	  zero fills.
> +
> +	  If in doubt, say "N".
> +
>   config X86_VERBOSE_BOOTUP
>   	bool "Enable verbose x86 bootup info messages"
>   	default y
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index f9d38a4..8bf52b6 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>   obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
>
>   obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
> +
> +obj-$(CONFIG_XPFO)		+= xpfo.o
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 493f541..27fc8a6 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -150,7 +150,8 @@ static int page_size_mask;
>
>   static void __init probe_page_size_mask(void)
>   {
> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
> +	!defined(CONFIG_XPFO)
>   	/*
>   	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>   	 * This will simplify cpa(), which otherwise needs to support splitting
> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> new file mode 100644
> index 0000000..6bc24d3
> --- /dev/null
> +++ b/arch/x86/mm/xpfo.c
> @@ -0,0 +1,176 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/pgtable.h>
> +#include <asm/tlbflush.h>
> +
> +#define TEST_XPFO_FLAG(flag, page) \
> +	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define SET_XPFO_FLAG(flag, page)			\
> +	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define CLEAR_XPFO_FLAG(flag, page)			\
> +	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
> +	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +inline void xpfo_clear_zap(struct page *page, int order)
> +{
> +	int i;
> +
> +	for (i = 0; i < (1 << order); i++)
> +		CLEAR_XPFO_FLAG(zap, page + i);
> +}
> +
> +inline int xpfo_test_and_clear_zap(struct page *page)
> +{
> +	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
> +}
> +
> +inline int xpfo_test_kernel(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(kernel, page);
> +}
> +
> +inline int xpfo_test_user(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(user, page);
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, tlb_shoot = 0;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
> +			TEST_XPFO_FLAG(user, page + i));
> +
> +		if (gfp & GFP_HIGHUSER) {

This check doesn't seem right. If the GFP flags have _any_ in common with
GFP_HIGHUSER it will be marked as a user page so GFP_KERNEL will be marked
as well.

> +			/* Initialize the xpfo lock and map counter */
> +			spin_lock_init(&(page + i)->xpfo.lock);

This is initializing the spin_lock every time. That's not really necessary.

> +			atomic_set(&(page + i)->xpfo.mapcount, 0);
> +
> +			/* Mark it as a user page */
> +			SET_XPFO_FLAG(user_fp, page + i);
> +
> +			/*
> +			 * Shoot the TLB if the page was previously allocated
> +			 * to kernel space
> +			 */
> +			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
> +				tlb_shoot = 1;
> +		} else {
> +			/* Mark it as a kernel page */
> +			SET_XPFO_FLAG(kernel, page + i);
> +		}
> +	}
> +
> +	if (tlb_shoot) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +
> +		/* The page frame was previously allocated to user space */
> +		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +
> +			/* Clear the page and mark it accordingly */
> +			clear_page((void *)kaddr);

Clearing the page isn't related to XPFO. There's other work ongoing to
do clearing of the page on free.

> +			SET_XPFO_FLAG(zap, page + i);
> +
> +			/* Map it back to kernel space */
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +
> +			/* No TLB update */
> +		}
> +
> +		/* Clear the xpfo fast-path flag */
> +		CLEAR_XPFO_FLAG(user_fp, page + i);
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB update required.
> +	 */
> +	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
> +	    TEST_XPFO_FLAG(user, page))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page frame is to be allocated back to user space. So unmap it
> +	 * from the kernel, update the TLB and mark it as a user page.
> +	 */
> +	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
> +	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);
> +		SET_XPFO_FLAG(user, page);
> +	}
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);

I'm confused by the checks in kmap/kunmap here. It looks like once the
page is allocated there is no changing of flags between user and
kernel mode so the checks for if the page is user seem redundant.

> diff --git a/block/blk-map.c b/block/blk-map.c
> index f565e11..b7b8302 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
>   		prv.iov_len = iov.iov_len;
>   	}
>
> -	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
> +	/*
> +	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
> +	 * is enabled. Results in an XPFO page fault otherwise.
> +	 */
> +	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
> +	    IS_ENABLED(CONFIG_XPFO))
>   		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>   	else
>   		bio = bio_map_user_iov(q, iter, gfp_mask);
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f329..0ca9130 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>   #ifndef ARCH_HAS_KMAP
>   static inline void *kmap(struct page *page)
>   {
> +	void *kaddr;
> +
>   	might_sleep();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>
>   static inline void kunmap(struct page *page)
>   {
> +	xpfo_kunmap(page_address(page), page);
>   }
>
>   static inline void *kmap_atomic(struct page *page)
>   {
> +	void *kaddr;
> +
>   	preempt_disable();
>   	pagefault_disable();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>   #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>
>   static inline void __kunmap_atomic(void *addr)
>   {
> +	xpfo_kunmap(addr, virt_to_page(addr));
> +
>   	pagefault_enable();
>   	preempt_enable();
>   }
> @@ -133,7 +146,8 @@ do {                                                            \
>   static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>   {
>   	void *addr = kmap_atomic(page);
> -	clear_user_page(addr, vaddr, page);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_user_page(addr, vaddr, page);
>   	kunmap_atomic(addr);
>   }
>   #endif
> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>   static inline void clear_highpage(struct page *page)
>   {
>   	void *kaddr = kmap_atomic(page);
> -	clear_page(kaddr);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_page(kaddr);
>   	kunmap_atomic(kaddr);
>   }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 624b78b..71c95aa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>   #include <linux/cpumask.h>
>   #include <linux/uprobes.h>
>   #include <linux/page-flags-layout.h>
> +#include <linux/xpfo.h>
>   #include <asm/page.h>
>   #include <asm/mmu.h>
>
> @@ -215,6 +216,9 @@ struct page {
>   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>   	int _last_cpupid;
>   #endif
> +#ifdef CONFIG_XPFO
> +	struct xpfo_info xpfo;
> +#endif
>   }
>   /*
>    * The struct page can be forced to be double word aligned so that atomic ops
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 0000000..c4f0871
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,88 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +/*
> + * XPFO page flags:
> + *
> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
> + * is used in the fast path, where the page is marked accordingly but *not*
> + * unmapped from the kernel. In most cases, the kernel will need access to the
> + * page immediately after its acquisition so an unnecessary mapping operation
> + * is avoided.
> + *
> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
> + * used in the slow path, where the page needs to be mapped/unmapped when the
> + * kernel wants to access it. If a page is deallocated and this flag is set,
> + * the page is cleared and mapped back into the kernel.
> + *
> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
> + * for identifying pages that are first assigned to kernel space and then freed
> + * and mapped to user space. In such cases, an expensive TLB shootdown is
> + * necessary. Pages allocated to user space, freed, and subsequently allocated
> + * to user space again, require only local TLB invalidation.
> + *
> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
> + * avoid zapping pages multiple times. Whenever a page is freed and was
> + * previously mapped to user space, it needs to be zapped before mapped back
> + * in to the kernel.
> + */

'zap' doesn't really indicate what is actually happening with the page. Can you
be a bit more descriptive about what this actually does?

> +
> +enum xpfo_pageflags {
> +	PG_XPFO_user_fp,
> +	PG_XPFO_user,
> +	PG_XPFO_kernel,
> +	PG_XPFO_zap,
> +};
> +
> +struct xpfo_info {
> +	unsigned long flags;	/* Flags for tracking the page's XPFO state */
> +	atomic_t mapcount;	/* Counter for balancing page map/unmap
> +				 * requests. Only the first map request maps
> +				 * the page back to kernel space. Likewise,
> +				 * only the last unmap request unmaps the page.
> +				 */
> +	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
> +				 * requests.
> +				 */
> +};

Can you change this to use the page_ext implementation? See what
mm/page_owner.c does. This might lessen the impact of the extra
page metadata. This metadata still feels like a copy of what
mm/highmem.c is trying to do though.

> +
> +extern void xpfo_clear_zap(struct page *page, int order);
> +extern int xpfo_test_and_clear_zap(struct page *page);
> +extern int xpfo_test_kernel(struct page *page);
> +extern int xpfo_test_user(struct page *page);
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +#else /* ifdef CONFIG_XPFO */
> +
> +static inline void xpfo_clear_zap(struct page *page, int order) { }
> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
> +static inline int xpfo_test_user(struct page *page) { return 0; }
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +#endif /* ifdef CONFIG_XPFO */
> +
> +#endif /* ifndef _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 76f29ec..cf57ee9 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>   {
>   	unsigned long pfn = PFN_DOWN(orig_addr);
>   	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_test_user(page)) {
>   		/* The buffer does not have a mapping.  Map it in and copy */
>   		unsigned int offset = orig_addr & ~PAGE_MASK;
>   		char *buffer;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 838ca8bb..47b42a3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>   	}
>   	arch_free_page(page, order);
>   	kernel_map_pages(page, 1 << order, 0);
> +	xpfo_free_page(page, order);
>
>   	return true;
>   }
> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
>   	arch_alloc_page(page, order);
>   	kernel_map_pages(page, 1 << order, 1);
>   	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>
>   	if (gfp_flags & __GFP_ZERO)
>   		for (i = 0; i < (1 << order); i++)
>   			clear_highpage(page + i);
> +	else
> +		xpfo_clear_zap(page, order);
>
>   	if (order && (gfp_flags & __GFP_COMP))
>   		prep_compound_page(page, order);
> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>   	}
>
>   	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -	if (!cold)
> +	if (!cold && !xpfo_test_kernel(page))
>   		list_add(&page->lru, &pcp->lists[migratetype]);
>   	else
>   		list_add_tail(&page->lru, &pcp->lists[migratetype]);
> +

What's the advantage of this?

>   	pcp->count++;
>   	if (pcp->count >= pcp->high) {
>   		unsigned long batch = READ_ONCE(pcp->batch);
>

Thanks,
Laura

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-03-01  1:31   ` Laura Abbott
  0 siblings, 0 replies; 93+ messages in thread
From: Laura Abbott @ 2016-03-01  1:31 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm; +Cc: vpk, Kees Cook

On 02/26/2016 06:21 AM, Juerg Haefliger wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userland, unless explicitly requested by the
> kernel. Whenever a page destined for userland is allocated, it is
> unmapped from physmap. When such a page is reclaimed from userland, it is
> mapped back to physmap.
>
> Mapping/unmapping from physmap is accomplished by modifying the PTE
> permission bits to allow/disallow access to the page.
>
> Additional fields are added to the page struct for XPFO housekeeping.
> Specifically a flags field to distinguish user vs. kernel pages, a
> reference counter to track physmap map/unmap operations and a lock to
> protect the XPFO fields.
>
> Known issues/limitations:
>    - Only supported on x86-64.
>    - Only supports 4k pages.
>    - Adds additional data to the page struct.
>    - There are most likely some additional and legitimate uses cases where
>      the kernel needs to access userspace. Those need to be identified and
>      made XPFO-aware.
>    - There's a performance impact if XPFO is turned on. Per the paper
>      referenced below it's in the 1-3% ballpark. More performance testing
>      wouldn't hurt. What tests to run though?
>
> Reference paper by the original patch authors:
>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>

General note: Make sure to cc the x86 maintainers on the next version of
the patch. I'd also recommend ccing the kernel hardening list (see the wiki
page http://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project for
details)

If you can find a way to break this up into x86 specific vs. generic patches
that would be better. Perhaps move the Kconfig for XPFO to the generic
Kconfig layer and make it depend on ARCH_HAS_XPFO? x86 can then select
ARCH_HAS_XPFO as the last option.

There also isn't much that's actually x86 specific here except for
some of the page table manipulation functions and even those can probably
be abstracted away. It would be good to get more of this out of x86 to
let other arches take advantage of it. The arm64 implementation would
look pretty similar if you save the old kernel mapping and restore
it on free.

  
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>   arch/x86/Kconfig         |   2 +-
>   arch/x86/Kconfig.debug   |  17 +++++
>   arch/x86/mm/Makefile     |   2 +
>   arch/x86/mm/init.c       |   3 +-
>   arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>   block/blk-map.c          |   7 +-
>   include/linux/highmem.h  |  23 +++++--
>   include/linux/mm_types.h |   4 ++
>   include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>   lib/swiotlb.c            |   3 +-
>   mm/page_alloc.c          |   7 +-
>   11 files changed, 323 insertions(+), 9 deletions(-)
>   create mode 100644 arch/x86/mm/xpfo.c
>   create mode 100644 include/linux/xpfo.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c46662f..9d32b4a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>   config X86_DIRECT_GBPAGES
>   	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>   	---help---
>   	  Certain kernel features effectively disable kernel
>   	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 9b18ed9..1331da5 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>
>   source "lib/Kconfig.debug"
>
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on DEBUG_KERNEL
> +	depends on X86_64
> +	select DEBUG_TLBFLUSH
> +	---help---
> +	  This option offers protection against 'ret2dir' (kernel) attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
> +	  are mapped back to physmap. Special care is taken to minimize the
> +	  impact on performance by reducing TLB shootdowns and unnecessary page
> +	  zero fills.
> +
> +	  If in doubt, say "N".
> +
>   config X86_VERBOSE_BOOTUP
>   	bool "Enable verbose x86 bootup info messages"
>   	default y
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index f9d38a4..8bf52b6 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>   obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
>
>   obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
> +
> +obj-$(CONFIG_XPFO)		+= xpfo.o
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 493f541..27fc8a6 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -150,7 +150,8 @@ static int page_size_mask;
>
>   static void __init probe_page_size_mask(void)
>   {
> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
> +	!defined(CONFIG_XPFO)
>   	/*
>   	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>   	 * This will simplify cpa(), which otherwise needs to support splitting
> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> new file mode 100644
> index 0000000..6bc24d3
> --- /dev/null
> +++ b/arch/x86/mm/xpfo.c
> @@ -0,0 +1,176 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/pgtable.h>
> +#include <asm/tlbflush.h>
> +
> +#define TEST_XPFO_FLAG(flag, page) \
> +	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define SET_XPFO_FLAG(flag, page)			\
> +	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define CLEAR_XPFO_FLAG(flag, page)			\
> +	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
> +	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +inline void xpfo_clear_zap(struct page *page, int order)
> +{
> +	int i;
> +
> +	for (i = 0; i < (1 << order); i++)
> +		CLEAR_XPFO_FLAG(zap, page + i);
> +}
> +
> +inline int xpfo_test_and_clear_zap(struct page *page)
> +{
> +	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
> +}
> +
> +inline int xpfo_test_kernel(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(kernel, page);
> +}
> +
> +inline int xpfo_test_user(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(user, page);
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, tlb_shoot = 0;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
> +			TEST_XPFO_FLAG(user, page + i));
> +
> +		if (gfp & GFP_HIGHUSER) {

This check doesn't seem right. If the GFP flags have _any_ in common with
GFP_HIGHUSER it will be marked as a user page so GFP_KERNEL will be marked
as well.

> +			/* Initialize the xpfo lock and map counter */
> +			spin_lock_init(&(page + i)->xpfo.lock);

This is initializing the spin_lock every time. That's not really necessary.

> +			atomic_set(&(page + i)->xpfo.mapcount, 0);
> +
> +			/* Mark it as a user page */
> +			SET_XPFO_FLAG(user_fp, page + i);
> +
> +			/*
> +			 * Shoot the TLB if the page was previously allocated
> +			 * to kernel space
> +			 */
> +			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
> +				tlb_shoot = 1;
> +		} else {
> +			/* Mark it as a kernel page */
> +			SET_XPFO_FLAG(kernel, page + i);
> +		}
> +	}
> +
> +	if (tlb_shoot) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +
> +		/* The page frame was previously allocated to user space */
> +		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +
> +			/* Clear the page and mark it accordingly */
> +			clear_page((void *)kaddr);

Clearing the page isn't related to XPFO. There's other work ongoing to
do clearing of the page on free.

> +			SET_XPFO_FLAG(zap, page + i);
> +
> +			/* Map it back to kernel space */
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +
> +			/* No TLB update */
> +		}
> +
> +		/* Clear the xpfo fast-path flag */
> +		CLEAR_XPFO_FLAG(user_fp, page + i);
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB update required.
> +	 */
> +	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
> +	    TEST_XPFO_FLAG(user, page))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page frame is to be allocated back to user space. So unmap it
> +	 * from the kernel, update the TLB and mark it as a user page.
> +	 */
> +	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
> +	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);
> +		SET_XPFO_FLAG(user, page);
> +	}
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);

I'm confused by the checks in kmap/kunmap here. It looks like once the
page is allocated there is no changing of flags between user and
kernel mode so the checks for if the page is user seem redundant.

> diff --git a/block/blk-map.c b/block/blk-map.c
> index f565e11..b7b8302 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
>   		prv.iov_len = iov.iov_len;
>   	}
>
> -	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
> +	/*
> +	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
> +	 * is enabled. Results in an XPFO page fault otherwise.
> +	 */
> +	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
> +	    IS_ENABLED(CONFIG_XPFO))
>   		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>   	else
>   		bio = bio_map_user_iov(q, iter, gfp_mask);
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f329..0ca9130 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>   #ifndef ARCH_HAS_KMAP
>   static inline void *kmap(struct page *page)
>   {
> +	void *kaddr;
> +
>   	might_sleep();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>
>   static inline void kunmap(struct page *page)
>   {
> +	xpfo_kunmap(page_address(page), page);
>   }
>
>   static inline void *kmap_atomic(struct page *page)
>   {
> +	void *kaddr;
> +
>   	preempt_disable();
>   	pagefault_disable();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>   #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>
>   static inline void __kunmap_atomic(void *addr)
>   {
> +	xpfo_kunmap(addr, virt_to_page(addr));
> +
>   	pagefault_enable();
>   	preempt_enable();
>   }
> @@ -133,7 +146,8 @@ do {                                                            \
>   static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>   {
>   	void *addr = kmap_atomic(page);
> -	clear_user_page(addr, vaddr, page);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_user_page(addr, vaddr, page);
>   	kunmap_atomic(addr);
>   }
>   #endif
> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>   static inline void clear_highpage(struct page *page)
>   {
>   	void *kaddr = kmap_atomic(page);
> -	clear_page(kaddr);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_page(kaddr);
>   	kunmap_atomic(kaddr);
>   }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 624b78b..71c95aa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>   #include <linux/cpumask.h>
>   #include <linux/uprobes.h>
>   #include <linux/page-flags-layout.h>
> +#include <linux/xpfo.h>
>   #include <asm/page.h>
>   #include <asm/mmu.h>
>
> @@ -215,6 +216,9 @@ struct page {
>   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>   	int _last_cpupid;
>   #endif
> +#ifdef CONFIG_XPFO
> +	struct xpfo_info xpfo;
> +#endif
>   }
>   /*
>    * The struct page can be forced to be double word aligned so that atomic ops
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 0000000..c4f0871
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,88 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +/*
> + * XPFO page flags:
> + *
> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
> + * is used in the fast path, where the page is marked accordingly but *not*
> + * unmapped from the kernel. In most cases, the kernel will need access to the
> + * page immediately after its acquisition so an unnecessary mapping operation
> + * is avoided.
> + *
> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
> + * used in the slow path, where the page needs to be mapped/unmapped when the
> + * kernel wants to access it. If a page is deallocated and this flag is set,
> + * the page is cleared and mapped back into the kernel.
> + *
> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
> + * for identifying pages that are first assigned to kernel space and then freed
> + * and mapped to user space. In such cases, an expensive TLB shootdown is
> + * necessary. Pages allocated to user space, freed, and subsequently allocated
> + * to user space again, require only local TLB invalidation.
> + *
> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
> + * avoid zapping pages multiple times. Whenever a page is freed and was
> + * previously mapped to user space, it needs to be zapped before mapped back
> + * in to the kernel.
> + */

'zap' doesn't really indicate what is actually happening with the page. Can you
be a bit more descriptive about what this actually does?

> +
> +enum xpfo_pageflags {
> +	PG_XPFO_user_fp,
> +	PG_XPFO_user,
> +	PG_XPFO_kernel,
> +	PG_XPFO_zap,
> +};
> +
> +struct xpfo_info {
> +	unsigned long flags;	/* Flags for tracking the page's XPFO state */
> +	atomic_t mapcount;	/* Counter for balancing page map/unmap
> +				 * requests. Only the first map request maps
> +				 * the page back to kernel space. Likewise,
> +				 * only the last unmap request unmaps the page.
> +				 */
> +	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
> +				 * requests.
> +				 */
> +};

Can you change this to use the page_ext implementation? See what
mm/page_owner.c does. This might lessen the impact of the extra
page metadata. This metadata still feels like a copy of what
mm/highmem.c is trying to do though.

> +
> +extern void xpfo_clear_zap(struct page *page, int order);
> +extern int xpfo_test_and_clear_zap(struct page *page);
> +extern int xpfo_test_kernel(struct page *page);
> +extern int xpfo_test_user(struct page *page);
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +#else /* ifdef CONFIG_XPFO */
> +
> +static inline void xpfo_clear_zap(struct page *page, int order) { }
> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
> +static inline int xpfo_test_user(struct page *page) { return 0; }
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +#endif /* ifdef CONFIG_XPFO */
> +
> +#endif /* ifndef _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 76f29ec..cf57ee9 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>   {
>   	unsigned long pfn = PFN_DOWN(orig_addr);
>   	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_test_user(page)) {
>   		/* The buffer does not have a mapping.  Map it in and copy */
>   		unsigned int offset = orig_addr & ~PAGE_MASK;
>   		char *buffer;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 838ca8bb..47b42a3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>   	}
>   	arch_free_page(page, order);
>   	kernel_map_pages(page, 1 << order, 0);
> +	xpfo_free_page(page, order);
>
>   	return true;
>   }
> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
>   	arch_alloc_page(page, order);
>   	kernel_map_pages(page, 1 << order, 1);
>   	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>
>   	if (gfp_flags & __GFP_ZERO)
>   		for (i = 0; i < (1 << order); i++)
>   			clear_highpage(page + i);
> +	else
> +		xpfo_clear_zap(page, order);
>
>   	if (order && (gfp_flags & __GFP_COMP))
>   		prep_compound_page(page, order);
> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>   	}
>
>   	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -	if (!cold)
> +	if (!cold && !xpfo_test_kernel(page))
>   		list_add(&page->lru, &pcp->lists[migratetype]);
>   	else
>   		list_add_tail(&page->lru, &pcp->lists[migratetype]);
> +

What's the advantage of this?

>   	pcp->count++;
>   	if (pcp->count >= pcp->high) {
>   		unsigned long batch = READ_ONCE(pcp->batch);
>

Thanks,
Laura

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-02-26 14:21 ` Juerg Haefliger
@ 2016-03-01  2:10   ` Balbir Singh
  -1 siblings, 0 replies; 93+ messages in thread
From: Balbir Singh @ 2016-03-01  2:10 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm; +Cc: vpk



On 27/02/16 01:21, Juerg Haefliger wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userland, unless explicitly requested by the
> kernel. Whenever a page destined for userland is allocated, it is
> unmapped from physmap. When such a page is reclaimed from userland, it is
> mapped back to physmap.
physmap == xen physmap? Please clarify
> Mapping/unmapping from physmap is accomplished by modifying the PTE
> permission bits to allow/disallow access to the page.
>
> Additional fields are added to the page struct for XPFO housekeeping.
> Specifically a flags field to distinguish user vs. kernel pages, a
> reference counter to track physmap map/unmap operations and a lock to
> protect the XPFO fields.
>
> Known issues/limitations:
>   - Only supported on x86-64.
Is it due to lack of porting or a design limitation?
>   - Only supports 4k pages.
>   - Adds additional data to the page struct.
>   - There are most likely some additional and legitimate uses cases where
>     the kernel needs to access userspace. Those need to be identified and
>     made XPFO-aware.
Why not build an audit mode for it?
>   - There's a performance impact if XPFO is turned on. Per the paper
>     referenced below it's in the 1-3% ballpark. More performance testing
>     wouldn't hurt. What tests to run though?
>
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
This patch needs to be broken down into smaller patches - a series
> ---
>  arch/x86/Kconfig         |   2 +-
>  arch/x86/Kconfig.debug   |  17 +++++
>  arch/x86/mm/Makefile     |   2 +
>  arch/x86/mm/init.c       |   3 +-
>  arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-map.c          |   7 +-
>  include/linux/highmem.h  |  23 +++++--
>  include/linux/mm_types.h |   4 ++
>  include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/page_alloc.c          |   7 +-
>  11 files changed, 323 insertions(+), 9 deletions(-)
>  create mode 100644 arch/x86/mm/xpfo.c
>  create mode 100644 include/linux/xpfo.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c46662f..9d32b4a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>  
>  config X86_DIRECT_GBPAGES
>  	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>  	---help---
>  	  Certain kernel features effectively disable kernel
>  	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 9b18ed9..1331da5 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>  
>  source "lib/Kconfig.debug"
>  
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on DEBUG_KERNEL
> +	depends on X86_64
> +	select DEBUG_TLBFLUSH
> +	---help---
> +	  This option offers protection against 'ret2dir' (kernel) attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
> +	  are mapped back to physmap. Special care is taken to minimize the
> +	  impact on performance by reducing TLB shootdowns and unnecessary page
> +	  zero fills.
> +
> +	  If in doubt, say "N".
> +
>  config X86_VERBOSE_BOOTUP
>  	bool "Enable verbose x86 bootup info messages"
>  	default y
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index f9d38a4..8bf52b6 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>  obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
>  
>  obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
> +
> +obj-$(CONFIG_XPFO)		+= xpfo.o
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 493f541..27fc8a6 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -150,7 +150,8 @@ static int page_size_mask;
>  
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
> +	!defined(CONFIG_XPFO)
>  	/*
>  	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>  	 * This will simplify cpa(), which otherwise needs to support splitting
> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> new file mode 100644
> index 0000000..6bc24d3
> --- /dev/null
> +++ b/arch/x86/mm/xpfo.c
> @@ -0,0 +1,176 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/pgtable.h>
> +#include <asm/tlbflush.h>
> +
> +#define TEST_XPFO_FLAG(flag, page) \
> +	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define SET_XPFO_FLAG(flag, page)			\
> +	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define CLEAR_XPFO_FLAG(flag, page)			\
> +	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
> +	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +inline void xpfo_clear_zap(struct page *page, int order)
> +{
> +	int i;
> +
> +	for (i = 0; i < (1 << order); i++)
> +		CLEAR_XPFO_FLAG(zap, page + i);
> +}
> +
> +inline int xpfo_test_and_clear_zap(struct page *page)
> +{
> +	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
> +}
> +
> +inline int xpfo_test_kernel(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(kernel, page);
> +}
> +
> +inline int xpfo_test_user(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(user, page);
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, tlb_shoot = 0;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
> +			TEST_XPFO_FLAG(user, page + i));
> +
> +		if (gfp & GFP_HIGHUSER) {
Why GFP_HIGHUSER?
> +			/* Initialize the xpfo lock and map counter */
> +			spin_lock_init(&(page + i)->xpfo.lock);
> +			atomic_set(&(page + i)->xpfo.mapcount, 0);
> +
> +			/* Mark it as a user page */
> +			SET_XPFO_FLAG(user_fp, page + i);
> +
> +			/*
> +			 * Shoot the TLB if the page was previously allocated
> +			 * to kernel space
> +			 */
> +			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
> +				tlb_shoot = 1;
> +		} else {
> +			/* Mark it as a kernel page */
> +			SET_XPFO_FLAG(kernel, page + i);
> +		}
> +	}
> +
> +	if (tlb_shoot) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +
> +		/* The page frame was previously allocated to user space */
> +		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +
> +			/* Clear the page and mark it accordingly */
> +			clear_page((void *)kaddr);
> +			SET_XPFO_FLAG(zap, page + i);
> +
> +			/* Map it back to kernel space */
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +
> +			/* No TLB update */
> +		}
> +
> +		/* Clear the xpfo fast-path flag */
> +		CLEAR_XPFO_FLAG(user_fp, page + i);
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB update required.
> +	 */
> +	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
> +	    TEST_XPFO_FLAG(user, page))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page frame is to be allocated back to user space. So unmap it
> +	 * from the kernel, update the TLB and mark it as a user page.
> +	 */
> +	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
> +	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);
> +		SET_XPFO_FLAG(user, page);
> +	}
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> diff --git a/block/blk-map.c b/block/blk-map.c
> index f565e11..b7b8302 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
>  		prv.iov_len = iov.iov_len;
>  	}
>  
> -	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
> +	/*
> +	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
> +	 * is enabled. Results in an XPFO page fault otherwise.
> +	 */
This does look like it might add a bunch of overhead
> +	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
> +	    IS_ENABLED(CONFIG_XPFO))
>  		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>  	else
>  		bio = bio_map_user_iov(q, iter, gfp_mask);
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f329..0ca9130 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +	void *kaddr;
> +
>  	might_sleep();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  
>  static inline void kunmap(struct page *page)
>  {
> +	xpfo_kunmap(page_address(page), page);
>  }
>  
>  static inline void *kmap_atomic(struct page *page)
>  {
> +	void *kaddr;
> +
>  	preempt_disable();
>  	pagefault_disable();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>  
>  static inline void __kunmap_atomic(void *addr)
>  {
> +	xpfo_kunmap(addr, virt_to_page(addr));
> +
>  	pagefault_enable();
>  	preempt_enable();
>  }
> @@ -133,7 +146,8 @@ do {                                                            \
>  static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>  {
>  	void *addr = kmap_atomic(page);
> -	clear_user_page(addr, vaddr, page);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_user_page(addr, vaddr, page);
>  	kunmap_atomic(addr);
>  }
>  #endif
> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>  static inline void clear_highpage(struct page *page)
>  {
>  	void *kaddr = kmap_atomic(page);
> -	clear_page(kaddr);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_page(kaddr);
>  	kunmap_atomic(kaddr);
>  }
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 624b78b..71c95aa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>  #include <linux/cpumask.h>
>  #include <linux/uprobes.h>
>  #include <linux/page-flags-layout.h>
> +#include <linux/xpfo.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>  
> @@ -215,6 +216,9 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
>  #endif
> +#ifdef CONFIG_XPFO
> +	struct xpfo_info xpfo;
> +#endif
>  }
>  /*
>   * The struct page can be forced to be double word aligned so that atomic ops
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 0000000..c4f0871
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,88 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +/*
> + * XPFO page flags:
> + *
> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
> + * is used in the fast path, where the page is marked accordingly but *not*
> + * unmapped from the kernel. In most cases, the kernel will need access to the
> + * page immediately after its acquisition so an unnecessary mapping operation
> + * is avoided.
> + *
> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
> + * used in the slow path, where the page needs to be mapped/unmapped when the
> + * kernel wants to access it. If a page is deallocated and this flag is set,
> + * the page is cleared and mapped back into the kernel.
> + *
> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
> + * for identifying pages that are first assigned to kernel space and then freed
> + * and mapped to user space. In such cases, an expensive TLB shootdown is
> + * necessary. Pages allocated to user space, freed, and subsequently allocated
> + * to user space again, require only local TLB invalidation.
> + *
> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
> + * avoid zapping pages multiple times. Whenever a page is freed and was
> + * previously mapped to user space, it needs to be zapped before mapped back
> + * in to the kernel.
> + */
> +
> +enum xpfo_pageflags {
> +	PG_XPFO_user_fp,
> +	PG_XPFO_user,
> +	PG_XPFO_kernel,
> +	PG_XPFO_zap,
> +};
> +
> +struct xpfo_info {
> +	unsigned long flags;	/* Flags for tracking the page's XPFO state */
> +	atomic_t mapcount;	/* Counter for balancing page map/unmap
> +				 * requests. Only the first map request maps
> +				 * the page back to kernel space. Likewise,
> +				 * only the last unmap request unmaps the page.
> +				 */
> +	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
> +				 * requests.
> +				 */
> +};
> +
> +extern void xpfo_clear_zap(struct page *page, int order);
> +extern int xpfo_test_and_clear_zap(struct page *page);
> +extern int xpfo_test_kernel(struct page *page);
> +extern int xpfo_test_user(struct page *page);
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +#else /* ifdef CONFIG_XPFO */
> +
> +static inline void xpfo_clear_zap(struct page *page, int order) { }
> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
> +static inline int xpfo_test_user(struct page *page) { return 0; }
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +#endif /* ifdef CONFIG_XPFO */
> +
> +#endif /* ifndef _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 76f29ec..cf57ee9 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>  	unsigned long pfn = PFN_DOWN(orig_addr);
>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>  
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_test_user(page)) {
>  		/* The buffer does not have a mapping.  Map it in and copy */
>  		unsigned int offset = orig_addr & ~PAGE_MASK;
>  		char *buffer;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 838ca8bb..47b42a3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>  	}
>  	arch_free_page(page, order);
>  	kernel_map_pages(page, 1 << order, 0);
> +	xpfo_free_page(page, order);
>  
>  	return true;
>  }
> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
>  	arch_alloc_page(page, order);
>  	kernel_map_pages(page, 1 << order, 1);
>  	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>  
>  	if (gfp_flags & __GFP_ZERO)
>  		for (i = 0; i < (1 << order); i++)
>  			clear_highpage(page + i);
> +	else
> +		xpfo_clear_zap(page, order);
>  
>  	if (order && (gfp_flags & __GFP_COMP))
>  		prep_compound_page(page, order);
> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>  	}
>  
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -	if (!cold)
> +	if (!cold && !xpfo_test_kernel(page))
>  		list_add(&page->lru, &pcp->lists[migratetype]);
>  	else
>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
> +
>  	pcp->count++;
>  	if (pcp->count >= pcp->high) {
>  		unsigned long batch = READ_ONCE(pcp->batch);

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-03-01  2:10   ` Balbir Singh
  0 siblings, 0 replies; 93+ messages in thread
From: Balbir Singh @ 2016-03-01  2:10 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm; +Cc: vpk



On 27/02/16 01:21, Juerg Haefliger wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userland, unless explicitly requested by the
> kernel. Whenever a page destined for userland is allocated, it is
> unmapped from physmap. When such a page is reclaimed from userland, it is
> mapped back to physmap.
physmap == xen physmap? Please clarify
> Mapping/unmapping from physmap is accomplished by modifying the PTE
> permission bits to allow/disallow access to the page.
>
> Additional fields are added to the page struct for XPFO housekeeping.
> Specifically a flags field to distinguish user vs. kernel pages, a
> reference counter to track physmap map/unmap operations and a lock to
> protect the XPFO fields.
>
> Known issues/limitations:
>   - Only supported on x86-64.
Is it due to lack of porting or a design limitation?
>   - Only supports 4k pages.
>   - Adds additional data to the page struct.
>   - There are most likely some additional and legitimate uses cases where
>     the kernel needs to access userspace. Those need to be identified and
>     made XPFO-aware.
Why not build an audit mode for it?
>   - There's a performance impact if XPFO is turned on. Per the paper
>     referenced below it's in the 1-3% ballpark. More performance testing
>     wouldn't hurt. What tests to run though?
>
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
This patch needs to be broken down into smaller patches - a series
> ---
>  arch/x86/Kconfig         |   2 +-
>  arch/x86/Kconfig.debug   |  17 +++++
>  arch/x86/mm/Makefile     |   2 +
>  arch/x86/mm/init.c       |   3 +-
>  arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-map.c          |   7 +-
>  include/linux/highmem.h  |  23 +++++--
>  include/linux/mm_types.h |   4 ++
>  include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/page_alloc.c          |   7 +-
>  11 files changed, 323 insertions(+), 9 deletions(-)
>  create mode 100644 arch/x86/mm/xpfo.c
>  create mode 100644 include/linux/xpfo.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c46662f..9d32b4a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>  
>  config X86_DIRECT_GBPAGES
>  	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>  	---help---
>  	  Certain kernel features effectively disable kernel
>  	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 9b18ed9..1331da5 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>  
>  source "lib/Kconfig.debug"
>  
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on DEBUG_KERNEL
> +	depends on X86_64
> +	select DEBUG_TLBFLUSH
> +	---help---
> +	  This option offers protection against 'ret2dir' (kernel) attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
> +	  are mapped back to physmap. Special care is taken to minimize the
> +	  impact on performance by reducing TLB shootdowns and unnecessary page
> +	  zero fills.
> +
> +	  If in doubt, say "N".
> +
>  config X86_VERBOSE_BOOTUP
>  	bool "Enable verbose x86 bootup info messages"
>  	default y
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index f9d38a4..8bf52b6 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>  obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
>  
>  obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
> +
> +obj-$(CONFIG_XPFO)		+= xpfo.o
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 493f541..27fc8a6 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -150,7 +150,8 @@ static int page_size_mask;
>  
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
> +	!defined(CONFIG_XPFO)
>  	/*
>  	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>  	 * This will simplify cpa(), which otherwise needs to support splitting
> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> new file mode 100644
> index 0000000..6bc24d3
> --- /dev/null
> +++ b/arch/x86/mm/xpfo.c
> @@ -0,0 +1,176 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +
> +#include <asm/pgtable.h>
> +#include <asm/tlbflush.h>
> +
> +#define TEST_XPFO_FLAG(flag, page) \
> +	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define SET_XPFO_FLAG(flag, page)			\
> +	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define CLEAR_XPFO_FLAG(flag, page)			\
> +	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
> +	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +inline void xpfo_clear_zap(struct page *page, int order)
> +{
> +	int i;
> +
> +	for (i = 0; i < (1 << order); i++)
> +		CLEAR_XPFO_FLAG(zap, page + i);
> +}
> +
> +inline int xpfo_test_and_clear_zap(struct page *page)
> +{
> +	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
> +}
> +
> +inline int xpfo_test_kernel(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(kernel, page);
> +}
> +
> +inline int xpfo_test_user(struct page *page)
> +{
> +	return TEST_XPFO_FLAG(user, page);
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, tlb_shoot = 0;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
> +			TEST_XPFO_FLAG(user, page + i));
> +
> +		if (gfp & GFP_HIGHUSER) {
Why GFP_HIGHUSER?
> +			/* Initialize the xpfo lock and map counter */
> +			spin_lock_init(&(page + i)->xpfo.lock);
> +			atomic_set(&(page + i)->xpfo.mapcount, 0);
> +
> +			/* Mark it as a user page */
> +			SET_XPFO_FLAG(user_fp, page + i);
> +
> +			/*
> +			 * Shoot the TLB if the page was previously allocated
> +			 * to kernel space
> +			 */
> +			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
> +				tlb_shoot = 1;
> +		} else {
> +			/* Mark it as a kernel page */
> +			SET_XPFO_FLAG(kernel, page + i);
> +		}
> +	}
> +
> +	if (tlb_shoot) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	unsigned long kaddr;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +
> +		/* The page frame was previously allocated to user space */
> +		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +
> +			/* Clear the page and mark it accordingly */
> +			clear_page((void *)kaddr);
> +			SET_XPFO_FLAG(zap, page + i);
> +
> +			/* Map it back to kernel space */
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +
> +			/* No TLB update */
> +		}
> +
> +		/* Clear the xpfo fast-path flag */
> +		CLEAR_XPFO_FLAG(user_fp, page + i);
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB update required.
> +	 */
> +	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
> +	    TEST_XPFO_FLAG(user, page))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	/* The page is allocated to kernel space, so nothing to do */
> +	if (TEST_XPFO_FLAG(kernel, page))
> +		return;
> +
> +	spin_lock_irqsave(&page->xpfo.lock, flags);
> +
> +	/*
> +	 * The page frame is to be allocated back to user space. So unmap it
> +	 * from the kernel, update the TLB and mark it as a user page.
> +	 */
> +	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
> +	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);
> +		SET_XPFO_FLAG(user, page);
> +	}
> +
> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> diff --git a/block/blk-map.c b/block/blk-map.c
> index f565e11..b7b8302 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
>  		prv.iov_len = iov.iov_len;
>  	}
>  
> -	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
> +	/*
> +	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
> +	 * is enabled. Results in an XPFO page fault otherwise.
> +	 */
This does look like it might add a bunch of overhead
> +	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
> +	    IS_ENABLED(CONFIG_XPFO))
>  		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>  	else
>  		bio = bio_map_user_iov(q, iter, gfp_mask);
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f329..0ca9130 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +	void *kaddr;
> +
>  	might_sleep();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  
>  static inline void kunmap(struct page *page)
>  {
> +	xpfo_kunmap(page_address(page), page);
>  }
>  
>  static inline void *kmap_atomic(struct page *page)
>  {
> +	void *kaddr;
> +
>  	preempt_disable();
>  	pagefault_disable();
> -	return page_address(page);
> +
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>  
>  static inline void __kunmap_atomic(void *addr)
>  {
> +	xpfo_kunmap(addr, virt_to_page(addr));
> +
>  	pagefault_enable();
>  	preempt_enable();
>  }
> @@ -133,7 +146,8 @@ do {                                                            \
>  static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>  {
>  	void *addr = kmap_atomic(page);
> -	clear_user_page(addr, vaddr, page);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_user_page(addr, vaddr, page);
>  	kunmap_atomic(addr);
>  }
>  #endif
> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>  static inline void clear_highpage(struct page *page)
>  {
>  	void *kaddr = kmap_atomic(page);
> -	clear_page(kaddr);
> +	if (!xpfo_test_and_clear_zap(page))
> +		clear_page(kaddr);
>  	kunmap_atomic(kaddr);
>  }
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 624b78b..71c95aa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>  #include <linux/cpumask.h>
>  #include <linux/uprobes.h>
>  #include <linux/page-flags-layout.h>
> +#include <linux/xpfo.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>  
> @@ -215,6 +216,9 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
>  #endif
> +#ifdef CONFIG_XPFO
> +	struct xpfo_info xpfo;
> +#endif
>  }
>  /*
>   * The struct page can be forced to be double word aligned so that atomic ops
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 0000000..c4f0871
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,88 @@
> +/*
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + *
> + * Authors:
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +/*
> + * XPFO page flags:
> + *
> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
> + * is used in the fast path, where the page is marked accordingly but *not*
> + * unmapped from the kernel. In most cases, the kernel will need access to the
> + * page immediately after its acquisition so an unnecessary mapping operation
> + * is avoided.
> + *
> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
> + * used in the slow path, where the page needs to be mapped/unmapped when the
> + * kernel wants to access it. If a page is deallocated and this flag is set,
> + * the page is cleared and mapped back into the kernel.
> + *
> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
> + * for identifying pages that are first assigned to kernel space and then freed
> + * and mapped to user space. In such cases, an expensive TLB shootdown is
> + * necessary. Pages allocated to user space, freed, and subsequently allocated
> + * to user space again, require only local TLB invalidation.
> + *
> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
> + * avoid zapping pages multiple times. Whenever a page is freed and was
> + * previously mapped to user space, it needs to be zapped before mapped back
> + * in to the kernel.
> + */
> +
> +enum xpfo_pageflags {
> +	PG_XPFO_user_fp,
> +	PG_XPFO_user,
> +	PG_XPFO_kernel,
> +	PG_XPFO_zap,
> +};
> +
> +struct xpfo_info {
> +	unsigned long flags;	/* Flags for tracking the page's XPFO state */
> +	atomic_t mapcount;	/* Counter for balancing page map/unmap
> +				 * requests. Only the first map request maps
> +				 * the page back to kernel space. Likewise,
> +				 * only the last unmap request unmaps the page.
> +				 */
> +	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
> +				 * requests.
> +				 */
> +};
> +
> +extern void xpfo_clear_zap(struct page *page, int order);
> +extern int xpfo_test_and_clear_zap(struct page *page);
> +extern int xpfo_test_kernel(struct page *page);
> +extern int xpfo_test_user(struct page *page);
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +#else /* ifdef CONFIG_XPFO */
> +
> +static inline void xpfo_clear_zap(struct page *page, int order) { }
> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
> +static inline int xpfo_test_user(struct page *page) { return 0; }
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +#endif /* ifdef CONFIG_XPFO */
> +
> +#endif /* ifndef _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 76f29ec..cf57ee9 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>  	unsigned long pfn = PFN_DOWN(orig_addr);
>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>  
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_test_user(page)) {
>  		/* The buffer does not have a mapping.  Map it in and copy */
>  		unsigned int offset = orig_addr & ~PAGE_MASK;
>  		char *buffer;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 838ca8bb..47b42a3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>  	}
>  	arch_free_page(page, order);
>  	kernel_map_pages(page, 1 << order, 0);
> +	xpfo_free_page(page, order);
>  
>  	return true;
>  }
> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
>  	arch_alloc_page(page, order);
>  	kernel_map_pages(page, 1 << order, 1);
>  	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>  
>  	if (gfp_flags & __GFP_ZERO)
>  		for (i = 0; i < (1 << order); i++)
>  			clear_highpage(page + i);
> +	else
> +		xpfo_clear_zap(page, order);
>  
>  	if (order && (gfp_flags & __GFP_COMP))
>  		prep_compound_page(page, order);
> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>  	}
>  
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -	if (!cold)
> +	if (!cold && !xpfo_test_kernel(page))
>  		list_add(&page->lru, &pcp->lists[migratetype]);
>  	else
>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
> +
>  	pcp->count++;
>  	if (pcp->count >= pcp->high) {
>  		unsigned long batch = READ_ONCE(pcp->batch);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-03-01  1:31   ` Laura Abbott
@ 2016-03-21  8:37     ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-03-21  8:37 UTC (permalink / raw)
  To: Laura Abbott, linux-kernel, linux-mm; +Cc: vpk, Kees Cook

Hi Laura,

Sorry for the late reply. I was on FTO and then traveling for the past couple of
days.


On 03/01/2016 02:31 AM, Laura Abbott wrote:
> On 02/26/2016 06:21 AM, Juerg Haefliger wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userland, unless explicitly requested by the
>> kernel. Whenever a page destined for userland is allocated, it is
>> unmapped from physmap. When such a page is reclaimed from userland, it is
>> mapped back to physmap.
>>
>> Mapping/unmapping from physmap is accomplished by modifying the PTE
>> permission bits to allow/disallow access to the page.
>>
>> Additional fields are added to the page struct for XPFO housekeeping.
>> Specifically a flags field to distinguish user vs. kernel pages, a
>> reference counter to track physmap map/unmap operations and a lock to
>> protect the XPFO fields.
>>
>> Known issues/limitations:
>>    - Only supported on x86-64.
>>    - Only supports 4k pages.
>>    - Adds additional data to the page struct.
>>    - There are most likely some additional and legitimate uses cases where
>>      the kernel needs to access userspace. Those need to be identified and
>>      made XPFO-aware.
>>    - There's a performance impact if XPFO is turned on. Per the paper
>>      referenced below it's in the 1-3% ballpark. More performance testing
>>      wouldn't hurt. What tests to run though?
>>
>> Reference paper by the original patch authors:
>>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>>
> 
> General note: Make sure to cc the x86 maintainers on the next version of
> the patch. I'd also recommend ccing the kernel hardening list (see the wiki
> page http://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project for
> details)

Good idea. Thanks for the suggestion.


> If you can find a way to break this up into x86 specific vs. generic patches
> that would be better. Perhaps move the Kconfig for XPFO to the generic
> Kconfig layer and make it depend on ARCH_HAS_XPFO? x86 can then select
> ARCH_HAS_XPFO as the last option.

Good idea.


> There also isn't much that's actually x86 specific here except for
> some of the page table manipulation functions and even those can probably
> be abstracted away. It would be good to get more of this out of x86 to
> let other arches take advantage of it. The arm64 implementation would
> look pretty similar if you save the old kernel mapping and restore
> it on free.

OK. I need to familiarize myself with ARM to figure out which pieces can move
out of the arch subdir.


>  
>> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
>> ---
>>   arch/x86/Kconfig         |   2 +-
>>   arch/x86/Kconfig.debug   |  17 +++++
>>   arch/x86/mm/Makefile     |   2 +
>>   arch/x86/mm/init.c       |   3 +-
>>   arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>>   block/blk-map.c          |   7 +-
>>   include/linux/highmem.h  |  23 +++++--
>>   include/linux/mm_types.h |   4 ++
>>   include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>>   lib/swiotlb.c            |   3 +-
>>   mm/page_alloc.c          |   7 +-
>>   11 files changed, 323 insertions(+), 9 deletions(-)
>>   create mode 100644 arch/x86/mm/xpfo.c
>>   create mode 100644 include/linux/xpfo.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c46662f..9d32b4a 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>>
>>   config X86_DIRECT_GBPAGES
>>       def_bool y
>> -    depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
>> +    depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>>       ---help---
>>         Certain kernel features effectively disable kernel
>>         linear 1 GB mappings (even if the CPU otherwise
>> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
>> index 9b18ed9..1331da5 100644
>> --- a/arch/x86/Kconfig.debug
>> +++ b/arch/x86/Kconfig.debug
>> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>>
>>   source "lib/Kconfig.debug"
>>
>> +config XPFO
>> +    bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +    default n
>> +    depends on DEBUG_KERNEL
>> +    depends on X86_64
>> +    select DEBUG_TLBFLUSH
>> +    ---help---
>> +      This option offers protection against 'ret2dir' (kernel) attacks.
>> +      When enabled, every time a page frame is allocated to user space, it
>> +      is unmapped from the direct mapped RAM region in kernel space
>> +      (physmap). Similarly, whenever page frames are freed/reclaimed, they
>> +      are mapped back to physmap. Special care is taken to minimize the
>> +      impact on performance by reducing TLB shootdowns and unnecessary page
>> +      zero fills.
>> +
>> +      If in doubt, say "N".
>> +
>>   config X86_VERBOSE_BOOTUP
>>       bool "Enable verbose x86 bootup info messages"
>>       default y
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index f9d38a4..8bf52b6 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)        += srat.o
>>   obj-$(CONFIG_NUMA_EMU)        += numa_emulation.o
>>
>>   obj-$(CONFIG_X86_INTEL_MPX)    += mpx.o
>> +
>> +obj-$(CONFIG_XPFO)        += xpfo.o
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 493f541..27fc8a6 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -150,7 +150,8 @@ static int page_size_mask;
>>
>>   static void __init probe_page_size_mask(void)
>>   {
>> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
>> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
>> +    !defined(CONFIG_XPFO)
>>       /*
>>        * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>>        * This will simplify cpa(), which otherwise needs to support splitting
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index 0000000..6bc24d3
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,176 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +
>> +#include <asm/pgtable.h>
>> +#include <asm/tlbflush.h>
>> +
>> +#define TEST_XPFO_FLAG(flag, page) \
>> +    test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define SET_XPFO_FLAG(flag, page)            \
>> +    __set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define CLEAR_XPFO_FLAG(flag, page)            \
>> +    __clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)            \
>> +    __test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +/*
>> + * Update a single kernel page table entry
>> + */
>> +static inline void set_kpte(struct page *page, unsigned long kaddr,
>> +                pgprot_t prot) {
>> +    unsigned int level;
>> +    pte_t *kpte = lookup_address(kaddr, &level);
>> +
>> +    /* We only support 4k pages for now */
>> +    BUG_ON(!kpte || level != PG_LEVEL_4K);
>> +
>> +    set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
>> +}
>> +
>> +inline void xpfo_clear_zap(struct page *page, int order)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < (1 << order); i++)
>> +        CLEAR_XPFO_FLAG(zap, page + i);
>> +}
>> +
>> +inline int xpfo_test_and_clear_zap(struct page *page)
>> +{
>> +    return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
>> +}
>> +
>> +inline int xpfo_test_kernel(struct page *page)
>> +{
>> +    return TEST_XPFO_FLAG(kernel, page);
>> +}
>> +
>> +inline int xpfo_test_user(struct page *page)
>> +{
>> +    return TEST_XPFO_FLAG(user, page);
>> +}
>> +
>> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
>> +{
>> +    int i, tlb_shoot = 0;
>> +    unsigned long kaddr;
>> +
>> +    for (i = 0; i < (1 << order); i++)  {
>> +        WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
>> +            TEST_XPFO_FLAG(user, page + i));
>> +
>> +        if (gfp & GFP_HIGHUSER) {
> 
> This check doesn't seem right. If the GFP flags have _any_ in common with
> GFP_HIGHUSER it will be marked as a user page so GFP_KERNEL will be marked
> as well.

Duh. You're right. I broke this when I cleaned up the original patch. It should be:
(gfp & GFP_HIGHUSER) == GFP_HIGHUSER



>> +            /* Initialize the xpfo lock and map counter */
>> +            spin_lock_init(&(page + i)->xpfo.lock);
> 
> This is initializing the spin_lock every time. That's not really necessary.

Correct. The initialization should probably be done when the page struct is
first allocated. But I haven't been able to find that piece of code quickly.
Will look again.


>> +            atomic_set(&(page + i)->xpfo.mapcount, 0);
>> +
>> +            /* Mark it as a user page */
>> +            SET_XPFO_FLAG(user_fp, page + i);
>> +
>> +            /*
>> +             * Shoot the TLB if the page was previously allocated
>> +             * to kernel space
>> +             */
>> +            if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
>> +                tlb_shoot = 1;
>> +        } else {
>> +            /* Mark it as a kernel page */
>> +            SET_XPFO_FLAG(kernel, page + i);
>> +        }
>> +    }
>> +
>> +    if (tlb_shoot) {
>> +        kaddr = (unsigned long)page_address(page);
>> +        flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
>> +                       PAGE_SIZE);
>> +    }
>> +}
>> +
>> +void xpfo_free_page(struct page *page, int order)
>> +{
>> +    int i;
>> +    unsigned long kaddr;
>> +
>> +    for (i = 0; i < (1 << order); i++) {
>> +
>> +        /* The page frame was previously allocated to user space */
>> +        if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
>> +            kaddr = (unsigned long)page_address(page + i);
>> +
>> +            /* Clear the page and mark it accordingly */
>> +            clear_page((void *)kaddr);
> 
> Clearing the page isn't related to XPFO. There's other work ongoing to
> do clearing of the page on free.

It's not strictly related to XPFO but adds another layer of security. Do you
happen to have a pointer to the ongoing work that you mentioned?


>> +            SET_XPFO_FLAG(zap, page + i);
>> +
>> +            /* Map it back to kernel space */
>> +            set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +            /* No TLB update */
>> +        }
>> +
>> +        /* Clear the xpfo fast-path flag */
>> +        CLEAR_XPFO_FLAG(user_fp, page + i);
>> +    }
>> +}
>> +
>> +void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +    unsigned long flags;
>> +
>> +    /* The page is allocated to kernel space, so nothing to do */
>> +    if (TEST_XPFO_FLAG(kernel, page))
>> +        return;
>> +
>> +    spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +    /*
>> +     * The page was previously allocated to user space, so map it back
>> +     * into the kernel. No TLB update required.
>> +     */
>> +    if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
>> +        TEST_XPFO_FLAG(user, page))
>> +        set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +    spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kmap);
>> +
>> +void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +    unsigned long flags;
>> +
>> +    /* The page is allocated to kernel space, so nothing to do */
>> +    if (TEST_XPFO_FLAG(kernel, page))
>> +        return;
>> +
>> +    spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +    /*
>> +     * The page frame is to be allocated back to user space. So unmap it
>> +     * from the kernel, update the TLB and mark it as a user page.
>> +     */
>> +    if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
>> +        (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
>> +        set_kpte(page, (unsigned long)kaddr, __pgprot(0));
>> +        __flush_tlb_one((unsigned long)kaddr);
>> +        SET_XPFO_FLAG(user, page);
>> +    }
>> +
>> +    spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kunmap);
> 
> I'm confused by the checks in kmap/kunmap here. It looks like once the
> page is allocated there is no changing of flags between user and
> kernel mode so the checks for if the page is user seem redundant.

Hmm... I think you're partially right. In xpfo_kmap we need to distinguish
between user and user_fp, so the check for 'user' is necessary. However, in
kunmap we can drop the check for 'user' || 'user_fp'.


>> diff --git a/block/blk-map.c b/block/blk-map.c
>> index f565e11..b7b8302 100644
>> --- a/block/blk-map.c
>> +++ b/block/blk-map.c
>> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct
>> request *rq,
>>           prv.iov_len = iov.iov_len;
>>       }
>>
>> -    if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
>> +    /*
>> +     * juergh: Temporary hack to force the use of a bounce buffer if XPFO
>> +     * is enabled. Results in an XPFO page fault otherwise.
>> +     */
>> +    if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
>> +        IS_ENABLED(CONFIG_XPFO))
>>           bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>>       else
>>           bio = bio_map_user_iov(q, iter, gfp_mask);
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index bb3f329..0ca9130 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>>   #ifndef ARCH_HAS_KMAP
>>   static inline void *kmap(struct page *page)
>>   {
>> +    void *kaddr;
>> +
>>       might_sleep();
>> -    return page_address(page);
>> +
>> +    kaddr = page_address(page);
>> +    xpfo_kmap(kaddr, page);
>> +    return kaddr;
>>   }
>>
>>   static inline void kunmap(struct page *page)
>>   {
>> +    xpfo_kunmap(page_address(page), page);
>>   }
>>
>>   static inline void *kmap_atomic(struct page *page)
>>   {
>> +    void *kaddr;
>> +
>>       preempt_disable();
>>       pagefault_disable();
>> -    return page_address(page);
>> +
>> +    kaddr = page_address(page);
>> +    xpfo_kmap(kaddr, page);
>> +    return kaddr;
>>   }
>>   #define kmap_atomic_prot(page, prot)    kmap_atomic(page)
>>
>>   static inline void __kunmap_atomic(void *addr)
>>   {
>> +    xpfo_kunmap(addr, virt_to_page(addr));
>> +
>>       pagefault_enable();
>>       preempt_enable();
>>   }
>> @@ -133,7 +146,8 @@ do
>> {                                                            \
>>   static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>>   {
>>       void *addr = kmap_atomic(page);
>> -    clear_user_page(addr, vaddr, page);
>> +    if (!xpfo_test_and_clear_zap(page))
>> +        clear_user_page(addr, vaddr, page);
>>       kunmap_atomic(addr);
>>   }
>>   #endif
>> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct
>> *vma,
>>   static inline void clear_highpage(struct page *page)
>>   {
>>       void *kaddr = kmap_atomic(page);
>> -    clear_page(kaddr);
>> +    if (!xpfo_test_and_clear_zap(page))
>> +        clear_page(kaddr);
>>       kunmap_atomic(kaddr);
>>   }
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 624b78b..71c95aa 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -12,6 +12,7 @@
>>   #include <linux/cpumask.h>
>>   #include <linux/uprobes.h>
>>   #include <linux/page-flags-layout.h>
>> +#include <linux/xpfo.h>
>>   #include <asm/page.h>
>>   #include <asm/mmu.h>
>>
>> @@ -215,6 +216,9 @@ struct page {
>>   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>>       int _last_cpupid;
>>   #endif
>> +#ifdef CONFIG_XPFO
>> +    struct xpfo_info xpfo;
>> +#endif
>>   }
>>   /*
>>    * The struct page can be forced to be double word aligned so that atomic ops
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> new file mode 100644
>> index 0000000..c4f0871
>> --- /dev/null
>> +++ b/include/linux/xpfo.h
>> @@ -0,0 +1,88 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#ifndef _LINUX_XPFO_H
>> +#define _LINUX_XPFO_H
>> +
>> +#ifdef CONFIG_XPFO
>> +
>> +/*
>> + * XPFO page flags:
>> + *
>> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
>> + * is used in the fast path, where the page is marked accordingly but *not*
>> + * unmapped from the kernel. In most cases, the kernel will need access to the
>> + * page immediately after its acquisition so an unnecessary mapping operation
>> + * is avoided.
>> + *
>> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
>> + * used in the slow path, where the page needs to be mapped/unmapped when the
>> + * kernel wants to access it. If a page is deallocated and this flag is set,
>> + * the page is cleared and mapped back into the kernel.
>> + *
>> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
>> + * for identifying pages that are first assigned to kernel space and then freed
>> + * and mapped to user space. In such cases, an expensive TLB shootdown is
>> + * necessary. Pages allocated to user space, freed, and subsequently allocated
>> + * to user space again, require only local TLB invalidation.
>> + *
>> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
>> + * avoid zapping pages multiple times. Whenever a page is freed and was
>> + * previously mapped to user space, it needs to be zapped before mapped back
>> + * in to the kernel.
>> + */
> 
> 'zap' doesn't really indicate what is actually happening with the page. Can you
> be a bit more descriptive about what this actually does?

It means that the page has been cleared at the time it was released back to the
free pool. To prevent multiple expensive cleaning operations. But this might go
away because of the ongoing work of sanitizing pages that you mentioned.


>> +
>> +enum xpfo_pageflags {
>> +    PG_XPFO_user_fp,
>> +    PG_XPFO_user,
>> +    PG_XPFO_kernel,
>> +    PG_XPFO_zap,
>> +};
>> +
>> +struct xpfo_info {
>> +    unsigned long flags;    /* Flags for tracking the page's XPFO state */
>> +    atomic_t mapcount;    /* Counter for balancing page map/unmap
>> +                 * requests. Only the first map request maps
>> +                 * the page back to kernel space. Likewise,
>> +                 * only the last unmap request unmaps the page.
>> +                 */
>> +    spinlock_t lock;    /* Lock to serialize concurrent map/unmap
>> +                 * requests.
>> +                 */
>> +};
> 
> Can you change this to use the page_ext implementation? See what
> mm/page_owner.c does. This might lessen the impact of the extra
> page metadata. This metadata still feels like a copy of what
> mm/highmem.c is trying to do though.

I'll look into that, thanks for the pointer.


>> +
>> +extern void xpfo_clear_zap(struct page *page, int order);
>> +extern int xpfo_test_and_clear_zap(struct page *page);
>> +extern int xpfo_test_kernel(struct page *page);
>> +extern int xpfo_test_user(struct page *page);
>> +
>> +extern void xpfo_kmap(void *kaddr, struct page *page);
>> +extern void xpfo_kunmap(void *kaddr, struct page *page);
>> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
>> +extern void xpfo_free_page(struct page *page, int order);
>> +
>> +#else /* ifdef CONFIG_XPFO */
>> +
>> +static inline void xpfo_clear_zap(struct page *page, int order) { }
>> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
>> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
>> +static inline int xpfo_test_user(struct page *page) { return 0; }
>> +
>> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
>> +static inline void xpfo_free_page(struct page *page, int order) { }
>> +
>> +#endif /* ifdef CONFIG_XPFO */
>> +
>> +#endif /* ifndef _LINUX_XPFO_H */
>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>> index 76f29ec..cf57ee9 100644
>> --- a/lib/swiotlb.c
>> +++ b/lib/swiotlb.c
>> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr,
>> phys_addr_t tlb_addr,
>>   {
>>       unsigned long pfn = PFN_DOWN(orig_addr);
>>       unsigned char *vaddr = phys_to_virt(tlb_addr);
>> +    struct page *page = pfn_to_page(pfn);
>>
>> -    if (PageHighMem(pfn_to_page(pfn))) {
>> +    if (PageHighMem(page) || xpfo_test_user(page)) {
>>           /* The buffer does not have a mapping.  Map it in and copy */
>>           unsigned int offset = orig_addr & ~PAGE_MASK;
>>           char *buffer;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 838ca8bb..47b42a3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page,
>> unsigned int order)
>>       }
>>       arch_free_page(page, order);
>>       kernel_map_pages(page, 1 << order, 0);
>> +    xpfo_free_page(page, order);
>>
>>       return true;
>>   }
>> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned
>> int order, gfp_t gfp_flags,
>>       arch_alloc_page(page, order);
>>       kernel_map_pages(page, 1 << order, 1);
>>       kasan_alloc_pages(page, order);
>> +    xpfo_alloc_page(page, order, gfp_flags);
>>
>>       if (gfp_flags & __GFP_ZERO)
>>           for (i = 0; i < (1 << order); i++)
>>               clear_highpage(page + i);
>> +    else
>> +        xpfo_clear_zap(page, order);
>>
>>       if (order && (gfp_flags & __GFP_COMP))
>>           prep_compound_page(page, order);
>> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>>       }
>>
>>       pcp = &this_cpu_ptr(zone->pageset)->pcp;
>> -    if (!cold)
>> +    if (!cold && !xpfo_test_kernel(page))
>>           list_add(&page->lru, &pcp->lists[migratetype]);
>>       else
>>           list_add_tail(&page->lru, &pcp->lists[migratetype]);
>> +
> 
> What's the advantage of this?

Allocating a page to userspace that was previously allocated to kernel space
requires an expensive TLB shootdown. The above will put previously
kernel-allocated pages in the cold page cache to postpone their allocation as
long as possible to minimize TLB shootdowns.


>>       pcp->count++;
>>       if (pcp->count >= pcp->high) {
>>           unsigned long batch = READ_ONCE(pcp->batch);
>>

Thanks for the review and comments! It's highly appreciated.

...Juerg


> Thanks,
> Laura

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-03-21  8:37     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-03-21  8:37 UTC (permalink / raw)
  To: Laura Abbott, linux-kernel, linux-mm; +Cc: vpk, Kees Cook

Hi Laura,

Sorry for the late reply. I was on FTO and then traveling for the past couple of
days.


On 03/01/2016 02:31 AM, Laura Abbott wrote:
> On 02/26/2016 06:21 AM, Juerg Haefliger wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userland, unless explicitly requested by the
>> kernel. Whenever a page destined for userland is allocated, it is
>> unmapped from physmap. When such a page is reclaimed from userland, it is
>> mapped back to physmap.
>>
>> Mapping/unmapping from physmap is accomplished by modifying the PTE
>> permission bits to allow/disallow access to the page.
>>
>> Additional fields are added to the page struct for XPFO housekeeping.
>> Specifically a flags field to distinguish user vs. kernel pages, a
>> reference counter to track physmap map/unmap operations and a lock to
>> protect the XPFO fields.
>>
>> Known issues/limitations:
>>    - Only supported on x86-64.
>>    - Only supports 4k pages.
>>    - Adds additional data to the page struct.
>>    - There are most likely some additional and legitimate uses cases where
>>      the kernel needs to access userspace. Those need to be identified and
>>      made XPFO-aware.
>>    - There's a performance impact if XPFO is turned on. Per the paper
>>      referenced below it's in the 1-3% ballpark. More performance testing
>>      wouldn't hurt. What tests to run though?
>>
>> Reference paper by the original patch authors:
>>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>>
> 
> General note: Make sure to cc the x86 maintainers on the next version of
> the patch. I'd also recommend ccing the kernel hardening list (see the wiki
> page http://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project for
> details)

Good idea. Thanks for the suggestion.


> If you can find a way to break this up into x86 specific vs. generic patches
> that would be better. Perhaps move the Kconfig for XPFO to the generic
> Kconfig layer and make it depend on ARCH_HAS_XPFO? x86 can then select
> ARCH_HAS_XPFO as the last option.

Good idea.


> There also isn't much that's actually x86 specific here except for
> some of the page table manipulation functions and even those can probably
> be abstracted away. It would be good to get more of this out of x86 to
> let other arches take advantage of it. The arm64 implementation would
> look pretty similar if you save the old kernel mapping and restore
> it on free.

OK. I need to familiarize myself with ARM to figure out which pieces can move
out of the arch subdir.


>  
>> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
>> ---
>>   arch/x86/Kconfig         |   2 +-
>>   arch/x86/Kconfig.debug   |  17 +++++
>>   arch/x86/mm/Makefile     |   2 +
>>   arch/x86/mm/init.c       |   3 +-
>>   arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>>   block/blk-map.c          |   7 +-
>>   include/linux/highmem.h  |  23 +++++--
>>   include/linux/mm_types.h |   4 ++
>>   include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>>   lib/swiotlb.c            |   3 +-
>>   mm/page_alloc.c          |   7 +-
>>   11 files changed, 323 insertions(+), 9 deletions(-)
>>   create mode 100644 arch/x86/mm/xpfo.c
>>   create mode 100644 include/linux/xpfo.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c46662f..9d32b4a 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>>
>>   config X86_DIRECT_GBPAGES
>>       def_bool y
>> -    depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
>> +    depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>>       ---help---
>>         Certain kernel features effectively disable kernel
>>         linear 1 GB mappings (even if the CPU otherwise
>> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
>> index 9b18ed9..1331da5 100644
>> --- a/arch/x86/Kconfig.debug
>> +++ b/arch/x86/Kconfig.debug
>> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>>
>>   source "lib/Kconfig.debug"
>>
>> +config XPFO
>> +    bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +    default n
>> +    depends on DEBUG_KERNEL
>> +    depends on X86_64
>> +    select DEBUG_TLBFLUSH
>> +    ---help---
>> +      This option offers protection against 'ret2dir' (kernel) attacks.
>> +      When enabled, every time a page frame is allocated to user space, it
>> +      is unmapped from the direct mapped RAM region in kernel space
>> +      (physmap). Similarly, whenever page frames are freed/reclaimed, they
>> +      are mapped back to physmap. Special care is taken to minimize the
>> +      impact on performance by reducing TLB shootdowns and unnecessary page
>> +      zero fills.
>> +
>> +      If in doubt, say "N".
>> +
>>   config X86_VERBOSE_BOOTUP
>>       bool "Enable verbose x86 bootup info messages"
>>       default y
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index f9d38a4..8bf52b6 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)        += srat.o
>>   obj-$(CONFIG_NUMA_EMU)        += numa_emulation.o
>>
>>   obj-$(CONFIG_X86_INTEL_MPX)    += mpx.o
>> +
>> +obj-$(CONFIG_XPFO)        += xpfo.o
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 493f541..27fc8a6 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -150,7 +150,8 @@ static int page_size_mask;
>>
>>   static void __init probe_page_size_mask(void)
>>   {
>> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
>> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
>> +    !defined(CONFIG_XPFO)
>>       /*
>>        * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>>        * This will simplify cpa(), which otherwise needs to support splitting
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index 0000000..6bc24d3
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,176 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +
>> +#include <asm/pgtable.h>
>> +#include <asm/tlbflush.h>
>> +
>> +#define TEST_XPFO_FLAG(flag, page) \
>> +    test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define SET_XPFO_FLAG(flag, page)            \
>> +    __set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define CLEAR_XPFO_FLAG(flag, page)            \
>> +    __clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)            \
>> +    __test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +/*
>> + * Update a single kernel page table entry
>> + */
>> +static inline void set_kpte(struct page *page, unsigned long kaddr,
>> +                pgprot_t prot) {
>> +    unsigned int level;
>> +    pte_t *kpte = lookup_address(kaddr, &level);
>> +
>> +    /* We only support 4k pages for now */
>> +    BUG_ON(!kpte || level != PG_LEVEL_4K);
>> +
>> +    set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
>> +}
>> +
>> +inline void xpfo_clear_zap(struct page *page, int order)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < (1 << order); i++)
>> +        CLEAR_XPFO_FLAG(zap, page + i);
>> +}
>> +
>> +inline int xpfo_test_and_clear_zap(struct page *page)
>> +{
>> +    return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
>> +}
>> +
>> +inline int xpfo_test_kernel(struct page *page)
>> +{
>> +    return TEST_XPFO_FLAG(kernel, page);
>> +}
>> +
>> +inline int xpfo_test_user(struct page *page)
>> +{
>> +    return TEST_XPFO_FLAG(user, page);
>> +}
>> +
>> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
>> +{
>> +    int i, tlb_shoot = 0;
>> +    unsigned long kaddr;
>> +
>> +    for (i = 0; i < (1 << order); i++)  {
>> +        WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
>> +            TEST_XPFO_FLAG(user, page + i));
>> +
>> +        if (gfp & GFP_HIGHUSER) {
> 
> This check doesn't seem right. If the GFP flags have _any_ in common with
> GFP_HIGHUSER it will be marked as a user page so GFP_KERNEL will be marked
> as well.

Duh. You're right. I broke this when I cleaned up the original patch. It should be:
(gfp & GFP_HIGHUSER) == GFP_HIGHUSER



>> +            /* Initialize the xpfo lock and map counter */
>> +            spin_lock_init(&(page + i)->xpfo.lock);
> 
> This is initializing the spin_lock every time. That's not really necessary.

Correct. The initialization should probably be done when the page struct is
first allocated. But I haven't been able to find that piece of code quickly.
Will look again.


>> +            atomic_set(&(page + i)->xpfo.mapcount, 0);
>> +
>> +            /* Mark it as a user page */
>> +            SET_XPFO_FLAG(user_fp, page + i);
>> +
>> +            /*
>> +             * Shoot the TLB if the page was previously allocated
>> +             * to kernel space
>> +             */
>> +            if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
>> +                tlb_shoot = 1;
>> +        } else {
>> +            /* Mark it as a kernel page */
>> +            SET_XPFO_FLAG(kernel, page + i);
>> +        }
>> +    }
>> +
>> +    if (tlb_shoot) {
>> +        kaddr = (unsigned long)page_address(page);
>> +        flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
>> +                       PAGE_SIZE);
>> +    }
>> +}
>> +
>> +void xpfo_free_page(struct page *page, int order)
>> +{
>> +    int i;
>> +    unsigned long kaddr;
>> +
>> +    for (i = 0; i < (1 << order); i++) {
>> +
>> +        /* The page frame was previously allocated to user space */
>> +        if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
>> +            kaddr = (unsigned long)page_address(page + i);
>> +
>> +            /* Clear the page and mark it accordingly */
>> +            clear_page((void *)kaddr);
> 
> Clearing the page isn't related to XPFO. There's other work ongoing to
> do clearing of the page on free.

It's not strictly related to XPFO but adds another layer of security. Do you
happen to have a pointer to the ongoing work that you mentioned?


>> +            SET_XPFO_FLAG(zap, page + i);
>> +
>> +            /* Map it back to kernel space */
>> +            set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +            /* No TLB update */
>> +        }
>> +
>> +        /* Clear the xpfo fast-path flag */
>> +        CLEAR_XPFO_FLAG(user_fp, page + i);
>> +    }
>> +}
>> +
>> +void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +    unsigned long flags;
>> +
>> +    /* The page is allocated to kernel space, so nothing to do */
>> +    if (TEST_XPFO_FLAG(kernel, page))
>> +        return;
>> +
>> +    spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +    /*
>> +     * The page was previously allocated to user space, so map it back
>> +     * into the kernel. No TLB update required.
>> +     */
>> +    if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
>> +        TEST_XPFO_FLAG(user, page))
>> +        set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +    spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kmap);
>> +
>> +void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +    unsigned long flags;
>> +
>> +    /* The page is allocated to kernel space, so nothing to do */
>> +    if (TEST_XPFO_FLAG(kernel, page))
>> +        return;
>> +
>> +    spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +    /*
>> +     * The page frame is to be allocated back to user space. So unmap it
>> +     * from the kernel, update the TLB and mark it as a user page.
>> +     */
>> +    if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
>> +        (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
>> +        set_kpte(page, (unsigned long)kaddr, __pgprot(0));
>> +        __flush_tlb_one((unsigned long)kaddr);
>> +        SET_XPFO_FLAG(user, page);
>> +    }
>> +
>> +    spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kunmap);
> 
> I'm confused by the checks in kmap/kunmap here. It looks like once the
> page is allocated there is no changing of flags between user and
> kernel mode so the checks for if the page is user seem redundant.

Hmm... I think you're partially right. In xpfo_kmap we need to distinguish
between user and user_fp, so the check for 'user' is necessary. However, in
kunmap we can drop the check for 'user' || 'user_fp'.


>> diff --git a/block/blk-map.c b/block/blk-map.c
>> index f565e11..b7b8302 100644
>> --- a/block/blk-map.c
>> +++ b/block/blk-map.c
>> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct
>> request *rq,
>>           prv.iov_len = iov.iov_len;
>>       }
>>
>> -    if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
>> +    /*
>> +     * juergh: Temporary hack to force the use of a bounce buffer if XPFO
>> +     * is enabled. Results in an XPFO page fault otherwise.
>> +     */
>> +    if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
>> +        IS_ENABLED(CONFIG_XPFO))
>>           bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>>       else
>>           bio = bio_map_user_iov(q, iter, gfp_mask);
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index bb3f329..0ca9130 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>>   #ifndef ARCH_HAS_KMAP
>>   static inline void *kmap(struct page *page)
>>   {
>> +    void *kaddr;
>> +
>>       might_sleep();
>> -    return page_address(page);
>> +
>> +    kaddr = page_address(page);
>> +    xpfo_kmap(kaddr, page);
>> +    return kaddr;
>>   }
>>
>>   static inline void kunmap(struct page *page)
>>   {
>> +    xpfo_kunmap(page_address(page), page);
>>   }
>>
>>   static inline void *kmap_atomic(struct page *page)
>>   {
>> +    void *kaddr;
>> +
>>       preempt_disable();
>>       pagefault_disable();
>> -    return page_address(page);
>> +
>> +    kaddr = page_address(page);
>> +    xpfo_kmap(kaddr, page);
>> +    return kaddr;
>>   }
>>   #define kmap_atomic_prot(page, prot)    kmap_atomic(page)
>>
>>   static inline void __kunmap_atomic(void *addr)
>>   {
>> +    xpfo_kunmap(addr, virt_to_page(addr));
>> +
>>       pagefault_enable();
>>       preempt_enable();
>>   }
>> @@ -133,7 +146,8 @@ do
>> {                                                            \
>>   static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>>   {
>>       void *addr = kmap_atomic(page);
>> -    clear_user_page(addr, vaddr, page);
>> +    if (!xpfo_test_and_clear_zap(page))
>> +        clear_user_page(addr, vaddr, page);
>>       kunmap_atomic(addr);
>>   }
>>   #endif
>> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct
>> *vma,
>>   static inline void clear_highpage(struct page *page)
>>   {
>>       void *kaddr = kmap_atomic(page);
>> -    clear_page(kaddr);
>> +    if (!xpfo_test_and_clear_zap(page))
>> +        clear_page(kaddr);
>>       kunmap_atomic(kaddr);
>>   }
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 624b78b..71c95aa 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -12,6 +12,7 @@
>>   #include <linux/cpumask.h>
>>   #include <linux/uprobes.h>
>>   #include <linux/page-flags-layout.h>
>> +#include <linux/xpfo.h>
>>   #include <asm/page.h>
>>   #include <asm/mmu.h>
>>
>> @@ -215,6 +216,9 @@ struct page {
>>   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>>       int _last_cpupid;
>>   #endif
>> +#ifdef CONFIG_XPFO
>> +    struct xpfo_info xpfo;
>> +#endif
>>   }
>>   /*
>>    * The struct page can be forced to be double word aligned so that atomic ops
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> new file mode 100644
>> index 0000000..c4f0871
>> --- /dev/null
>> +++ b/include/linux/xpfo.h
>> @@ -0,0 +1,88 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#ifndef _LINUX_XPFO_H
>> +#define _LINUX_XPFO_H
>> +
>> +#ifdef CONFIG_XPFO
>> +
>> +/*
>> + * XPFO page flags:
>> + *
>> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
>> + * is used in the fast path, where the page is marked accordingly but *not*
>> + * unmapped from the kernel. In most cases, the kernel will need access to the
>> + * page immediately after its acquisition so an unnecessary mapping operation
>> + * is avoided.
>> + *
>> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
>> + * used in the slow path, where the page needs to be mapped/unmapped when the
>> + * kernel wants to access it. If a page is deallocated and this flag is set,
>> + * the page is cleared and mapped back into the kernel.
>> + *
>> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
>> + * for identifying pages that are first assigned to kernel space and then freed
>> + * and mapped to user space. In such cases, an expensive TLB shootdown is
>> + * necessary. Pages allocated to user space, freed, and subsequently allocated
>> + * to user space again, require only local TLB invalidation.
>> + *
>> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
>> + * avoid zapping pages multiple times. Whenever a page is freed and was
>> + * previously mapped to user space, it needs to be zapped before mapped back
>> + * in to the kernel.
>> + */
> 
> 'zap' doesn't really indicate what is actually happening with the page. Can you
> be a bit more descriptive about what this actually does?

It means that the page has been cleared at the time it was released back to the
free pool. To prevent multiple expensive cleaning operations. But this might go
away because of the ongoing work of sanitizing pages that you mentioned.


>> +
>> +enum xpfo_pageflags {
>> +    PG_XPFO_user_fp,
>> +    PG_XPFO_user,
>> +    PG_XPFO_kernel,
>> +    PG_XPFO_zap,
>> +};
>> +
>> +struct xpfo_info {
>> +    unsigned long flags;    /* Flags for tracking the page's XPFO state */
>> +    atomic_t mapcount;    /* Counter for balancing page map/unmap
>> +                 * requests. Only the first map request maps
>> +                 * the page back to kernel space. Likewise,
>> +                 * only the last unmap request unmaps the page.
>> +                 */
>> +    spinlock_t lock;    /* Lock to serialize concurrent map/unmap
>> +                 * requests.
>> +                 */
>> +};
> 
> Can you change this to use the page_ext implementation? See what
> mm/page_owner.c does. This might lessen the impact of the extra
> page metadata. This metadata still feels like a copy of what
> mm/highmem.c is trying to do though.

I'll look into that, thanks for the pointer.


>> +
>> +extern void xpfo_clear_zap(struct page *page, int order);
>> +extern int xpfo_test_and_clear_zap(struct page *page);
>> +extern int xpfo_test_kernel(struct page *page);
>> +extern int xpfo_test_user(struct page *page);
>> +
>> +extern void xpfo_kmap(void *kaddr, struct page *page);
>> +extern void xpfo_kunmap(void *kaddr, struct page *page);
>> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
>> +extern void xpfo_free_page(struct page *page, int order);
>> +
>> +#else /* ifdef CONFIG_XPFO */
>> +
>> +static inline void xpfo_clear_zap(struct page *page, int order) { }
>> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
>> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
>> +static inline int xpfo_test_user(struct page *page) { return 0; }
>> +
>> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
>> +static inline void xpfo_free_page(struct page *page, int order) { }
>> +
>> +#endif /* ifdef CONFIG_XPFO */
>> +
>> +#endif /* ifndef _LINUX_XPFO_H */
>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>> index 76f29ec..cf57ee9 100644
>> --- a/lib/swiotlb.c
>> +++ b/lib/swiotlb.c
>> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr,
>> phys_addr_t tlb_addr,
>>   {
>>       unsigned long pfn = PFN_DOWN(orig_addr);
>>       unsigned char *vaddr = phys_to_virt(tlb_addr);
>> +    struct page *page = pfn_to_page(pfn);
>>
>> -    if (PageHighMem(pfn_to_page(pfn))) {
>> +    if (PageHighMem(page) || xpfo_test_user(page)) {
>>           /* The buffer does not have a mapping.  Map it in and copy */
>>           unsigned int offset = orig_addr & ~PAGE_MASK;
>>           char *buffer;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 838ca8bb..47b42a3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page,
>> unsigned int order)
>>       }
>>       arch_free_page(page, order);
>>       kernel_map_pages(page, 1 << order, 0);
>> +    xpfo_free_page(page, order);
>>
>>       return true;
>>   }
>> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned
>> int order, gfp_t gfp_flags,
>>       arch_alloc_page(page, order);
>>       kernel_map_pages(page, 1 << order, 1);
>>       kasan_alloc_pages(page, order);
>> +    xpfo_alloc_page(page, order, gfp_flags);
>>
>>       if (gfp_flags & __GFP_ZERO)
>>           for (i = 0; i < (1 << order); i++)
>>               clear_highpage(page + i);
>> +    else
>> +        xpfo_clear_zap(page, order);
>>
>>       if (order && (gfp_flags & __GFP_COMP))
>>           prep_compound_page(page, order);
>> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>>       }
>>
>>       pcp = &this_cpu_ptr(zone->pageset)->pcp;
>> -    if (!cold)
>> +    if (!cold && !xpfo_test_kernel(page))
>>           list_add(&page->lru, &pcp->lists[migratetype]);
>>       else
>>           list_add_tail(&page->lru, &pcp->lists[migratetype]);
>> +
> 
> What's the advantage of this?

Allocating a page to userspace that was previously allocated to kernel space
requires an expensive TLB shootdown. The above will put previously
kernel-allocated pages in the cold page cache to postpone their allocation as
long as possible to minimize TLB shootdowns.


>>       pcp->count++;
>>       if (pcp->count >= pcp->high) {
>>           unsigned long batch = READ_ONCE(pcp->batch);
>>

Thanks for the review and comments! It's highly appreciated.

...Juerg


> Thanks,
> Laura

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-03-01  2:10   ` Balbir Singh
@ 2016-03-21  8:44     ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-03-21  8:44 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel, linux-mm; +Cc: vpk

Hi Balbir,

Apologies for the slow reply.


On 03/01/2016 03:10 AM, Balbir Singh wrote:
> 
> 
> On 27/02/16 01:21, Juerg Haefliger wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userland, unless explicitly requested by the
>> kernel. Whenever a page destined for userland is allocated, it is
>> unmapped from physmap. When such a page is reclaimed from userland, it is
>> mapped back to physmap.
> physmap == xen physmap? Please clarify

No, it's not XEN related. I might have the terminology wrong. Physmap is what
the original authors used for describing <quote> a large, contiguous virtual
memory region inside kernel address space that contains a direct mapping of part
or all (depending on the architecture) physical memory. </quote>


>> Mapping/unmapping from physmap is accomplished by modifying the PTE
>> permission bits to allow/disallow access to the page.
>>
>> Additional fields are added to the page struct for XPFO housekeeping.
>> Specifically a flags field to distinguish user vs. kernel pages, a
>> reference counter to track physmap map/unmap operations and a lock to
>> protect the XPFO fields.
>>
>> Known issues/limitations:
>>   - Only supported on x86-64.
> Is it due to lack of porting or a design limitation?

Lack of porting. Support for other architectures will come later.


>>   - Only supports 4k pages.
>>   - Adds additional data to the page struct.
>>   - There are most likely some additional and legitimate uses cases where
>>     the kernel needs to access userspace. Those need to be identified and
>>     made XPFO-aware.
> Why not build an audit mode for it?

Can you elaborate what you mean by this?


>>   - There's a performance impact if XPFO is turned on. Per the paper
>>     referenced below it's in the 1-3% ballpark. More performance testing
>>     wouldn't hurt. What tests to run though?
>>
>> Reference paper by the original patch authors:
>>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>>
>> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> This patch needs to be broken down into smaller patches - a series

Agreed.


>> ---
>>  arch/x86/Kconfig         |   2 +-
>>  arch/x86/Kconfig.debug   |  17 +++++
>>  arch/x86/mm/Makefile     |   2 +
>>  arch/x86/mm/init.c       |   3 +-
>>  arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>>  block/blk-map.c          |   7 +-
>>  include/linux/highmem.h  |  23 +++++--
>>  include/linux/mm_types.h |   4 ++
>>  include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>>  lib/swiotlb.c            |   3 +-
>>  mm/page_alloc.c          |   7 +-
>>  11 files changed, 323 insertions(+), 9 deletions(-)
>>  create mode 100644 arch/x86/mm/xpfo.c
>>  create mode 100644 include/linux/xpfo.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c46662f..9d32b4a 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>>  
>>  config X86_DIRECT_GBPAGES
>>  	def_bool y
>> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
>> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>>  	---help---
>>  	  Certain kernel features effectively disable kernel
>>  	  linear 1 GB mappings (even if the CPU otherwise
>> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
>> index 9b18ed9..1331da5 100644
>> --- a/arch/x86/Kconfig.debug
>> +++ b/arch/x86/Kconfig.debug
>> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>>  
>>  source "lib/Kconfig.debug"
>>  
>> +config XPFO
>> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +	default n
>> +	depends on DEBUG_KERNEL
>> +	depends on X86_64
>> +	select DEBUG_TLBFLUSH
>> +	---help---
>> +	  This option offers protection against 'ret2dir' (kernel) attacks.
>> +	  When enabled, every time a page frame is allocated to user space, it
>> +	  is unmapped from the direct mapped RAM region in kernel space
>> +	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
>> +	  are mapped back to physmap. Special care is taken to minimize the
>> +	  impact on performance by reducing TLB shootdowns and unnecessary page
>> +	  zero fills.
>> +
>> +	  If in doubt, say "N".
>> +
>>  config X86_VERBOSE_BOOTUP
>>  	bool "Enable verbose x86 bootup info messages"
>>  	default y
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index f9d38a4..8bf52b6 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>>  obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
>>  
>>  obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
>> +
>> +obj-$(CONFIG_XPFO)		+= xpfo.o
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 493f541..27fc8a6 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -150,7 +150,8 @@ static int page_size_mask;
>>  
>>  static void __init probe_page_size_mask(void)
>>  {
>> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
>> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
>> +	!defined(CONFIG_XPFO)
>>  	/*
>>  	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>>  	 * This will simplify cpa(), which otherwise needs to support splitting
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index 0000000..6bc24d3
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,176 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +
>> +#include <asm/pgtable.h>
>> +#include <asm/tlbflush.h>
>> +
>> +#define TEST_XPFO_FLAG(flag, page) \
>> +	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define SET_XPFO_FLAG(flag, page)			\
>> +	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define CLEAR_XPFO_FLAG(flag, page)			\
>> +	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
>> +	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +/*
>> + * Update a single kernel page table entry
>> + */
>> +static inline void set_kpte(struct page *page, unsigned long kaddr,
>> +			    pgprot_t prot) {
>> +	unsigned int level;
>> +	pte_t *kpte = lookup_address(kaddr, &level);
>> +
>> +	/* We only support 4k pages for now */
>> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
>> +
>> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
>> +}
>> +
>> +inline void xpfo_clear_zap(struct page *page, int order)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < (1 << order); i++)
>> +		CLEAR_XPFO_FLAG(zap, page + i);
>> +}
>> +
>> +inline int xpfo_test_and_clear_zap(struct page *page)
>> +{
>> +	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
>> +}
>> +
>> +inline int xpfo_test_kernel(struct page *page)
>> +{
>> +	return TEST_XPFO_FLAG(kernel, page);
>> +}
>> +
>> +inline int xpfo_test_user(struct page *page)
>> +{
>> +	return TEST_XPFO_FLAG(user, page);
>> +}
>> +
>> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
>> +{
>> +	int i, tlb_shoot = 0;
>> +	unsigned long kaddr;
>> +
>> +	for (i = 0; i < (1 << order); i++)  {
>> +		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
>> +			TEST_XPFO_FLAG(user, page + i));
>> +
>> +		if (gfp & GFP_HIGHUSER) {
> Why GFP_HIGHUSER?

The check is wrong. It should be ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER).

Thanks
...Juerg


>> +			/* Initialize the xpfo lock and map counter */
>> +			spin_lock_init(&(page + i)->xpfo.lock);
>> +			atomic_set(&(page + i)->xpfo.mapcount, 0);
>> +
>> +			/* Mark it as a user page */
>> +			SET_XPFO_FLAG(user_fp, page + i);
>> +
>> +			/*
>> +			 * Shoot the TLB if the page was previously allocated
>> +			 * to kernel space
>> +			 */
>> +			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
>> +				tlb_shoot = 1;
>> +		} else {
>> +			/* Mark it as a kernel page */
>> +			SET_XPFO_FLAG(kernel, page + i);
>> +		}
>> +	}
>> +
>> +	if (tlb_shoot) {
>> +		kaddr = (unsigned long)page_address(page);
>> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
>> +				       PAGE_SIZE);
>> +	}
>> +}
>> +
>> +void xpfo_free_page(struct page *page, int order)
>> +{
>> +	int i;
>> +	unsigned long kaddr;
>> +
>> +	for (i = 0; i < (1 << order); i++) {
>> +
>> +		/* The page frame was previously allocated to user space */
>> +		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
>> +			kaddr = (unsigned long)page_address(page + i);
>> +
>> +			/* Clear the page and mark it accordingly */
>> +			clear_page((void *)kaddr);
>> +			SET_XPFO_FLAG(zap, page + i);
>> +
>> +			/* Map it back to kernel space */
>> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +			/* No TLB update */
>> +		}
>> +
>> +		/* Clear the xpfo fast-path flag */
>> +		CLEAR_XPFO_FLAG(user_fp, page + i);
>> +	}
>> +}
>> +
>> +void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	/* The page is allocated to kernel space, so nothing to do */
>> +	if (TEST_XPFO_FLAG(kernel, page))
>> +		return;
>> +
>> +	spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +	/*
>> +	 * The page was previously allocated to user space, so map it back
>> +	 * into the kernel. No TLB update required.
>> +	 */
>> +	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
>> +	    TEST_XPFO_FLAG(user, page))
>> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kmap);
>> +
>> +void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	/* The page is allocated to kernel space, so nothing to do */
>> +	if (TEST_XPFO_FLAG(kernel, page))
>> +		return;
>> +
>> +	spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +	/*
>> +	 * The page frame is to be allocated back to user space. So unmap it
>> +	 * from the kernel, update the TLB and mark it as a user page.
>> +	 */
>> +	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
>> +	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
>> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
>> +		__flush_tlb_one((unsigned long)kaddr);
>> +		SET_XPFO_FLAG(user, page);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kunmap);
>> diff --git a/block/blk-map.c b/block/blk-map.c
>> index f565e11..b7b8302 100644
>> --- a/block/blk-map.c
>> +++ b/block/blk-map.c
>> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
>>  		prv.iov_len = iov.iov_len;
>>  	}
>>  
>> -	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
>> +	/*
>> +	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
>> +	 * is enabled. Results in an XPFO page fault otherwise.
>> +	 */
> This does look like it might add a bunch of overhead
>> +	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
>> +	    IS_ENABLED(CONFIG_XPFO))
>>  		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>>  	else
>>  		bio = bio_map_user_iov(q, iter, gfp_mask);
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index bb3f329..0ca9130 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>>  #ifndef ARCH_HAS_KMAP
>>  static inline void *kmap(struct page *page)
>>  {
>> +	void *kaddr;
>> +
>>  	might_sleep();
>> -	return page_address(page);
>> +
>> +	kaddr = page_address(page);
>> +	xpfo_kmap(kaddr, page);
>> +	return kaddr;
>>  }
>>  
>>  static inline void kunmap(struct page *page)
>>  {
>> +	xpfo_kunmap(page_address(page), page);
>>  }
>>  
>>  static inline void *kmap_atomic(struct page *page)
>>  {
>> +	void *kaddr;
>> +
>>  	preempt_disable();
>>  	pagefault_disable();
>> -	return page_address(page);
>> +
>> +	kaddr = page_address(page);
>> +	xpfo_kmap(kaddr, page);
>> +	return kaddr;
>>  }
>>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>>  
>>  static inline void __kunmap_atomic(void *addr)
>>  {
>> +	xpfo_kunmap(addr, virt_to_page(addr));
>> +
>>  	pagefault_enable();
>>  	preempt_enable();
>>  }
>> @@ -133,7 +146,8 @@ do {                                                            \
>>  static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>>  {
>>  	void *addr = kmap_atomic(page);
>> -	clear_user_page(addr, vaddr, page);
>> +	if (!xpfo_test_and_clear_zap(page))
>> +		clear_user_page(addr, vaddr, page);
>>  	kunmap_atomic(addr);
>>  }
>>  #endif
>> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>>  static inline void clear_highpage(struct page *page)
>>  {
>>  	void *kaddr = kmap_atomic(page);
>> -	clear_page(kaddr);
>> +	if (!xpfo_test_and_clear_zap(page))
>> +		clear_page(kaddr);
>>  	kunmap_atomic(kaddr);
>>  }
>>  
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 624b78b..71c95aa 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -12,6 +12,7 @@
>>  #include <linux/cpumask.h>
>>  #include <linux/uprobes.h>
>>  #include <linux/page-flags-layout.h>
>> +#include <linux/xpfo.h>
>>  #include <asm/page.h>
>>  #include <asm/mmu.h>
>>  
>> @@ -215,6 +216,9 @@ struct page {
>>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>>  	int _last_cpupid;
>>  #endif
>> +#ifdef CONFIG_XPFO
>> +	struct xpfo_info xpfo;
>> +#endif
>>  }
>>  /*
>>   * The struct page can be forced to be double word aligned so that atomic ops
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> new file mode 100644
>> index 0000000..c4f0871
>> --- /dev/null
>> +++ b/include/linux/xpfo.h
>> @@ -0,0 +1,88 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#ifndef _LINUX_XPFO_H
>> +#define _LINUX_XPFO_H
>> +
>> +#ifdef CONFIG_XPFO
>> +
>> +/*
>> + * XPFO page flags:
>> + *
>> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
>> + * is used in the fast path, where the page is marked accordingly but *not*
>> + * unmapped from the kernel. In most cases, the kernel will need access to the
>> + * page immediately after its acquisition so an unnecessary mapping operation
>> + * is avoided.
>> + *
>> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
>> + * used in the slow path, where the page needs to be mapped/unmapped when the
>> + * kernel wants to access it. If a page is deallocated and this flag is set,
>> + * the page is cleared and mapped back into the kernel.
>> + *
>> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
>> + * for identifying pages that are first assigned to kernel space and then freed
>> + * and mapped to user space. In such cases, an expensive TLB shootdown is
>> + * necessary. Pages allocated to user space, freed, and subsequently allocated
>> + * to user space again, require only local TLB invalidation.
>> + *
>> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
>> + * avoid zapping pages multiple times. Whenever a page is freed and was
>> + * previously mapped to user space, it needs to be zapped before mapped back
>> + * in to the kernel.
>> + */
>> +
>> +enum xpfo_pageflags {
>> +	PG_XPFO_user_fp,
>> +	PG_XPFO_user,
>> +	PG_XPFO_kernel,
>> +	PG_XPFO_zap,
>> +};
>> +
>> +struct xpfo_info {
>> +	unsigned long flags;	/* Flags for tracking the page's XPFO state */
>> +	atomic_t mapcount;	/* Counter for balancing page map/unmap
>> +				 * requests. Only the first map request maps
>> +				 * the page back to kernel space. Likewise,
>> +				 * only the last unmap request unmaps the page.
>> +				 */
>> +	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
>> +				 * requests.
>> +				 */
>> +};
>> +
>> +extern void xpfo_clear_zap(struct page *page, int order);
>> +extern int xpfo_test_and_clear_zap(struct page *page);
>> +extern int xpfo_test_kernel(struct page *page);
>> +extern int xpfo_test_user(struct page *page);
>> +
>> +extern void xpfo_kmap(void *kaddr, struct page *page);
>> +extern void xpfo_kunmap(void *kaddr, struct page *page);
>> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
>> +extern void xpfo_free_page(struct page *page, int order);
>> +
>> +#else /* ifdef CONFIG_XPFO */
>> +
>> +static inline void xpfo_clear_zap(struct page *page, int order) { }
>> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
>> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
>> +static inline int xpfo_test_user(struct page *page) { return 0; }
>> +
>> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
>> +static inline void xpfo_free_page(struct page *page, int order) { }
>> +
>> +#endif /* ifdef CONFIG_XPFO */
>> +
>> +#endif /* ifndef _LINUX_XPFO_H */
>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>> index 76f29ec..cf57ee9 100644
>> --- a/lib/swiotlb.c
>> +++ b/lib/swiotlb.c
>> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>>  {
>>  	unsigned long pfn = PFN_DOWN(orig_addr);
>>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
>> +	struct page *page = pfn_to_page(pfn);
>>  
>> -	if (PageHighMem(pfn_to_page(pfn))) {
>> +	if (PageHighMem(page) || xpfo_test_user(page)) {
>>  		/* The buffer does not have a mapping.  Map it in and copy */
>>  		unsigned int offset = orig_addr & ~PAGE_MASK;
>>  		char *buffer;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 838ca8bb..47b42a3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>>  	}
>>  	arch_free_page(page, order);
>>  	kernel_map_pages(page, 1 << order, 0);
>> +	xpfo_free_page(page, order);
>>  
>>  	return true;
>>  }
>> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
>>  	arch_alloc_page(page, order);
>>  	kernel_map_pages(page, 1 << order, 1);
>>  	kasan_alloc_pages(page, order);
>> +	xpfo_alloc_page(page, order, gfp_flags);
>>  
>>  	if (gfp_flags & __GFP_ZERO)
>>  		for (i = 0; i < (1 << order); i++)
>>  			clear_highpage(page + i);
>> +	else
>> +		xpfo_clear_zap(page, order);
>>  
>>  	if (order && (gfp_flags & __GFP_COMP))
>>  		prep_compound_page(page, order);
>> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>>  	}
>>  
>>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>> -	if (!cold)
>> +	if (!cold && !xpfo_test_kernel(page))
>>  		list_add(&page->lru, &pcp->lists[migratetype]);
>>  	else
>>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
>> +
>>  	pcp->count++;
>>  	if (pcp->count >= pcp->high) {
>>  		unsigned long batch = READ_ONCE(pcp->batch);
> 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-03-21  8:44     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-03-21  8:44 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel, linux-mm; +Cc: vpk

Hi Balbir,

Apologies for the slow reply.


On 03/01/2016 03:10 AM, Balbir Singh wrote:
> 
> 
> On 27/02/16 01:21, Juerg Haefliger wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userland, unless explicitly requested by the
>> kernel. Whenever a page destined for userland is allocated, it is
>> unmapped from physmap. When such a page is reclaimed from userland, it is
>> mapped back to physmap.
> physmap == xen physmap? Please clarify

No, it's not XEN related. I might have the terminology wrong. Physmap is what
the original authors used for describing <quote> a large, contiguous virtual
memory region inside kernel address space that contains a direct mapping of part
or all (depending on the architecture) physical memory. </quote>


>> Mapping/unmapping from physmap is accomplished by modifying the PTE
>> permission bits to allow/disallow access to the page.
>>
>> Additional fields are added to the page struct for XPFO housekeeping.
>> Specifically a flags field to distinguish user vs. kernel pages, a
>> reference counter to track physmap map/unmap operations and a lock to
>> protect the XPFO fields.
>>
>> Known issues/limitations:
>>   - Only supported on x86-64.
> Is it due to lack of porting or a design limitation?

Lack of porting. Support for other architectures will come later.


>>   - Only supports 4k pages.
>>   - Adds additional data to the page struct.
>>   - There are most likely some additional and legitimate uses cases where
>>     the kernel needs to access userspace. Those need to be identified and
>>     made XPFO-aware.
> Why not build an audit mode for it?

Can you elaborate what you mean by this?


>>   - There's a performance impact if XPFO is turned on. Per the paper
>>     referenced below it's in the 1-3% ballpark. More performance testing
>>     wouldn't hurt. What tests to run though?
>>
>> Reference paper by the original patch authors:
>>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>>
>> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> This patch needs to be broken down into smaller patches - a series

Agreed.


>> ---
>>  arch/x86/Kconfig         |   2 +-
>>  arch/x86/Kconfig.debug   |  17 +++++
>>  arch/x86/mm/Makefile     |   2 +
>>  arch/x86/mm/init.c       |   3 +-
>>  arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>>  block/blk-map.c          |   7 +-
>>  include/linux/highmem.h  |  23 +++++--
>>  include/linux/mm_types.h |   4 ++
>>  include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>>  lib/swiotlb.c            |   3 +-
>>  mm/page_alloc.c          |   7 +-
>>  11 files changed, 323 insertions(+), 9 deletions(-)
>>  create mode 100644 arch/x86/mm/xpfo.c
>>  create mode 100644 include/linux/xpfo.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c46662f..9d32b4a 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>>  
>>  config X86_DIRECT_GBPAGES
>>  	def_bool y
>> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
>> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>>  	---help---
>>  	  Certain kernel features effectively disable kernel
>>  	  linear 1 GB mappings (even if the CPU otherwise
>> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
>> index 9b18ed9..1331da5 100644
>> --- a/arch/x86/Kconfig.debug
>> +++ b/arch/x86/Kconfig.debug
>> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>>  
>>  source "lib/Kconfig.debug"
>>  
>> +config XPFO
>> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +	default n
>> +	depends on DEBUG_KERNEL
>> +	depends on X86_64
>> +	select DEBUG_TLBFLUSH
>> +	---help---
>> +	  This option offers protection against 'ret2dir' (kernel) attacks.
>> +	  When enabled, every time a page frame is allocated to user space, it
>> +	  is unmapped from the direct mapped RAM region in kernel space
>> +	  (physmap). Similarly, whenever page frames are freed/reclaimed, they
>> +	  are mapped back to physmap. Special care is taken to minimize the
>> +	  impact on performance by reducing TLB shootdowns and unnecessary page
>> +	  zero fills.
>> +
>> +	  If in doubt, say "N".
>> +
>>  config X86_VERBOSE_BOOTUP
>>  	bool "Enable verbose x86 bootup info messages"
>>  	default y
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index f9d38a4..8bf52b6 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>>  obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
>>  
>>  obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
>> +
>> +obj-$(CONFIG_XPFO)		+= xpfo.o
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 493f541..27fc8a6 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -150,7 +150,8 @@ static int page_size_mask;
>>  
>>  static void __init probe_page_size_mask(void)
>>  {
>> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
>> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
>> +	!defined(CONFIG_XPFO)
>>  	/*
>>  	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>>  	 * This will simplify cpa(), which otherwise needs to support splitting
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index 0000000..6bc24d3
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,176 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +
>> +#include <asm/pgtable.h>
>> +#include <asm/tlbflush.h>
>> +
>> +#define TEST_XPFO_FLAG(flag, page) \
>> +	test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define SET_XPFO_FLAG(flag, page)			\
>> +	__set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define CLEAR_XPFO_FLAG(flag, page)			\
>> +	__clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)			\
>> +	__test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +/*
>> + * Update a single kernel page table entry
>> + */
>> +static inline void set_kpte(struct page *page, unsigned long kaddr,
>> +			    pgprot_t prot) {
>> +	unsigned int level;
>> +	pte_t *kpte = lookup_address(kaddr, &level);
>> +
>> +	/* We only support 4k pages for now */
>> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
>> +
>> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
>> +}
>> +
>> +inline void xpfo_clear_zap(struct page *page, int order)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < (1 << order); i++)
>> +		CLEAR_XPFO_FLAG(zap, page + i);
>> +}
>> +
>> +inline int xpfo_test_and_clear_zap(struct page *page)
>> +{
>> +	return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
>> +}
>> +
>> +inline int xpfo_test_kernel(struct page *page)
>> +{
>> +	return TEST_XPFO_FLAG(kernel, page);
>> +}
>> +
>> +inline int xpfo_test_user(struct page *page)
>> +{
>> +	return TEST_XPFO_FLAG(user, page);
>> +}
>> +
>> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
>> +{
>> +	int i, tlb_shoot = 0;
>> +	unsigned long kaddr;
>> +
>> +	for (i = 0; i < (1 << order); i++)  {
>> +		WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
>> +			TEST_XPFO_FLAG(user, page + i));
>> +
>> +		if (gfp & GFP_HIGHUSER) {
> Why GFP_HIGHUSER?

The check is wrong. It should be ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER).

Thanks
...Juerg


>> +			/* Initialize the xpfo lock and map counter */
>> +			spin_lock_init(&(page + i)->xpfo.lock);
>> +			atomic_set(&(page + i)->xpfo.mapcount, 0);
>> +
>> +			/* Mark it as a user page */
>> +			SET_XPFO_FLAG(user_fp, page + i);
>> +
>> +			/*
>> +			 * Shoot the TLB if the page was previously allocated
>> +			 * to kernel space
>> +			 */
>> +			if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
>> +				tlb_shoot = 1;
>> +		} else {
>> +			/* Mark it as a kernel page */
>> +			SET_XPFO_FLAG(kernel, page + i);
>> +		}
>> +	}
>> +
>> +	if (tlb_shoot) {
>> +		kaddr = (unsigned long)page_address(page);
>> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
>> +				       PAGE_SIZE);
>> +	}
>> +}
>> +
>> +void xpfo_free_page(struct page *page, int order)
>> +{
>> +	int i;
>> +	unsigned long kaddr;
>> +
>> +	for (i = 0; i < (1 << order); i++) {
>> +
>> +		/* The page frame was previously allocated to user space */
>> +		if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
>> +			kaddr = (unsigned long)page_address(page + i);
>> +
>> +			/* Clear the page and mark it accordingly */
>> +			clear_page((void *)kaddr);
>> +			SET_XPFO_FLAG(zap, page + i);
>> +
>> +			/* Map it back to kernel space */
>> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +			/* No TLB update */
>> +		}
>> +
>> +		/* Clear the xpfo fast-path flag */
>> +		CLEAR_XPFO_FLAG(user_fp, page + i);
>> +	}
>> +}
>> +
>> +void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	/* The page is allocated to kernel space, so nothing to do */
>> +	if (TEST_XPFO_FLAG(kernel, page))
>> +		return;
>> +
>> +	spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +	/*
>> +	 * The page was previously allocated to user space, so map it back
>> +	 * into the kernel. No TLB update required.
>> +	 */
>> +	if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
>> +	    TEST_XPFO_FLAG(user, page))
>> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kmap);
>> +
>> +void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	/* The page is allocated to kernel space, so nothing to do */
>> +	if (TEST_XPFO_FLAG(kernel, page))
>> +		return;
>> +
>> +	spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +	/*
>> +	 * The page frame is to be allocated back to user space. So unmap it
>> +	 * from the kernel, update the TLB and mark it as a user page.
>> +	 */
>> +	if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
>> +	    (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
>> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
>> +		__flush_tlb_one((unsigned long)kaddr);
>> +		SET_XPFO_FLAG(user, page);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kunmap);
>> diff --git a/block/blk-map.c b/block/blk-map.c
>> index f565e11..b7b8302 100644
>> --- a/block/blk-map.c
>> +++ b/block/blk-map.c
>> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
>>  		prv.iov_len = iov.iov_len;
>>  	}
>>  
>> -	if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
>> +	/*
>> +	 * juergh: Temporary hack to force the use of a bounce buffer if XPFO
>> +	 * is enabled. Results in an XPFO page fault otherwise.
>> +	 */
> This does look like it might add a bunch of overhead
>> +	if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
>> +	    IS_ENABLED(CONFIG_XPFO))
>>  		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>>  	else
>>  		bio = bio_map_user_iov(q, iter, gfp_mask);
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index bb3f329..0ca9130 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>>  #ifndef ARCH_HAS_KMAP
>>  static inline void *kmap(struct page *page)
>>  {
>> +	void *kaddr;
>> +
>>  	might_sleep();
>> -	return page_address(page);
>> +
>> +	kaddr = page_address(page);
>> +	xpfo_kmap(kaddr, page);
>> +	return kaddr;
>>  }
>>  
>>  static inline void kunmap(struct page *page)
>>  {
>> +	xpfo_kunmap(page_address(page), page);
>>  }
>>  
>>  static inline void *kmap_atomic(struct page *page)
>>  {
>> +	void *kaddr;
>> +
>>  	preempt_disable();
>>  	pagefault_disable();
>> -	return page_address(page);
>> +
>> +	kaddr = page_address(page);
>> +	xpfo_kmap(kaddr, page);
>> +	return kaddr;
>>  }
>>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>>  
>>  static inline void __kunmap_atomic(void *addr)
>>  {
>> +	xpfo_kunmap(addr, virt_to_page(addr));
>> +
>>  	pagefault_enable();
>>  	preempt_enable();
>>  }
>> @@ -133,7 +146,8 @@ do {                                                            \
>>  static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>>  {
>>  	void *addr = kmap_atomic(page);
>> -	clear_user_page(addr, vaddr, page);
>> +	if (!xpfo_test_and_clear_zap(page))
>> +		clear_user_page(addr, vaddr, page);
>>  	kunmap_atomic(addr);
>>  }
>>  #endif
>> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>>  static inline void clear_highpage(struct page *page)
>>  {
>>  	void *kaddr = kmap_atomic(page);
>> -	clear_page(kaddr);
>> +	if (!xpfo_test_and_clear_zap(page))
>> +		clear_page(kaddr);
>>  	kunmap_atomic(kaddr);
>>  }
>>  
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 624b78b..71c95aa 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -12,6 +12,7 @@
>>  #include <linux/cpumask.h>
>>  #include <linux/uprobes.h>
>>  #include <linux/page-flags-layout.h>
>> +#include <linux/xpfo.h>
>>  #include <asm/page.h>
>>  #include <asm/mmu.h>
>>  
>> @@ -215,6 +216,9 @@ struct page {
>>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>>  	int _last_cpupid;
>>  #endif
>> +#ifdef CONFIG_XPFO
>> +	struct xpfo_info xpfo;
>> +#endif
>>  }
>>  /*
>>   * The struct page can be forced to be double word aligned so that atomic ops
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> new file mode 100644
>> index 0000000..c4f0871
>> --- /dev/null
>> +++ b/include/linux/xpfo.h
>> @@ -0,0 +1,88 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#ifndef _LINUX_XPFO_H
>> +#define _LINUX_XPFO_H
>> +
>> +#ifdef CONFIG_XPFO
>> +
>> +/*
>> + * XPFO page flags:
>> + *
>> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
>> + * is used in the fast path, where the page is marked accordingly but *not*
>> + * unmapped from the kernel. In most cases, the kernel will need access to the
>> + * page immediately after its acquisition so an unnecessary mapping operation
>> + * is avoided.
>> + *
>> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
>> + * used in the slow path, where the page needs to be mapped/unmapped when the
>> + * kernel wants to access it. If a page is deallocated and this flag is set,
>> + * the page is cleared and mapped back into the kernel.
>> + *
>> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
>> + * for identifying pages that are first assigned to kernel space and then freed
>> + * and mapped to user space. In such cases, an expensive TLB shootdown is
>> + * necessary. Pages allocated to user space, freed, and subsequently allocated
>> + * to user space again, require only local TLB invalidation.
>> + *
>> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
>> + * avoid zapping pages multiple times. Whenever a page is freed and was
>> + * previously mapped to user space, it needs to be zapped before mapped back
>> + * in to the kernel.
>> + */
>> +
>> +enum xpfo_pageflags {
>> +	PG_XPFO_user_fp,
>> +	PG_XPFO_user,
>> +	PG_XPFO_kernel,
>> +	PG_XPFO_zap,
>> +};
>> +
>> +struct xpfo_info {
>> +	unsigned long flags;	/* Flags for tracking the page's XPFO state */
>> +	atomic_t mapcount;	/* Counter for balancing page map/unmap
>> +				 * requests. Only the first map request maps
>> +				 * the page back to kernel space. Likewise,
>> +				 * only the last unmap request unmaps the page.
>> +				 */
>> +	spinlock_t lock;	/* Lock to serialize concurrent map/unmap
>> +				 * requests.
>> +				 */
>> +};
>> +
>> +extern void xpfo_clear_zap(struct page *page, int order);
>> +extern int xpfo_test_and_clear_zap(struct page *page);
>> +extern int xpfo_test_kernel(struct page *page);
>> +extern int xpfo_test_user(struct page *page);
>> +
>> +extern void xpfo_kmap(void *kaddr, struct page *page);
>> +extern void xpfo_kunmap(void *kaddr, struct page *page);
>> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
>> +extern void xpfo_free_page(struct page *page, int order);
>> +
>> +#else /* ifdef CONFIG_XPFO */
>> +
>> +static inline void xpfo_clear_zap(struct page *page, int order) { }
>> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
>> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
>> +static inline int xpfo_test_user(struct page *page) { return 0; }
>> +
>> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
>> +static inline void xpfo_free_page(struct page *page, int order) { }
>> +
>> +#endif /* ifdef CONFIG_XPFO */
>> +
>> +#endif /* ifndef _LINUX_XPFO_H */
>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>> index 76f29ec..cf57ee9 100644
>> --- a/lib/swiotlb.c
>> +++ b/lib/swiotlb.c
>> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>>  {
>>  	unsigned long pfn = PFN_DOWN(orig_addr);
>>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
>> +	struct page *page = pfn_to_page(pfn);
>>  
>> -	if (PageHighMem(pfn_to_page(pfn))) {
>> +	if (PageHighMem(page) || xpfo_test_user(page)) {
>>  		/* The buffer does not have a mapping.  Map it in and copy */
>>  		unsigned int offset = orig_addr & ~PAGE_MASK;
>>  		char *buffer;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 838ca8bb..47b42a3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>>  	}
>>  	arch_free_page(page, order);
>>  	kernel_map_pages(page, 1 << order, 0);
>> +	xpfo_free_page(page, order);
>>  
>>  	return true;
>>  }
>> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
>>  	arch_alloc_page(page, order);
>>  	kernel_map_pages(page, 1 << order, 1);
>>  	kasan_alloc_pages(page, order);
>> +	xpfo_alloc_page(page, order, gfp_flags);
>>  
>>  	if (gfp_flags & __GFP_ZERO)
>>  		for (i = 0; i < (1 << order); i++)
>>  			clear_highpage(page + i);
>> +	else
>> +		xpfo_clear_zap(page, order);
>>  
>>  	if (order && (gfp_flags & __GFP_COMP))
>>  		prep_compound_page(page, order);
>> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>>  	}
>>  
>>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>> -	if (!cold)
>> +	if (!cold && !xpfo_test_kernel(page))
>>  		list_add(&page->lru, &pcp->lists[migratetype]);
>>  	else
>>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
>> +
>>  	pcp->count++;
>>  	if (pcp->count >= pcp->high) {
>>  		unsigned long batch = READ_ONCE(pcp->batch);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-03-21  8:37     ` Juerg Haefliger
@ 2016-03-28 19:29       ` Laura Abbott
  -1 siblings, 0 replies; 93+ messages in thread
From: Laura Abbott @ 2016-03-28 19:29 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm; +Cc: vpk, Kees Cook

On 03/21/2016 01:37 AM, Juerg Haefliger wrote:
...
>>> +void xpfo_free_page(struct page *page, int order)
>>> +{
>>> +    int i;
>>> +    unsigned long kaddr;
>>> +
>>> +    for (i = 0; i < (1 << order); i++) {
>>> +
>>> +        /* The page frame was previously allocated to user space */
>>> +        if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
>>> +            kaddr = (unsigned long)page_address(page + i);
>>> +
>>> +            /* Clear the page and mark it accordingly */
>>> +            clear_page((void *)kaddr);
>>
>> Clearing the page isn't related to XPFO. There's other work ongoing to
>> do clearing of the page on free.
>
> It's not strictly related to XPFO but adds another layer of security. Do you
> happen to have a pointer to the ongoing work that you mentioned?
>
>

The work was merged for the 4.6 merge window
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8823b1dbc05fab1a8bec275eeae4709257c2661d

This is a separate option to clear the page.

...

>>> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>>>        }
>>>
>>>        pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>> -    if (!cold)
>>> +    if (!cold && !xpfo_test_kernel(page))
>>>            list_add(&page->lru, &pcp->lists[migratetype]);
>>>        else
>>>            list_add_tail(&page->lru, &pcp->lists[migratetype]);
>>> +
>>
>> What's the advantage of this?
>
> Allocating a page to userspace that was previously allocated to kernel space
> requires an expensive TLB shootdown. The above will put previously
> kernel-allocated pages in the cold page cache to postpone their allocation as
> long as possible to minimize TLB shootdowns.
>
>

That makes sense. You probably want to make this a separate commmit with
this explanation as the commit text.


>>>        pcp->count++;
>>>        if (pcp->count >= pcp->high) {
>>>            unsigned long batch = READ_ONCE(pcp->batch);
>>>
>
> Thanks for the review and comments! It's highly appreciated.
>
> ...Juerg
>
>
>> Thanks,
>> Laura

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-03-28 19:29       ` Laura Abbott
  0 siblings, 0 replies; 93+ messages in thread
From: Laura Abbott @ 2016-03-28 19:29 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm; +Cc: vpk, Kees Cook

On 03/21/2016 01:37 AM, Juerg Haefliger wrote:
...
>>> +void xpfo_free_page(struct page *page, int order)
>>> +{
>>> +    int i;
>>> +    unsigned long kaddr;
>>> +
>>> +    for (i = 0; i < (1 << order); i++) {
>>> +
>>> +        /* The page frame was previously allocated to user space */
>>> +        if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
>>> +            kaddr = (unsigned long)page_address(page + i);
>>> +
>>> +            /* Clear the page and mark it accordingly */
>>> +            clear_page((void *)kaddr);
>>
>> Clearing the page isn't related to XPFO. There's other work ongoing to
>> do clearing of the page on free.
>
> It's not strictly related to XPFO but adds another layer of security. Do you
> happen to have a pointer to the ongoing work that you mentioned?
>
>

The work was merged for the 4.6 merge window
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8823b1dbc05fab1a8bec275eeae4709257c2661d

This is a separate option to clear the page.

...

>>> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>>>        }
>>>
>>>        pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>> -    if (!cold)
>>> +    if (!cold && !xpfo_test_kernel(page))
>>>            list_add(&page->lru, &pcp->lists[migratetype]);
>>>        else
>>>            list_add_tail(&page->lru, &pcp->lists[migratetype]);
>>> +
>>
>> What's the advantage of this?
>
> Allocating a page to userspace that was previously allocated to kernel space
> requires an expensive TLB shootdown. The above will put previously
> kernel-allocated pages in the cold page cache to postpone their allocation as
> long as possible to minimize TLB shootdowns.
>
>

That makes sense. You probably want to make this a separate commmit with
this explanation as the commit text.


>>>        pcp->count++;
>>>        if (pcp->count >= pcp->high) {
>>>            unsigned long batch = READ_ONCE(pcp->batch);
>>>
>
> Thanks for the review and comments! It's highly appreciated.
>
> ...Juerg
>
>
>> Thanks,
>> Laura

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-03-21  8:44     ` Juerg Haefliger
@ 2016-04-01  0:21       ` Balbir Singh
  -1 siblings, 0 replies; 93+ messages in thread
From: Balbir Singh @ 2016-04-01  0:21 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: linux-kernel, linux-mm, vpk

On Mon, Mar 21, 2016 at 7:44 PM, Juerg Haefliger
<juerg.haefliger@hpe.com> wrote:
> Hi Balbir,
>
> Apologies for the slow reply.
>
No problem, I lost this in my inbox as well due to the reply latency.
>
> On 03/01/2016 03:10 AM, Balbir Singh wrote:
>>
>>
>> On 27/02/16 01:21, Juerg Haefliger wrote:
>>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>>> attacks. The basic idea is to enforce exclusive ownership of page frames
>>> by either the kernel or userland, unless explicitly requested by the
>>> kernel. Whenever a page destined for userland is allocated, it is
>>> unmapped from physmap. When such a page is reclaimed from userland, it is
>>> mapped back to physmap.
>> physmap == xen physmap? Please clarify
>
> No, it's not XEN related. I might have the terminology wrong. Physmap is what
> the original authors used for describing <quote> a large, contiguous virtual
> memory region inside kernel address space that contains a direct mapping of part
> or all (depending on the architecture) physical memory. </quote>
>
Thanks for clarifying
>
>>> Mapping/unmapping from physmap is accomplished by modifying the PTE
>>> permission bits to allow/disallow access to the page.
>>>
>>> Additional fields are added to the page struct for XPFO housekeeping.
>>> Specifically a flags field to distinguish user vs. kernel pages, a
>>> reference counter to track physmap map/unmap operations and a lock to
>>> protect the XPFO fields.
>>>
>>> Known issues/limitations:
>>>   - Only supported on x86-64.
>> Is it due to lack of porting or a design limitation?
>
> Lack of porting. Support for other architectures will come later.
>
OK
>
>>>   - Only supports 4k pages.
>>>   - Adds additional data to the page struct.
>>>   - There are most likely some additional and legitimate uses cases where
>>>     the kernel needs to access userspace. Those need to be identified and
>>>     made XPFO-aware.
>> Why not build an audit mode for it?
>
> Can you elaborate what you mean by this?
>
What I meant is when the kernel needs to access userspace and XPFO is
not aware of it
and is going to block it, write to a log/trace buffer so that it can
be audited for correctness

>
>>>   - There's a performance impact if XPFO is turned on. Per the paper
>>>     referenced below it's in the 1-3% ballpark. More performance testing
>>>     wouldn't hurt. What tests to run though?
>>>
>>> Reference paper by the original patch authors:
>>>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>>>
>>> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
>>> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
>> This patch needs to be broken down into smaller patches - a series
>
> Agreed.
>

I think it will be good to describe what is XPFO aware

1. How are device mmap'd shared between kernel/user covered?
2. How is copy_from/to_user covered?
3. How is vdso covered?
4. More...


Balbir Singh.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-04-01  0:21       ` Balbir Singh
  0 siblings, 0 replies; 93+ messages in thread
From: Balbir Singh @ 2016-04-01  0:21 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: linux-kernel, linux-mm, vpk

On Mon, Mar 21, 2016 at 7:44 PM, Juerg Haefliger
<juerg.haefliger@hpe.com> wrote:
> Hi Balbir,
>
> Apologies for the slow reply.
>
No problem, I lost this in my inbox as well due to the reply latency.
>
> On 03/01/2016 03:10 AM, Balbir Singh wrote:
>>
>>
>> On 27/02/16 01:21, Juerg Haefliger wrote:
>>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>>> attacks. The basic idea is to enforce exclusive ownership of page frames
>>> by either the kernel or userland, unless explicitly requested by the
>>> kernel. Whenever a page destined for userland is allocated, it is
>>> unmapped from physmap. When such a page is reclaimed from userland, it is
>>> mapped back to physmap.
>> physmap == xen physmap? Please clarify
>
> No, it's not XEN related. I might have the terminology wrong. Physmap is what
> the original authors used for describing <quote> a large, contiguous virtual
> memory region inside kernel address space that contains a direct mapping of part
> or all (depending on the architecture) physical memory. </quote>
>
Thanks for clarifying
>
>>> Mapping/unmapping from physmap is accomplished by modifying the PTE
>>> permission bits to allow/disallow access to the page.
>>>
>>> Additional fields are added to the page struct for XPFO housekeeping.
>>> Specifically a flags field to distinguish user vs. kernel pages, a
>>> reference counter to track physmap map/unmap operations and a lock to
>>> protect the XPFO fields.
>>>
>>> Known issues/limitations:
>>>   - Only supported on x86-64.
>> Is it due to lack of porting or a design limitation?
>
> Lack of porting. Support for other architectures will come later.
>
OK
>
>>>   - Only supports 4k pages.
>>>   - Adds additional data to the page struct.
>>>   - There are most likely some additional and legitimate uses cases where
>>>     the kernel needs to access userspace. Those need to be identified and
>>>     made XPFO-aware.
>> Why not build an audit mode for it?
>
> Can you elaborate what you mean by this?
>
What I meant is when the kernel needs to access userspace and XPFO is
not aware of it
and is going to block it, write to a log/trace buffer so that it can
be audited for correctness

>
>>>   - There's a performance impact if XPFO is turned on. Per the paper
>>>     referenced below it's in the 1-3% ballpark. More performance testing
>>>     wouldn't hurt. What tests to run though?
>>>
>>> Reference paper by the original patch authors:
>>>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>>>
>>> Suggested-by: Vasileios P. Kemerlis <vpk@cs.brown.edu>
>>> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
>> This patch needs to be broken down into smaller patches - a series
>
> Agreed.
>

I think it will be good to describe what is XPFO aware

1. How are device mmap'd shared between kernel/user covered?
2. How is copy_from/to_user covered?
3. How is vdso covered?
4. More...


Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-02-26 14:21 ` Juerg Haefliger
  (?)
@ 2016-09-02 11:39   ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Changes from:
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (3):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache
  block: Always use a bounce buffer when XPFO is enabled

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 block/blk-map.c          |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 12 files changed, 314 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-02 11:39   ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Changes from:
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (3):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache
  block: Always use a bounce buffer when XPFO is enabled

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 block/blk-map.c          |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 12 files changed, 314 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-02 11:39   ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Changes from:
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (3):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache
  block: Always use a bounce buffer when XPFO is enabled

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 block/blk-map.c          |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 12 files changed, 314 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 1/3] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-09-02 11:39   ` Juerg Haefliger
  (?)
@ 2016-09-02 11:39     ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 205 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 11 files changed, 296 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c33562..dc5604a710c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1350,7 +1351,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d28a2d741f9e..426427b54639 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 03f2a3e7d76d..fdf63dcc399e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -27,6 +27,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -48,6 +50,11 @@ struct page_ext {
 	int last_migrate_reason;
 	depot_stack_handle_t handle;
 #endif
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf3fa09..e6f8894423da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,3 +103,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..0241c8a7e72a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1029,6 +1029,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1726,6 +1727,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 44a4c029c8e7..1cd7d7f460cc 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -63,6 +64,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..ddb1be05485d
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
diff --git a/security/Kconfig b/security/Kconfig
index da10d9b573a4..1eac37a9bec2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,26 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL && ARCH_SUPPORTS_XPFO
+	select DEBUG_TLBFLUSH
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 1/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-02 11:39     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 205 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 11 files changed, 296 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c33562..dc5604a710c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1350,7 +1351,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d28a2d741f9e..426427b54639 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 03f2a3e7d76d..fdf63dcc399e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -27,6 +27,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -48,6 +50,11 @@ struct page_ext {
 	int last_migrate_reason;
 	depot_stack_handle_t handle;
 #endif
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf3fa09..e6f8894423da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,3 +103,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..0241c8a7e72a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1029,6 +1029,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1726,6 +1727,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 44a4c029c8e7..1cd7d7f460cc 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -63,6 +64,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..ddb1be05485d
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
diff --git a/security/Kconfig b/security/Kconfig
index da10d9b573a4..1eac37a9bec2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,26 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL && ARCH_SUPPORTS_XPFO
+	select DEBUG_TLBFLUSH
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 1/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-02 11:39     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 205 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 11 files changed, 296 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c33562..dc5604a710c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1350,7 +1351,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d28a2d741f9e..426427b54639 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 03f2a3e7d76d..fdf63dcc399e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -27,6 +27,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -48,6 +50,11 @@ struct page_ext {
 	int last_migrate_reason;
 	depot_stack_handle_t handle;
 #endif
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf3fa09..e6f8894423da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,3 +103,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..0241c8a7e72a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1029,6 +1029,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1726,6 +1727,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 44a4c029c8e7..1cd7d7f460cc 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -63,6 +64,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..ddb1be05485d
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
diff --git a/security/Kconfig b/security/Kconfig
index da10d9b573a4..1eac37a9bec2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,26 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL && ARCH_SUPPORTS_XPFO
+	select DEBUG_TLBFLUSH
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-02 11:39   ` Juerg Haefliger
  (?)
@ 2016-09-02 11:39     ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0241c8a7e72a..83404b41e52d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2421,7 +2421,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index ddb1be05485d..f8dffda0c961 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -203,3 +203,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-02 11:39     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0241c8a7e72a..83404b41e52d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2421,7 +2421,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index ddb1be05485d..f8dffda0c961 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -203,3 +203,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-02 11:39     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0241c8a7e72a..83404b41e52d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2421,7 +2421,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index ddb1be05485d..f8dffda0c961 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -203,3 +203,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
  2016-09-02 11:39   ` Juerg Haefliger
  (?)
@ 2016-09-02 11:39     ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This is a temporary hack to prevent the use of bio_map_user_iov()
which causes XPFO page faults.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index b8657fa8dc9a..e889dbfee6fb 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -52,7 +52,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	struct bio *bio, *orig_bio;
 	int ret;
 
-	if (copy)
+	if (copy || IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
@ 2016-09-02 11:39     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This is a temporary hack to prevent the use of bio_map_user_iov()
which causes XPFO page faults.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index b8657fa8dc9a..e889dbfee6fb 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -52,7 +52,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	struct bio *bio, *orig_bio;
 	int ret;
 
-	if (copy)
+	if (copy || IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
@ 2016-09-02 11:39     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-02 11:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This is a temporary hack to prevent the use of bio_map_user_iov()
which causes XPFO page faults.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index b8657fa8dc9a..e889dbfee6fb 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -52,7 +52,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	struct bio *bio, *orig_bio;
 	int ret;
 
-	if (copy)
+	if (copy || IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-02 11:39     ` Juerg Haefliger
  (?)
@ 2016-09-02 20:39       ` Dave Hansen
  -1 siblings, 0 replies; 93+ messages in thread
From: Dave Hansen @ 2016-09-02 20:39 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk

On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
> Allocating a page to userspace that was previously allocated to the
> kernel requires an expensive TLB shootdown. To minimize this, we only
> put non-kernel pages into the hot cache to favor their allocation.

But kernel allocations do allocate from these pools, right?  Does this
just mean that kernel allocations usually have to pay the penalty to
convert a page?

So, what's the logic here?  You're assuming that order-0 kernel
allocations are more rare than allocations for userspace?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-02 20:39       ` Dave Hansen
  0 siblings, 0 replies; 93+ messages in thread
From: Dave Hansen @ 2016-09-02 20:39 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk

On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
> Allocating a page to userspace that was previously allocated to the
> kernel requires an expensive TLB shootdown. To minimize this, we only
> put non-kernel pages into the hot cache to favor their allocation.

But kernel allocations do allocate from these pools, right?  Does this
just mean that kernel allocations usually have to pay the penalty to
convert a page?

So, what's the logic here?  You're assuming that order-0 kernel
allocations are more rare than allocations for userspace?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-02 20:39       ` Dave Hansen
  0 siblings, 0 replies; 93+ messages in thread
From: Dave Hansen @ 2016-09-02 20:39 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk

On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
> Allocating a page to userspace that was previously allocated to the
> kernel requires an expensive TLB shootdown. To minimize this, we only
> put non-kernel pages into the hot cache to favor their allocation.

But kernel allocations do allocate from these pools, right?  Does this
just mean that kernel allocations usually have to pay the penalty to
convert a page?

So, what's the logic here?  You're assuming that order-0 kernel
allocations are more rare than allocations for userspace?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-02 20:39       ` Dave Hansen
@ 2016-09-05 11:54         ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-05 11:54 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm, kernel-hardening, linux-x86_64; +Cc: vpk


[-- Attachment #1.1: Type: text/plain, Size: 989 bytes --]

On 09/02/2016 10:39 PM, Dave Hansen wrote:
> On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
>> Allocating a page to userspace that was previously allocated to the
>> kernel requires an expensive TLB shootdown. To minimize this, we only
>> put non-kernel pages into the hot cache to favor their allocation.
> 
> But kernel allocations do allocate from these pools, right?

Yes.


> Does this
> just mean that kernel allocations usually have to pay the penalty to
> convert a page?

Only pages that are allocated for userspace (gfp & GFP_HIGHUSER == GFP_HIGHUSER) which were
previously allocated for the kernel (gfp & GFP_HIGHUSER != GFP_HIGHUSER) have to pay the penalty.


> So, what's the logic here?  You're assuming that order-0 kernel
> allocations are more rare than allocations for userspace?

The logic is to put reclaimed kernel pages into the cold cache to postpone their allocation as long
as possible to minimize (potential) TLB flushes.

...Juerg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-05 11:54         ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-05 11:54 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm, kernel-hardening, linux-x86_64; +Cc: vpk


[-- Attachment #1.1: Type: text/plain, Size: 989 bytes --]

On 09/02/2016 10:39 PM, Dave Hansen wrote:
> On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
>> Allocating a page to userspace that was previously allocated to the
>> kernel requires an expensive TLB shootdown. To minimize this, we only
>> put non-kernel pages into the hot cache to favor their allocation.
> 
> But kernel allocations do allocate from these pools, right?

Yes.


> Does this
> just mean that kernel allocations usually have to pay the penalty to
> convert a page?

Only pages that are allocated for userspace (gfp & GFP_HIGHUSER == GFP_HIGHUSER) which were
previously allocated for the kernel (gfp & GFP_HIGHUSER != GFP_HIGHUSER) have to pay the penalty.


> So, what's the logic here?  You're assuming that order-0 kernel
> allocations are more rare than allocations for userspace?

The logic is to put reclaimed kernel pages into the cold cache to postpone their allocation as long
as possible to minimize (potential) TLB flushes.

...Juerg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-09-02 11:39   ` Juerg Haefliger
  (?)
@ 2016-09-14  7:18     ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Changes from:
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (3):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache
  block: Always use a bounce buffer when XPFO is enabled

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 block/blk-map.c          |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 12 files changed, 314 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-14  7:18     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Changes from:
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (3):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache
  block: Always use a bounce buffer when XPFO is enabled

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 block/blk-map.c          |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 12 files changed, 314 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-14  7:18     ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Changes from:
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (3):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache
  block: Always use a bounce buffer when XPFO is enabled

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 block/blk-map.c          |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 12 files changed, 314 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 1/3] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-09-14  7:18     ` Juerg Haefliger
  (?)
@ 2016-09-14  7:18       ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 205 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 11 files changed, 296 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c33562..dc5604a710c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1350,7 +1351,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d28a2d741f9e..426427b54639 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 03f2a3e7d76d..fdf63dcc399e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -27,6 +27,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -48,6 +50,11 @@ struct page_ext {
 	int last_migrate_reason;
 	depot_stack_handle_t handle;
 #endif
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf3fa09..e6f8894423da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,3 +103,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..0241c8a7e72a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1029,6 +1029,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1726,6 +1727,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 44a4c029c8e7..1cd7d7f460cc 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -63,6 +64,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..ddb1be05485d
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
diff --git a/security/Kconfig b/security/Kconfig
index da10d9b573a4..1eac37a9bec2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,26 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL && ARCH_SUPPORTS_XPFO
+	select DEBUG_TLBFLUSH
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 1/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-14  7:18       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 205 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 11 files changed, 296 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c33562..dc5604a710c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1350,7 +1351,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d28a2d741f9e..426427b54639 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 03f2a3e7d76d..fdf63dcc399e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -27,6 +27,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -48,6 +50,11 @@ struct page_ext {
 	int last_migrate_reason;
 	depot_stack_handle_t handle;
 #endif
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf3fa09..e6f8894423da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,3 +103,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..0241c8a7e72a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1029,6 +1029,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1726,6 +1727,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 44a4c029c8e7..1cd7d7f460cc 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -63,6 +64,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..ddb1be05485d
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
diff --git a/security/Kconfig b/security/Kconfig
index da10d9b573a4..1eac37a9bec2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,26 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL && ARCH_SUPPORTS_XPFO
+	select DEBUG_TLBFLUSH
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 1/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-14  7:18       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 205 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  20 +++++
 11 files changed, 296 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c33562..dc5604a710c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1350,7 +1351,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d28a2d741f9e..426427b54639 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 03f2a3e7d76d..fdf63dcc399e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -27,6 +27,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -48,6 +50,11 @@ struct page_ext {
 	int last_migrate_reason;
 	depot_stack_handle_t handle;
 #endif
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf3fa09..e6f8894423da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,3 +103,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fbe73a6fe4b..0241c8a7e72a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1029,6 +1029,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1726,6 +1727,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 44a4c029c8e7..1cd7d7f460cc 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -63,6 +64,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..ddb1be05485d
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
diff --git a/security/Kconfig b/security/Kconfig
index da10d9b573a4..1eac37a9bec2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,26 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on DEBUG_KERNEL && ARCH_SUPPORTS_XPFO
+	select DEBUG_TLBFLUSH
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-14  7:18     ` Juerg Haefliger
  (?)
@ 2016-09-14  7:19       ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0241c8a7e72a..83404b41e52d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2421,7 +2421,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index ddb1be05485d..f8dffda0c961 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -203,3 +203,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-14  7:19       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0241c8a7e72a..83404b41e52d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2421,7 +2421,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index ddb1be05485d..f8dffda0c961 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -203,3 +203,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-14  7:19       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0241c8a7e72a..83404b41e52d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2421,7 +2421,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index ddb1be05485d..f8dffda0c961 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -203,3 +203,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
  2016-09-14  7:18     ` Juerg Haefliger
  (?)
@ 2016-09-14  7:19       ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This is a temporary hack to prevent the use of bio_map_user_iov()
which causes XPFO page faults.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index b8657fa8dc9a..e889dbfee6fb 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -52,7 +52,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	struct bio *bio, *orig_bio;
 	int ret;
 
-	if (copy)
+	if (copy || IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
@ 2016-09-14  7:19       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This is a temporary hack to prevent the use of bio_map_user_iov()
which causes XPFO page faults.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index b8657fa8dc9a..e889dbfee6fb 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -52,7 +52,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	struct bio *bio, *orig_bio;
 	int ret;
 
-	if (copy)
+	if (copy || IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
@ 2016-09-14  7:19       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: juerg.haefliger, vpk

This is a temporary hack to prevent the use of bio_map_user_iov()
which causes XPFO page faults.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index b8657fa8dc9a..e889dbfee6fb 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -52,7 +52,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	struct bio *bio, *orig_bio;
 	int ret;
 
-	if (copy)
+	if (copy || IS_ENABLED(CONFIG_XPFO))
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
 		bio = bio_map_user_iov(q, iter, gfp_mask);
-- 
2.9.3

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-09-14  7:18     ` Juerg Haefliger
@ 2016-09-14  7:23       ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64; +Cc: vpk


[-- Attachment #1.1: Type: text/plain, Size: 2720 bytes --]

Resending to include the kernel-hardening list. Sorry, I wasn't subscribed with the correct email
address when I sent this the first time.

...Juerg

On 09/14/2016 09:18 AM, Juerg Haefliger wrote:
> Changes from:
>   v1 -> v2:
>     - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
>       arch-agnostic.
>     - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
>       for x86.
>     - Use page_ext for the additional per-page data.
>     - Removed the clearing of pages. This can be accomplished by using
>       PAGE_POISONING.
>     - Split up the patch into multiple patches.
>     - Fixed additional issues identified by reviewers.
> 
> This patch series adds support for XPFO which protects against 'ret2dir'
> kernel attacks. The basic idea is to enforce exclusive ownership of page
> frames by either the kernel or userspace, unless explicitly requested by
> the kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
> 
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
> 
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
> 
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Juerg Haefliger (3):
>   Add support for eXclusive Page Frame Ownership (XPFO)
>   xpfo: Only put previous userspace pages into the hot cache
>   block: Always use a bounce buffer when XPFO is enabled
> 
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  block/blk-map.c          |   2 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  41 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |  10 ++-
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  20 +++++
>  12 files changed, 314 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
> 


-- 
Juerg Haefliger
Hewlett Packard Enterprise


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-14  7:23       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64; +Cc: vpk


[-- Attachment #1.1: Type: text/plain, Size: 2720 bytes --]

Resending to include the kernel-hardening list. Sorry, I wasn't subscribed with the correct email
address when I sent this the first time.

...Juerg

On 09/14/2016 09:18 AM, Juerg Haefliger wrote:
> Changes from:
>   v1 -> v2:
>     - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
>       arch-agnostic.
>     - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
>       for x86.
>     - Use page_ext for the additional per-page data.
>     - Removed the clearing of pages. This can be accomplished by using
>       PAGE_POISONING.
>     - Split up the patch into multiple patches.
>     - Fixed additional issues identified by reviewers.
> 
> This patch series adds support for XPFO which protects against 'ret2dir'
> kernel attacks. The basic idea is to enforce exclusive ownership of page
> frames by either the kernel or userspace, unless explicitly requested by
> the kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
> 
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
> 
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
> 
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Juerg Haefliger (3):
>   Add support for eXclusive Page Frame Ownership (XPFO)
>   xpfo: Only put previous userspace pages into the hot cache
>   block: Always use a bounce buffer when XPFO is enabled
> 
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  block/blk-map.c          |   2 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  41 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |  10 ++-
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 213 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  20 +++++
>  12 files changed, 314 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
> 


-- 
Juerg Haefliger
Hewlett Packard Enterprise


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
  2016-09-14  7:19       ` Juerg Haefliger
  (?)
@ 2016-09-14  7:33         ` Christoph Hellwig
  -1 siblings, 0 replies; 93+ messages in thread
From: Christoph Hellwig @ 2016-09-14  7:33 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk

On Wed, Sep 14, 2016 at 09:19:01AM +0200, Juerg Haefliger wrote:
> This is a temporary hack to prevent the use of bio_map_user_iov()
> which causes XPFO page faults.
> 
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>

Sorry, but if your scheme doesn't support get_user_pages access to
user memory is't a steaming pile of crap and entirely unacceptable.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
@ 2016-09-14  7:33         ` Christoph Hellwig
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Hellwig @ 2016-09-14  7:33 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk

On Wed, Sep 14, 2016 at 09:19:01AM +0200, Juerg Haefliger wrote:
> This is a temporary hack to prevent the use of bio_map_user_iov()
> which causes XPFO page faults.
> 
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>

Sorry, but if your scheme doesn't support get_user_pages access to
user memory is't a steaming pile of crap and entirely unacceptable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled
@ 2016-09-14  7:33         ` Christoph Hellwig
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Hellwig @ 2016-09-14  7:33 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk

On Wed, Sep 14, 2016 at 09:19:01AM +0200, Juerg Haefliger wrote:
> This is a temporary hack to prevent the use of bio_map_user_iov()
> which causes XPFO page faults.
> 
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>

Sorry, but if your scheme doesn't support get_user_pages access to
user memory is't a steaming pile of crap and entirely unacceptable.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-09-14  7:18     ` Juerg Haefliger
@ 2016-09-14  9:36       ` Mark Rutland
  -1 siblings, 0 replies; 93+ messages in thread
From: Mark Rutland @ 2016-09-14  9:36 UTC (permalink / raw)
  To: kernel-hardening
  Cc: linux-kernel, linux-mm, linux-x86_64, juerg.haefliger, vpk

Hi,

On Wed, Sep 14, 2016 at 09:18:58AM +0200, Juerg Haefliger wrote:

> This patch series adds support for XPFO which protects against 'ret2dir'
> kernel attacks. The basic idea is to enforce exclusive ownership of page
> frames by either the kernel or userspace, unless explicitly requested by
> the kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.

> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
> 
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Just to check, doesn't DEBUG_RODATA ensure that the linear mapping is
non-executable on x86_64 (as it does for arm64)?

For both arm64 and x86_64, DEBUG_RODATA is mandatory (or soon to be so).
Assuming that implies a lack of execute permission for x86_64, that
should provide a similar level of protection against erroneously
branching to addresses in the linear map, without the complexity and
overhead of mapping/unmapping pages.

So to me it looks like this approach may only be useful for
architectures without page-granular execute permission controls.

Is this also intended to protect against erroneous *data* accesses to
the linear map?

Am I missing something?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-14  9:36       ` Mark Rutland
  0 siblings, 0 replies; 93+ messages in thread
From: Mark Rutland @ 2016-09-14  9:36 UTC (permalink / raw)
  To: kernel-hardening
  Cc: linux-kernel, linux-mm, linux-x86_64, juerg.haefliger, vpk

Hi,

On Wed, Sep 14, 2016 at 09:18:58AM +0200, Juerg Haefliger wrote:

> This patch series adds support for XPFO which protects against 'ret2dir'
> kernel attacks. The basic idea is to enforce exclusive ownership of page
> frames by either the kernel or userspace, unless explicitly requested by
> the kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.

> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
> 
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Just to check, doesn't DEBUG_RODATA ensure that the linear mapping is
non-executable on x86_64 (as it does for arm64)?

For both arm64 and x86_64, DEBUG_RODATA is mandatory (or soon to be so).
Assuming that implies a lack of execute permission for x86_64, that
should provide a similar level of protection against erroneously
branching to addresses in the linear map, without the complexity and
overhead of mapping/unmapping pages.

So to me it looks like this approach may only be useful for
architectures without page-granular execute permission controls.

Is this also intended to protect against erroneous *data* accesses to
the linear map?

Am I missing something?

Thanks,
Mark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-09-14  9:36       ` Mark Rutland
@ 2016-09-14  9:49         ` Mark Rutland
  -1 siblings, 0 replies; 93+ messages in thread
From: Mark Rutland @ 2016-09-14  9:49 UTC (permalink / raw)
  To: kernel-hardening
  Cc: linux-kernel, linux-mm, linux-x86_64, juerg.haefliger, vpk

On Wed, Sep 14, 2016 at 10:36:34AM +0100, Mark Rutland wrote:
> On Wed, Sep 14, 2016 at 09:18:58AM +0200, Juerg Haefliger wrote:
> > This patch series adds support for XPFO which protects against 'ret2dir'
> > kernel attacks. The basic idea is to enforce exclusive ownership of page
> > frames by either the kernel or userspace, unless explicitly requested by
> > the kernel. Whenever a page destined for userspace is allocated, it is
> > unmapped from physmap (the kernel's page table). When such a page is
> > reclaimed from userspace, it is mapped back to physmap.

> > Reference paper by the original patch authors:
> >   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

> For both arm64 and x86_64, DEBUG_RODATA is mandatory (or soon to be so).
> Assuming that implies a lack of execute permission for x86_64, that
> should provide a similar level of protection against erroneously
> branching to addresses in the linear map, without the complexity and
> overhead of mapping/unmapping pages.
> 
> So to me it looks like this approach may only be useful for
> architectures without page-granular execute permission controls.
> 
> Is this also intended to protect against erroneous *data* accesses to
> the linear map?

Now that I read the paper more carefully, I can see that this is the
case, and this does catch issues which DEBUG_RODATA cannot.

Apologies for the noise.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-09-14  9:49         ` Mark Rutland
  0 siblings, 0 replies; 93+ messages in thread
From: Mark Rutland @ 2016-09-14  9:49 UTC (permalink / raw)
  To: kernel-hardening
  Cc: linux-kernel, linux-mm, linux-x86_64, juerg.haefliger, vpk

On Wed, Sep 14, 2016 at 10:36:34AM +0100, Mark Rutland wrote:
> On Wed, Sep 14, 2016 at 09:18:58AM +0200, Juerg Haefliger wrote:
> > This patch series adds support for XPFO which protects against 'ret2dir'
> > kernel attacks. The basic idea is to enforce exclusive ownership of page
> > frames by either the kernel or userspace, unless explicitly requested by
> > the kernel. Whenever a page destined for userspace is allocated, it is
> > unmapped from physmap (the kernel's page table). When such a page is
> > reclaimed from userspace, it is mapped back to physmap.

> > Reference paper by the original patch authors:
> >   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

> For both arm64 and x86_64, DEBUG_RODATA is mandatory (or soon to be so).
> Assuming that implies a lack of execute permission for x86_64, that
> should provide a similar level of protection against erroneously
> branching to addresses in the linear map, without the complexity and
> overhead of mapping/unmapping pages.
> 
> So to me it looks like this approach may only be useful for
> architectures without page-granular execute permission controls.
> 
> Is this also intended to protect against erroneous *data* accesses to
> the linear map?

Now that I read the paper more carefully, I can see that this is the
case, and this does catch issues which DEBUG_RODATA cannot.

Apologies for the noise.

Thanks,
Mark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-14  7:19       ` Juerg Haefliger
@ 2016-09-14 14:33         ` Dave Hansen
  -1 siblings, 0 replies; 93+ messages in thread
From: Dave Hansen @ 2016-09-14 14:33 UTC (permalink / raw)
  To: kernel-hardening, linux-kernel, linux-mm, linux-x86_64
  Cc: juerg.haefliger, vpk

On 09/14/2016 12:19 AM, Juerg Haefliger wrote:
> Allocating a page to userspace that was previously allocated to the
> kernel requires an expensive TLB shootdown. To minimize this, we only
> put non-kernel pages into the hot cache to favor their allocation.

Hi, I had some questions about this the last time you posted it.  Maybe
you want to address them now.

--

But kernel allocations do allocate from these pools, right?  Does this
just mean that kernel allocations usually have to pay the penalty to
convert a page?

So, what's the logic here?  You're assuming that order-0 kernel
allocations are more rare than allocations for userspace?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-14 14:33         ` Dave Hansen
  0 siblings, 0 replies; 93+ messages in thread
From: Dave Hansen @ 2016-09-14 14:33 UTC (permalink / raw)
  To: kernel-hardening, linux-kernel, linux-mm, linux-x86_64
  Cc: juerg.haefliger, vpk

On 09/14/2016 12:19 AM, Juerg Haefliger wrote:
> Allocating a page to userspace that was previously allocated to the
> kernel requires an expensive TLB shootdown. To minimize this, we only
> put non-kernel pages into the hot cache to favor their allocation.

Hi, I had some questions about this the last time you posted it.  Maybe
you want to address them now.

--

But kernel allocations do allocate from these pools, right?  Does this
just mean that kernel allocations usually have to pay the penalty to
convert a page?

So, what's the logic here?  You're assuming that order-0 kernel
allocations are more rare than allocations for userspace?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-14 14:33         ` Dave Hansen
  (?)
@ 2016-09-14 14:40         ` Juerg Haefliger
  2016-09-14 14:48             ` Dave Hansen
  -1 siblings, 1 reply; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-14 14:40 UTC (permalink / raw)
  To: Dave Hansen, kernel-hardening, linux-kernel, linux-mm, linux-x86_64; +Cc: vpk


[-- Attachment #1.1: Type: text/plain, Size: 817 bytes --]

Hi Dave,

On 09/14/2016 04:33 PM, Dave Hansen wrote:
> On 09/14/2016 12:19 AM, Juerg Haefliger wrote:
>> Allocating a page to userspace that was previously allocated to the
>> kernel requires an expensive TLB shootdown. To minimize this, we only
>> put non-kernel pages into the hot cache to favor their allocation.
> 
> Hi, I had some questions about this the last time you posted it.  Maybe
> you want to address them now.

I did reply: https://lkml.org/lkml/2016/9/5/249

...Juerg


> --
> 
> But kernel allocations do allocate from these pools, right?  Does this
> just mean that kernel allocations usually have to pay the penalty to
> convert a page?
> 
> So, what's the logic here?  You're assuming that order-0 kernel
> allocations are more rare than allocations for userspace?
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-14 14:40         ` Juerg Haefliger
@ 2016-09-14 14:48             ` Dave Hansen
  0 siblings, 0 replies; 93+ messages in thread
From: Dave Hansen @ 2016-09-14 14:48 UTC (permalink / raw)
  To: Juerg Haefliger, kernel-hardening, linux-kernel, linux-mm, linux-x86_64
  Cc: vpk

> On 09/02/2016 10:39 PM, Dave Hansen wrote:
>> On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
>> Does this
>> just mean that kernel allocations usually have to pay the penalty to
>> convert a page?
> 
> Only pages that are allocated for userspace (gfp & GFP_HIGHUSER == GFP_HIGHUSER) which were
> previously allocated for the kernel (gfp & GFP_HIGHUSER != GFP_HIGHUSER) have to pay the penalty.
> 
>> So, what's the logic here?  You're assuming that order-0 kernel
>> allocations are more rare than allocations for userspace?
> 
> The logic is to put reclaimed kernel pages into the cold cache to
> postpone their allocation as long as possible to minimize (potential)
> TLB flushes.

OK, but if we put them in the cold area but kernel allocations pull them
from the hot cache, aren't we virtually guaranteeing that kernel
allocations will have to to TLB shootdown to convert a page?

It seems like you also need to convert all kernel allocations to pull
from the cold area.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
@ 2016-09-14 14:48             ` Dave Hansen
  0 siblings, 0 replies; 93+ messages in thread
From: Dave Hansen @ 2016-09-14 14:48 UTC (permalink / raw)
  To: Juerg Haefliger, kernel-hardening, linux-kernel, linux-mm, linux-x86_64
  Cc: vpk

> On 09/02/2016 10:39 PM, Dave Hansen wrote:
>> On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
>> Does this
>> just mean that kernel allocations usually have to pay the penalty to
>> convert a page?
> 
> Only pages that are allocated for userspace (gfp & GFP_HIGHUSER == GFP_HIGHUSER) which were
> previously allocated for the kernel (gfp & GFP_HIGHUSER != GFP_HIGHUSER) have to pay the penalty.
> 
>> So, what's the logic here?  You're assuming that order-0 kernel
>> allocations are more rare than allocations for userspace?
> 
> The logic is to put reclaimed kernel pages into the cold cache to
> postpone their allocation as long as possible to minimize (potential)
> TLB flushes.

OK, but if we put them in the cold area but kernel allocations pull them
from the hot cache, aren't we virtually guaranteeing that kernel
allocations will have to to TLB shootdown to convert a page?

It seems like you also need to convert all kernel allocations to pull
from the cold area.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache
  2016-09-14 14:48             ` Dave Hansen
  (?)
@ 2016-09-21  5:32             ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-09-21  5:32 UTC (permalink / raw)
  To: Dave Hansen, kernel-hardening, linux-kernel, linux-mm, linux-x86_64; +Cc: vpk


[-- Attachment #1.1: Type: text/plain, Size: 1411 bytes --]

On 09/14/2016 04:48 PM, Dave Hansen wrote:
>> On 09/02/2016 10:39 PM, Dave Hansen wrote:
>>> On 09/02/2016 04:39 AM, Juerg Haefliger wrote:
>>> Does this
>>> just mean that kernel allocations usually have to pay the penalty to
>>> convert a page?
>>
>> Only pages that are allocated for userspace (gfp & GFP_HIGHUSER == GFP_HIGHUSER) which were
>> previously allocated for the kernel (gfp & GFP_HIGHUSER != GFP_HIGHUSER) have to pay the penalty.
>>
>>> So, what's the logic here?  You're assuming that order-0 kernel
>>> allocations are more rare than allocations for userspace?
>>
>> The logic is to put reclaimed kernel pages into the cold cache to
>> postpone their allocation as long as possible to minimize (potential)
>> TLB flushes.
> 
> OK, but if we put them in the cold area but kernel allocations pull them
> from the hot cache, aren't we virtually guaranteeing that kernel
> allocations will have to to TLB shootdown to convert a page?

No. Allocations for the kernel never require a TLB shootdown. Only allocations for userspace (and
only if the page was previously a kernel page).


> It seems like you also need to convert all kernel allocations to pull
> from the cold area.

Kernel allocations can continue to pull from the hot cache. Maybe introduce another cache for the
userspace pages? But I'm not sure what other implications this might have.

...Juerg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v3 0/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-09-14  7:18     ` Juerg Haefliger
  (?)
@ 2016-11-04 14:45       ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

Changes from:
  v2 -> v3:
    - Removed 'depends on DEBUG_KERNEL' and 'select DEBUG_TLBFLUSH'.
      These are left-overs from the original patch and are not required.
    - Make libata XPFO-aware, i.e., properly handle pages that were
      unmapped by XPFO. This takes care of the temporary hack in v2 that
      forced the use of a bounce buffer in block/blk-map.c.
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (removed from the kernel's page table). When such a
page is reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (2):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 drivers/ata/libata-sff.c |   4 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 214 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  19 +++++
 12 files changed, 315 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v3 0/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-04 14:45       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

Changes from:
  v2 -> v3:
    - Removed 'depends on DEBUG_KERNEL' and 'select DEBUG_TLBFLUSH'.
      These are left-overs from the original patch and are not required.
    - Make libata XPFO-aware, i.e., properly handle pages that were
      unmapped by XPFO. This takes care of the temporary hack in v2 that
      forced the use of a bounce buffer in block/blk-map.c.
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (removed from the kernel's page table). When such a
page is reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (2):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 drivers/ata/libata-sff.c |   4 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 214 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  19 +++++
 12 files changed, 315 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.10.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v3 0/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-04 14:45       ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

Changes from:
  v2 -> v3:
    - Removed 'depends on DEBUG_KERNEL' and 'select DEBUG_TLBFLUSH'.
      These are left-overs from the original patch and are not required.
    - Make libata XPFO-aware, i.e., properly handle pages that were
      unmapped by XPFO. This takes care of the temporary hack in v2 that
      forced the use of a bounce buffer in block/blk-map.c.
  v1 -> v2:
    - Moved the code from arch/x86/mm/ to mm/ since it's (mostly)
      arch-agnostic.
    - Moved the config to the generic layer and added ARCH_SUPPORTS_XPFO
      for x86.
    - Use page_ext for the additional per-page data.
    - Removed the clearing of pages. This can be accomplished by using
      PAGE_POISONING.
    - Split up the patch into multiple patches.
    - Fixed additional issues identified by reviewers.

This patch series adds support for XPFO which protects against 'ret2dir'
kernel attacks. The basic idea is to enforce exclusive ownership of page
frames by either the kernel or userspace, unless explicitly requested by
the kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (removed from the kernel's page table). When such a
page is reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Juerg Haefliger (2):
  Add support for eXclusive Page Frame Ownership (XPFO)
  xpfo: Only put previous userspace pages into the hot cache

 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 drivers/ata/libata-sff.c |   4 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  41 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |  10 ++-
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 214 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  19 +++++
 12 files changed, 315 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-04 14:45       ` Juerg Haefliger
  (?)
@ 2016-11-04 14:45         ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 drivers/ata/libata-sff.c |   4 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  19 +++++
 12 files changed, 298 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bada636d1065..38b334f8fde5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 22af912d66d2..a6fafbae02bb 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
index 051b6158d1b7..58af734be25d 100644
--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
 
 	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
 
-	if (PageHighMem(page)) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		unsigned long flags;
 
 		/* FIXME: use a bounce buffer */
@@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
 
 	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
 
-	if (PageHighMem(page)) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		unsigned long flags;
 
 		/* FIXME: use bounce buffer */
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 9298c393ddaa..0e451a42e5a3 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -29,6 +29,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -44,6 +46,11 @@ enum page_ext_flags {
  */
 struct page_ext {
 	unsigned long flags;
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..175680f516aa 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa7c4bd..100e80e008e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 121dcffc4ec1..ba6dbcacc2db 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..8e3a6a694b6a
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,206 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
diff --git a/security/Kconfig b/security/Kconfig
index 118f4549404e..4502e15c8419 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on ARCH_SUPPORTS_XPFO
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-04 14:45         ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 drivers/ata/libata-sff.c |   4 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  19 +++++
 12 files changed, 298 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bada636d1065..38b334f8fde5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 22af912d66d2..a6fafbae02bb 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
index 051b6158d1b7..58af734be25d 100644
--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
 
 	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
 
-	if (PageHighMem(page)) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		unsigned long flags;
 
 		/* FIXME: use a bounce buffer */
@@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
 
 	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
 
-	if (PageHighMem(page)) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		unsigned long flags;
 
 		/* FIXME: use bounce buffer */
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 9298c393ddaa..0e451a42e5a3 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -29,6 +29,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -44,6 +46,11 @@ enum page_ext_flags {
  */
 struct page_ext {
 	unsigned long flags;
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..175680f516aa 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa7c4bd..100e80e008e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 121dcffc4ec1..ba6dbcacc2db 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..8e3a6a694b6a
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,206 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
diff --git a/security/Kconfig b/security/Kconfig
index 118f4549404e..4502e15c8419 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on ARCH_SUPPORTS_XPFO
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.10.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-04 14:45         ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping.
Specifically two flags to distinguish user vs. kernel pages and to tag
unmapped pages and a reference counter to balance kmap/kunmap operations
and a lock to serialize access to the XPFO fields.

Known issues/limitations:
  - Only supports x86-64 (for now)
  - Only supports 4k pages (for now)
  - There are most likely some legitimate uses cases where the kernel needs
    to access userspace which need to be made XPFO-aware
  - Performance penalty

Reference paper by the original patch authors:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 arch/x86/Kconfig         |   3 +-
 arch/x86/mm/init.c       |   2 +-
 drivers/ata/libata-sff.c |   4 +-
 include/linux/highmem.h  |  15 +++-
 include/linux/page_ext.h |   7 ++
 include/linux/xpfo.h     |  39 +++++++++
 lib/swiotlb.c            |   3 +-
 mm/Makefile              |   1 +
 mm/page_alloc.c          |   2 +
 mm/page_ext.c            |   4 +
 mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig         |  19 +++++
 12 files changed, 298 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bada636d1065..38b334f8fde5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
@@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
 
 config X86_DIRECT_GBPAGES
 	def_bool y
-	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
+	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
 	---help---
 	  Certain kernel features effectively disable kernel
 	  linear 1 GB mappings (even if the CPU otherwise
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 22af912d66d2..a6fafbae02bb 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,7 +161,7 @@ static int page_size_mask;
 
 static void __init probe_page_size_mask(void)
 {
-#if !defined(CONFIG_KMEMCHECK)
+#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
 	/*
 	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
 	 * use small pages.
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
index 051b6158d1b7..58af734be25d 100644
--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
 
 	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
 
-	if (PageHighMem(page)) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		unsigned long flags;
 
 		/* FIXME: use a bounce buffer */
@@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
 
 	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
 
-	if (PageHighMem(page)) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		unsigned long flags;
 
 		/* FIXME: use bounce buffer */
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 9298c393ddaa..0e451a42e5a3 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -29,6 +29,8 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
+	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
@@ -44,6 +46,11 @@ enum page_ext_flags {
  */
 struct page_ext {
 	unsigned long flags;
+#ifdef CONFIG_XPFO
+	int inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+#endif
 };
 
 extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..77187578ca33
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+extern void xpfo_kmap(void *kaddr, struct page *page);
+extern void xpfo_kunmap(void *kaddr, struct page *page);
+extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
+extern void xpfo_free_page(struct page *page, int order);
+
+extern bool xpfo_page_is_unmapped(struct page *page);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_page(struct page *page, int order) { }
+
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0e19d7..455eff44604e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 {
 	unsigned long pfn = PFN_DOWN(orig_addr);
 	unsigned char *vaddr = phys_to_virt(tlb_addr);
+	struct page *page = pfn_to_page(pfn);
 
-	if (PageHighMem(pfn_to_page(pfn))) {
+	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
 		/* The buffer does not have a mapping.  Map it in and copy */
 		unsigned int offset = orig_addr & ~PAGE_MASK;
 		char *buffer;
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..175680f516aa 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa7c4bd..100e80e008e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_page(page, order);
 
 	return true;
 }
@@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_page(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 121dcffc4ec1..ba6dbcacc2db 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..8e3a6a694b6a
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,206 @@
+/*
+ * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool need_xpfo(void)
+{
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+/*
+ * Update a single kernel page table entry
+ */
+static inline void set_kpte(struct page *page, unsigned long kaddr,
+			    pgprot_t prot) {
+	unsigned int level;
+	pte_t *kpte = lookup_address(kaddr, &level);
+
+	/* We only support 4k pages for now */
+	BUG_ON(!kpte || level != PG_LEVEL_4K);
+
+	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+}
+
+void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		page_ext = lookup_page_ext(page + i);
+
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+
+		/* Initialize the map lock and map counter */
+		if (!page_ext->inited) {
+			spin_lock_init(&page_ext->maplock);
+			atomic_set(&page_ext->mapcount, 0);
+			page_ext->inited = 1;
+		}
+		BUG_ON(atomic_read(&page_ext->mapcount));
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Flush the TLB if the page was previously allocated
+			 * to the kernel.
+			 */
+			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
+					       &page_ext->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a kernel page */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+		}
+	}
+
+	if (flush_tlb) {
+		kaddr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
+				       PAGE_SIZE);
+	}
+}
+
+void xpfo_free_page(struct page *page, int order)
+{
+	int i;
+	struct page_ext *page_ext;
+	unsigned long kaddr;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		page_ext = lookup_page_ext(page + i);
+
+		if (!page_ext->inited) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page and it needs to
+			 * be tagged accordingly.
+			 */
+			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
+				       &page_ext->flags)) {
+			kaddr = (unsigned long)page_address(page + i);
+			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
+	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
+		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct page_ext *page_ext;
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	page_ext = lookup_page_ext(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!page_ext->inited ||
+	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
+		return;
+
+	spin_lock_irqsave(&page_ext->maplock, flags);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&page_ext->mapcount) == 0) {
+		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
+		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
+		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
+		__flush_tlb_one((unsigned long)kaddr);
+	}
+
+	spin_unlock_irqrestore(&page_ext->maplock, flags);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
+
+inline bool xpfo_page_is_unmapped(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
diff --git a/security/Kconfig b/security/Kconfig
index 118f4549404e..4502e15c8419 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@ menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on ARCH_SUPPORTS_XPFO
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v3 2/2] xpfo: Only put previous userspace pages into the hot cache
  2016-11-04 14:45       ` Juerg Haefliger
  (?)
@ 2016-11-04 14:45         ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 100e80e008e2..09ef4f7cfd14 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2440,7 +2440,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 8e3a6a694b6a..0e447e38008a 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -204,3 +204,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
 EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH v3 2/2] xpfo: Only put previous userspace pages into the hot cache
@ 2016-11-04 14:45         ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 100e80e008e2..09ef4f7cfd14 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2440,7 +2440,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 8e3a6a694b6a..0e447e38008a 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -204,3 +204,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
 EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.10.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] [RFC PATCH v3 2/2] xpfo: Only put previous userspace pages into the hot cache
@ 2016-11-04 14:45         ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kernel-hardening, linux-x86_64
  Cc: vpk, juerg.haefliger

Allocating a page to userspace that was previously allocated to the
kernel requires an expensive TLB shootdown. To minimize this, we only
put non-kernel pages into the hot cache to favor their allocation.

Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
---
 include/linux/xpfo.h | 2 ++
 mm/page_alloc.c      | 8 +++++++-
 mm/xpfo.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 77187578ca33..077d1cfadfa2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,7 @@ extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
 extern void xpfo_free_page(struct page *page, int order);
 
 extern bool xpfo_page_is_unmapped(struct page *page);
+extern bool xpfo_page_is_kernel(struct page *page);
 
 #else /* !CONFIG_XPFO */
 
@@ -33,6 +34,7 @@ static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
 static inline void xpfo_free_page(struct page *page, int order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+static inline bool xpfo_page_is_kernel(struct page *page) { return false; }
 
 #endif /* CONFIG_XPFO */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 100e80e008e2..09ef4f7cfd14 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2440,7 +2440,13 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (!cold)
+	/*
+	 * XPFO: Allocating a page to userspace that was previously allocated
+	 * to the kernel requires an expensive TLB shootdown. To minimize this,
+	 * we only put non-kernel pages into the hot cache to favor their
+	 * allocation.
+	 */
+	if (!cold && !xpfo_page_is_kernel(page))
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	else
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 8e3a6a694b6a..0e447e38008a 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -204,3 +204,11 @@ inline bool xpfo_page_is_unmapped(struct page *page)
 	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
 }
 EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+inline bool xpfo_page_is_kernel(struct page *page)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+
+	return test_bit(PAGE_EXT_XPFO_KERNEL, &lookup_page_ext(page)->flags);
+}
-- 
2.10.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-04 14:45         ` Juerg Haefliger
  (?)
@ 2016-11-04 14:50           ` Christoph Hellwig
  -1 siblings, 0 replies; 93+ messages in thread
From: Christoph Hellwig @ 2016-11-04 14:50 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk,
	Tejun Heo, linux-ide

The libata parts here really need to be split out and the proper list
and maintainer need to be Cc'ed.

> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h

This is just piling one nasty hack on top of another.  libata should
just use the highmem case unconditionally, as it is the correct thing
to do for all cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-04 14:50           ` Christoph Hellwig
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Hellwig @ 2016-11-04 14:50 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk,
	Tejun Heo, linux-ide

The libata parts here really need to be split out and the proper list
and maintainer need to be Cc'ed.

> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h

This is just piling one nasty hack on top of another.  libata should
just use the highmem case unconditionally, as it is the correct thing
to do for all cases.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-04 14:50           ` Christoph Hellwig
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Hellwig @ 2016-11-04 14:50 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk,
	Tejun Heo, linux-ide

The libata parts here really need to be split out and the proper list
and maintainer need to be Cc'ed.

> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h

This is just piling one nasty hack on top of another.  libata should
just use the highmem case unconditionally, as it is the correct thing
to do for all cases.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-04 14:45         ` Juerg Haefliger
  (?)
@ 2016-11-10  5:53           ` ZhaoJunmin Zhao(Junmin)
  -1 siblings, 0 replies; 93+ messages in thread
From: ZhaoJunmin Zhao(Junmin) @ 2016-11-10  5:53 UTC (permalink / raw)
  To: kernel-hardening, linux-kernel, linux-mm, linux-x86_64
  Cc: vpk, juerg.haefliger

> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
>
> Known issues/limitations:
>    - Only supports x86-64 (for now)
>    - Only supports 4k pages (for now)
>    - There are most likely some legitimate uses cases where the kernel needs
>      to access userspace which need to be made XPFO-aware
>    - Performance penalty
>
> Reference paper by the original patch authors:
>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>   arch/x86/Kconfig         |   3 +-
>   arch/x86/mm/init.c       |   2 +-
>   drivers/ata/libata-sff.c |   4 +-
>   include/linux/highmem.h  |  15 +++-
>   include/linux/page_ext.h |   7 ++
>   include/linux/xpfo.h     |  39 +++++++++
>   lib/swiotlb.c            |   3 +-
>   mm/Makefile              |   1 +
>   mm/page_alloc.c          |   2 +
>   mm/page_ext.c            |   4 +
>   mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>   security/Kconfig         |  19 +++++
>   12 files changed, 298 insertions(+), 7 deletions(-)
>   create mode 100644 include/linux/xpfo.h
>   create mode 100644 mm/xpfo.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>   	select HAVE_STACK_VALIDATION		if X86_64
>   	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>   	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_SUPPORTS_XPFO		if X86_64
>
>   config INSTRUCTION_DECODER
>   	def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>   config X86_DIRECT_GBPAGES
>   	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>   	---help---
>   	  Certain kernel features effectively disable kernel
>   	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>
>   static void __init probe_page_size_mask(void)
>   {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>   	/*
>   	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>   	 * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>
>   	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		unsigned long flags;
>
>   		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>
>   	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		unsigned long flags;
>
>   		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>   #include <linux/mm.h>
>   #include <linux/uaccess.h>
>   #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>
>   #include <asm/cacheflush.h>
>
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>   #ifndef ARCH_HAS_KMAP
>   static inline void *kmap(struct page *page)
>   {
> +	void *kaddr;
> +
>   	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>
>   static inline void kunmap(struct page *page)
>   {
> +	xpfo_kunmap(page_address(page), page);
>   }
>
>   static inline void *kmap_atomic(struct page *page)
>   {
> +	void *kaddr;
> +
>   	preempt_disable();
>   	pagefault_disable();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>   #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>
>   static inline void __kunmap_atomic(void *addr)
>   {
> +	xpfo_kunmap(addr, virt_to_page(addr));
>   	pagefault_enable();
>   	preempt_enable();
>   }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>   	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
>   	PAGE_EXT_DEBUG_GUARD,
>   	PAGE_EXT_OWNER,
> +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
>   #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>   	PAGE_EXT_YOUNG,
>   	PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>    */
>   struct page_ext {
>   	unsigned long flags;
> +#ifdef CONFIG_XPFO
> +	int inited;		/* Map counter and lock initialized */
> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> +#endif
>   };
>
>   extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>   {
>   	unsigned long pfn = PFN_DOWN(orig_addr);
>   	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		/* The buffer does not have a mapping.  Map it in and copy */
>   		unsigned int offset = orig_addr & ~PAGE_MASK;
>   		char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>   obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>   obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>   obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>   	kernel_poison_pages(page, 1 << order, 0);
>   	kernel_map_pages(page, 1 << order, 0);
>   	kasan_free_pages(page, order);
> +	xpfo_free_page(page, order);
>
>   	return true;
>   }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>   	kernel_map_pages(page, 1 << order, 1);
>   	kernel_poison_pages(page, 1 << order, 1);
>   	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>   	set_page_owner(page, order, gfp_flags);
>   }
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>   #include <linux/kmemleak.h>
>   #include <linux/page_owner.h>
>   #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>
>   /*
>    * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>   #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>   	&page_idle_ops,
>   #endif
> +#ifdef CONFIG_XPFO
> +	&page_xpfo_ops,
> +#endif
>   };
>
>   static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +	return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +	printk(KERN_INFO "XPFO enabled\n");
> +	static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +	.need = need_xpfo,
> +	.init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, flush_tlb = 0;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +		/* Initialize the map lock and map counter */
> +		if (!page_ext->inited) {
> +			spin_lock_init(&page_ext->maplock);
> +			atomic_set(&page_ext->mapcount, 0);
> +			page_ext->inited = 1;
> +		}
> +		BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +			/*
> +			 * Flush the TLB if the page was previously allocated
> +			 * to the kernel.
> +			 */
> +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +					       &page_ext->flags))
> +				flush_tlb = 1;
> +		} else {
> +			/* Tag the page as a kernel page */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +		}
> +	}
> +
> +	if (flush_tlb) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		if (!page_ext->inited) {
> +			/*
> +			 * The page was allocated before page_ext was
> +			 * initialized, so it is a kernel page and it needs to
> +			 * be tagged accordingly.
> +			 */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +			continue;
> +		}
> +
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +				       &page_ext->flags)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +		}
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB flush required.
> +	 */
> +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from the
> +	 * kernel, flush the TLB and tag it as a user page.
> +	 */
> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);
> +	}
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return false;
> +
> +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>
>   source security/keys/Kconfig
>
> +config ARCH_SUPPORTS_XPFO
> +	bool
> +
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on ARCH_SUPPORTS_XPFO
> +	select PAGE_EXTENSION
> +	help
> +	  This option offers protection against 'ret2dir' kernel attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +	  mapped back to physmap.
> +
> +	  There is a slight performance impact when this option is enabled.
> +
> +	  If in doubt, say "N".
> +
>   config SECURITY_DMESG_RESTRICT
>   	bool "Restrict unprivileged access to the kernel syslog"
>   	default n
>

When a physical page is assigned to a process in user space, it should 
be unmaped from kernel physmap.  From the code, I can see the patch only 
handle the page in high memory zone. if the kernel use the high memory 
zone, it will call the kmap. So I would like to know if the physical 
page is coming from normal zone,how to handle it.

Thanks
Zhaojunmin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-10  5:53           ` ZhaoJunmin Zhao(Junmin)
  0 siblings, 0 replies; 93+ messages in thread
From: ZhaoJunmin Zhao(Junmin) @ 2016-11-10  5:53 UTC (permalink / raw)
  To: kernel-hardening, linux-kernel, linux-mm, linux-x86_64
  Cc: vpk, juerg.haefliger

> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
>
> Known issues/limitations:
>    - Only supports x86-64 (for now)
>    - Only supports 4k pages (for now)
>    - There are most likely some legitimate uses cases where the kernel needs
>      to access userspace which need to be made XPFO-aware
>    - Performance penalty
>
> Reference paper by the original patch authors:
>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>   arch/x86/Kconfig         |   3 +-
>   arch/x86/mm/init.c       |   2 +-
>   drivers/ata/libata-sff.c |   4 +-
>   include/linux/highmem.h  |  15 +++-
>   include/linux/page_ext.h |   7 ++
>   include/linux/xpfo.h     |  39 +++++++++
>   lib/swiotlb.c            |   3 +-
>   mm/Makefile              |   1 +
>   mm/page_alloc.c          |   2 +
>   mm/page_ext.c            |   4 +
>   mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>   security/Kconfig         |  19 +++++
>   12 files changed, 298 insertions(+), 7 deletions(-)
>   create mode 100644 include/linux/xpfo.h
>   create mode 100644 mm/xpfo.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>   	select HAVE_STACK_VALIDATION		if X86_64
>   	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>   	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_SUPPORTS_XPFO		if X86_64
>
>   config INSTRUCTION_DECODER
>   	def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>   config X86_DIRECT_GBPAGES
>   	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>   	---help---
>   	  Certain kernel features effectively disable kernel
>   	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>
>   static void __init probe_page_size_mask(void)
>   {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>   	/*
>   	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>   	 * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>
>   	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		unsigned long flags;
>
>   		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>
>   	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		unsigned long flags;
>
>   		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>   #include <linux/mm.h>
>   #include <linux/uaccess.h>
>   #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>
>   #include <asm/cacheflush.h>
>
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>   #ifndef ARCH_HAS_KMAP
>   static inline void *kmap(struct page *page)
>   {
> +	void *kaddr;
> +
>   	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>
>   static inline void kunmap(struct page *page)
>   {
> +	xpfo_kunmap(page_address(page), page);
>   }
>
>   static inline void *kmap_atomic(struct page *page)
>   {
> +	void *kaddr;
> +
>   	preempt_disable();
>   	pagefault_disable();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>   #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>
>   static inline void __kunmap_atomic(void *addr)
>   {
> +	xpfo_kunmap(addr, virt_to_page(addr));
>   	pagefault_enable();
>   	preempt_enable();
>   }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>   	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
>   	PAGE_EXT_DEBUG_GUARD,
>   	PAGE_EXT_OWNER,
> +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
>   #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>   	PAGE_EXT_YOUNG,
>   	PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>    */
>   struct page_ext {
>   	unsigned long flags;
> +#ifdef CONFIG_XPFO
> +	int inited;		/* Map counter and lock initialized */
> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> +#endif
>   };
>
>   extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>   {
>   	unsigned long pfn = PFN_DOWN(orig_addr);
>   	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		/* The buffer does not have a mapping.  Map it in and copy */
>   		unsigned int offset = orig_addr & ~PAGE_MASK;
>   		char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>   obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>   obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>   obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>   	kernel_poison_pages(page, 1 << order, 0);
>   	kernel_map_pages(page, 1 << order, 0);
>   	kasan_free_pages(page, order);
> +	xpfo_free_page(page, order);
>
>   	return true;
>   }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>   	kernel_map_pages(page, 1 << order, 1);
>   	kernel_poison_pages(page, 1 << order, 1);
>   	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>   	set_page_owner(page, order, gfp_flags);
>   }
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>   #include <linux/kmemleak.h>
>   #include <linux/page_owner.h>
>   #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>
>   /*
>    * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>   #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>   	&page_idle_ops,
>   #endif
> +#ifdef CONFIG_XPFO
> +	&page_xpfo_ops,
> +#endif
>   };
>
>   static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +	return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +	printk(KERN_INFO "XPFO enabled\n");
> +	static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +	.need = need_xpfo,
> +	.init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, flush_tlb = 0;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +		/* Initialize the map lock and map counter */
> +		if (!page_ext->inited) {
> +			spin_lock_init(&page_ext->maplock);
> +			atomic_set(&page_ext->mapcount, 0);
> +			page_ext->inited = 1;
> +		}
> +		BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +			/*
> +			 * Flush the TLB if the page was previously allocated
> +			 * to the kernel.
> +			 */
> +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +					       &page_ext->flags))
> +				flush_tlb = 1;
> +		} else {
> +			/* Tag the page as a kernel page */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +		}
> +	}
> +
> +	if (flush_tlb) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		if (!page_ext->inited) {
> +			/*
> +			 * The page was allocated before page_ext was
> +			 * initialized, so it is a kernel page and it needs to
> +			 * be tagged accordingly.
> +			 */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +			continue;
> +		}
> +
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +				       &page_ext->flags)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +		}
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB flush required.
> +	 */
> +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from the
> +	 * kernel, flush the TLB and tag it as a user page.
> +	 */
> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);
> +	}
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return false;
> +
> +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>
>   source security/keys/Kconfig
>
> +config ARCH_SUPPORTS_XPFO
> +	bool
> +
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on ARCH_SUPPORTS_XPFO
> +	select PAGE_EXTENSION
> +	help
> +	  This option offers protection against 'ret2dir' kernel attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +	  mapped back to physmap.
> +
> +	  There is a slight performance impact when this option is enabled.
> +
> +	  If in doubt, say "N".
> +
>   config SECURITY_DMESG_RESTRICT
>   	bool "Restrict unprivileged access to the kernel syslog"
>   	default n
>

When a physical page is assigned to a process in user space, it should 
be unmaped from kernel physmap.  From the code, I can see the patch only 
handle the page in high memory zone. if the kernel use the high memory 
zone, it will call the kmap. So I would like to know if the physical 
page is coming from normal zone,how to handle it.

Thanks
Zhaojunmin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [kernel-hardening] [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-10  5:53           ` ZhaoJunmin Zhao(Junmin)
  0 siblings, 0 replies; 93+ messages in thread
From: ZhaoJunmin Zhao(Junmin) @ 2016-11-10  5:53 UTC (permalink / raw)
  To: kernel-hardening, linux-kernel, linux-mm, linux-x86_64
  Cc: vpk, juerg.haefliger

> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
>
> Known issues/limitations:
>    - Only supports x86-64 (for now)
>    - Only supports 4k pages (for now)
>    - There are most likely some legitimate uses cases where the kernel needs
>      to access userspace which need to be made XPFO-aware
>    - Performance penalty
>
> Reference paper by the original patch authors:
>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>   arch/x86/Kconfig         |   3 +-
>   arch/x86/mm/init.c       |   2 +-
>   drivers/ata/libata-sff.c |   4 +-
>   include/linux/highmem.h  |  15 +++-
>   include/linux/page_ext.h |   7 ++
>   include/linux/xpfo.h     |  39 +++++++++
>   lib/swiotlb.c            |   3 +-
>   mm/Makefile              |   1 +
>   mm/page_alloc.c          |   2 +
>   mm/page_ext.c            |   4 +
>   mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>   security/Kconfig         |  19 +++++
>   12 files changed, 298 insertions(+), 7 deletions(-)
>   create mode 100644 include/linux/xpfo.h
>   create mode 100644 mm/xpfo.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>   	select HAVE_STACK_VALIDATION		if X86_64
>   	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>   	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_SUPPORTS_XPFO		if X86_64
>
>   config INSTRUCTION_DECODER
>   	def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>   config X86_DIRECT_GBPAGES
>   	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>   	---help---
>   	  Certain kernel features effectively disable kernel
>   	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>
>   static void __init probe_page_size_mask(void)
>   {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>   	/*
>   	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>   	 * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>
>   	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		unsigned long flags;
>
>   		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>
>   	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		unsigned long flags;
>
>   		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>   #include <linux/mm.h>
>   #include <linux/uaccess.h>
>   #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>
>   #include <asm/cacheflush.h>
>
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>   #ifndef ARCH_HAS_KMAP
>   static inline void *kmap(struct page *page)
>   {
> +	void *kaddr;
> +
>   	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>
>   static inline void kunmap(struct page *page)
>   {
> +	xpfo_kunmap(page_address(page), page);
>   }
>
>   static inline void *kmap_atomic(struct page *page)
>   {
> +	void *kaddr;
> +
>   	preempt_disable();
>   	pagefault_disable();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>   }
>   #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>
>   static inline void __kunmap_atomic(void *addr)
>   {
> +	xpfo_kunmap(addr, virt_to_page(addr));
>   	pagefault_enable();
>   	preempt_enable();
>   }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>   	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
>   	PAGE_EXT_DEBUG_GUARD,
>   	PAGE_EXT_OWNER,
> +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
>   #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>   	PAGE_EXT_YOUNG,
>   	PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>    */
>   struct page_ext {
>   	unsigned long flags;
> +#ifdef CONFIG_XPFO
> +	int inited;		/* Map counter and lock initialized */
> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> +#endif
>   };
>
>   extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>   {
>   	unsigned long pfn = PFN_DOWN(orig_addr);
>   	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>   		/* The buffer does not have a mapping.  Map it in and copy */
>   		unsigned int offset = orig_addr & ~PAGE_MASK;
>   		char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>   obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>   obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>   obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>   	kernel_poison_pages(page, 1 << order, 0);
>   	kernel_map_pages(page, 1 << order, 0);
>   	kasan_free_pages(page, order);
> +	xpfo_free_page(page, order);
>
>   	return true;
>   }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>   	kernel_map_pages(page, 1 << order, 1);
>   	kernel_poison_pages(page, 1 << order, 1);
>   	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>   	set_page_owner(page, order, gfp_flags);
>   }
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>   #include <linux/kmemleak.h>
>   #include <linux/page_owner.h>
>   #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>
>   /*
>    * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>   #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>   	&page_idle_ops,
>   #endif
> +#ifdef CONFIG_XPFO
> +	&page_xpfo_ops,
> +#endif
>   };
>
>   static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +	return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +	printk(KERN_INFO "XPFO enabled\n");
> +	static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +	.need = need_xpfo,
> +	.init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, flush_tlb = 0;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +		/* Initialize the map lock and map counter */
> +		if (!page_ext->inited) {
> +			spin_lock_init(&page_ext->maplock);
> +			atomic_set(&page_ext->mapcount, 0);
> +			page_ext->inited = 1;
> +		}
> +		BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +			/*
> +			 * Flush the TLB if the page was previously allocated
> +			 * to the kernel.
> +			 */
> +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +					       &page_ext->flags))
> +				flush_tlb = 1;
> +		} else {
> +			/* Tag the page as a kernel page */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +		}
> +	}
> +
> +	if (flush_tlb) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		if (!page_ext->inited) {
> +			/*
> +			 * The page was allocated before page_ext was
> +			 * initialized, so it is a kernel page and it needs to
> +			 * be tagged accordingly.
> +			 */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +			continue;
> +		}
> +
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +				       &page_ext->flags)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +		}
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB flush required.
> +	 */
> +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from the
> +	 * kernel, flush the TLB and tag it as a user page.
> +	 */
> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);
> +	}
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return false;
> +
> +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>
>   source security/keys/Kconfig
>
> +config ARCH_SUPPORTS_XPFO
> +	bool
> +
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on ARCH_SUPPORTS_XPFO
> +	select PAGE_EXTENSION
> +	help
> +	  This option offers protection against 'ret2dir' kernel attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +	  mapped back to physmap.
> +
> +	  There is a slight performance impact when this option is enabled.
> +
> +	  If in doubt, say "N".
> +
>   config SECURITY_DMESG_RESTRICT
>   	bool "Restrict unprivileged access to the kernel syslog"
>   	default n
>

When a physical page is assigned to a process in user space, it should 
be unmaped from kernel physmap.  From the code, I can see the patch only 
handle the page in high memory zone. if the kernel use the high memory 
zone, it will call the kmap. So I would like to know if the physical 
page is coming from normal zone,how to handle it.

Thanks
Zhaojunmin

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-04 14:45         ` Juerg Haefliger
  (?)
@ 2016-11-10 19:11           ` Kees Cook
  -1 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2016-11-10 19:11 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk

On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.

Thanks for keeping on this! I'd really like to see it land and then
get more architectures to support it.

> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty

In the Kconfig you say "slight", but I'm curious what kinds of
benchmarks you've done and if there's a more specific cost we can
declare, just to give people more of an idea what the hit looks like?
(What workloads would trigger a lot of XPFO unmapping, for example?)

Thanks!

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-10 19:11           ` Kees Cook
  0 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2016-11-10 19:11 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk

On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.

Thanks for keeping on this! I'd really like to see it land and then
get more architectures to support it.

> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty

In the Kconfig you say "slight", but I'm curious what kinds of
benchmarks you've done and if there's a more specific cost we can
declare, just to give people more of an idea what the hit looks like?
(What workloads would trigger a lot of XPFO unmapping, for example?)

Thanks!

-Kees

-- 
Kees Cook
Nexus Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-10 19:11           ` Kees Cook
  0 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2016-11-10 19:11 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk

On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.

Thanks for keeping on this! I'd really like to see it land and then
get more architectures to support it.

> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty

In the Kconfig you say "slight", but I'm curious what kinds of
benchmarks you've done and if there's a more specific cost we can
declare, just to give people more of an idea what the hit looks like?
(What workloads would trigger a lot of XPFO unmapping, for example?)

Thanks!

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-04 14:45         ` Juerg Haefliger
  (?)
@ 2016-11-10 19:24           ` Kees Cook
  -1 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2016-11-10 19:24 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk

On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
>
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
>
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Would it be possible to create an lkdtm test that can exercise this protection?

> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  drivers/ata/libata-sff.c |   4 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  39 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |   2 +
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  19 +++++
>  12 files changed, 298 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>         select HAVE_STACK_VALIDATION            if X86_64
>         select ARCH_USES_HIGH_VMA_FLAGS         if X86_INTEL_MEMORY_PROTECTION_KEYS
>         select ARCH_HAS_PKEYS                   if X86_INTEL_MEMORY_PROTECTION_KEYS
> +       select ARCH_SUPPORTS_XPFO               if X86_64
>
>  config INSTRUCTION_DECODER
>         def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>  config X86_DIRECT_GBPAGES
>         def_bool y
> -       depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +       depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>         ---help---
>           Certain kernel features effectively disable kernel
>           linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>         /*
>          * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>          * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>
>         DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -       if (PageHighMem(page)) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 unsigned long flags;
>
>                 /* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>
>         DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -       if (PageHighMem(page)) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 unsigned long flags;
>
>                 /* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>  #include <linux/mm.h>
>  #include <linux/uaccess.h>
>  #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>
>  #include <asm/cacheflush.h>
>
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +       void *kaddr;
> +
>         might_sleep();
> -       return page_address(page);
> +       kaddr = page_address(page);
> +       xpfo_kmap(kaddr, page);
> +       return kaddr;
>  }
>
>  static inline void kunmap(struct page *page)
>  {
> +       xpfo_kunmap(page_address(page), page);
>  }
>
>  static inline void *kmap_atomic(struct page *page)
>  {
> +       void *kaddr;
> +
>         preempt_disable();
>         pagefault_disable();
> -       return page_address(page);
> +       kaddr = page_address(page);
> +       xpfo_kmap(kaddr, page);
> +       return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)   kmap_atomic(page)
>
>  static inline void __kunmap_atomic(void *addr)
>  {
> +       xpfo_kunmap(addr, virt_to_page(addr));
>         pagefault_enable();
>         preempt_enable();
>  }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>         PAGE_EXT_DEBUG_POISON,          /* Page is poisoned */
>         PAGE_EXT_DEBUG_GUARD,
>         PAGE_EXT_OWNER,
> +       PAGE_EXT_XPFO_KERNEL,           /* Page is a kernel page */
> +       PAGE_EXT_XPFO_UNMAPPED,         /* Page is unmapped */
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>         PAGE_EXT_YOUNG,
>         PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>   */
>  struct page_ext {
>         unsigned long flags;
> +#ifdef CONFIG_XPFO
> +       int inited;             /* Map counter and lock initialized */
> +       atomic_t mapcount;      /* Counter for balancing map/unmap requests */
> +       spinlock_t maplock;     /* Lock to serialize map/unmap requests */
> +#endif
>  };
>
>  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>         unsigned long pfn = PFN_DOWN(orig_addr);
>         unsigned char *vaddr = phys_to_virt(tlb_addr);
> +       struct page *page = pfn_to_page(pfn);
>
> -       if (PageHighMem(pfn_to_page(pfn))) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 /* The buffer does not have a mapping.  Map it in and copy */
>                 unsigned int offset = orig_addr & ~PAGE_MASK;
>                 char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>         kernel_poison_pages(page, 1 << order, 0);
>         kernel_map_pages(page, 1 << order, 0);
>         kasan_free_pages(page, order);
> +       xpfo_free_page(page, order);
>
>         return true;
>  }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>         kernel_map_pages(page, 1 << order, 1);
>         kernel_poison_pages(page, 1 << order, 1);
>         kasan_alloc_pages(page, order);
> +       xpfo_alloc_page(page, order, gfp_flags);
>         set_page_owner(page, order, gfp_flags);
>  }
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
>  #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>
>  /*
>   * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>         &page_idle_ops,
>  #endif
> +#ifdef CONFIG_XPFO
> +       &page_xpfo_ops,
> +#endif
>  };
>
>  static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +       return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +       printk(KERN_INFO "XPFO enabled\n");
> +       static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +       .need = need_xpfo,
> +       .init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +                           pgprot_t prot) {
> +       unsigned int level;
> +       pte_t *kpte = lookup_address(kaddr, &level);
> +
> +       /* We only support 4k pages for now */
> +       BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +       set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +       int i, flush_tlb = 0;
> +       struct page_ext *page_ext;
> +       unsigned long kaddr;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       for (i = 0; i < (1 << order); i++)  {
> +               page_ext = lookup_page_ext(page + i);
> +
> +               BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +               /* Initialize the map lock and map counter */
> +               if (!page_ext->inited) {
> +                       spin_lock_init(&page_ext->maplock);
> +                       atomic_set(&page_ext->mapcount, 0);
> +                       page_ext->inited = 1;
> +               }
> +               BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +               if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +                       /*
> +                        * Flush the TLB if the page was previously allocated
> +                        * to the kernel.
> +                        */
> +                       if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +                                              &page_ext->flags))
> +                               flush_tlb = 1;
> +               } else {
> +                       /* Tag the page as a kernel page */
> +                       set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +               }
> +       }
> +
> +       if (flush_tlb) {
> +               kaddr = (unsigned long)page_address(page);
> +               flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +                                      PAGE_SIZE);
> +       }
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +       int i;
> +       struct page_ext *page_ext;
> +       unsigned long kaddr;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       for (i = 0; i < (1 << order); i++) {
> +               page_ext = lookup_page_ext(page + i);
> +
> +               if (!page_ext->inited) {
> +                       /*
> +                        * The page was allocated before page_ext was
> +                        * initialized, so it is a kernel page and it needs to
> +                        * be tagged accordingly.
> +                        */
> +                       set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +                       continue;
> +               }
> +
> +               /*
> +                * Map the page back into the kernel if it was previously
> +                * allocated to user space.
> +                */
> +               if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +                                      &page_ext->flags)) {
> +                       kaddr = (unsigned long)page_address(page + i);
> +                       set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +               }
> +       }
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       unsigned long flags;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       page_ext = lookup_page_ext(page);
> +
> +       /*
> +        * The page was allocated before page_ext was initialized (which means
> +        * it's a kernel page) or it's allocated to the kernel, so nothing to
> +        * do.
> +        */
> +       if (!page_ext->inited ||
> +           test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +               return;
> +
> +       spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +       /*
> +        * The page was previously allocated to user space, so map it back
> +        * into the kernel. No TLB flush required.
> +        */
> +       if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +           test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +               set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +       spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       unsigned long flags;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       page_ext = lookup_page_ext(page);
> +
> +       /*
> +        * The page was allocated before page_ext was initialized (which means
> +        * it's a kernel page) or it's allocated to the kernel, so nothing to
> +        * do.
> +        */
> +       if (!page_ext->inited ||
> +           test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +               return;
> +
> +       spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +       /*
> +        * The page is to be allocated back to user space, so unmap it from the
> +        * kernel, flush the TLB and tag it as a user page.
> +        */
> +       if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +               BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +               set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +               set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +               __flush_tlb_one((unsigned long)kaddr);
> +       }
> +
> +       spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return false;
> +
> +       return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>
>  source security/keys/Kconfig
>
> +config ARCH_SUPPORTS_XPFO
> +       bool

Can you include a "help" section here to describe what requirements an
architecture needs to support XPFO? See HAVE_ARCH_SECCOMP_FILTER and
HAVE_ARCH_VMAP_STACK or some examples.

> +config XPFO
> +       bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +       default n
> +       depends on ARCH_SUPPORTS_XPFO
> +       select PAGE_EXTENSION
> +       help
> +         This option offers protection against 'ret2dir' kernel attacks.
> +         When enabled, every time a page frame is allocated to user space, it
> +         is unmapped from the direct mapped RAM region in kernel space
> +         (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +         mapped back to physmap.
> +
> +         There is a slight performance impact when this option is enabled.
> +
> +         If in doubt, say "N".
> +
>  config SECURITY_DMESG_RESTRICT
>         bool "Restrict unprivileged access to the kernel syslog"
>         default n
> --
> 2.10.1
>

I've added these patches to my kspp tree on kernel.org, so it should
get some 0-day testing now...

Thanks!

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-10 19:24           ` Kees Cook
  0 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2016-11-10 19:24 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk

On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
>
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
>
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Would it be possible to create an lkdtm test that can exercise this protection?

> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  drivers/ata/libata-sff.c |   4 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  39 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |   2 +
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  19 +++++
>  12 files changed, 298 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>         select HAVE_STACK_VALIDATION            if X86_64
>         select ARCH_USES_HIGH_VMA_FLAGS         if X86_INTEL_MEMORY_PROTECTION_KEYS
>         select ARCH_HAS_PKEYS                   if X86_INTEL_MEMORY_PROTECTION_KEYS
> +       select ARCH_SUPPORTS_XPFO               if X86_64
>
>  config INSTRUCTION_DECODER
>         def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>  config X86_DIRECT_GBPAGES
>         def_bool y
> -       depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +       depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>         ---help---
>           Certain kernel features effectively disable kernel
>           linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>         /*
>          * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>          * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>
>         DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -       if (PageHighMem(page)) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 unsigned long flags;
>
>                 /* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>
>         DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -       if (PageHighMem(page)) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 unsigned long flags;
>
>                 /* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>  #include <linux/mm.h>
>  #include <linux/uaccess.h>
>  #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>
>  #include <asm/cacheflush.h>
>
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +       void *kaddr;
> +
>         might_sleep();
> -       return page_address(page);
> +       kaddr = page_address(page);
> +       xpfo_kmap(kaddr, page);
> +       return kaddr;
>  }
>
>  static inline void kunmap(struct page *page)
>  {
> +       xpfo_kunmap(page_address(page), page);
>  }
>
>  static inline void *kmap_atomic(struct page *page)
>  {
> +       void *kaddr;
> +
>         preempt_disable();
>         pagefault_disable();
> -       return page_address(page);
> +       kaddr = page_address(page);
> +       xpfo_kmap(kaddr, page);
> +       return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)   kmap_atomic(page)
>
>  static inline void __kunmap_atomic(void *addr)
>  {
> +       xpfo_kunmap(addr, virt_to_page(addr));
>         pagefault_enable();
>         preempt_enable();
>  }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>         PAGE_EXT_DEBUG_POISON,          /* Page is poisoned */
>         PAGE_EXT_DEBUG_GUARD,
>         PAGE_EXT_OWNER,
> +       PAGE_EXT_XPFO_KERNEL,           /* Page is a kernel page */
> +       PAGE_EXT_XPFO_UNMAPPED,         /* Page is unmapped */
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>         PAGE_EXT_YOUNG,
>         PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>   */
>  struct page_ext {
>         unsigned long flags;
> +#ifdef CONFIG_XPFO
> +       int inited;             /* Map counter and lock initialized */
> +       atomic_t mapcount;      /* Counter for balancing map/unmap requests */
> +       spinlock_t maplock;     /* Lock to serialize map/unmap requests */
> +#endif
>  };
>
>  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>         unsigned long pfn = PFN_DOWN(orig_addr);
>         unsigned char *vaddr = phys_to_virt(tlb_addr);
> +       struct page *page = pfn_to_page(pfn);
>
> -       if (PageHighMem(pfn_to_page(pfn))) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 /* The buffer does not have a mapping.  Map it in and copy */
>                 unsigned int offset = orig_addr & ~PAGE_MASK;
>                 char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>         kernel_poison_pages(page, 1 << order, 0);
>         kernel_map_pages(page, 1 << order, 0);
>         kasan_free_pages(page, order);
> +       xpfo_free_page(page, order);
>
>         return true;
>  }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>         kernel_map_pages(page, 1 << order, 1);
>         kernel_poison_pages(page, 1 << order, 1);
>         kasan_alloc_pages(page, order);
> +       xpfo_alloc_page(page, order, gfp_flags);
>         set_page_owner(page, order, gfp_flags);
>  }
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
>  #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>
>  /*
>   * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>         &page_idle_ops,
>  #endif
> +#ifdef CONFIG_XPFO
> +       &page_xpfo_ops,
> +#endif
>  };
>
>  static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +       return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +       printk(KERN_INFO "XPFO enabled\n");
> +       static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +       .need = need_xpfo,
> +       .init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +                           pgprot_t prot) {
> +       unsigned int level;
> +       pte_t *kpte = lookup_address(kaddr, &level);
> +
> +       /* We only support 4k pages for now */
> +       BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +       set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +       int i, flush_tlb = 0;
> +       struct page_ext *page_ext;
> +       unsigned long kaddr;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       for (i = 0; i < (1 << order); i++)  {
> +               page_ext = lookup_page_ext(page + i);
> +
> +               BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +               /* Initialize the map lock and map counter */
> +               if (!page_ext->inited) {
> +                       spin_lock_init(&page_ext->maplock);
> +                       atomic_set(&page_ext->mapcount, 0);
> +                       page_ext->inited = 1;
> +               }
> +               BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +               if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +                       /*
> +                        * Flush the TLB if the page was previously allocated
> +                        * to the kernel.
> +                        */
> +                       if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +                                              &page_ext->flags))
> +                               flush_tlb = 1;
> +               } else {
> +                       /* Tag the page as a kernel page */
> +                       set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +               }
> +       }
> +
> +       if (flush_tlb) {
> +               kaddr = (unsigned long)page_address(page);
> +               flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +                                      PAGE_SIZE);
> +       }
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +       int i;
> +       struct page_ext *page_ext;
> +       unsigned long kaddr;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       for (i = 0; i < (1 << order); i++) {
> +               page_ext = lookup_page_ext(page + i);
> +
> +               if (!page_ext->inited) {
> +                       /*
> +                        * The page was allocated before page_ext was
> +                        * initialized, so it is a kernel page and it needs to
> +                        * be tagged accordingly.
> +                        */
> +                       set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +                       continue;
> +               }
> +
> +               /*
> +                * Map the page back into the kernel if it was previously
> +                * allocated to user space.
> +                */
> +               if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +                                      &page_ext->flags)) {
> +                       kaddr = (unsigned long)page_address(page + i);
> +                       set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +               }
> +       }
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       unsigned long flags;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       page_ext = lookup_page_ext(page);
> +
> +       /*
> +        * The page was allocated before page_ext was initialized (which means
> +        * it's a kernel page) or it's allocated to the kernel, so nothing to
> +        * do.
> +        */
> +       if (!page_ext->inited ||
> +           test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +               return;
> +
> +       spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +       /*
> +        * The page was previously allocated to user space, so map it back
> +        * into the kernel. No TLB flush required.
> +        */
> +       if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +           test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +               set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +       spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       unsigned long flags;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       page_ext = lookup_page_ext(page);
> +
> +       /*
> +        * The page was allocated before page_ext was initialized (which means
> +        * it's a kernel page) or it's allocated to the kernel, so nothing to
> +        * do.
> +        */
> +       if (!page_ext->inited ||
> +           test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +               return;
> +
> +       spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +       /*
> +        * The page is to be allocated back to user space, so unmap it from the
> +        * kernel, flush the TLB and tag it as a user page.
> +        */
> +       if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +               BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +               set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +               set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +               __flush_tlb_one((unsigned long)kaddr);
> +       }
> +
> +       spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return false;
> +
> +       return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>
>  source security/keys/Kconfig
>
> +config ARCH_SUPPORTS_XPFO
> +       bool

Can you include a "help" section here to describe what requirements an
architecture needs to support XPFO? See HAVE_ARCH_SECCOMP_FILTER and
HAVE_ARCH_VMAP_STACK or some examples.

> +config XPFO
> +       bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +       default n
> +       depends on ARCH_SUPPORTS_XPFO
> +       select PAGE_EXTENSION
> +       help
> +         This option offers protection against 'ret2dir' kernel attacks.
> +         When enabled, every time a page frame is allocated to user space, it
> +         is unmapped from the direct mapped RAM region in kernel space
> +         (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +         mapped back to physmap.
> +
> +         There is a slight performance impact when this option is enabled.
> +
> +         If in doubt, say "N".
> +
>  config SECURITY_DMESG_RESTRICT
>         bool "Restrict unprivileged access to the kernel syslog"
>         default n
> --
> 2.10.1
>

I've added these patches to my kspp tree on kernel.org, so it should
get some 0-day testing now...

Thanks!

-Kees

-- 
Kees Cook
Nexus Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-10 19:24           ` Kees Cook
  0 siblings, 0 replies; 93+ messages in thread
From: Kees Cook @ 2016-11-10 19:24 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk

On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
>
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
>
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
>
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Would it be possible to create an lkdtm test that can exercise this protection?

> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  drivers/ata/libata-sff.c |   4 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  39 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |   2 +
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  19 +++++
>  12 files changed, 298 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>         select HAVE_STACK_VALIDATION            if X86_64
>         select ARCH_USES_HIGH_VMA_FLAGS         if X86_INTEL_MEMORY_PROTECTION_KEYS
>         select ARCH_HAS_PKEYS                   if X86_INTEL_MEMORY_PROTECTION_KEYS
> +       select ARCH_SUPPORTS_XPFO               if X86_64
>
>  config INSTRUCTION_DECODER
>         def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>
>  config X86_DIRECT_GBPAGES
>         def_bool y
> -       depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +       depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>         ---help---
>           Certain kernel features effectively disable kernel
>           linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>         /*
>          * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>          * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>
>         DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -       if (PageHighMem(page)) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 unsigned long flags;
>
>                 /* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>
>         DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>
> -       if (PageHighMem(page)) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 unsigned long flags;
>
>                 /* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>  #include <linux/mm.h>
>  #include <linux/uaccess.h>
>  #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>
>  #include <asm/cacheflush.h>
>
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +       void *kaddr;
> +
>         might_sleep();
> -       return page_address(page);
> +       kaddr = page_address(page);
> +       xpfo_kmap(kaddr, page);
> +       return kaddr;
>  }
>
>  static inline void kunmap(struct page *page)
>  {
> +       xpfo_kunmap(page_address(page), page);
>  }
>
>  static inline void *kmap_atomic(struct page *page)
>  {
> +       void *kaddr;
> +
>         preempt_disable();
>         pagefault_disable();
> -       return page_address(page);
> +       kaddr = page_address(page);
> +       xpfo_kmap(kaddr, page);
> +       return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)   kmap_atomic(page)
>
>  static inline void __kunmap_atomic(void *addr)
>  {
> +       xpfo_kunmap(addr, virt_to_page(addr));
>         pagefault_enable();
>         preempt_enable();
>  }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>         PAGE_EXT_DEBUG_POISON,          /* Page is poisoned */
>         PAGE_EXT_DEBUG_GUARD,
>         PAGE_EXT_OWNER,
> +       PAGE_EXT_XPFO_KERNEL,           /* Page is a kernel page */
> +       PAGE_EXT_XPFO_UNMAPPED,         /* Page is unmapped */
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>         PAGE_EXT_YOUNG,
>         PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>   */
>  struct page_ext {
>         unsigned long flags;
> +#ifdef CONFIG_XPFO
> +       int inited;             /* Map counter and lock initialized */
> +       atomic_t mapcount;      /* Counter for balancing map/unmap requests */
> +       spinlock_t maplock;     /* Lock to serialize map/unmap requests */
> +#endif
>  };
>
>  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>         unsigned long pfn = PFN_DOWN(orig_addr);
>         unsigned char *vaddr = phys_to_virt(tlb_addr);
> +       struct page *page = pfn_to_page(pfn);
>
> -       if (PageHighMem(pfn_to_page(pfn))) {
> +       if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>                 /* The buffer does not have a mapping.  Map it in and copy */
>                 unsigned int offset = orig_addr & ~PAGE_MASK;
>                 char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>         kernel_poison_pages(page, 1 << order, 0);
>         kernel_map_pages(page, 1 << order, 0);
>         kasan_free_pages(page, order);
> +       xpfo_free_page(page, order);
>
>         return true;
>  }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>         kernel_map_pages(page, 1 << order, 1);
>         kernel_poison_pages(page, 1 << order, 1);
>         kasan_alloc_pages(page, order);
> +       xpfo_alloc_page(page, order, gfp_flags);
>         set_page_owner(page, order, gfp_flags);
>  }
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
>  #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>
>  /*
>   * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>         &page_idle_ops,
>  #endif
> +#ifdef CONFIG_XPFO
> +       &page_xpfo_ops,
> +#endif
>  };
>
>  static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +       return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +       printk(KERN_INFO "XPFO enabled\n");
> +       static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +       .need = need_xpfo,
> +       .init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +                           pgprot_t prot) {
> +       unsigned int level;
> +       pte_t *kpte = lookup_address(kaddr, &level);
> +
> +       /* We only support 4k pages for now */
> +       BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +       set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}
> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +       int i, flush_tlb = 0;
> +       struct page_ext *page_ext;
> +       unsigned long kaddr;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       for (i = 0; i < (1 << order); i++)  {
> +               page_ext = lookup_page_ext(page + i);
> +
> +               BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +               /* Initialize the map lock and map counter */
> +               if (!page_ext->inited) {
> +                       spin_lock_init(&page_ext->maplock);
> +                       atomic_set(&page_ext->mapcount, 0);
> +                       page_ext->inited = 1;
> +               }
> +               BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +               if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +                       /*
> +                        * Flush the TLB if the page was previously allocated
> +                        * to the kernel.
> +                        */
> +                       if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +                                              &page_ext->flags))
> +                               flush_tlb = 1;
> +               } else {
> +                       /* Tag the page as a kernel page */
> +                       set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +               }
> +       }
> +
> +       if (flush_tlb) {
> +               kaddr = (unsigned long)page_address(page);
> +               flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +                                      PAGE_SIZE);
> +       }
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +       int i;
> +       struct page_ext *page_ext;
> +       unsigned long kaddr;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       for (i = 0; i < (1 << order); i++) {
> +               page_ext = lookup_page_ext(page + i);
> +
> +               if (!page_ext->inited) {
> +                       /*
> +                        * The page was allocated before page_ext was
> +                        * initialized, so it is a kernel page and it needs to
> +                        * be tagged accordingly.
> +                        */
> +                       set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +                       continue;
> +               }
> +
> +               /*
> +                * Map the page back into the kernel if it was previously
> +                * allocated to user space.
> +                */
> +               if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +                                      &page_ext->flags)) {
> +                       kaddr = (unsigned long)page_address(page + i);
> +                       set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> +               }
> +       }
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       unsigned long flags;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       page_ext = lookup_page_ext(page);
> +
> +       /*
> +        * The page was allocated before page_ext was initialized (which means
> +        * it's a kernel page) or it's allocated to the kernel, so nothing to
> +        * do.
> +        */
> +       if (!page_ext->inited ||
> +           test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +               return;
> +
> +       spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +       /*
> +        * The page was previously allocated to user space, so map it back
> +        * into the kernel. No TLB flush required.
> +        */
> +       if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +           test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +               set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +       spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       unsigned long flags;
> +
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return;
> +
> +       page_ext = lookup_page_ext(page);
> +
> +       /*
> +        * The page was allocated before page_ext was initialized (which means
> +        * it's a kernel page) or it's allocated to the kernel, so nothing to
> +        * do.
> +        */
> +       if (!page_ext->inited ||
> +           test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +               return;
> +
> +       spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +       /*
> +        * The page is to be allocated back to user space, so unmap it from the
> +        * kernel, flush the TLB and tag it as a user page.
> +        */
> +       if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +               BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +               set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +               set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +               __flush_tlb_one((unsigned long)kaddr);
> +       }
> +
> +       spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +       if (!static_branch_unlikely(&xpfo_inited))
> +               return false;
> +
> +       return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>
>  source security/keys/Kconfig
>
> +config ARCH_SUPPORTS_XPFO
> +       bool

Can you include a "help" section here to describe what requirements an
architecture needs to support XPFO? See HAVE_ARCH_SECCOMP_FILTER and
HAVE_ARCH_VMAP_STACK or some examples.

> +config XPFO
> +       bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +       default n
> +       depends on ARCH_SUPPORTS_XPFO
> +       select PAGE_EXTENSION
> +       help
> +         This option offers protection against 'ret2dir' kernel attacks.
> +         When enabled, every time a page frame is allocated to user space, it
> +         is unmapped from the direct mapped RAM region in kernel space
> +         (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +         mapped back to physmap.
> +
> +         There is a slight performance impact when this option is enabled.
> +
> +         If in doubt, say "N".
> +
>  config SECURITY_DMESG_RESTRICT
>         bool "Restrict unprivileged access to the kernel syslog"
>         default n
> --
> 2.10.1
>

I've added these patches to my kspp tree on kernel.org, so it should
get some 0-day testing now...

Thanks!

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-10 19:11           ` Kees Cook
  (?)
@ 2016-11-15 11:15             ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-15 11:15 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2123 bytes --]

Sorry for the late reply, I just found your email in my cluttered inbox.

On 11/10/2016 08:11 PM, Kees Cook wrote:
> On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userspace, unless explicitly requested by the
>> kernel. Whenever a page destined for userspace is allocated, it is
>> unmapped from physmap (the kernel's page table). When such a page is
>> reclaimed from userspace, it is mapped back to physmap.
>>
>> Additional fields in the page_ext struct are used for XPFO housekeeping.
>> Specifically two flags to distinguish user vs. kernel pages and to tag
>> unmapped pages and a reference counter to balance kmap/kunmap operations
>> and a lock to serialize access to the XPFO fields.
> 
> Thanks for keeping on this! I'd really like to see it land and then
> get more architectures to support it.

Good to hear :-)


>> Known issues/limitations:
>>   - Only supports x86-64 (for now)
>>   - Only supports 4k pages (for now)
>>   - There are most likely some legitimate uses cases where the kernel needs
>>     to access userspace which need to be made XPFO-aware
>>   - Performance penalty
> 
> In the Kconfig you say "slight", but I'm curious what kinds of
> benchmarks you've done and if there's a more specific cost we can
> declare, just to give people more of an idea what the hit looks like?
> (What workloads would trigger a lot of XPFO unmapping, for example?)

That 'slight' wording is based on the performance numbers published in the referenced paper.

So far I've only run kernel compilation tests. For that workload, the big performance hit comes from
disabling >4k page sizes (around 10%). Adding XPFO on top causes 'only' another 0.5% performance
penalty. I'm currently looking into adding support for larger page sizes to see what the real impact
is and then generate some more relevant numbers.

...Juerg


> Thanks!
> 
> -Kees
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-15 11:15             ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-15 11:15 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2123 bytes --]

Sorry for the late reply, I just found your email in my cluttered inbox.

On 11/10/2016 08:11 PM, Kees Cook wrote:
> On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userspace, unless explicitly requested by the
>> kernel. Whenever a page destined for userspace is allocated, it is
>> unmapped from physmap (the kernel's page table). When such a page is
>> reclaimed from userspace, it is mapped back to physmap.
>>
>> Additional fields in the page_ext struct are used for XPFO housekeeping.
>> Specifically two flags to distinguish user vs. kernel pages and to tag
>> unmapped pages and a reference counter to balance kmap/kunmap operations
>> and a lock to serialize access to the XPFO fields.
> 
> Thanks for keeping on this! I'd really like to see it land and then
> get more architectures to support it.

Good to hear :-)


>> Known issues/limitations:
>>   - Only supports x86-64 (for now)
>>   - Only supports 4k pages (for now)
>>   - There are most likely some legitimate uses cases where the kernel needs
>>     to access userspace which need to be made XPFO-aware
>>   - Performance penalty
> 
> In the Kconfig you say "slight", but I'm curious what kinds of
> benchmarks you've done and if there's a more specific cost we can
> declare, just to give people more of an idea what the hit looks like?
> (What workloads would trigger a lot of XPFO unmapping, for example?)

That 'slight' wording is based on the performance numbers published in the referenced paper.

So far I've only run kernel compilation tests. For that workload, the big performance hit comes from
disabling >4k page sizes (around 10%). Adding XPFO on top causes 'only' another 0.5% performance
penalty. I'm currently looking into adding support for larger page sizes to see what the real impact
is and then generate some more relevant numbers.

...Juerg


> Thanks!
> 
> -Kees
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-15 11:15             ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-15 11:15 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2123 bytes --]

Sorry for the late reply, I just found your email in my cluttered inbox.

On 11/10/2016 08:11 PM, Kees Cook wrote:
> On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userspace, unless explicitly requested by the
>> kernel. Whenever a page destined for userspace is allocated, it is
>> unmapped from physmap (the kernel's page table). When such a page is
>> reclaimed from userspace, it is mapped back to physmap.
>>
>> Additional fields in the page_ext struct are used for XPFO housekeeping.
>> Specifically two flags to distinguish user vs. kernel pages and to tag
>> unmapped pages and a reference counter to balance kmap/kunmap operations
>> and a lock to serialize access to the XPFO fields.
> 
> Thanks for keeping on this! I'd really like to see it land and then
> get more architectures to support it.

Good to hear :-)


>> Known issues/limitations:
>>   - Only supports x86-64 (for now)
>>   - Only supports 4k pages (for now)
>>   - There are most likely some legitimate uses cases where the kernel needs
>>     to access userspace which need to be made XPFO-aware
>>   - Performance penalty
> 
> In the Kconfig you say "slight", but I'm curious what kinds of
> benchmarks you've done and if there's a more specific cost we can
> declare, just to give people more of an idea what the hit looks like?
> (What workloads would trigger a lot of XPFO unmapping, for example?)

That 'slight' wording is based on the performance numbers published in the referenced paper.

So far I've only run kernel compilation tests. For that workload, the big performance hit comes from
disabling >4k page sizes (around 10%). Adding XPFO on top causes 'only' another 0.5% performance
penalty. I'm currently looking into adding support for larger page sizes to see what the real impact
is and then generate some more relevant numbers.

...Juerg


> Thanks!
> 
> -Kees
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-10 19:24           ` Kees Cook
  (?)
@ 2016-11-15 11:18             ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-15 11:18 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2873 bytes --]

On 11/10/2016 08:24 PM, Kees Cook wrote:
> On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userspace, unless explicitly requested by the
>> kernel. Whenever a page destined for userspace is allocated, it is
>> unmapped from physmap (the kernel's page table). When such a page is
>> reclaimed from userspace, it is mapped back to physmap.
>>
>> Additional fields in the page_ext struct are used for XPFO housekeeping.
>> Specifically two flags to distinguish user vs. kernel pages and to tag
>> unmapped pages and a reference counter to balance kmap/kunmap operations
>> and a lock to serialize access to the XPFO fields.
>>
>> Known issues/limitations:
>>   - Only supports x86-64 (for now)
>>   - Only supports 4k pages (for now)
>>   - There are most likely some legitimate uses cases where the kernel needs
>>     to access userspace which need to be made XPFO-aware
>>   - Performance penalty
>>
>> Reference paper by the original patch authors:
>>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Would it be possible to create an lkdtm test that can exercise this protection?

I'll look into it.


>> diff --git a/security/Kconfig b/security/Kconfig
>> index 118f4549404e..4502e15c8419 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -6,6 +6,25 @@ menu "Security options"
>>
>>  source security/keys/Kconfig
>>
>> +config ARCH_SUPPORTS_XPFO
>> +       bool
> 
> Can you include a "help" section here to describe what requirements an
> architecture needs to support XPFO? See HAVE_ARCH_SECCOMP_FILTER and
> HAVE_ARCH_VMAP_STACK or some examples.

Will do.


>> +config XPFO
>> +       bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +       default n
>> +       depends on ARCH_SUPPORTS_XPFO
>> +       select PAGE_EXTENSION
>> +       help
>> +         This option offers protection against 'ret2dir' kernel attacks.
>> +         When enabled, every time a page frame is allocated to user space, it
>> +         is unmapped from the direct mapped RAM region in kernel space
>> +         (physmap). Similarly, when a page frame is freed/reclaimed, it is
>> +         mapped back to physmap.
>> +
>> +         There is a slight performance impact when this option is enabled.
>> +
>> +         If in doubt, say "N".
>> +
>>  config SECURITY_DMESG_RESTRICT
>>         bool "Restrict unprivileged access to the kernel syslog"
>>         default n
> 
> I've added these patches to my kspp tree on kernel.org, so it should
> get some 0-day testing now...

Very good. Thanks!


> Thanks!

Appreciate the feedback.

...Juerg


> -Kees
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-15 11:18             ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-15 11:18 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2873 bytes --]

On 11/10/2016 08:24 PM, Kees Cook wrote:
> On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userspace, unless explicitly requested by the
>> kernel. Whenever a page destined for userspace is allocated, it is
>> unmapped from physmap (the kernel's page table). When such a page is
>> reclaimed from userspace, it is mapped back to physmap.
>>
>> Additional fields in the page_ext struct are used for XPFO housekeeping.
>> Specifically two flags to distinguish user vs. kernel pages and to tag
>> unmapped pages and a reference counter to balance kmap/kunmap operations
>> and a lock to serialize access to the XPFO fields.
>>
>> Known issues/limitations:
>>   - Only supports x86-64 (for now)
>>   - Only supports 4k pages (for now)
>>   - There are most likely some legitimate uses cases where the kernel needs
>>     to access userspace which need to be made XPFO-aware
>>   - Performance penalty
>>
>> Reference paper by the original patch authors:
>>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Would it be possible to create an lkdtm test that can exercise this protection?

I'll look into it.


>> diff --git a/security/Kconfig b/security/Kconfig
>> index 118f4549404e..4502e15c8419 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -6,6 +6,25 @@ menu "Security options"
>>
>>  source security/keys/Kconfig
>>
>> +config ARCH_SUPPORTS_XPFO
>> +       bool
> 
> Can you include a "help" section here to describe what requirements an
> architecture needs to support XPFO? See HAVE_ARCH_SECCOMP_FILTER and
> HAVE_ARCH_VMAP_STACK or some examples.

Will do.


>> +config XPFO
>> +       bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +       default n
>> +       depends on ARCH_SUPPORTS_XPFO
>> +       select PAGE_EXTENSION
>> +       help
>> +         This option offers protection against 'ret2dir' kernel attacks.
>> +         When enabled, every time a page frame is allocated to user space, it
>> +         is unmapped from the direct mapped RAM region in kernel space
>> +         (physmap). Similarly, when a page frame is freed/reclaimed, it is
>> +         mapped back to physmap.
>> +
>> +         There is a slight performance impact when this option is enabled.
>> +
>> +         If in doubt, say "N".
>> +
>>  config SECURITY_DMESG_RESTRICT
>>         bool "Restrict unprivileged access to the kernel syslog"
>>         default n
> 
> I've added these patches to my kspp tree on kernel.org, so it should
> get some 0-day testing now...

Very good. Thanks!


> Thanks!

Appreciate the feedback.

...Juerg


> -Kees
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-15 11:18             ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-15 11:18 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML, Linux-MM, kernel-hardening, linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2873 bytes --]

On 11/10/2016 08:24 PM, Kees Cook wrote:
> On Fri, Nov 4, 2016 at 7:45 AM, Juerg Haefliger <juerg.haefliger@hpe.com> wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userspace, unless explicitly requested by the
>> kernel. Whenever a page destined for userspace is allocated, it is
>> unmapped from physmap (the kernel's page table). When such a page is
>> reclaimed from userspace, it is mapped back to physmap.
>>
>> Additional fields in the page_ext struct are used for XPFO housekeeping.
>> Specifically two flags to distinguish user vs. kernel pages and to tag
>> unmapped pages and a reference counter to balance kmap/kunmap operations
>> and a lock to serialize access to the XPFO fields.
>>
>> Known issues/limitations:
>>   - Only supports x86-64 (for now)
>>   - Only supports 4k pages (for now)
>>   - There are most likely some legitimate uses cases where the kernel needs
>>     to access userspace which need to be made XPFO-aware
>>   - Performance penalty
>>
>> Reference paper by the original patch authors:
>>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Would it be possible to create an lkdtm test that can exercise this protection?

I'll look into it.


>> diff --git a/security/Kconfig b/security/Kconfig
>> index 118f4549404e..4502e15c8419 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -6,6 +6,25 @@ menu "Security options"
>>
>>  source security/keys/Kconfig
>>
>> +config ARCH_SUPPORTS_XPFO
>> +       bool
> 
> Can you include a "help" section here to describe what requirements an
> architecture needs to support XPFO? See HAVE_ARCH_SECCOMP_FILTER and
> HAVE_ARCH_VMAP_STACK or some examples.

Will do.


>> +config XPFO
>> +       bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +       default n
>> +       depends on ARCH_SUPPORTS_XPFO
>> +       select PAGE_EXTENSION
>> +       help
>> +         This option offers protection against 'ret2dir' kernel attacks.
>> +         When enabled, every time a page frame is allocated to user space, it
>> +         is unmapped from the direct mapped RAM region in kernel space
>> +         (physmap). Similarly, when a page frame is freed/reclaimed, it is
>> +         mapped back to physmap.
>> +
>> +         There is a slight performance impact when this option is enabled.
>> +
>> +         If in doubt, say "N".
>> +
>>  config SECURITY_DMESG_RESTRICT
>>         bool "Restrict unprivileged access to the kernel syslog"
>>         default n
> 
> I've added these patches to my kspp tree on kernel.org, so it should
> get some 0-day testing now...

Very good. Thanks!


> Thanks!

Appreciate the feedback.

...Juerg


> -Kees
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-04 14:45         ` Juerg Haefliger
  (?)
@ 2016-11-24 10:56           ` AKASHI Takahiro
  -1 siblings, 0 replies; 93+ messages in thread
From: AKASHI Takahiro @ 2016-11-24 10:56 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk

Hi,

I'm trying to give it a spin on arm64, but ...

On Fri, Nov 04, 2016 at 03:45:33PM +0100, Juerg Haefliger wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
> 
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
> 
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
> 
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  drivers/ata/libata-sff.c |   4 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  39 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |   2 +
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  19 +++++
>  12 files changed, 298 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>  	select HAVE_STACK_VALIDATION		if X86_64
>  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_SUPPORTS_XPFO		if X86_64
>  
>  config INSTRUCTION_DECODER
>  	def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>  
>  config X86_DIRECT_GBPAGES
>  	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>  	---help---
>  	  Certain kernel features effectively disable kernel
>  	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>  
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>  	/*
>  	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>  	 * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>  #include <linux/mm.h>
>  #include <linux/uaccess.h>
>  #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>  
>  #include <asm/cacheflush.h>
>  
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +	void *kaddr;
> +
>  	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  
>  static inline void kunmap(struct page *page)
>  {
> +	xpfo_kunmap(page_address(page), page);
>  }
>  
>  static inline void *kmap_atomic(struct page *page)
>  {
> +	void *kaddr;
> +
>  	preempt_disable();
>  	pagefault_disable();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>  
>  static inline void __kunmap_atomic(void *addr)
>  {
> +	xpfo_kunmap(addr, virt_to_page(addr));
>  	pagefault_enable();
>  	preempt_enable();
>  }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>  	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
>  	PAGE_EXT_DEBUG_GUARD,
>  	PAGE_EXT_OWNER,
> +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>  	PAGE_EXT_YOUNG,
>  	PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>   */
>  struct page_ext {
>  	unsigned long flags;
> +#ifdef CONFIG_XPFO
> +	int inited;		/* Map counter and lock initialized */
> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> +#endif
>  };
>  
>  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>  	unsigned long pfn = PFN_DOWN(orig_addr);
>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>  
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		/* The buffer does not have a mapping.  Map it in and copy */
>  		unsigned int offset = orig_addr & ~PAGE_MASK;
>  		char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  	kernel_poison_pages(page, 1 << order, 0);
>  	kernel_map_pages(page, 1 << order, 0);
>  	kasan_free_pages(page, order);
> +	xpfo_free_page(page, order);
>  
>  	return true;
>  }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	kernel_map_pages(page, 1 << order, 1);
>  	kernel_poison_pages(page, 1 << order, 1);
>  	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>  	set_page_owner(page, order, gfp_flags);
>  }
>  
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
>  #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>  
>  /*
>   * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>  	&page_idle_ops,
>  #endif
> +#ifdef CONFIG_XPFO
> +	&page_xpfo_ops,
> +#endif
>  };
>  
>  static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +	return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +	printk(KERN_INFO "XPFO enabled\n");
> +	static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +	.need = need_xpfo,
> +	.init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}

As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
would it be better to put the whole definition into arch-specific part?

> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, flush_tlb = 0;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +		/* Initialize the map lock and map counter */
> +		if (!page_ext->inited) {
> +			spin_lock_init(&page_ext->maplock);
> +			atomic_set(&page_ext->mapcount, 0);
> +			page_ext->inited = 1;
> +		}
> +		BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +			/*
> +			 * Flush the TLB if the page was previously allocated
> +			 * to the kernel.
> +			 */
> +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +					       &page_ext->flags))
> +				flush_tlb = 1;
> +		} else {
> +			/* Tag the page as a kernel page */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +		}
> +	}
> +
> +	if (flush_tlb) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		if (!page_ext->inited) {
> +			/*
> +			 * The page was allocated before page_ext was
> +			 * initialized, so it is a kernel page and it needs to
> +			 * be tagged accordingly.
> +			 */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +			continue;
> +		}
> +
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +				       &page_ext->flags)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));

Why not PAGE_KERNEL?

> +		}
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB flush required.
> +	 */
> +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from the
> +	 * kernel, flush the TLB and tag it as a user page.
> +	 */
> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);

Again __flush_tlb_one() is x86-specific.
flush_tlb_kernel_range() instead?

Thanks,
-Takahiro AKASHI

> +	}
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return false;
> +
> +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>  
>  source security/keys/Kconfig
>  
> +config ARCH_SUPPORTS_XPFO
> +	bool
> +
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on ARCH_SUPPORTS_XPFO
> +	select PAGE_EXTENSION
> +	help
> +	  This option offers protection against 'ret2dir' kernel attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +	  mapped back to physmap.
> +
> +	  There is a slight performance impact when this option is enabled.
> +
> +	  If in doubt, say "N".
> +
>  config SECURITY_DMESG_RESTRICT
>  	bool "Restrict unprivileged access to the kernel syslog"
>  	default n
> -- 
> 2.10.1
> 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-24 10:56           ` AKASHI Takahiro
  0 siblings, 0 replies; 93+ messages in thread
From: AKASHI Takahiro @ 2016-11-24 10:56 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk

Hi,

I'm trying to give it a spin on arm64, but ...

On Fri, Nov 04, 2016 at 03:45:33PM +0100, Juerg Haefliger wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
> 
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
> 
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
> 
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  drivers/ata/libata-sff.c |   4 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  39 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |   2 +
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  19 +++++
>  12 files changed, 298 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>  	select HAVE_STACK_VALIDATION		if X86_64
>  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_SUPPORTS_XPFO		if X86_64
>  
>  config INSTRUCTION_DECODER
>  	def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>  
>  config X86_DIRECT_GBPAGES
>  	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>  	---help---
>  	  Certain kernel features effectively disable kernel
>  	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>  
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>  	/*
>  	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>  	 * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>  #include <linux/mm.h>
>  #include <linux/uaccess.h>
>  #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>  
>  #include <asm/cacheflush.h>
>  
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +	void *kaddr;
> +
>  	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  
>  static inline void kunmap(struct page *page)
>  {
> +	xpfo_kunmap(page_address(page), page);
>  }
>  
>  static inline void *kmap_atomic(struct page *page)
>  {
> +	void *kaddr;
> +
>  	preempt_disable();
>  	pagefault_disable();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>  
>  static inline void __kunmap_atomic(void *addr)
>  {
> +	xpfo_kunmap(addr, virt_to_page(addr));
>  	pagefault_enable();
>  	preempt_enable();
>  }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>  	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
>  	PAGE_EXT_DEBUG_GUARD,
>  	PAGE_EXT_OWNER,
> +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>  	PAGE_EXT_YOUNG,
>  	PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>   */
>  struct page_ext {
>  	unsigned long flags;
> +#ifdef CONFIG_XPFO
> +	int inited;		/* Map counter and lock initialized */
> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> +#endif
>  };
>  
>  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>  	unsigned long pfn = PFN_DOWN(orig_addr);
>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>  
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		/* The buffer does not have a mapping.  Map it in and copy */
>  		unsigned int offset = orig_addr & ~PAGE_MASK;
>  		char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  	kernel_poison_pages(page, 1 << order, 0);
>  	kernel_map_pages(page, 1 << order, 0);
>  	kasan_free_pages(page, order);
> +	xpfo_free_page(page, order);
>  
>  	return true;
>  }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	kernel_map_pages(page, 1 << order, 1);
>  	kernel_poison_pages(page, 1 << order, 1);
>  	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>  	set_page_owner(page, order, gfp_flags);
>  }
>  
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
>  #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>  
>  /*
>   * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>  	&page_idle_ops,
>  #endif
> +#ifdef CONFIG_XPFO
> +	&page_xpfo_ops,
> +#endif
>  };
>  
>  static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +	return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +	printk(KERN_INFO "XPFO enabled\n");
> +	static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +	.need = need_xpfo,
> +	.init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}

As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
would it be better to put the whole definition into arch-specific part?

> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, flush_tlb = 0;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +		/* Initialize the map lock and map counter */
> +		if (!page_ext->inited) {
> +			spin_lock_init(&page_ext->maplock);
> +			atomic_set(&page_ext->mapcount, 0);
> +			page_ext->inited = 1;
> +		}
> +		BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +			/*
> +			 * Flush the TLB if the page was previously allocated
> +			 * to the kernel.
> +			 */
> +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +					       &page_ext->flags))
> +				flush_tlb = 1;
> +		} else {
> +			/* Tag the page as a kernel page */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +		}
> +	}
> +
> +	if (flush_tlb) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		if (!page_ext->inited) {
> +			/*
> +			 * The page was allocated before page_ext was
> +			 * initialized, so it is a kernel page and it needs to
> +			 * be tagged accordingly.
> +			 */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +			continue;
> +		}
> +
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +				       &page_ext->flags)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));

Why not PAGE_KERNEL?

> +		}
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB flush required.
> +	 */
> +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from the
> +	 * kernel, flush the TLB and tag it as a user page.
> +	 */
> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);

Again __flush_tlb_one() is x86-specific.
flush_tlb_kernel_range() instead?

Thanks,
-Takahiro AKASHI

> +	}
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return false;
> +
> +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>  
>  source security/keys/Kconfig
>  
> +config ARCH_SUPPORTS_XPFO
> +	bool
> +
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on ARCH_SUPPORTS_XPFO
> +	select PAGE_EXTENSION
> +	help
> +	  This option offers protection against 'ret2dir' kernel attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +	  mapped back to physmap.
> +
> +	  There is a slight performance impact when this option is enabled.
> +
> +	  If in doubt, say "N".
> +
>  config SECURITY_DMESG_RESTRICT
>  	bool "Restrict unprivileged access to the kernel syslog"
>  	default n
> -- 
> 2.10.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-24 10:56           ` AKASHI Takahiro
  0 siblings, 0 replies; 93+ messages in thread
From: AKASHI Takahiro @ 2016-11-24 10:56 UTC (permalink / raw)
  To: Juerg Haefliger
  Cc: linux-kernel, linux-mm, kernel-hardening, linux-x86_64, vpk

Hi,

I'm trying to give it a spin on arm64, but ...

On Fri, Nov 04, 2016 at 03:45:33PM +0100, Juerg Haefliger wrote:
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
> 
> Additional fields in the page_ext struct are used for XPFO housekeeping.
> Specifically two flags to distinguish user vs. kernel pages and to tag
> unmapped pages and a reference counter to balance kmap/kunmap operations
> and a lock to serialize access to the XPFO fields.
> 
> Known issues/limitations:
>   - Only supports x86-64 (for now)
>   - Only supports 4k pages (for now)
>   - There are most likely some legitimate uses cases where the kernel needs
>     to access userspace which need to be made XPFO-aware
>   - Performance penalty
> 
> Reference paper by the original patch authors:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> ---
>  arch/x86/Kconfig         |   3 +-
>  arch/x86/mm/init.c       |   2 +-
>  drivers/ata/libata-sff.c |   4 +-
>  include/linux/highmem.h  |  15 +++-
>  include/linux/page_ext.h |   7 ++
>  include/linux/xpfo.h     |  39 +++++++++
>  lib/swiotlb.c            |   3 +-
>  mm/Makefile              |   1 +
>  mm/page_alloc.c          |   2 +
>  mm/page_ext.c            |   4 +
>  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig         |  19 +++++
>  12 files changed, 298 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/xpfo.h
>  create mode 100644 mm/xpfo.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bada636d1065..38b334f8fde5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -165,6 +165,7 @@ config X86
>  	select HAVE_STACK_VALIDATION		if X86_64
>  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_SUPPORTS_XPFO		if X86_64
>  
>  config INSTRUCTION_DECODER
>  	def_bool y
> @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
>  
>  config X86_DIRECT_GBPAGES
>  	def_bool y
> -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>  	---help---
>  	  Certain kernel features effectively disable kernel
>  	  linear 1 GB mappings (even if the CPU otherwise
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 22af912d66d2..a6fafbae02bb 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -161,7 +161,7 @@ static int page_size_mask;
>  
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
>  	/*
>  	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
>  	 * use small pages.
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 051b6158d1b7..58af734be25d 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use a bounce buffer */
> @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  
>  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
>  
> -	if (PageHighMem(page)) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		unsigned long flags;
>  
>  		/* FIXME: use bounce buffer */
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index bb3f3297062a..7a17c166532f 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -7,6 +7,7 @@
>  #include <linux/mm.h>
>  #include <linux/uaccess.h>
>  #include <linux/hardirq.h>
> +#include <linux/xpfo.h>
>  
>  #include <asm/cacheflush.h>
>  
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +	void *kaddr;
> +
>  	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  
>  static inline void kunmap(struct page *page)
>  {
> +	xpfo_kunmap(page_address(page), page);
>  }
>  
>  static inline void *kmap_atomic(struct page *page)
>  {
> +	void *kaddr;
> +
>  	preempt_disable();
>  	pagefault_disable();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
>  
>  static inline void __kunmap_atomic(void *addr)
>  {
> +	xpfo_kunmap(addr, virt_to_page(addr));
>  	pagefault_enable();
>  	preempt_enable();
>  }
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 9298c393ddaa..0e451a42e5a3 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -29,6 +29,8 @@ enum page_ext_flags {
>  	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
>  	PAGE_EXT_DEBUG_GUARD,
>  	PAGE_EXT_OWNER,
> +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>  	PAGE_EXT_YOUNG,
>  	PAGE_EXT_IDLE,
> @@ -44,6 +46,11 @@ enum page_ext_flags {
>   */
>  struct page_ext {
>  	unsigned long flags;
> +#ifdef CONFIG_XPFO
> +	int inited;		/* Map counter and lock initialized */
> +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> +#endif
>  };
>  
>  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> new file mode 100644
> index 000000000000..77187578ca33
> --- /dev/null
> +++ b/include/linux/xpfo.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_XPFO_H
> +#define _LINUX_XPFO_H
> +
> +#ifdef CONFIG_XPFO
> +
> +extern struct page_ext_operations page_xpfo_ops;
> +
> +extern void xpfo_kmap(void *kaddr, struct page *page);
> +extern void xpfo_kunmap(void *kaddr, struct page *page);
> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> +extern void xpfo_free_page(struct page *page, int order);
> +
> +extern bool xpfo_page_is_unmapped(struct page *page);
> +
> +#else /* !CONFIG_XPFO */
> +
> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> +static inline void xpfo_free_page(struct page *page, int order) { }
> +
> +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> +
> +#endif /* CONFIG_XPFO */
> +
> +#endif /* _LINUX_XPFO_H */
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0e19d7..455eff44604e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  {
>  	unsigned long pfn = PFN_DOWN(orig_addr);
>  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> +	struct page *page = pfn_to_page(pfn);
>  
> -	if (PageHighMem(pfn_to_page(pfn))) {
> +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>  		/* The buffer does not have a mapping.  Map it in and copy */
>  		unsigned int offset = orig_addr & ~PAGE_MASK;
>  		char *buffer;
> diff --git a/mm/Makefile b/mm/Makefile
> index 295bd7a9f76b..175680f516aa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> +obj-$(CONFIG_XPFO) += xpfo.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8fd42aa7c4bd..100e80e008e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  	kernel_poison_pages(page, 1 << order, 0);
>  	kernel_map_pages(page, 1 << order, 0);
>  	kasan_free_pages(page, order);
> +	xpfo_free_page(page, order);
>  
>  	return true;
>  }
> @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	kernel_map_pages(page, 1 << order, 1);
>  	kernel_poison_pages(page, 1 << order, 1);
>  	kasan_alloc_pages(page, order);
> +	xpfo_alloc_page(page, order, gfp_flags);
>  	set_page_owner(page, order, gfp_flags);
>  }
>  
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 121dcffc4ec1..ba6dbcacc2db 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
>  #include <linux/page_idle.h>
> +#include <linux/xpfo.h>
>  
>  /*
>   * struct page extension
> @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
>  	&page_idle_ops,
>  #endif
> +#ifdef CONFIG_XPFO
> +	&page_xpfo_ops,
> +#endif
>  };
>  
>  static unsigned long total_usage;
> diff --git a/mm/xpfo.c b/mm/xpfo.c
> new file mode 100644
> index 000000000000..8e3a6a694b6a
> --- /dev/null
> +++ b/mm/xpfo.c
> @@ -0,0 +1,206 @@
> +/*
> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/page_ext.h>
> +#include <linux/xpfo.h>
> +
> +#include <asm/tlbflush.h>
> +
> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> +
> +static bool need_xpfo(void)
> +{
> +	return true;
> +}
> +
> +static void init_xpfo(void)
> +{
> +	printk(KERN_INFO "XPFO enabled\n");
> +	static_branch_enable(&xpfo_inited);
> +}
> +
> +struct page_ext_operations page_xpfo_ops = {
> +	.need = need_xpfo,
> +	.init = init_xpfo,
> +};
> +
> +/*
> + * Update a single kernel page table entry
> + */
> +static inline void set_kpte(struct page *page, unsigned long kaddr,
> +			    pgprot_t prot) {
> +	unsigned int level;
> +	pte_t *kpte = lookup_address(kaddr, &level);
> +
> +	/* We only support 4k pages for now */
> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> +
> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> +}

As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
would it be better to put the whole definition into arch-specific part?

> +
> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, flush_tlb = 0;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +
> +		/* Initialize the map lock and map counter */
> +		if (!page_ext->inited) {
> +			spin_lock_init(&page_ext->maplock);
> +			atomic_set(&page_ext->mapcount, 0);
> +			page_ext->inited = 1;
> +		}
> +		BUG_ON(atomic_read(&page_ext->mapcount));
> +
> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +			/*
> +			 * Flush the TLB if the page was previously allocated
> +			 * to the kernel.
> +			 */
> +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> +					       &page_ext->flags))
> +				flush_tlb = 1;
> +		} else {
> +			/* Tag the page as a kernel page */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +		}
> +	}
> +
> +	if (flush_tlb) {
> +		kaddr = (unsigned long)page_address(page);
> +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> +				       PAGE_SIZE);
> +	}
> +}
> +
> +void xpfo_free_page(struct page *page, int order)
> +{
> +	int i;
> +	struct page_ext *page_ext;
> +	unsigned long kaddr;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		page_ext = lookup_page_ext(page + i);
> +
> +		if (!page_ext->inited) {
> +			/*
> +			 * The page was allocated before page_ext was
> +			 * initialized, so it is a kernel page and it needs to
> +			 * be tagged accordingly.
> +			 */
> +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> +			continue;
> +		}
> +
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> +				       &page_ext->flags)) {
> +			kaddr = (unsigned long)page_address(page + i);
> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));

Why not PAGE_KERNEL?

> +		}
> +	}
> +}
> +
> +void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page was previously allocated to user space, so map it back
> +	 * into the kernel. No TLB flush required.
> +	 */
> +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kmap);
> +
> +void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	struct page_ext *page_ext;
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	page_ext = lookup_page_ext(page);
> +
> +	/*
> +	 * The page was allocated before page_ext was initialized (which means
> +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> +	 * do.
> +	 */
> +	if (!page_ext->inited ||
> +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> +		return;
> +
> +	spin_lock_irqsave(&page_ext->maplock, flags);
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from the
> +	 * kernel, flush the TLB and tag it as a user page.
> +	 */
> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> +		__flush_tlb_one((unsigned long)kaddr);

Again __flush_tlb_one() is x86-specific.
flush_tlb_kernel_range() instead?

Thanks,
-Takahiro AKASHI

> +	}
> +
> +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> +}
> +EXPORT_SYMBOL(xpfo_kunmap);
> +
> +inline bool xpfo_page_is_unmapped(struct page *page)
> +{
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return false;
> +
> +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> +}
> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> diff --git a/security/Kconfig b/security/Kconfig
> index 118f4549404e..4502e15c8419 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -6,6 +6,25 @@ menu "Security options"
>  
>  source security/keys/Kconfig
>  
> +config ARCH_SUPPORTS_XPFO
> +	bool
> +
> +config XPFO
> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n
> +	depends on ARCH_SUPPORTS_XPFO
> +	select PAGE_EXTENSION
> +	help
> +	  This option offers protection against 'ret2dir' kernel attacks.
> +	  When enabled, every time a page frame is allocated to user space, it
> +	  is unmapped from the direct mapped RAM region in kernel space
> +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> +	  mapped back to physmap.
> +
> +	  There is a slight performance impact when this option is enabled.
> +
> +	  If in doubt, say "N".
> +
>  config SECURITY_DMESG_RESTRICT
>  	bool "Restrict unprivileged access to the kernel syslog"
>  	default n
> -- 
> 2.10.1
> 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-24 10:56           ` AKASHI Takahiro
@ 2016-11-28 11:15             ` Juerg Haefliger
  -1 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-28 11:15 UTC (permalink / raw)
  To: AKASHI Takahiro, linux-kernel, linux-mm, kernel-hardening,
	linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2003 bytes --]

On 11/24/2016 11:56 AM, AKASHI Takahiro wrote:
> Hi,
> 
> I'm trying to give it a spin on arm64, but ...

Thanks for trying this.


>> +/*
>> + * Update a single kernel page table entry
>> + */
>> +static inline void set_kpte(struct page *page, unsigned long kaddr,
>> +			    pgprot_t prot) {
>> +	unsigned int level;
>> +	pte_t *kpte = lookup_address(kaddr, &level);
>> +
>> +	/* We only support 4k pages for now */
>> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
>> +
>> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
>> +}
> 
> As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
> would it be better to put the whole definition into arch-specific part?

Well yes but I haven't really looked into splitting up the arch specific stuff.


>> +		/*
>> +		 * Map the page back into the kernel if it was previously
>> +		 * allocated to user space.
>> +		 */
>> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
>> +				       &page_ext->flags)) {
>> +			kaddr = (unsigned long)page_address(page + i);
>> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> 
> Why not PAGE_KERNEL?

Good catch, thanks!


>> +	/*
>> +	 * The page is to be allocated back to user space, so unmap it from the
>> +	 * kernel, flush the TLB and tag it as a user page.
>> +	 */
>> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
>> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
>> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
>> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
>> +		__flush_tlb_one((unsigned long)kaddr);
> 
> Again __flush_tlb_one() is x86-specific.
> flush_tlb_kernel_range() instead?

I'll take a look. If you can tell me what the relevant arm64 equivalents are for the arch-specific
functions, that would help tremendously.

Thanks for the comments!

...Juerg



> Thanks,
> -Takahiro AKASHI


-- 
Juerg Haefliger
Hewlett Packard Enterprise


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-11-28 11:15             ` Juerg Haefliger
  0 siblings, 0 replies; 93+ messages in thread
From: Juerg Haefliger @ 2016-11-28 11:15 UTC (permalink / raw)
  To: AKASHI Takahiro, linux-kernel, linux-mm, kernel-hardening,
	linux-x86_64, vpk


[-- Attachment #1.1: Type: text/plain, Size: 2003 bytes --]

On 11/24/2016 11:56 AM, AKASHI Takahiro wrote:
> Hi,
> 
> I'm trying to give it a spin on arm64, but ...

Thanks for trying this.


>> +/*
>> + * Update a single kernel page table entry
>> + */
>> +static inline void set_kpte(struct page *page, unsigned long kaddr,
>> +			    pgprot_t prot) {
>> +	unsigned int level;
>> +	pte_t *kpte = lookup_address(kaddr, &level);
>> +
>> +	/* We only support 4k pages for now */
>> +	BUG_ON(!kpte || level != PG_LEVEL_4K);
>> +
>> +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
>> +}
> 
> As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
> would it be better to put the whole definition into arch-specific part?

Well yes but I haven't really looked into splitting up the arch specific stuff.


>> +		/*
>> +		 * Map the page back into the kernel if it was previously
>> +		 * allocated to user space.
>> +		 */
>> +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
>> +				       &page_ext->flags)) {
>> +			kaddr = (unsigned long)page_address(page + i);
>> +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> 
> Why not PAGE_KERNEL?

Good catch, thanks!


>> +	/*
>> +	 * The page is to be allocated back to user space, so unmap it from the
>> +	 * kernel, flush the TLB and tag it as a user page.
>> +	 */
>> +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
>> +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
>> +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
>> +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
>> +		__flush_tlb_one((unsigned long)kaddr);
> 
> Again __flush_tlb_one() is x86-specific.
> flush_tlb_kernel_range() instead?

I'll take a look. If you can tell me what the relevant arm64 equivalents are for the arch-specific
functions, that would help tremendously.

Thanks for the comments!

...Juerg



> Thanks,
> -Takahiro AKASHI


-- 
Juerg Haefliger
Hewlett Packard Enterprise


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
  2016-11-24 10:56           ` AKASHI Takahiro
  (?)
@ 2016-12-09  9:02             ` AKASHI Takahiro
  -1 siblings, 0 replies; 93+ messages in thread
From: AKASHI Takahiro @ 2016-12-09  9:02 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm, kernel-hardening,
	linux-x86_64, vpk

On Thu, Nov 24, 2016 at 07:56:30PM +0900, AKASHI Takahiro wrote:
> Hi,
> 
> I'm trying to give it a spin on arm64, but ...

In my experiment on hikey,
the kernel boot failed, catching a page fault around cache operations,
(a) __clean_dcache_area_pou() on 4KB-page kernel, 
(b) __inval_cache_range() on 64KB-page kernel,
(See more details for backtrace below.)

This is because, on arm64, cache operations are by VA (in particular,
of direct/linear mapping of physical memory). So I think that 
naively unmapping a page from physmap in xpfo_kunmap() won't work well
on arm64.

-Takahiro AKASHI

case (a)
--------
Unable to handle kernel paging request at virtual address ffff800000cba000
pgd = ffff80003ba8c000
*pgd=0000000000000000
task: ffff80003be38000 task.stack: ffff80003be40000
PC is at __clean_dcache_area_pou+0x20/0x38
LR is at sync_icache_aliases+0x2c/0x40
 ...
Call trace:
 ...
__clean_dcache_area_pou+0x20/0x38
__sync_icache_dcache+0x6c/0xa8
alloc_set_pte+0x33c/0x588
filemap_map_pages+0x3a8/0x3b8
handle_mm_fault+0x910/0x1080
do_page_fault+0x2b0/0x358
do_mem_abort+0x44/0xa0
el0_ia+0x18/0x1c

case (b)
--------
Unable to handle kernel paging request at virtual address ffff80002aed0000
pgd = ffff000008f40000
, *pud=000000003dfc0003
, *pmd=000000003dfa0003
, *pte=000000002aed0000
task: ffff800028711900 task.stack: ffff800029020000
PC is at __inval_cache_range+0x3c/0x60
LR is at __swiotlb_map_sg_attrs+0x6c/0x98
 ...

Call trace:
 ...
__inval_cache_range+0x3c/0x60
dw_mci_pre_dma_transfer.isra.7+0xfc/0x190
dw_mci_pre_req+0x50/0x60
mmc_start_req+0x4c/0x420
mmc_blk_issue_rw_rq+0xb0/0x9b8
mmc_blk_issue_rq+0x154/0x518
mmc_queue_thread+0xac/0x158
kthread+0xd0/0xe8
ret_from_fork+0x10/0x20


> 
> On Fri, Nov 04, 2016 at 03:45:33PM +0100, Juerg Haefliger wrote:
> > This patch adds support for XPFO which protects against 'ret2dir' kernel
> > attacks. The basic idea is to enforce exclusive ownership of page frames
> > by either the kernel or userspace, unless explicitly requested by the
> > kernel. Whenever a page destined for userspace is allocated, it is
> > unmapped from physmap (the kernel's page table). When such a page is
> > reclaimed from userspace, it is mapped back to physmap.
> > 
> > Additional fields in the page_ext struct are used for XPFO housekeeping.
> > Specifically two flags to distinguish user vs. kernel pages and to tag
> > unmapped pages and a reference counter to balance kmap/kunmap operations
> > and a lock to serialize access to the XPFO fields.
> > 
> > Known issues/limitations:
> >   - Only supports x86-64 (for now)
> >   - Only supports 4k pages (for now)
> >   - There are most likely some legitimate uses cases where the kernel needs
> >     to access userspace which need to be made XPFO-aware
> >   - Performance penalty
> > 
> > Reference paper by the original patch authors:
> >   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> > 
> > Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> > Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> > ---
> >  arch/x86/Kconfig         |   3 +-
> >  arch/x86/mm/init.c       |   2 +-
> >  drivers/ata/libata-sff.c |   4 +-
> >  include/linux/highmem.h  |  15 +++-
> >  include/linux/page_ext.h |   7 ++
> >  include/linux/xpfo.h     |  39 +++++++++
> >  lib/swiotlb.c            |   3 +-
> >  mm/Makefile              |   1 +
> >  mm/page_alloc.c          |   2 +
> >  mm/page_ext.c            |   4 +
> >  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
> >  security/Kconfig         |  19 +++++
> >  12 files changed, 298 insertions(+), 7 deletions(-)
> >  create mode 100644 include/linux/xpfo.h
> >  create mode 100644 mm/xpfo.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index bada636d1065..38b334f8fde5 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -165,6 +165,7 @@ config X86
> >  	select HAVE_STACK_VALIDATION		if X86_64
> >  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
> >  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> > +	select ARCH_SUPPORTS_XPFO		if X86_64
> >  
> >  config INSTRUCTION_DECODER
> >  	def_bool y
> > @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
> >  
> >  config X86_DIRECT_GBPAGES
> >  	def_bool y
> > -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> > +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
> >  	---help---
> >  	  Certain kernel features effectively disable kernel
> >  	  linear 1 GB mappings (even if the CPU otherwise
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 22af912d66d2..a6fafbae02bb 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -161,7 +161,7 @@ static int page_size_mask;
> >  
> >  static void __init probe_page_size_mask(void)
> >  {
> > -#if !defined(CONFIG_KMEMCHECK)
> > +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
> >  	/*
> >  	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
> >  	 * use small pages.
> > diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> > index 051b6158d1b7..58af734be25d 100644
> > --- a/drivers/ata/libata-sff.c
> > +++ b/drivers/ata/libata-sff.c
> > @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
> >  
> >  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
> >  
> > -	if (PageHighMem(page)) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		unsigned long flags;
> >  
> >  		/* FIXME: use a bounce buffer */
> > @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
> >  
> >  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
> >  
> > -	if (PageHighMem(page)) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		unsigned long flags;
> >  
> >  		/* FIXME: use bounce buffer */
> > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > index bb3f3297062a..7a17c166532f 100644
> > --- a/include/linux/highmem.h
> > +++ b/include/linux/highmem.h
> > @@ -7,6 +7,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/hardirq.h>
> > +#include <linux/xpfo.h>
> >  
> >  #include <asm/cacheflush.h>
> >  
> > @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
> >  #ifndef ARCH_HAS_KMAP
> >  static inline void *kmap(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	might_sleep();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> >  
> >  static inline void kunmap(struct page *page)
> >  {
> > +	xpfo_kunmap(page_address(page), page);
> >  }
> >  
> >  static inline void *kmap_atomic(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	preempt_disable();
> >  	pagefault_disable();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> >  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
> >  
> >  static inline void __kunmap_atomic(void *addr)
> >  {
> > +	xpfo_kunmap(addr, virt_to_page(addr));
> >  	pagefault_enable();
> >  	preempt_enable();
> >  }
> > diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> > index 9298c393ddaa..0e451a42e5a3 100644
> > --- a/include/linux/page_ext.h
> > +++ b/include/linux/page_ext.h
> > @@ -29,6 +29,8 @@ enum page_ext_flags {
> >  	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
> >  	PAGE_EXT_DEBUG_GUARD,
> >  	PAGE_EXT_OWNER,
> > +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> > +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
> >  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> >  	PAGE_EXT_YOUNG,
> >  	PAGE_EXT_IDLE,
> > @@ -44,6 +46,11 @@ enum page_ext_flags {
> >   */
> >  struct page_ext {
> >  	unsigned long flags;
> > +#ifdef CONFIG_XPFO
> > +	int inited;		/* Map counter and lock initialized */
> > +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> > +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> > +#endif
> >  };
> >  
> >  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> > diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> > new file mode 100644
> > index 000000000000..77187578ca33
> > --- /dev/null
> > +++ b/include/linux/xpfo.h
> > @@ -0,0 +1,39 @@
> > +/*
> > + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> > + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#ifndef _LINUX_XPFO_H
> > +#define _LINUX_XPFO_H
> > +
> > +#ifdef CONFIG_XPFO
> > +
> > +extern struct page_ext_operations page_xpfo_ops;
> > +
> > +extern void xpfo_kmap(void *kaddr, struct page *page);
> > +extern void xpfo_kunmap(void *kaddr, struct page *page);
> > +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> > +extern void xpfo_free_page(struct page *page, int order);
> > +
> > +extern bool xpfo_page_is_unmapped(struct page *page);
> > +
> > +#else /* !CONFIG_XPFO */
> > +
> > +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> > +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> > +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> > +static inline void xpfo_free_page(struct page *page, int order) { }
> > +
> > +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> > +
> > +#endif /* CONFIG_XPFO */
> > +
> > +#endif /* _LINUX_XPFO_H */
> > diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> > index 22e13a0e19d7..455eff44604e 100644
> > --- a/lib/swiotlb.c
> > +++ b/lib/swiotlb.c
> > @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> >  {
> >  	unsigned long pfn = PFN_DOWN(orig_addr);
> >  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> > +	struct page *page = pfn_to_page(pfn);
> >  
> > -	if (PageHighMem(pfn_to_page(pfn))) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		/* The buffer does not have a mapping.  Map it in and copy */
> >  		unsigned int offset = orig_addr & ~PAGE_MASK;
> >  		char *buffer;
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 295bd7a9f76b..175680f516aa 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> >  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
> >  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
> >  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> > +obj-$(CONFIG_XPFO) += xpfo.o
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8fd42aa7c4bd..100e80e008e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> >  	kernel_poison_pages(page, 1 << order, 0);
> >  	kernel_map_pages(page, 1 << order, 0);
> >  	kasan_free_pages(page, order);
> > +	xpfo_free_page(page, order);
> >  
> >  	return true;
> >  }
> > @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  	kernel_map_pages(page, 1 << order, 1);
> >  	kernel_poison_pages(page, 1 << order, 1);
> >  	kasan_alloc_pages(page, order);
> > +	xpfo_alloc_page(page, order, gfp_flags);
> >  	set_page_owner(page, order, gfp_flags);
> >  }
> >  
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 121dcffc4ec1..ba6dbcacc2db 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/kmemleak.h>
> >  #include <linux/page_owner.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/xpfo.h>
> >  
> >  /*
> >   * struct page extension
> > @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> >  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> >  	&page_idle_ops,
> >  #endif
> > +#ifdef CONFIG_XPFO
> > +	&page_xpfo_ops,
> > +#endif
> >  };
> >  
> >  static unsigned long total_usage;
> > diff --git a/mm/xpfo.c b/mm/xpfo.c
> > new file mode 100644
> > index 000000000000..8e3a6a694b6a
> > --- /dev/null
> > +++ b/mm/xpfo.c
> > @@ -0,0 +1,206 @@
> > +/*
> > + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> > + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/page_ext.h>
> > +#include <linux/xpfo.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> > +
> > +static bool need_xpfo(void)
> > +{
> > +	return true;
> > +}
> > +
> > +static void init_xpfo(void)
> > +{
> > +	printk(KERN_INFO "XPFO enabled\n");
> > +	static_branch_enable(&xpfo_inited);
> > +}
> > +
> > +struct page_ext_operations page_xpfo_ops = {
> > +	.need = need_xpfo,
> > +	.init = init_xpfo,
> > +};
> > +
> > +/*
> > + * Update a single kernel page table entry
> > + */
> > +static inline void set_kpte(struct page *page, unsigned long kaddr,
> > +			    pgprot_t prot) {
> > +	unsigned int level;
> > +	pte_t *kpte = lookup_address(kaddr, &level);
> > +
> > +	/* We only support 4k pages for now */
> > +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> > +
> > +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> > +}
> 
> As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
> would it be better to put the whole definition into arch-specific part?
> 
> > +
> > +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> > +{
> > +	int i, flush_tlb = 0;
> > +	struct page_ext *page_ext;
> > +	unsigned long kaddr;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	for (i = 0; i < (1 << order); i++)  {
> > +		page_ext = lookup_page_ext(page + i);
> > +
> > +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> > +
> > +		/* Initialize the map lock and map counter */
> > +		if (!page_ext->inited) {
> > +			spin_lock_init(&page_ext->maplock);
> > +			atomic_set(&page_ext->mapcount, 0);
> > +			page_ext->inited = 1;
> > +		}
> > +		BUG_ON(atomic_read(&page_ext->mapcount));
> > +
> > +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> > +			/*
> > +			 * Flush the TLB if the page was previously allocated
> > +			 * to the kernel.
> > +			 */
> > +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> > +					       &page_ext->flags))
> > +				flush_tlb = 1;
> > +		} else {
> > +			/* Tag the page as a kernel page */
> > +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> > +		}
> > +	}
> > +
> > +	if (flush_tlb) {
> > +		kaddr = (unsigned long)page_address(page);
> > +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> > +				       PAGE_SIZE);
> > +	}
> > +}
> > +
> > +void xpfo_free_page(struct page *page, int order)
> > +{
> > +	int i;
> > +	struct page_ext *page_ext;
> > +	unsigned long kaddr;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	for (i = 0; i < (1 << order); i++) {
> > +		page_ext = lookup_page_ext(page + i);
> > +
> > +		if (!page_ext->inited) {
> > +			/*
> > +			 * The page was allocated before page_ext was
> > +			 * initialized, so it is a kernel page and it needs to
> > +			 * be tagged accordingly.
> > +			 */
> > +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Map the page back into the kernel if it was previously
> > +		 * allocated to user space.
> > +		 */
> > +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> > +				       &page_ext->flags)) {
> > +			kaddr = (unsigned long)page_address(page + i);
> > +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> 
> Why not PAGE_KERNEL?
> 
> > +		}
> > +	}
> > +}
> > +
> > +void xpfo_kmap(void *kaddr, struct page *page)
> > +{
> > +	struct page_ext *page_ext;
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	page_ext = lookup_page_ext(page);
> > +
> > +	/*
> > +	 * The page was allocated before page_ext was initialized (which means
> > +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> > +	 * do.
> > +	 */
> > +	if (!page_ext->inited ||
> > +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> > +		return;
> > +
> > +	spin_lock_irqsave(&page_ext->maplock, flags);
> > +
> > +	/*
> > +	 * The page was previously allocated to user space, so map it back
> > +	 * into the kernel. No TLB flush required.
> > +	 */
> > +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> > +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> > +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> > +
> > +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_kmap);
> > +
> > +void xpfo_kunmap(void *kaddr, struct page *page)
> > +{
> > +	struct page_ext *page_ext;
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	page_ext = lookup_page_ext(page);
> > +
> > +	/*
> > +	 * The page was allocated before page_ext was initialized (which means
> > +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> > +	 * do.
> > +	 */
> > +	if (!page_ext->inited ||
> > +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> > +		return;
> > +
> > +	spin_lock_irqsave(&page_ext->maplock, flags);
> > +
> > +	/*
> > +	 * The page is to be allocated back to user space, so unmap it from the
> > +	 * kernel, flush the TLB and tag it as a user page.
> > +	 */
> > +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> > +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> > +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> > +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> > +		__flush_tlb_one((unsigned long)kaddr);
> 
> Again __flush_tlb_one() is x86-specific.
> flush_tlb_kernel_range() instead?
> 
> Thanks,
> -Takahiro AKASHI
> 
> > +	}
> > +
> > +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_kunmap);
> > +
> > +inline bool xpfo_page_is_unmapped(struct page *page)
> > +{
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return false;
> > +
> > +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 118f4549404e..4502e15c8419 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -6,6 +6,25 @@ menu "Security options"
> >  
> >  source security/keys/Kconfig
> >  
> > +config ARCH_SUPPORTS_XPFO
> > +	bool
> > +
> > +config XPFO
> > +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> > +	default n
> > +	depends on ARCH_SUPPORTS_XPFO
> > +	select PAGE_EXTENSION
> > +	help
> > +	  This option offers protection against 'ret2dir' kernel attacks.
> > +	  When enabled, every time a page frame is allocated to user space, it
> > +	  is unmapped from the direct mapped RAM region in kernel space
> > +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> > +	  mapped back to physmap.
> > +
> > +	  There is a slight performance impact when this option is enabled.
> > +
> > +	  If in doubt, say "N".
> > +
> >  config SECURITY_DMESG_RESTRICT
> >  	bool "Restrict unprivileged access to the kernel syslog"
> >  	default n
> > -- 
> > 2.10.1
> > 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-12-09  9:02             ` AKASHI Takahiro
  0 siblings, 0 replies; 93+ messages in thread
From: AKASHI Takahiro @ 2016-12-09  9:02 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm, kernel-hardening,
	linux-x86_64, vpk

On Thu, Nov 24, 2016 at 07:56:30PM +0900, AKASHI Takahiro wrote:
> Hi,
> 
> I'm trying to give it a spin on arm64, but ...

In my experiment on hikey,
the kernel boot failed, catching a page fault around cache operations,
(a) __clean_dcache_area_pou() on 4KB-page kernel, 
(b) __inval_cache_range() on 64KB-page kernel,
(See more details for backtrace below.)

This is because, on arm64, cache operations are by VA (in particular,
of direct/linear mapping of physical memory). So I think that 
naively unmapping a page from physmap in xpfo_kunmap() won't work well
on arm64.

-Takahiro AKASHI

case (a)
--------
Unable to handle kernel paging request at virtual address ffff800000cba000
pgd = ffff80003ba8c000
*pgd=0000000000000000
task: ffff80003be38000 task.stack: ffff80003be40000
PC is at __clean_dcache_area_pou+0x20/0x38
LR is at sync_icache_aliases+0x2c/0x40
 ...
Call trace:
 ...
__clean_dcache_area_pou+0x20/0x38
__sync_icache_dcache+0x6c/0xa8
alloc_set_pte+0x33c/0x588
filemap_map_pages+0x3a8/0x3b8
handle_mm_fault+0x910/0x1080
do_page_fault+0x2b0/0x358
do_mem_abort+0x44/0xa0
el0_ia+0x18/0x1c

case (b)
--------
Unable to handle kernel paging request at virtual address ffff80002aed0000
pgd = ffff000008f40000
, *pud=000000003dfc0003
, *pmd=000000003dfa0003
, *pte=000000002aed0000
task: ffff800028711900 task.stack: ffff800029020000
PC is at __inval_cache_range+0x3c/0x60
LR is at __swiotlb_map_sg_attrs+0x6c/0x98
 ...

Call trace:
 ...
__inval_cache_range+0x3c/0x60
dw_mci_pre_dma_transfer.isra.7+0xfc/0x190
dw_mci_pre_req+0x50/0x60
mmc_start_req+0x4c/0x420
mmc_blk_issue_rw_rq+0xb0/0x9b8
mmc_blk_issue_rq+0x154/0x518
mmc_queue_thread+0xac/0x158
kthread+0xd0/0xe8
ret_from_fork+0x10/0x20


> 
> On Fri, Nov 04, 2016 at 03:45:33PM +0100, Juerg Haefliger wrote:
> > This patch adds support for XPFO which protects against 'ret2dir' kernel
> > attacks. The basic idea is to enforce exclusive ownership of page frames
> > by either the kernel or userspace, unless explicitly requested by the
> > kernel. Whenever a page destined for userspace is allocated, it is
> > unmapped from physmap (the kernel's page table). When such a page is
> > reclaimed from userspace, it is mapped back to physmap.
> > 
> > Additional fields in the page_ext struct are used for XPFO housekeeping.
> > Specifically two flags to distinguish user vs. kernel pages and to tag
> > unmapped pages and a reference counter to balance kmap/kunmap operations
> > and a lock to serialize access to the XPFO fields.
> > 
> > Known issues/limitations:
> >   - Only supports x86-64 (for now)
> >   - Only supports 4k pages (for now)
> >   - There are most likely some legitimate uses cases where the kernel needs
> >     to access userspace which need to be made XPFO-aware
> >   - Performance penalty
> > 
> > Reference paper by the original patch authors:
> >   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> > 
> > Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> > Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> > ---
> >  arch/x86/Kconfig         |   3 +-
> >  arch/x86/mm/init.c       |   2 +-
> >  drivers/ata/libata-sff.c |   4 +-
> >  include/linux/highmem.h  |  15 +++-
> >  include/linux/page_ext.h |   7 ++
> >  include/linux/xpfo.h     |  39 +++++++++
> >  lib/swiotlb.c            |   3 +-
> >  mm/Makefile              |   1 +
> >  mm/page_alloc.c          |   2 +
> >  mm/page_ext.c            |   4 +
> >  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
> >  security/Kconfig         |  19 +++++
> >  12 files changed, 298 insertions(+), 7 deletions(-)
> >  create mode 100644 include/linux/xpfo.h
> >  create mode 100644 mm/xpfo.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index bada636d1065..38b334f8fde5 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -165,6 +165,7 @@ config X86
> >  	select HAVE_STACK_VALIDATION		if X86_64
> >  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
> >  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> > +	select ARCH_SUPPORTS_XPFO		if X86_64
> >  
> >  config INSTRUCTION_DECODER
> >  	def_bool y
> > @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
> >  
> >  config X86_DIRECT_GBPAGES
> >  	def_bool y
> > -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> > +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
> >  	---help---
> >  	  Certain kernel features effectively disable kernel
> >  	  linear 1 GB mappings (even if the CPU otherwise
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 22af912d66d2..a6fafbae02bb 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -161,7 +161,7 @@ static int page_size_mask;
> >  
> >  static void __init probe_page_size_mask(void)
> >  {
> > -#if !defined(CONFIG_KMEMCHECK)
> > +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
> >  	/*
> >  	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
> >  	 * use small pages.
> > diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> > index 051b6158d1b7..58af734be25d 100644
> > --- a/drivers/ata/libata-sff.c
> > +++ b/drivers/ata/libata-sff.c
> > @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
> >  
> >  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
> >  
> > -	if (PageHighMem(page)) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		unsigned long flags;
> >  
> >  		/* FIXME: use a bounce buffer */
> > @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
> >  
> >  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
> >  
> > -	if (PageHighMem(page)) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		unsigned long flags;
> >  
> >  		/* FIXME: use bounce buffer */
> > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > index bb3f3297062a..7a17c166532f 100644
> > --- a/include/linux/highmem.h
> > +++ b/include/linux/highmem.h
> > @@ -7,6 +7,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/hardirq.h>
> > +#include <linux/xpfo.h>
> >  
> >  #include <asm/cacheflush.h>
> >  
> > @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
> >  #ifndef ARCH_HAS_KMAP
> >  static inline void *kmap(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	might_sleep();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> >  
> >  static inline void kunmap(struct page *page)
> >  {
> > +	xpfo_kunmap(page_address(page), page);
> >  }
> >  
> >  static inline void *kmap_atomic(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	preempt_disable();
> >  	pagefault_disable();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> >  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
> >  
> >  static inline void __kunmap_atomic(void *addr)
> >  {
> > +	xpfo_kunmap(addr, virt_to_page(addr));
> >  	pagefault_enable();
> >  	preempt_enable();
> >  }
> > diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> > index 9298c393ddaa..0e451a42e5a3 100644
> > --- a/include/linux/page_ext.h
> > +++ b/include/linux/page_ext.h
> > @@ -29,6 +29,8 @@ enum page_ext_flags {
> >  	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
> >  	PAGE_EXT_DEBUG_GUARD,
> >  	PAGE_EXT_OWNER,
> > +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> > +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
> >  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> >  	PAGE_EXT_YOUNG,
> >  	PAGE_EXT_IDLE,
> > @@ -44,6 +46,11 @@ enum page_ext_flags {
> >   */
> >  struct page_ext {
> >  	unsigned long flags;
> > +#ifdef CONFIG_XPFO
> > +	int inited;		/* Map counter and lock initialized */
> > +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> > +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> > +#endif
> >  };
> >  
> >  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> > diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> > new file mode 100644
> > index 000000000000..77187578ca33
> > --- /dev/null
> > +++ b/include/linux/xpfo.h
> > @@ -0,0 +1,39 @@
> > +/*
> > + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> > + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#ifndef _LINUX_XPFO_H
> > +#define _LINUX_XPFO_H
> > +
> > +#ifdef CONFIG_XPFO
> > +
> > +extern struct page_ext_operations page_xpfo_ops;
> > +
> > +extern void xpfo_kmap(void *kaddr, struct page *page);
> > +extern void xpfo_kunmap(void *kaddr, struct page *page);
> > +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> > +extern void xpfo_free_page(struct page *page, int order);
> > +
> > +extern bool xpfo_page_is_unmapped(struct page *page);
> > +
> > +#else /* !CONFIG_XPFO */
> > +
> > +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> > +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> > +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> > +static inline void xpfo_free_page(struct page *page, int order) { }
> > +
> > +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> > +
> > +#endif /* CONFIG_XPFO */
> > +
> > +#endif /* _LINUX_XPFO_H */
> > diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> > index 22e13a0e19d7..455eff44604e 100644
> > --- a/lib/swiotlb.c
> > +++ b/lib/swiotlb.c
> > @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> >  {
> >  	unsigned long pfn = PFN_DOWN(orig_addr);
> >  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> > +	struct page *page = pfn_to_page(pfn);
> >  
> > -	if (PageHighMem(pfn_to_page(pfn))) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		/* The buffer does not have a mapping.  Map it in and copy */
> >  		unsigned int offset = orig_addr & ~PAGE_MASK;
> >  		char *buffer;
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 295bd7a9f76b..175680f516aa 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> >  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
> >  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
> >  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> > +obj-$(CONFIG_XPFO) += xpfo.o
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8fd42aa7c4bd..100e80e008e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> >  	kernel_poison_pages(page, 1 << order, 0);
> >  	kernel_map_pages(page, 1 << order, 0);
> >  	kasan_free_pages(page, order);
> > +	xpfo_free_page(page, order);
> >  
> >  	return true;
> >  }
> > @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  	kernel_map_pages(page, 1 << order, 1);
> >  	kernel_poison_pages(page, 1 << order, 1);
> >  	kasan_alloc_pages(page, order);
> > +	xpfo_alloc_page(page, order, gfp_flags);
> >  	set_page_owner(page, order, gfp_flags);
> >  }
> >  
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 121dcffc4ec1..ba6dbcacc2db 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/kmemleak.h>
> >  #include <linux/page_owner.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/xpfo.h>
> >  
> >  /*
> >   * struct page extension
> > @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> >  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> >  	&page_idle_ops,
> >  #endif
> > +#ifdef CONFIG_XPFO
> > +	&page_xpfo_ops,
> > +#endif
> >  };
> >  
> >  static unsigned long total_usage;
> > diff --git a/mm/xpfo.c b/mm/xpfo.c
> > new file mode 100644
> > index 000000000000..8e3a6a694b6a
> > --- /dev/null
> > +++ b/mm/xpfo.c
> > @@ -0,0 +1,206 @@
> > +/*
> > + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> > + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/page_ext.h>
> > +#include <linux/xpfo.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> > +
> > +static bool need_xpfo(void)
> > +{
> > +	return true;
> > +}
> > +
> > +static void init_xpfo(void)
> > +{
> > +	printk(KERN_INFO "XPFO enabled\n");
> > +	static_branch_enable(&xpfo_inited);
> > +}
> > +
> > +struct page_ext_operations page_xpfo_ops = {
> > +	.need = need_xpfo,
> > +	.init = init_xpfo,
> > +};
> > +
> > +/*
> > + * Update a single kernel page table entry
> > + */
> > +static inline void set_kpte(struct page *page, unsigned long kaddr,
> > +			    pgprot_t prot) {
> > +	unsigned int level;
> > +	pte_t *kpte = lookup_address(kaddr, &level);
> > +
> > +	/* We only support 4k pages for now */
> > +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> > +
> > +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> > +}
> 
> As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
> would it be better to put the whole definition into arch-specific part?
> 
> > +
> > +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> > +{
> > +	int i, flush_tlb = 0;
> > +	struct page_ext *page_ext;
> > +	unsigned long kaddr;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	for (i = 0; i < (1 << order); i++)  {
> > +		page_ext = lookup_page_ext(page + i);
> > +
> > +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> > +
> > +		/* Initialize the map lock and map counter */
> > +		if (!page_ext->inited) {
> > +			spin_lock_init(&page_ext->maplock);
> > +			atomic_set(&page_ext->mapcount, 0);
> > +			page_ext->inited = 1;
> > +		}
> > +		BUG_ON(atomic_read(&page_ext->mapcount));
> > +
> > +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> > +			/*
> > +			 * Flush the TLB if the page was previously allocated
> > +			 * to the kernel.
> > +			 */
> > +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> > +					       &page_ext->flags))
> > +				flush_tlb = 1;
> > +		} else {
> > +			/* Tag the page as a kernel page */
> > +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> > +		}
> > +	}
> > +
> > +	if (flush_tlb) {
> > +		kaddr = (unsigned long)page_address(page);
> > +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> > +				       PAGE_SIZE);
> > +	}
> > +}
> > +
> > +void xpfo_free_page(struct page *page, int order)
> > +{
> > +	int i;
> > +	struct page_ext *page_ext;
> > +	unsigned long kaddr;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	for (i = 0; i < (1 << order); i++) {
> > +		page_ext = lookup_page_ext(page + i);
> > +
> > +		if (!page_ext->inited) {
> > +			/*
> > +			 * The page was allocated before page_ext was
> > +			 * initialized, so it is a kernel page and it needs to
> > +			 * be tagged accordingly.
> > +			 */
> > +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Map the page back into the kernel if it was previously
> > +		 * allocated to user space.
> > +		 */
> > +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> > +				       &page_ext->flags)) {
> > +			kaddr = (unsigned long)page_address(page + i);
> > +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> 
> Why not PAGE_KERNEL?
> 
> > +		}
> > +	}
> > +}
> > +
> > +void xpfo_kmap(void *kaddr, struct page *page)
> > +{
> > +	struct page_ext *page_ext;
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	page_ext = lookup_page_ext(page);
> > +
> > +	/*
> > +	 * The page was allocated before page_ext was initialized (which means
> > +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> > +	 * do.
> > +	 */
> > +	if (!page_ext->inited ||
> > +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> > +		return;
> > +
> > +	spin_lock_irqsave(&page_ext->maplock, flags);
> > +
> > +	/*
> > +	 * The page was previously allocated to user space, so map it back
> > +	 * into the kernel. No TLB flush required.
> > +	 */
> > +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> > +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> > +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> > +
> > +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_kmap);
> > +
> > +void xpfo_kunmap(void *kaddr, struct page *page)
> > +{
> > +	struct page_ext *page_ext;
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	page_ext = lookup_page_ext(page);
> > +
> > +	/*
> > +	 * The page was allocated before page_ext was initialized (which means
> > +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> > +	 * do.
> > +	 */
> > +	if (!page_ext->inited ||
> > +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> > +		return;
> > +
> > +	spin_lock_irqsave(&page_ext->maplock, flags);
> > +
> > +	/*
> > +	 * The page is to be allocated back to user space, so unmap it from the
> > +	 * kernel, flush the TLB and tag it as a user page.
> > +	 */
> > +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> > +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> > +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> > +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> > +		__flush_tlb_one((unsigned long)kaddr);
> 
> Again __flush_tlb_one() is x86-specific.
> flush_tlb_kernel_range() instead?
> 
> Thanks,
> -Takahiro AKASHI
> 
> > +	}
> > +
> > +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_kunmap);
> > +
> > +inline bool xpfo_page_is_unmapped(struct page *page)
> > +{
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return false;
> > +
> > +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 118f4549404e..4502e15c8419 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -6,6 +6,25 @@ menu "Security options"
> >  
> >  source security/keys/Kconfig
> >  
> > +config ARCH_SUPPORTS_XPFO
> > +	bool
> > +
> > +config XPFO
> > +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> > +	default n
> > +	depends on ARCH_SUPPORTS_XPFO
> > +	select PAGE_EXTENSION
> > +	help
> > +	  This option offers protection against 'ret2dir' kernel attacks.
> > +	  When enabled, every time a page frame is allocated to user space, it
> > +	  is unmapped from the direct mapped RAM region in kernel space
> > +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> > +	  mapped back to physmap.
> > +
> > +	  There is a slight performance impact when this option is enabled.
> > +
> > +	  If in doubt, say "N".
> > +
> >  config SECURITY_DMESG_RESTRICT
> >  	bool "Restrict unprivileged access to the kernel syslog"
> >  	default n
> > -- 
> > 2.10.1
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [kernel-hardening] Re: [RFC PATCH v3 1/2] Add support for eXclusive Page Frame Ownership (XPFO)
@ 2016-12-09  9:02             ` AKASHI Takahiro
  0 siblings, 0 replies; 93+ messages in thread
From: AKASHI Takahiro @ 2016-12-09  9:02 UTC (permalink / raw)
  To: Juerg Haefliger, linux-kernel, linux-mm, kernel-hardening,
	linux-x86_64, vpk

On Thu, Nov 24, 2016 at 07:56:30PM +0900, AKASHI Takahiro wrote:
> Hi,
> 
> I'm trying to give it a spin on arm64, but ...

In my experiment on hikey,
the kernel boot failed, catching a page fault around cache operations,
(a) __clean_dcache_area_pou() on 4KB-page kernel, 
(b) __inval_cache_range() on 64KB-page kernel,
(See more details for backtrace below.)

This is because, on arm64, cache operations are by VA (in particular,
of direct/linear mapping of physical memory). So I think that 
naively unmapping a page from physmap in xpfo_kunmap() won't work well
on arm64.

-Takahiro AKASHI

case (a)
--------
Unable to handle kernel paging request at virtual address ffff800000cba000
pgd = ffff80003ba8c000
*pgd=0000000000000000
task: ffff80003be38000 task.stack: ffff80003be40000
PC is at __clean_dcache_area_pou+0x20/0x38
LR is at sync_icache_aliases+0x2c/0x40
 ...
Call trace:
 ...
__clean_dcache_area_pou+0x20/0x38
__sync_icache_dcache+0x6c/0xa8
alloc_set_pte+0x33c/0x588
filemap_map_pages+0x3a8/0x3b8
handle_mm_fault+0x910/0x1080
do_page_fault+0x2b0/0x358
do_mem_abort+0x44/0xa0
el0_ia+0x18/0x1c

case (b)
--------
Unable to handle kernel paging request at virtual address ffff80002aed0000
pgd = ffff000008f40000
, *pud=000000003dfc0003
, *pmd=000000003dfa0003
, *pte=000000002aed0000
task: ffff800028711900 task.stack: ffff800029020000
PC is at __inval_cache_range+0x3c/0x60
LR is at __swiotlb_map_sg_attrs+0x6c/0x98
 ...

Call trace:
 ...
__inval_cache_range+0x3c/0x60
dw_mci_pre_dma_transfer.isra.7+0xfc/0x190
dw_mci_pre_req+0x50/0x60
mmc_start_req+0x4c/0x420
mmc_blk_issue_rw_rq+0xb0/0x9b8
mmc_blk_issue_rq+0x154/0x518
mmc_queue_thread+0xac/0x158
kthread+0xd0/0xe8
ret_from_fork+0x10/0x20


> 
> On Fri, Nov 04, 2016 at 03:45:33PM +0100, Juerg Haefliger wrote:
> > This patch adds support for XPFO which protects against 'ret2dir' kernel
> > attacks. The basic idea is to enforce exclusive ownership of page frames
> > by either the kernel or userspace, unless explicitly requested by the
> > kernel. Whenever a page destined for userspace is allocated, it is
> > unmapped from physmap (the kernel's page table). When such a page is
> > reclaimed from userspace, it is mapped back to physmap.
> > 
> > Additional fields in the page_ext struct are used for XPFO housekeeping.
> > Specifically two flags to distinguish user vs. kernel pages and to tag
> > unmapped pages and a reference counter to balance kmap/kunmap operations
> > and a lock to serialize access to the XPFO fields.
> > 
> > Known issues/limitations:
> >   - Only supports x86-64 (for now)
> >   - Only supports 4k pages (for now)
> >   - There are most likely some legitimate uses cases where the kernel needs
> >     to access userspace which need to be made XPFO-aware
> >   - Performance penalty
> > 
> > Reference paper by the original patch authors:
> >   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> > 
> > Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> > Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
> > ---
> >  arch/x86/Kconfig         |   3 +-
> >  arch/x86/mm/init.c       |   2 +-
> >  drivers/ata/libata-sff.c |   4 +-
> >  include/linux/highmem.h  |  15 +++-
> >  include/linux/page_ext.h |   7 ++
> >  include/linux/xpfo.h     |  39 +++++++++
> >  lib/swiotlb.c            |   3 +-
> >  mm/Makefile              |   1 +
> >  mm/page_alloc.c          |   2 +
> >  mm/page_ext.c            |   4 +
> >  mm/xpfo.c                | 206 +++++++++++++++++++++++++++++++++++++++++++++++
> >  security/Kconfig         |  19 +++++
> >  12 files changed, 298 insertions(+), 7 deletions(-)
> >  create mode 100644 include/linux/xpfo.h
> >  create mode 100644 mm/xpfo.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index bada636d1065..38b334f8fde5 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -165,6 +165,7 @@ config X86
> >  	select HAVE_STACK_VALIDATION		if X86_64
> >  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
> >  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> > +	select ARCH_SUPPORTS_XPFO		if X86_64
> >  
> >  config INSTRUCTION_DECODER
> >  	def_bool y
> > @@ -1361,7 +1362,7 @@ config ARCH_DMA_ADDR_T_64BIT
> >  
> >  config X86_DIRECT_GBPAGES
> >  	def_bool y
> > -	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
> > +	depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
> >  	---help---
> >  	  Certain kernel features effectively disable kernel
> >  	  linear 1 GB mappings (even if the CPU otherwise
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 22af912d66d2..a6fafbae02bb 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -161,7 +161,7 @@ static int page_size_mask;
> >  
> >  static void __init probe_page_size_mask(void)
> >  {
> > -#if !defined(CONFIG_KMEMCHECK)
> > +#if !defined(CONFIG_KMEMCHECK) && !defined(CONFIG_XPFO)
> >  	/*
> >  	 * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
> >  	 * use small pages.
> > diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> > index 051b6158d1b7..58af734be25d 100644
> > --- a/drivers/ata/libata-sff.c
> > +++ b/drivers/ata/libata-sff.c
> > @@ -715,7 +715,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
> >  
> >  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
> >  
> > -	if (PageHighMem(page)) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		unsigned long flags;
> >  
> >  		/* FIXME: use a bounce buffer */
> > @@ -860,7 +860,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
> >  
> >  	DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
> >  
> > -	if (PageHighMem(page)) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		unsigned long flags;
> >  
> >  		/* FIXME: use bounce buffer */
> > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > index bb3f3297062a..7a17c166532f 100644
> > --- a/include/linux/highmem.h
> > +++ b/include/linux/highmem.h
> > @@ -7,6 +7,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/hardirq.h>
> > +#include <linux/xpfo.h>
> >  
> >  #include <asm/cacheflush.h>
> >  
> > @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
> >  #ifndef ARCH_HAS_KMAP
> >  static inline void *kmap(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	might_sleep();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> >  
> >  static inline void kunmap(struct page *page)
> >  {
> > +	xpfo_kunmap(page_address(page), page);
> >  }
> >  
> >  static inline void *kmap_atomic(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	preempt_disable();
> >  	pagefault_disable();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> >  #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
> >  
> >  static inline void __kunmap_atomic(void *addr)
> >  {
> > +	xpfo_kunmap(addr, virt_to_page(addr));
> >  	pagefault_enable();
> >  	preempt_enable();
> >  }
> > diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> > index 9298c393ddaa..0e451a42e5a3 100644
> > --- a/include/linux/page_ext.h
> > +++ b/include/linux/page_ext.h
> > @@ -29,6 +29,8 @@ enum page_ext_flags {
> >  	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
> >  	PAGE_EXT_DEBUG_GUARD,
> >  	PAGE_EXT_OWNER,
> > +	PAGE_EXT_XPFO_KERNEL,		/* Page is a kernel page */
> > +	PAGE_EXT_XPFO_UNMAPPED,		/* Page is unmapped */
> >  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> >  	PAGE_EXT_YOUNG,
> >  	PAGE_EXT_IDLE,
> > @@ -44,6 +46,11 @@ enum page_ext_flags {
> >   */
> >  struct page_ext {
> >  	unsigned long flags;
> > +#ifdef CONFIG_XPFO
> > +	int inited;		/* Map counter and lock initialized */
> > +	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
> > +	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
> > +#endif
> >  };
> >  
> >  extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> > diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> > new file mode 100644
> > index 000000000000..77187578ca33
> > --- /dev/null
> > +++ b/include/linux/xpfo.h
> > @@ -0,0 +1,39 @@
> > +/*
> > + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> > + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#ifndef _LINUX_XPFO_H
> > +#define _LINUX_XPFO_H
> > +
> > +#ifdef CONFIG_XPFO
> > +
> > +extern struct page_ext_operations page_xpfo_ops;
> > +
> > +extern void xpfo_kmap(void *kaddr, struct page *page);
> > +extern void xpfo_kunmap(void *kaddr, struct page *page);
> > +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
> > +extern void xpfo_free_page(struct page *page, int order);
> > +
> > +extern bool xpfo_page_is_unmapped(struct page *page);
> > +
> > +#else /* !CONFIG_XPFO */
> > +
> > +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
> > +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
> > +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
> > +static inline void xpfo_free_page(struct page *page, int order) { }
> > +
> > +static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
> > +
> > +#endif /* CONFIG_XPFO */
> > +
> > +#endif /* _LINUX_XPFO_H */
> > diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> > index 22e13a0e19d7..455eff44604e 100644
> > --- a/lib/swiotlb.c
> > +++ b/lib/swiotlb.c
> > @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
> >  {
> >  	unsigned long pfn = PFN_DOWN(orig_addr);
> >  	unsigned char *vaddr = phys_to_virt(tlb_addr);
> > +	struct page *page = pfn_to_page(pfn);
> >  
> > -	if (PageHighMem(pfn_to_page(pfn))) {
> > +	if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> >  		/* The buffer does not have a mapping.  Map it in and copy */
> >  		unsigned int offset = orig_addr & ~PAGE_MASK;
> >  		char *buffer;
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 295bd7a9f76b..175680f516aa 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> >  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
> >  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
> >  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> > +obj-$(CONFIG_XPFO) += xpfo.o
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8fd42aa7c4bd..100e80e008e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1045,6 +1045,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> >  	kernel_poison_pages(page, 1 << order, 0);
> >  	kernel_map_pages(page, 1 << order, 0);
> >  	kasan_free_pages(page, order);
> > +	xpfo_free_page(page, order);
> >  
> >  	return true;
> >  }
> > @@ -1745,6 +1746,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  	kernel_map_pages(page, 1 << order, 1);
> >  	kernel_poison_pages(page, 1 << order, 1);
> >  	kasan_alloc_pages(page, order);
> > +	xpfo_alloc_page(page, order, gfp_flags);
> >  	set_page_owner(page, order, gfp_flags);
> >  }
> >  
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 121dcffc4ec1..ba6dbcacc2db 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/kmemleak.h>
> >  #include <linux/page_owner.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/xpfo.h>
> >  
> >  /*
> >   * struct page extension
> > @@ -68,6 +69,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> >  #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> >  	&page_idle_ops,
> >  #endif
> > +#ifdef CONFIG_XPFO
> > +	&page_xpfo_ops,
> > +#endif
> >  };
> >  
> >  static unsigned long total_usage;
> > diff --git a/mm/xpfo.c b/mm/xpfo.c
> > new file mode 100644
> > index 000000000000..8e3a6a694b6a
> > --- /dev/null
> > +++ b/mm/xpfo.c
> > @@ -0,0 +1,206 @@
> > +/*
> > + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
> > + * Copyright (C) 2016 Brown University. All rights reserved.
> > + *
> > + * Authors:
> > + *   Juerg Haefliger <juerg.haefliger@hpe.com>
> > + *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License version 2 as published by
> > + * the Free Software Foundation.
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/page_ext.h>
> > +#include <linux/xpfo.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +DEFINE_STATIC_KEY_FALSE(xpfo_inited);
> > +
> > +static bool need_xpfo(void)
> > +{
> > +	return true;
> > +}
> > +
> > +static void init_xpfo(void)
> > +{
> > +	printk(KERN_INFO "XPFO enabled\n");
> > +	static_branch_enable(&xpfo_inited);
> > +}
> > +
> > +struct page_ext_operations page_xpfo_ops = {
> > +	.need = need_xpfo,
> > +	.init = init_xpfo,
> > +};
> > +
> > +/*
> > + * Update a single kernel page table entry
> > + */
> > +static inline void set_kpte(struct page *page, unsigned long kaddr,
> > +			    pgprot_t prot) {
> > +	unsigned int level;
> > +	pte_t *kpte = lookup_address(kaddr, &level);
> > +
> > +	/* We only support 4k pages for now */
> > +	BUG_ON(!kpte || level != PG_LEVEL_4K);
> > +
> > +	set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
> > +}
> 
> As lookup_address() and set_pte_atomic() (and PG_LEVEL_4K), are arch-specific,
> would it be better to put the whole definition into arch-specific part?
> 
> > +
> > +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
> > +{
> > +	int i, flush_tlb = 0;
> > +	struct page_ext *page_ext;
> > +	unsigned long kaddr;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	for (i = 0; i < (1 << order); i++)  {
> > +		page_ext = lookup_page_ext(page + i);
> > +
> > +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> > +
> > +		/* Initialize the map lock and map counter */
> > +		if (!page_ext->inited) {
> > +			spin_lock_init(&page_ext->maplock);
> > +			atomic_set(&page_ext->mapcount, 0);
> > +			page_ext->inited = 1;
> > +		}
> > +		BUG_ON(atomic_read(&page_ext->mapcount));
> > +
> > +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> > +			/*
> > +			 * Flush the TLB if the page was previously allocated
> > +			 * to the kernel.
> > +			 */
> > +			if (test_and_clear_bit(PAGE_EXT_XPFO_KERNEL,
> > +					       &page_ext->flags))
> > +				flush_tlb = 1;
> > +		} else {
> > +			/* Tag the page as a kernel page */
> > +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> > +		}
> > +	}
> > +
> > +	if (flush_tlb) {
> > +		kaddr = (unsigned long)page_address(page);
> > +		flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
> > +				       PAGE_SIZE);
> > +	}
> > +}
> > +
> > +void xpfo_free_page(struct page *page, int order)
> > +{
> > +	int i;
> > +	struct page_ext *page_ext;
> > +	unsigned long kaddr;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	for (i = 0; i < (1 << order); i++) {
> > +		page_ext = lookup_page_ext(page + i);
> > +
> > +		if (!page_ext->inited) {
> > +			/*
> > +			 * The page was allocated before page_ext was
> > +			 * initialized, so it is a kernel page and it needs to
> > +			 * be tagged accordingly.
> > +			 */
> > +			set_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Map the page back into the kernel if it was previously
> > +		 * allocated to user space.
> > +		 */
> > +		if (test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED,
> > +				       &page_ext->flags)) {
> > +			kaddr = (unsigned long)page_address(page + i);
> > +			set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
> 
> Why not PAGE_KERNEL?
> 
> > +		}
> > +	}
> > +}
> > +
> > +void xpfo_kmap(void *kaddr, struct page *page)
> > +{
> > +	struct page_ext *page_ext;
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	page_ext = lookup_page_ext(page);
> > +
> > +	/*
> > +	 * The page was allocated before page_ext was initialized (which means
> > +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> > +	 * do.
> > +	 */
> > +	if (!page_ext->inited ||
> > +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> > +		return;
> > +
> > +	spin_lock_irqsave(&page_ext->maplock, flags);
> > +
> > +	/*
> > +	 * The page was previously allocated to user space, so map it back
> > +	 * into the kernel. No TLB flush required.
> > +	 */
> > +	if ((atomic_inc_return(&page_ext->mapcount) == 1) &&
> > +	    test_and_clear_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags))
> > +		set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
> > +
> > +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_kmap);
> > +
> > +void xpfo_kunmap(void *kaddr, struct page *page)
> > +{
> > +	struct page_ext *page_ext;
> > +	unsigned long flags;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	page_ext = lookup_page_ext(page);
> > +
> > +	/*
> > +	 * The page was allocated before page_ext was initialized (which means
> > +	 * it's a kernel page) or it's allocated to the kernel, so nothing to
> > +	 * do.
> > +	 */
> > +	if (!page_ext->inited ||
> > +	    test_bit(PAGE_EXT_XPFO_KERNEL, &page_ext->flags))
> > +		return;
> > +
> > +	spin_lock_irqsave(&page_ext->maplock, flags);
> > +
> > +	/*
> > +	 * The page is to be allocated back to user space, so unmap it from the
> > +	 * kernel, flush the TLB and tag it as a user page.
> > +	 */
> > +	if (atomic_dec_return(&page_ext->mapcount) == 0) {
> > +		BUG_ON(test_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags));
> > +		set_bit(PAGE_EXT_XPFO_UNMAPPED, &page_ext->flags);
> > +		set_kpte(page, (unsigned long)kaddr, __pgprot(0));
> > +		__flush_tlb_one((unsigned long)kaddr);
> 
> Again __flush_tlb_one() is x86-specific.
> flush_tlb_kernel_range() instead?
> 
> Thanks,
> -Takahiro AKASHI
> 
> > +	}
> > +
> > +	spin_unlock_irqrestore(&page_ext->maplock, flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_kunmap);
> > +
> > +inline bool xpfo_page_is_unmapped(struct page *page)
> > +{
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return false;
> > +
> > +	return test_bit(PAGE_EXT_XPFO_UNMAPPED, &lookup_page_ext(page)->flags);
> > +}
> > +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 118f4549404e..4502e15c8419 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -6,6 +6,25 @@ menu "Security options"
> >  
> >  source security/keys/Kconfig
> >  
> > +config ARCH_SUPPORTS_XPFO
> > +	bool
> > +
> > +config XPFO
> > +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> > +	default n
> > +	depends on ARCH_SUPPORTS_XPFO
> > +	select PAGE_EXTENSION
> > +	help
> > +	  This option offers protection against 'ret2dir' kernel attacks.
> > +	  When enabled, every time a page frame is allocated to user space, it
> > +	  is unmapped from the direct mapped RAM region in kernel space
> > +	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
> > +	  mapped back to physmap.
> > +
> > +	  There is a slight performance impact when this option is enabled.
> > +
> > +	  If in doubt, say "N".
> > +
> >  config SECURITY_DMESG_RESTRICT
> >  	bool "Restrict unprivileged access to the kernel syslog"
> >  	default n
> > -- 
> > 2.10.1
> > 

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2016-12-09  9:02 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-26 14:21 [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO) Juerg Haefliger
2016-02-26 14:21 ` Juerg Haefliger
2016-03-01  1:31 ` Laura Abbott
2016-03-01  1:31   ` Laura Abbott
2016-03-21  8:37   ` Juerg Haefliger
2016-03-21  8:37     ` Juerg Haefliger
2016-03-28 19:29     ` Laura Abbott
2016-03-28 19:29       ` Laura Abbott
2016-03-01  2:10 ` Balbir Singh
2016-03-01  2:10   ` Balbir Singh
2016-03-21  8:44   ` Juerg Haefliger
2016-03-21  8:44     ` Juerg Haefliger
2016-04-01  0:21     ` Balbir Singh
2016-04-01  0:21       ` Balbir Singh
2016-09-02 11:39 ` [RFC PATCH v2 0/3] " Juerg Haefliger
2016-09-02 11:39   ` [kernel-hardening] " Juerg Haefliger
2016-09-02 11:39   ` Juerg Haefliger
2016-09-02 11:39   ` [RFC PATCH v2 1/3] " Juerg Haefliger
2016-09-02 11:39     ` [kernel-hardening] " Juerg Haefliger
2016-09-02 11:39     ` Juerg Haefliger
2016-09-02 11:39   ` [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache Juerg Haefliger
2016-09-02 11:39     ` [kernel-hardening] " Juerg Haefliger
2016-09-02 11:39     ` Juerg Haefliger
2016-09-02 20:39     ` Dave Hansen
2016-09-02 20:39       ` [kernel-hardening] " Dave Hansen
2016-09-02 20:39       ` Dave Hansen
2016-09-05 11:54       ` Juerg Haefliger
2016-09-05 11:54         ` [kernel-hardening] " Juerg Haefliger
2016-09-02 11:39   ` [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled Juerg Haefliger
2016-09-02 11:39     ` [kernel-hardening] " Juerg Haefliger
2016-09-02 11:39     ` Juerg Haefliger
2016-09-14  7:18   ` [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO) Juerg Haefliger
2016-09-14  7:18     ` [kernel-hardening] " Juerg Haefliger
2016-09-14  7:18     ` Juerg Haefliger
2016-09-14  7:18     ` [RFC PATCH v2 1/3] " Juerg Haefliger
2016-09-14  7:18       ` [kernel-hardening] " Juerg Haefliger
2016-09-14  7:18       ` Juerg Haefliger
2016-09-14  7:19     ` [RFC PATCH v2 2/3] xpfo: Only put previous userspace pages into the hot cache Juerg Haefliger
2016-09-14  7:19       ` [kernel-hardening] " Juerg Haefliger
2016-09-14  7:19       ` Juerg Haefliger
2016-09-14 14:33       ` [kernel-hardening] " Dave Hansen
2016-09-14 14:33         ` Dave Hansen
2016-09-14 14:40         ` Juerg Haefliger
2016-09-14 14:48           ` Dave Hansen
2016-09-14 14:48             ` Dave Hansen
2016-09-21  5:32             ` Juerg Haefliger
2016-09-14  7:19     ` [RFC PATCH v2 3/3] block: Always use a bounce buffer when XPFO is enabled Juerg Haefliger
2016-09-14  7:19       ` [kernel-hardening] " Juerg Haefliger
2016-09-14  7:19       ` Juerg Haefliger
2016-09-14  7:33       ` Christoph Hellwig
2016-09-14  7:33         ` [kernel-hardening] " Christoph Hellwig
2016-09-14  7:33         ` Christoph Hellwig
2016-09-14  7:23     ` [RFC PATCH v2 0/3] Add support for eXclusive Page Frame Ownership (XPFO) Juerg Haefliger
2016-09-14  7:23       ` [kernel-hardening] " Juerg Haefliger
2016-09-14  9:36     ` [kernel-hardening] " Mark Rutland
2016-09-14  9:36       ` Mark Rutland
2016-09-14  9:49       ` Mark Rutland
2016-09-14  9:49         ` Mark Rutland
2016-11-04 14:45     ` [RFC PATCH v3 0/2] " Juerg Haefliger
2016-11-04 14:45       ` [kernel-hardening] " Juerg Haefliger
2016-11-04 14:45       ` Juerg Haefliger
2016-11-04 14:45       ` [RFC PATCH v3 1/2] " Juerg Haefliger
2016-11-04 14:45         ` [kernel-hardening] " Juerg Haefliger
2016-11-04 14:45         ` Juerg Haefliger
2016-11-04 14:50         ` Christoph Hellwig
2016-11-04 14:50           ` [kernel-hardening] " Christoph Hellwig
2016-11-04 14:50           ` Christoph Hellwig
2016-11-10  5:53         ` [kernel-hardening] " ZhaoJunmin Zhao(Junmin)
2016-11-10  5:53           ` ZhaoJunmin Zhao(Junmin)
2016-11-10  5:53           ` ZhaoJunmin Zhao(Junmin)
2016-11-10 19:11         ` Kees Cook
2016-11-10 19:11           ` [kernel-hardening] " Kees Cook
2016-11-10 19:11           ` Kees Cook
2016-11-15 11:15           ` Juerg Haefliger
2016-11-15 11:15             ` [kernel-hardening] " Juerg Haefliger
2016-11-15 11:15             ` Juerg Haefliger
2016-11-10 19:24         ` Kees Cook
2016-11-10 19:24           ` [kernel-hardening] " Kees Cook
2016-11-10 19:24           ` Kees Cook
2016-11-15 11:18           ` Juerg Haefliger
2016-11-15 11:18             ` [kernel-hardening] " Juerg Haefliger
2016-11-15 11:18             ` Juerg Haefliger
2016-11-24 10:56         ` AKASHI Takahiro
2016-11-24 10:56           ` [kernel-hardening] " AKASHI Takahiro
2016-11-24 10:56           ` AKASHI Takahiro
2016-11-28 11:15           ` Juerg Haefliger
2016-11-28 11:15             ` [kernel-hardening] " Juerg Haefliger
2016-12-09  9:02           ` AKASHI Takahiro
2016-12-09  9:02             ` [kernel-hardening] " AKASHI Takahiro
2016-12-09  9:02             ` AKASHI Takahiro
2016-11-04 14:45       ` [RFC PATCH v3 2/2] xpfo: Only put previous userspace pages into the hot cache Juerg Haefliger
2016-11-04 14:45         ` [kernel-hardening] " Juerg Haefliger
2016-11-04 14:45         ` Juerg Haefliger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.