[RFC PTI 0/3] Clean up pgd handling and fix VSYSCALL and LDT

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PTI 0/3] Clean up pgd handling and fix VSYSCALL and LDT
@ 2017-12-08 19:59 Andy Lutomirski
  2017-12-08 19:59 ` [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andy Lutomirski @ 2017-12-08 19:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

This needs more testing, but here goes.

Andy Lutomirski (3):
  x86/pti: Vastly simplify pgd synchronization
  Revert "x86/mm/pti: Disable native VSYSCALL"
  x86/pti: Put the LDT in its own PGD if PTI is on

 Documentation/x86/x86_64/mm.txt         |  11 ++-
 arch/x86/Kconfig                        |   8 --
 arch/x86/include/asm/mmu_context.h      |  33 +++++++-
 arch/x86/include/asm/pgtable_64.h       |  74 +++++++----------
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/include/asm/processor.h        |  23 ++++--
 arch/x86/kernel/ldt.c                   | 139 +++++++++++++++++++++++++++++---
 arch/x86/mm/pti.c                       |  52 +++---------
 8 files changed, 224 insertions(+), 118 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization
  2017-12-08 19:59 [RFC PTI 0/3] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
@ 2017-12-08 19:59 ` Andy Lutomirski
  2017-12-08 20:17   ` Dave Hansen
  2017-12-08 19:59 ` [RFC PTI 2/3] Revert "x86/mm/pti: Disable native VSYSCALL" Andy Lutomirski
  2017-12-08 19:59 ` [RFC PTI 3/3] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
  2 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2017-12-08 19:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

Back when we would dynamically add mappings to the usermode tables,
we needed to preallocate all the high top-level entries in the
usermode tables.  We don't need this in recent versions of PTI, so
get rid of preallocation.

With preallocation gone, the comments in pti_set_user_pgd() make
even less sense.  Rearrange the function to make it entirely obvious
what it does and does not do.  FWIW, I haven't even tried to wrap my
head around the old logic, since it seemed to be somewhere between
incomprehensible and wrong.

I admit that a bit of the earlier complexity was based on my
suggestions.  Mea culpa.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h | 74 +++++++++++++++------------------------
 arch/x86/mm/pti.c                 | 52 ++++++---------------------
 2 files changed, 39 insertions(+), 87 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index f5adf92091c6..be8d086de927 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -195,14 +195,6 @@ static inline bool pgdp_maps_userspace(void *__ptr)
 }
 
 /*
- * Does this PGD allow access from userspace?
- */
-static inline bool pgd_userspace_access(pgd_t pgd)
-{
-	return pgd.pgd & _PAGE_USER;
-}
-
-/*
  * Take a PGD location (pgdp) and a pgd value that needs to be set there.
  * Populates the user and returns the resulting PGD that must be set in
  * the kernel copy of the page tables.
@@ -213,50 +205,42 @@ static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
 		return pgd;
 
-	if (pgd_userspace_access(pgd)) {
-		if (pgdp_maps_userspace(pgdp)) {
-			/*
-			 * The user page tables get the full PGD,
-			 * accessible from userspace:
-			 */
-			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
-			/*
-			 * For the copy of the pgd that the kernel uses,
-			 * make it unusable to userspace.  This ensures on
-			 * in case that a return to userspace with the
-			 * kernel CR3 value, userspace will crash instead
-			 * of running.
-			 *
-			 * Note: NX might be not available or disabled.
-			 */
-			if (__supported_pte_mask & _PAGE_NX)
-				pgd.pgd |= _PAGE_NX;
-		}
-	} else if (pgd_userspace_access(*pgdp)) {
+	if (pgdp_maps_userspace(pgdp)) {
 		/*
-		 * We are clearing a _PAGE_USER PGD for which we presumably
-		 * populated the user PGD.  We must now clear the user PGD
-		 * entry.
+		 * The user page tables get the full PGD,
+		 * accessible from userspace:
 		 */
-		if (pgdp_maps_userspace(pgdp)) {
-			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
-		} else {
-			/*
-			 * Attempted to clear a _PAGE_USER PGD which is in
-			 * the kernel portion of the address space.  PGDs
-			 * are pre-populated and we never clear them.
-			 */
-			WARN_ON_ONCE(1);
-		}
+		kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
+
+		/*
+		 * If this is normal user memory, make it NX in the kernel
+		 * pagetables so that, if we somehow screw up and return to
+		 * usermode with the kernel CR3 loaded, we'll get a page
+		 * fault instead of allowing user code to execute with
+		 * the wrong CR3.
+		 *
+		 * As exceptions, we don't set NX if:
+		 *  - this is EFI or similar, the kernel may execute from it
+		 *  - we don't have NX support
+		 *  - we're clearing the PGD (i.e. pgd.pgd == 0).
+		 */
+		if ((pgd.pgd & _PAGE_USER) && (__supported_pte_mask & _PAGE_NX))
+			pgd.pgd |= _PAGE_NX;
 	} else {
 		/*
-		 * _PAGE_USER was not set in either the PGD being set or
-		 * cleared.  All kernel PGDs should be pre-populated so
-		 * this should never happen after boot.
+		 * Changes to the high (kernel) portion of the kernelmode
+		 * page tables are not automatically propagated to the
+		 * usermode tables.
+		 *
+		 * Users should keep in mind that, unlike the kernelmode
+		 * tables, there is no vmalloc_fault equivalent for the
+		 * usermode tables.  Top-level entries added to init_mm's
+		 * usermode pgd after boot will not be automatically
+		 * propagated to other mms.
 		 */
-		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
 	}
 #endif
+
 	/* return the copy of the PGD we want the kernel to use: */
 	return pgd;
 }
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index bd5d042adb3c..b74f3f89f4d9 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -83,8 +83,16 @@ static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
 	}
 
 	if (pgd_none(*pgd)) {
-		WARN_ONCE(1, "All user pgds should have been populated\n");
-		return NULL;
+		unsigned long new_p4d_page = __get_free_page(gfp);
+		if (!new_p4d_page)
+			return NULL;
+
+		if (p4d_none(*p4d)) {
+			set_pgd(pgd, __pgd(_KERNPG_TABLE | __pa(new_p4d_page)));
+			new_p4d_page = 0;
+		}
+		if (new_p4d_page)
+			free_page(new_p4d_page);
 	}
 	BUILD_BUG_ON(pgd_large(*pgd) != 0);
 
@@ -193,45 +201,6 @@ static void __init pti_clone_entry_text(void)
 }
 
 /*
- * Ensure that the top level of the user page tables are entirely
- * populated.  This ensures that all processes that get forked have the
- * same entries.  This way, we do not have to ever go set up new entries in
- * older processes.
- *
- * Note: we never free these, so there are no updates to them after this.
- */
-static void __init pti_init_all_pgds(void)
-{
-	pgd_t *pgd;
-	int i;
-
-	pgd = kernel_to_user_pgdp(pgd_offset_k(0UL));
-	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
-		/*
-		 * Each PGD entry moves up PGDIR_SIZE bytes through the
-		 * address space, so get the first virtual address mapped
-		 * by PGD #i:
-		 */
-		unsigned long addr = i * PGDIR_SIZE;
-#if CONFIG_PGTABLE_LEVELS > 4
-		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
-		if (!p4d) {
-			WARN_ON(1);
-			break;
-		}
-		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
-#else /* CONFIG_PGTABLE_LEVELS <= 4 */
-		pud_t *pud = pud_alloc_one(&init_mm, addr);
-		if (!pud) {
-			WARN_ON(1);
-			break;
-		}
-		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
-#endif /* CONFIG_PGTABLE_LEVELS */
-	}
-}
-
-/*
  * Initialize kernel page table isolation
  */
 void __init pti_init(void)
@@ -241,7 +210,6 @@ void __init pti_init(void)
 
 	pr_info("enabled\n");
 
-	pti_init_all_pgds();
 	pti_clone_user_shared();
 	pti_clone_entry_text();
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PTI 2/3] Revert "x86/mm/pti: Disable native VSYSCALL"
  2017-12-08 19:59 [RFC PTI 0/3] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
  2017-12-08 19:59 ` [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
@ 2017-12-08 19:59 ` Andy Lutomirski
  2017-12-08 19:59 ` [RFC PTI 3/3] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
  2 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2017-12-08 19:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

This reverts commit 6a7b4041b853ecc653e2c1dda5b736ab5fd29357.

With the PGD-propagation logic simplified, there's no need for this.
---
 arch/x86/Kconfig | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 411838058194..babb1e53b0a6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2233,9 +2233,6 @@ choice
 
 	config LEGACY_VSYSCALL_NATIVE
 		bool "Native"
-		# The VSYSCALL page comes from the kernel page tables
-		# and is not available when PAGE_TABLE_ISOLATION is enabled.
-		depends on !PAGE_TABLE_ISOLATION
 		help
 		  Actual executable code is located in the fixed vsyscall
 		  address mapping, implementing time() efficiently. Since
@@ -2253,11 +2250,6 @@ choice
 		  exploits. This configuration is recommended when userspace
 		  still uses the vsyscall area.
 
-		  When PAGE_TABLE_ISOLATION is enabled, the vsyscall area will become
-		  unreadable.  This emulation option still works, but PAGE_TABLE_ISOLATION
-		  will make it harder to do things like trace code using the
-		  emulation.
-
 	config LEGACY_VSYSCALL_NONE
 		bool "None"
 		help
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PTI 3/3] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-08 19:59 [RFC PTI 0/3] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
  2017-12-08 19:59 ` [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
  2017-12-08 19:59 ` [RFC PTI 2/3] Revert "x86/mm/pti: Disable native VSYSCALL" Andy Lutomirski
@ 2017-12-08 19:59 ` Andy Lutomirski
  2017-12-08 21:48   ` Thomas Gleixner
  2 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2017-12-08 19:59 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

With PTI on, we need the LDT to be in the usermode tables somewhere,
and the LDT is per-mm.

tglx had a hack to have a per-cpu LDT and context switch it, but it
was probably insanely slow due to the required TLB flushes.

Instead, take advantage of the fact that we have an address space
hole that gives us a completely unused pgd and make that pgd be
per-mm.  We can put the LDT in it.

This has a down side: the LDT isn't (currently) randomized, and an
attack that can write the LDT is instant root due to call gates
(thanks, AMD, for leaving call gates in AMD64 but designing them
wrong so they're only useful for exploits).  We could mitigate this
by making the LDT read-only or randomizing it, either of which is
strightforward on top of this patch.

XXX: The 5-level case needs testing and docs updates

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 Documentation/x86/x86_64/mm.txt         |  11 ++-
 arch/x86/include/asm/mmu_context.h      |  33 +++++++-
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/include/asm/processor.h        |  23 ++++--
 arch/x86/kernel/ldt.c                   | 139 +++++++++++++++++++++++++++++---
 5 files changed, 185 insertions(+), 23 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 2d7d6590ade8..bfa44e1cb293 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -12,13 +12,15 @@ ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
 ... unused hole ...
 ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
 ... unused hole ...
+fffffe8000000000 - fffffeffffffffff (=39 bits) LDT range
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
 ... unused hole ...
 ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
-ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space (variable)
-ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffa0000000 - [fixmap start]   (~1526 MB) module mapping space (variable)
+[fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
+ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
 
 Virtual memory map with 5 level page tables:
@@ -39,8 +41,9 @@ ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
 ... unused hole ...
 ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
-ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
-ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffa0000000 - [fixmap start]   (~1526 MB) module mapping space
+[fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
+ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
 
 Architecture defines a 64-bit virtual address. Implementations can support
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 5e1a1ecb65c6..eb87bbeddacc 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -52,13 +52,29 @@ struct ldt_struct {
 	 */
 	struct desc_struct *entries;
 	unsigned int nr_entries;
+
+	/*
+	 * If PTI is in use, then the entries array is not mapped while we're
+	 * in user mode.  The whole array will be aliased at the addressed
+	 * given by ldt_slot_va(slot).
+	 */
+	int slot;
 };
 
+/* This is a multiple of PAGE_SIZE. */
+#define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE)
+
+static void *ldt_slot_va(int slot)
+{
+	return (void *)(LDT_BASE_ADDR + LDT_SLOT_STRIDE * slot);
+}
+
 /*
  * Used for LDT copy/destruction.
  */
 int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
+void ldt_arch_exit_mmap(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
 static inline int init_new_context_ldt(struct task_struct *tsk,
 				       struct mm_struct *mm)
@@ -90,10 +106,20 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 	 * that we can see.
 	 */
 
-	if (unlikely(ldt))
-		set_ldt(ldt->entries, ldt->nr_entries);
-	else
+	if (unlikely(ldt)) {
+		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
+			if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
+				clear_LDT();
+				return;
+			}
+
+			set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
+		} else {
+			set_ldt(ldt->entries, ldt->nr_entries);
+		}
+	} else {
 		clear_LDT();
+	}
 #else
 	clear_LDT();
 #endif
@@ -185,6 +211,7 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm,
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
 	paravirt_arch_exit_mmap(mm);
+	ldt_arch_exit_mmap(mm);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 6d5f45dcd4a1..130f575f8d1e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -100,6 +100,8 @@ typedef struct { pteval_t pte; } pte_t;
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 #define ESPFIX_PGD_ENTRY _AC(-2, UL)
 #define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT)
+#define LDT_PGD_ENTRY _AC(-3, UL)
+#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT)
 #define EFI_VA_START	 ( -4 * (_AC(1, UL) << 30))
 #define EFI_VA_END	 (-68 * (_AC(1, UL) << 30))
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 9e482d8b0b97..9c18da64daa9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -851,13 +851,22 @@ static inline void spin_lock_prefetch(const void *x)
 
 #else
 /*
- * User space process size. 47bits minus one guard page.  The guard
- * page is necessary on Intel CPUs: if a SYSCALL instruction is at
- * the highest possible canonical userspace address, then that
- * syscall will enter the kernel with a non-canonical return
- * address, and SYSRET will explode dangerously.  We avoid this
- * particular problem by preventing anything from being mapped
- * at the maximum canonical address.
+ * User space process size.  This is the first address outside the user range.
+ * There are a few constraints that determine this:
+ *
+ * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
+ * address, then that syscall will enter the kernel with a
+ * non-canonical return address, and SYSRET will explode dangerously.
+ * We avoid this particular problem by preventing anything executable
+ * from being mapped at the maximum canonical address.
+ *
+ * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
+ * CPUs malfunction if they execute code from the highest canonical page.
+ * They'll speculate right off the end of the canonical space, and
+ * bad things happen.  This is worked around in the same way as the
+ * Intel problem.
+ *
+ * With page table isolation enabled, we map the LDT in ... [stay tuned]
  */
 #define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
 
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index ae5615b03def..8170b57fc498 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -19,6 +19,7 @@
 #include <linux/uaccess.h>
 
 #include <asm/ldt.h>
+#include <asm/tlb.h>
 #include <asm/desc.h>
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
@@ -46,13 +47,12 @@ static void refresh_ldt_segments(void)
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
-	mm_context_t *pc;
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
 		return;
 
-	pc = &mm->context;
-	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
+	__flush_tlb_all();
+	load_mm_ldt(mm);
 
 	refresh_ldt_segments();
 }
@@ -90,9 +90,113 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	}
 
 	new_ldt->nr_entries = num_entries;
+	new_ldt->slot = -1;
 	return new_ldt;
 }
 
+static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
+{
+#ifdef CONFIG_X86_64
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	bool is_vmalloc;
+	bool had_top_level_entry;
+	int i;
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
+		return 0;
+
+	WARN_ON(ldt->slot != -1);
+
+	/*
+	 * Both LDT slots are contained in a single PMD, so we can pass
+	 * LDT_BASE_ADDR instead of the real address to all the _offset
+	 * helpers.
+	 */
+	pgd = pgd_offset(mm, LDT_BASE_ADDR);
+
+	/*
+	 * Did we already have the top level entry allocated?  We can't
+	 * use pgd_none() for this because it doens't do anything on
+	 * 4-level page table kernels.
+	 */
+	had_top_level_entry = (pgd->pgd != 0);
+
+	p4d = p4d_alloc(mm, pgd, LDT_BASE_ADDR);
+	if (!p4d)
+		return -ENOMEM;
+
+	pud = pud_alloc(mm, p4d, LDT_BASE_ADDR);
+	if (!pud)
+		return -ENOMEM;
+	pmd = pmd_alloc(mm, pud, LDT_BASE_ADDR);
+	if (!pmd)
+		return -ENOMEM;
+	if (pte_alloc(mm, pmd, LDT_BASE_ADDR))
+		return -ENOMEM;
+
+	if (mm->context.ldt) {
+		/*
+		 * We already had an LDT.  The top-level entry should already
+		 * have been allocated and synchronized with the usermode
+		 * tables.
+		 */
+		WARN_ON(!had_top_level_entry);
+		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
+			WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
+	} else {
+		WARN_ON(had_top_level_entry);
+		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
+			WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
+			set_pgd(kernel_to_user_pgdp(pgd), *pgd);
+		}
+	}
+
+	is_vmalloc = is_vmalloc_addr(ldt->entries);
+
+	for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) {
+		unsigned long offset = i << PAGE_SHIFT;
+		unsigned long va = (unsigned long)ldt_slot_va(slot) + offset;
+		const void *src = (char *)ldt->entries + offset;
+		unsigned long pfn = is_vmalloc ? vmalloc_to_pfn(src) :
+			page_to_pfn(virt_to_page(src));
+
+		pte = pte_offset_kernel(pmd, va);
+		set_pte(pte, pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL)));
+	}
+
+	flush_tlb_mm_range(mm,
+			   (unsigned long)ldt_slot_va(slot),
+			   (unsigned long)ldt_slot_va(slot) + LDT_SLOT_STRIDE,
+			   0);
+
+	ldt->slot = slot;
+
+	return 0;
+#else
+	return -EINVAL;
+#endif
+}
+
+static void free_ldt_pgtables(struct mm_struct *mm)
+{
+#ifdef CONFIG_X86_64
+	struct mmu_gather tlb;
+	unsigned long start = LDT_BASE_ADDR;
+	unsigned long end = start + (1UL << PGDIR_SHIFT);
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
+		return;
+
+	tlb_gather_mmu(&tlb, mm, start, end);
+	free_pgd_range(&tlb, start, end, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+#endif
+}
+
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
@@ -134,17 +238,15 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 	int retval = 0;
 
 	mutex_init(&mm->context.lock);
+	mm->context.ldt = NULL;
+
 	old_mm = current->mm;
-	if (!old_mm) {
-		mm->context.ldt = NULL;
+	if (!old_mm)
 		return 0;
-	}
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt) {
-		mm->context.ldt = NULL;
+	if (!old_mm->context.ldt)
 		goto out_unlock;
-	}
 
 	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
 	if (!new_ldt) {
@@ -155,8 +257,17 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
 	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
 	finalize_ldt_struct(new_ldt);
+	retval = map_ldt_struct(mm, new_ldt, 0);
+	if (retval)
+		goto out_free;
 
 	mm->context.ldt = new_ldt;
+	goto out_unlock;
+
+out_free:
+	free_ldt_pgtables(mm);
+	free_ldt_struct(new_ldt);
+	return retval;
 
 out_unlock:
 	mutex_unlock(&old_mm->context.lock);
@@ -174,6 +285,11 @@ void destroy_context_ldt(struct mm_struct *mm)
 	mm->context.ldt = NULL;
 }
 
+void ldt_arch_exit_mmap(struct mm_struct *mm)
+{
+	free_ldt_pgtables(mm);
+}
+
 static int read_ldt(void __user *ptr, unsigned long bytecount)
 {
 	struct mm_struct *mm = current->mm;
@@ -285,6 +401,11 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 
 	new_ldt->entries[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);
+	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
+	if (error) {
+		free_ldt_struct(old_ldt);
+		goto out_unlock;
+	}
 
 	install_ldt(mm, new_ldt);
 	free_ldt_struct(old_ldt);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization
  2017-12-08 19:59 ` [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
@ 2017-12-08 20:17   ` Dave Hansen
  2017-12-08 20:49     ` Andy Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2017-12-08 20:17 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra

>From a high level, what finally allowed this to happen?  Because
kpti_add_user_map() all went away, including the LDT one?


> +	if (pgdp_maps_userspace(pgdp)) {
>  		/*
> +		 * The user page tables get the full PGD,
> +		 * accessible from userspace:
>  		 */
> +		kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
> +
> +		/*
> +		 * If this is normal user memory, make it NX in the kernel
> +		 * pagetables so that, if we somehow screw up and return to
> +		 * usermode with the kernel CR3 loaded, we'll get a page
> +		 * fault instead of allowing user code to execute with
> +		 * the wrong CR3.
> +		 *
> +		 * As exceptions, we don't set NX if:
> +		 *  - this is EFI or similar, the kernel may execute from it
> +		 *  - we don't have NX support
> +		 *  - we're clearing the PGD (i.e. pgd.pgd == 0).
> +		 */
> +		if ((pgd.pgd & _PAGE_USER) && (__supported_pte_mask & _PAGE_NX))
> +			pgd.pgd |= _PAGE_NX;

I scratched my head for a sec to realize why you didn't need an explicit
pgd.pgd==0 check.  It's part of the _PAGE_USER check, which is a bit
confusing.  Could we do:

if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) &&
    (__supported_pte_mask & _PAGE_NX))

And change the comment:

	 * As exceptions, we don't set NX fo:
	 *  - PGDs without _PAGE_USER:  Assume this is for a weird in-kernel
	 *    user like EFI from which we may need to execute.
	 *  - PGDs withoout _PAGE_PRESENT: PGD is being cleared, must
	 *    not set _PAGE_NX
	 *  - we don't have NX support





>  	} else {
>  		/*
> +		 * Changes to the high (kernel) portion of the kernelmode
> +		 * page tables are not automatically propagated to the
> +		 * usermode tables.
> +		 *
> +		 * Users should keep in mind that, unlike the kernelmode
> +		 * tables, there is no vmalloc_fault equivalent for the
> +		 * usermode tables.  Top-level entries added to init_mm's
> +		 * usermode pgd after boot will not be automatically
> +		 * propagated to other mms.
>  		 */
> -		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
>  	}
>  #endif

How does the VSYSCALL page get into the user page tables now?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization
  2017-12-08 20:49     ` Andy Lutomirski
@ 2017-12-08 20:48       ` Dave Hansen
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Hansen @ 2017-12-08 20:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra

On 12/08/2017 12:49 PM, Andy Lutomirski wrote:
>> if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) &&
>>     (__supported_pte_mask & _PAGE_NX))
> I assume you mean pgd.pgd & (_PAGE_USER|_PAGE_PRESENT) ==
> (_PAGE_USER|_PAGE_PRESENT)?

Yeah.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization
  2017-12-08 20:17   ` Dave Hansen
@ 2017-12-08 20:49     ` Andy Lutomirski
  2017-12-08 20:48       ` Dave Hansen
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2017-12-08 20:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Brian Gerst, David Laight, Kees Cook, Peter Zijlstra

On Fri, Dec 8, 2017 at 12:17 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> From a high level, what finally allowed this to happen?  Because
> kpti_add_user_map() all went away, including the LDT one?
>

Yes, exactly.

>
>> +     if (pgdp_maps_userspace(pgdp)) {
>>               /*
>> +              * The user page tables get the full PGD,
>> +              * accessible from userspace:
>>                */
>> +             kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
>> +
>> +             /*
>> +              * If this is normal user memory, make it NX in the kernel
>> +              * pagetables so that, if we somehow screw up and return to
>> +              * usermode with the kernel CR3 loaded, we'll get a page
>> +              * fault instead of allowing user code to execute with
>> +              * the wrong CR3.
>> +              *
>> +              * As exceptions, we don't set NX if:
>> +              *  - this is EFI or similar, the kernel may execute from it
>> +              *  - we don't have NX support
>> +              *  - we're clearing the PGD (i.e. pgd.pgd == 0).
>> +              */
>> +             if ((pgd.pgd & _PAGE_USER) && (__supported_pte_mask & _PAGE_NX))
>> +                     pgd.pgd |= _PAGE_NX;
>
> I scratched my head for a sec to realize why you didn't need an explicit
> pgd.pgd==0 check.  It's part of the _PAGE_USER check, which is a bit
> confusing.  Could we do:
>
> if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) &&
>     (__supported_pte_mask & _PAGE_NX))

I assume you mean pgd.pgd & (_PAGE_USER|_PAGE_PRESENT) ==
(_PAGE_USER|_PAGE_PRESENT)?

>
> And change the comment:
>
>          * As exceptions, we don't set NX fo:
>          *  - PGDs without _PAGE_USER:  Assume this is for a weird in-kernel
>          *    user like EFI from which we may need to execute.
>          *  - PGDs withoout _PAGE_PRESENT: PGD is being cleared, must
>          *    not set _PAGE_NX
>          *  - we don't have NX support
>
>
>
>
>
>>       } else {
>>               /*
>> +              * Changes to the high (kernel) portion of the kernelmode
>> +              * page tables are not automatically propagated to the
>> +              * usermode tables.
>> +              *
>> +              * Users should keep in mind that, unlike the kernelmode
>> +              * tables, there is no vmalloc_fault equivalent for the
>> +              * usermode tables.  Top-level entries added to init_mm's
>> +              * usermode pgd after boot will not be automatically
>> +              * propagated to other mms.
>>                */
>> -             WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
>>       }
>>  #endif
>
> How does the VSYSCALL page get into the user page tables now?

Fantastic question.  I appear to be bad at testing.  Lemme fix that.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PTI 3/3] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-08 19:59 ` [RFC PTI 3/3] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
@ 2017-12-08 21:48   ` Thomas Gleixner
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2017-12-08 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra

On Fri, 8 Dec 2017, Andy Lutomirski wrote:

> With PTI on, we need the LDT to be in the usermode tables somewhere,
> and the LDT is per-mm.
> 
> tglx had a hack to have a per-cpu LDT and context switch it, but it
> was probably insanely slow due to the required TLB flushes.
>
> Instead, take advantage of the fact that we have an address space
> hole that gives us a completely unused pgd and make that pgd be
> per-mm.  We can put the LDT in it.
> 
> This has a down side: the LDT isn't (currently) randomized, and an
> attack that can write the LDT is instant root due to call gates
> (thanks, AMD, for leaving call gates in AMD64 but designing them
> wrong so they're only useful for exploits).  We could mitigate this
> by making the LDT read-only or randomizing it, either of which is
> strightforward on top of this patch.

Randomizing yes. RO not so much. I spent quite some time with Peter Zijstra
to make that work.

The problem is that the CPU tries to write the accessed bit in the entry
which is referenced by a selector register.

In most cases this is a non-issue because the access happens not when the
selector is loaded, it happens when the selector is referenced, e.g. mov
fs:.... This results in a very nice to identify page fault and writing the
accessed bit from a special fault handler works nicely.

Now whats special is the stack segment. If the LDT has been modified and
the IRET or whatever brings the task back to user space, it loads SS from
stack and this immediately results in a #GP, which does not tell what
failed and a fixup of the address bit does not work in that case. So the
#GP just comes back.

Forcing he accessed bit to 1 when installing the descriptors does not help
either.

So yes, thanks a lot to the folks who came up with that wonderful exploit
target.

> +static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
> +{
> +#ifdef CONFIG_X86_64
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	bool is_vmalloc;
> +	bool had_top_level_entry;
> +	int i;
> +
> +	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +		return 0;
> +
> +	WARN_ON(ldt->slot != -1);
> +
> +	/*
> +	 * Both LDT slots are contained in a single PMD, so we can pass
> +	 * LDT_BASE_ADDR instead of the real address to all the _offset
> +	 * helpers.
> +	 */
> +	pgd = pgd_offset(mm, LDT_BASE_ADDR);
> +
> +	/*
> +	 * Did we already have the top level entry allocated?  We can't
> +	 * use pgd_none() for this because it doens't do anything on
> +	 * 4-level page table kernels.
> +	 */
> +	had_top_level_entry = (pgd->pgd != 0);
> +
> +	p4d = p4d_alloc(mm, pgd, LDT_BASE_ADDR);
> +	if (!p4d)
> +		return -ENOMEM;
> +
> +	pud = pud_alloc(mm, p4d, LDT_BASE_ADDR);
> +	if (!pud)
> +		return -ENOMEM;
> +	pmd = pmd_alloc(mm, pud, LDT_BASE_ADDR);
> +	if (!pmd)
> +		return -ENOMEM;
> +	if (pte_alloc(mm, pmd, LDT_BASE_ADDR))
> +		return -ENOMEM;

I really find it disgusting that we have a gazillion copies of exactly the
same code which just populates stuff up the place where PTEs can be
installed.

Can we pretty please have _ONE_ general out of line helper function for
this?

> +
> +	if (mm->context.ldt) {
> +		/*
> +		 * We already had an LDT.  The top-level entry should already
> +		 * have been allocated and synchronized with the usermode
> +		 * tables.
> +		 */
> +		WARN_ON(!had_top_level_entry);
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +			WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
> +	} else {
> +		WARN_ON(had_top_level_entry);
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
> +			WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
> +			set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> +		}
> +	}
> +
> +	is_vmalloc = is_vmalloc_addr(ldt->entries);

Just get rid of that single page allocation thing and use vmalloc
unconditionally.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-12-08 21:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-08 19:59 [RFC PTI 0/3] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
2017-12-08 19:59 ` [RFC PTI 1/3] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
2017-12-08 20:17   ` Dave Hansen
2017-12-08 20:49     ` Andy Lutomirski
2017-12-08 20:48       ` Dave Hansen
2017-12-08 19:59 ` [RFC PTI 2/3] Revert "x86/mm/pti: Disable native VSYSCALL" Andy Lutomirski
2017-12-08 19:59 ` [RFC PTI 3/3] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
2017-12-08 21:48   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.