[PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT
@ 2017-12-11  6:47 Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 1/6] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:47 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

I'm getting reasonably happy with this.  It still needs more testing,
but I want to get it out there.

The main things that need testing are the 5-level case for the both
vsyscalls and the LDT.  I'm getting a double-fault in ldt_gdt_64,
but I haven't checked whether it's a bug in this series, and it
kind of looks like it isn't.  I'll figure it out in the morning.
The docs also want updating for the 5 level case.

Changes from v1:
 - vsyscalls actually work.
 - Added the "Warn and fail" patch to prevent the testing goof I had on v1.
 - Lots of cleanups

Andy Lutomirski (6):
  x86/pti: Vastly simplify pgd synchronization
  Revert "x86/mm/pti: Disable native VSYSCALL"
  x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
  x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  x86/pti: Map the vsyscall page if needed
  x86/pti: Put the LDT in its own PGD if PTI is on

 Documentation/x86/x86_64/mm.txt         |  11 ++-
 arch/x86/Kconfig                        |   8 --
 arch/x86/entry/vsyscall/vsyscall_64.c   |  37 ++++++++-
 arch/x86/include/asm/mmu_context.h      |  33 +++++++-
 arch/x86/include/asm/pgtable.h          |   6 +-
 arch/x86/include/asm/pgtable_64.h       |  77 ++++++++----------
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/include/asm/processor.h        |  23 ++++--
 arch/x86/include/asm/vsyscall.h         |   1 +
 arch/x86/kernel/ldt.c                   | 138 +++++++++++++++++++++++++++++---
 arch/x86/mm/pti.c                       | 115 ++++++++++++++++----------
 11 files changed, 331 insertions(+), 120 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH PTI v2 1/6] x86/pti: Vastly simplify pgd synchronization
  2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
@ 2017-12-11  6:47 ` Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 2/6] Revert "x86/mm/pti: Disable native VSYSCALL" Andy Lutomirski
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:47 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

Back when we would dynamically add mappings to the usermode tables,
we needed to preallocate all the high top-level entries in the
usermode tables.  We don't need this in recent versions of PTI, so
get rid of preallocation.

With preallocation gone, the comments in pti_set_user_pgd() make
even less sense.  Rearrange the function to make it entirely obvious
what it does and does not do.  FWIW, I haven't even tried to wrap my
head around the old logic, since it seemed to be somewhere between
incomprehensible and wrong.

I admit that a bit of the earlier complexity was based on my
suggestions.  Mea culpa.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h | 74 +++++++++++++++------------------------
 arch/x86/mm/pti.c                 | 52 ++++++---------------------
 2 files changed, 39 insertions(+), 87 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index f5adf92091c6..be8d086de927 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -195,14 +195,6 @@ static inline bool pgdp_maps_userspace(void *__ptr)
 }
 
 /*
- * Does this PGD allow access from userspace?
- */
-static inline bool pgd_userspace_access(pgd_t pgd)
-{
-	return pgd.pgd & _PAGE_USER;
-}
-
-/*
  * Take a PGD location (pgdp) and a pgd value that needs to be set there.
  * Populates the user and returns the resulting PGD that must be set in
  * the kernel copy of the page tables.
@@ -213,50 +205,42 @@ static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
 		return pgd;
 
-	if (pgd_userspace_access(pgd)) {
-		if (pgdp_maps_userspace(pgdp)) {
-			/*
-			 * The user page tables get the full PGD,
-			 * accessible from userspace:
-			 */
-			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
-			/*
-			 * For the copy of the pgd that the kernel uses,
-			 * make it unusable to userspace.  This ensures on
-			 * in case that a return to userspace with the
-			 * kernel CR3 value, userspace will crash instead
-			 * of running.
-			 *
-			 * Note: NX might be not available or disabled.
-			 */
-			if (__supported_pte_mask & _PAGE_NX)
-				pgd.pgd |= _PAGE_NX;
-		}
-	} else if (pgd_userspace_access(*pgdp)) {
+	if (pgdp_maps_userspace(pgdp)) {
 		/*
-		 * We are clearing a _PAGE_USER PGD for which we presumably
-		 * populated the user PGD.  We must now clear the user PGD
-		 * entry.
+		 * The user page tables get the full PGD,
+		 * accessible from userspace:
 		 */
-		if (pgdp_maps_userspace(pgdp)) {
-			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
-		} else {
-			/*
-			 * Attempted to clear a _PAGE_USER PGD which is in
-			 * the kernel portion of the address space.  PGDs
-			 * are pre-populated and we never clear them.
-			 */
-			WARN_ON_ONCE(1);
-		}
+		kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
+
+		/*
+		 * If this is normal user memory, make it NX in the kernel
+		 * pagetables so that, if we somehow screw up and return to
+		 * usermode with the kernel CR3 loaded, we'll get a page
+		 * fault instead of allowing user code to execute with
+		 * the wrong CR3.
+		 *
+		 * As exceptions, we don't set NX if:
+		 *  - this is EFI or similar, the kernel may execute from it
+		 *  - we don't have NX support
+		 *  - we're clearing the PGD (i.e. pgd.pgd == 0).
+		 */
+		if ((pgd.pgd & _PAGE_USER) && (__supported_pte_mask & _PAGE_NX))
+			pgd.pgd |= _PAGE_NX;
 	} else {
 		/*
-		 * _PAGE_USER was not set in either the PGD being set or
-		 * cleared.  All kernel PGDs should be pre-populated so
-		 * this should never happen after boot.
+		 * Changes to the high (kernel) portion of the kernelmode
+		 * page tables are not automatically propagated to the
+		 * usermode tables.
+		 *
+		 * Users should keep in mind that, unlike the kernelmode
+		 * tables, there is no vmalloc_fault equivalent for the
+		 * usermode tables.  Top-level entries added to init_mm's
+		 * usermode pgd after boot will not be automatically
+		 * propagated to other mms.
 		 */
-		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
 	}
 #endif
+
 	/* return the copy of the PGD we want the kernel to use: */
 	return pgd;
 }
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index bd5d042adb3c..f48645d2f3fd 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -83,8 +83,16 @@ static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
 	}
 
 	if (pgd_none(*pgd)) {
-		WARN_ONCE(1, "All user pgds should have been populated\n");
-		return NULL;
+		unsigned long new_p4d_page = __get_free_page(gfp);
+		if (!new_p4d_page)
+			return NULL;
+
+		if (pgd_none(*pgd)) {
+			set_pgd(pgd, __pgd(_KERNPG_TABLE | __pa(new_p4d_page)));
+			new_p4d_page = 0;
+		}
+		if (new_p4d_page)
+			free_page(new_p4d_page);
 	}
 	BUILD_BUG_ON(pgd_large(*pgd) != 0);
 
@@ -193,45 +201,6 @@ static void __init pti_clone_entry_text(void)
 }
 
 /*
- * Ensure that the top level of the user page tables are entirely
- * populated.  This ensures that all processes that get forked have the
- * same entries.  This way, we do not have to ever go set up new entries in
- * older processes.
- *
- * Note: we never free these, so there are no updates to them after this.
- */
-static void __init pti_init_all_pgds(void)
-{
-	pgd_t *pgd;
-	int i;
-
-	pgd = kernel_to_user_pgdp(pgd_offset_k(0UL));
-	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
-		/*
-		 * Each PGD entry moves up PGDIR_SIZE bytes through the
-		 * address space, so get the first virtual address mapped
-		 * by PGD #i:
-		 */
-		unsigned long addr = i * PGDIR_SIZE;
-#if CONFIG_PGTABLE_LEVELS > 4
-		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
-		if (!p4d) {
-			WARN_ON(1);
-			break;
-		}
-		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
-#else /* CONFIG_PGTABLE_LEVELS <= 4 */
-		pud_t *pud = pud_alloc_one(&init_mm, addr);
-		if (!pud) {
-			WARN_ON(1);
-			break;
-		}
-		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
-#endif /* CONFIG_PGTABLE_LEVELS */
-	}
-}
-
-/*
  * Initialize kernel page table isolation
  */
 void __init pti_init(void)
@@ -241,7 +210,6 @@ void __init pti_init(void)
 
 	pr_info("enabled\n");
 
-	pti_init_all_pgds();
 	pti_clone_user_shared();
 	pti_clone_entry_text();
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH PTI v2 2/6] Revert "x86/mm/pti: Disable native VSYSCALL"
  2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 1/6] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
@ 2017-12-11  6:47 ` Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy Andy Lutomirski
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:47 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

This reverts commit 6a7b4041b853ecc653e2c1dda5b736ab5fd29357.

With the PGD-propagation logic simplified, there's no need for this.
---
 arch/x86/Kconfig | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 411838058194..babb1e53b0a6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2233,9 +2233,6 @@ choice
 
 	config LEGACY_VSYSCALL_NATIVE
 		bool "Native"
-		# The VSYSCALL page comes from the kernel page tables
-		# and is not available when PAGE_TABLE_ISOLATION is enabled.
-		depends on !PAGE_TABLE_ISOLATION
 		help
 		  Actual executable code is located in the fixed vsyscall
 		  address mapping, implementing time() efficiently. Since
@@ -2253,11 +2250,6 @@ choice
 		  exploits. This configuration is recommended when userspace
 		  still uses the vsyscall area.
 
-		  When PAGE_TABLE_ISOLATION is enabled, the vsyscall area will become
-		  unreadable.  This emulation option still works, but PAGE_TABLE_ISOLATION
-		  will make it harder to do things like trace code using the
-		  emulation.
-
 	config LEGACY_VSYSCALL_NONE
 		bool "None"
 		help
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
  2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 1/6] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 2/6] Revert "x86/mm/pti: Disable native VSYSCALL" Andy Lutomirski
@ 2017-12-11  6:47 ` Andy Lutomirski
  2017-12-11 13:39   ` Ingo Molnar
  2017-12-11  6:47 ` [PATCH PTI v2 4/6] x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode Andy Lutomirski
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:47 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

The kernel is very erratic as to which pagetables have _PAGE_USER
set.  The vsyscall page gets lucky: it seems that all of the
relevant pagetables are among the apparently arbitrary ones that set
_PAGE_USER.  Rather than relying on chance, just explicitly set
_PAGE_USER.

This will let us clean up pagetable setup to stop setting
_PAGE_USER.  The added code can also be reused by pagetable
isolation to manage the _PAGE_USER bit in the usermode tables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/vsyscall/vsyscall_64.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index f279ba2643dc..bc88a0540347 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -329,16 +329,47 @@ int in_gate_area_no_mm(unsigned long addr)
 	return vsyscall_mode != NONE && (addr & PAGE_MASK) == VSYSCALL_ADDR;
 }
 
+/*
+ * The VSYSCALL page is the only user-accessible page in the kernel address
+ * range.  Normally, the kernel page tables can have _PAGE_USER clear, but
+ * the tables covering VSYSCALL_ADDR need _PAGE_USER set if vsyscalls
+ * are enabled.
+ *
+ * Some day we may create a "minimal" vsyscall mode in which we emulate
+ * vsyscalls but leave the page not present.  If so, we skip calling
+ * this.
+ */
+static void __init set_vsyscall_pgtable_user_bits(void)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset_k(VSYSCALL_ADDR);
+	pgd->pgd |= _PAGE_USER;
+	p4d = p4d_offset(pgd, VSYSCALL_ADDR);
+#if CONFIG_PGTABLE_LEVELS >= 5
+	p4d->p4d |= _PAGE_USER;
+#endif
+	pud = pud_offset(p4d, VSYSCALL_ADDR);
+	pud->pud |= _PAGE_USER;
+	pmd = pmd_offset(pud, VSYSCALL_ADDR);
+	pmd->pmd |= _PAGE_USER;
+}
+
 void __init map_vsyscall(void)
 {
 	extern char __vsyscall_page;
 	unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
 
-	if (vsyscall_mode != NONE)
+	if (vsyscall_mode != NONE) {
 		__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
 			     vsyscall_mode == NATIVE
 			     ? PAGE_KERNEL_VSYSCALL
 			     : PAGE_KERNEL_VVAR);
+		set_vsyscall_pgtable_user_bits();
+	}
 
 	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
 		     (unsigned long)VSYSCALL_ADDR);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH PTI v2 4/6] x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-12-11  6:47 ` [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy Andy Lutomirski
@ 2017-12-11  6:47 ` Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 5/6] x86/pti: Map the vsyscall page if needed Andy Lutomirski
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:47 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

If something goes wrong with pagetable setup, vsyscall=native will
accidentally fall back to emulation.  Make it warn and fail so that
we notice.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/vsyscall/vsyscall_64.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index bc88a0540347..a06f2ae09ad6 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -138,6 +138,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 
 	WARN_ON_ONCE(address != regs->ip);
 
+	/* This should be unreachable in NATIVE mode. */
+	if (WARN_ON(vsyscall_mode == NATIVE))
+		return false;
+
 	if (vsyscall_mode == NONE) {
 		warn_bad_vsyscall(KERN_INFO, regs,
 				  "vsyscall attempted with vsyscall=none");
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH PTI v2 5/6] x86/pti: Map the vsyscall page if needed
  2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-12-11  6:47 ` [PATCH PTI v2 4/6] x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode Andy Lutomirski
@ 2017-12-11  6:47 ` Andy Lutomirski
  2017-12-11  6:47 ` [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
  2017-12-11  6:54 ` [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
  6 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:47 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

Make VSYSCALLs work fully in PTI mode.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/vsyscall/vsyscall_64.c |  6 ++--
 arch/x86/include/asm/pgtable.h        |  6 +++-
 arch/x86/include/asm/pgtable_64.h     |  9 +++--
 arch/x86/include/asm/vsyscall.h       |  1 +
 arch/x86/mm/pti.c                     | 63 +++++++++++++++++++++++++++++++++++
 5 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index a06f2ae09ad6..e4a6fe8354f0 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -343,14 +343,14 @@ int in_gate_area_no_mm(unsigned long addr)
  * vsyscalls but leave the page not present.  If so, we skip calling
  * this.
  */
-static void __init set_vsyscall_pgtable_user_bits(void)
+void __init set_vsyscall_pgtable_user_bits(pgd_t *root)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 
-	pgd = pgd_offset_k(VSYSCALL_ADDR);
+	pgd = pgd_offset_pgd(root, VSYSCALL_ADDR);
 	pgd->pgd |= _PAGE_USER;
 	p4d = p4d_offset(pgd, VSYSCALL_ADDR);
 #if CONFIG_PGTABLE_LEVELS >= 5
@@ -372,7 +372,7 @@ void __init map_vsyscall(void)
 			     vsyscall_mode == NATIVE
 			     ? PAGE_KERNEL_VSYSCALL
 			     : PAGE_KERNEL_VVAR);
-		set_vsyscall_pgtable_user_bits();
+		set_vsyscall_pgtable_user_bits(swapper_pg_dir);
 	}
 
 	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 83c0c77e7365..a8a8fc15ca16 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -920,7 +920,11 @@ static inline int pgd_none(pgd_t pgd)
  * pgd_offset() returns a (pgd_t *)
  * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
  */
-#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
+#define pgd_offset_pgd(pgd, address) (pgd + pgd_index((address)))
+/*
+ * a shortcut to get a pgd_t in a given mm
+ */
+#define pgd_offset(mm, address) pgd_offset_pgd((mm)->pgd, (address))
 /*
  * a shortcut which implies the use of the kernel's pgd, instead
  * of a process's
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index be8d086de927..a2fb3f8bc985 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -220,11 +220,14 @@ static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 		 * the wrong CR3.
 		 *
 		 * As exceptions, we don't set NX if:
-		 *  - this is EFI or similar, the kernel may execute from it
+		 *  - _PAGE_USER is not set.  This could be an executable
+		 *     EFI runtime mapping or something similar, and the kernel
+		 *     may execute from it
 		 *  - we don't have NX support
-		 *  - we're clearing the PGD (i.e. pgd.pgd == 0).
+		 *  - we're clearing the PGD (i.e. the new pgd is not present).
 		 */
-		if ((pgd.pgd & _PAGE_USER) && (__supported_pte_mask & _PAGE_NX))
+		if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
+		    (__supported_pte_mask & _PAGE_NX))
 			pgd.pgd |= _PAGE_NX;
 	} else {
 		/*
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d9a7c659009c..b986b2ca688a 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -7,6 +7,7 @@
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
 extern void map_vsyscall(void);
+extern void set_vsyscall_pgtable_user_bits(pgd_t *root);
 
 /*
  * Called on instruction fetch fault in vsyscall page.
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index f48645d2f3fd..a9c53d21f0a8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -38,6 +38,7 @@
 
 #include <asm/cpufeature.h>
 #include <asm/hypervisor.h>
+#include <asm/vsyscall.h>
 #include <asm/cmdline.h>
 #include <asm/pti.h>
 #include <asm/pgtable.h>
@@ -133,6 +134,48 @@ static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
 	return pmd_offset(pud, address);
 }
 
+/*
+ * Walk the shadow copy of the page tables (optionally) trying to allocate
+ * page table pages on the way down.  Does not support large pages.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables.  It is never used for userspace data.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
+{
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+	pmd_t *pmd = pti_user_pagetable_walk_pmd(address);
+	pte_t *pte;
+
+	/* We can't do anything sensible if we hit a large mapping. */
+	if (pmd_large(*pmd)) {
+		WARN_ON(1);
+		return NULL;
+	}
+
+	if (pmd_none(*pmd)) {
+		unsigned long new_pte_page = __get_free_page(gfp);
+		if (!new_pte_page)
+			return NULL;
+
+		if (pmd_none(*pmd)) {
+			set_pmd(pmd, __pmd(_KERNPG_TABLE | __pa(new_pte_page)));
+			new_pte_page = 0;
+		}
+		if (new_pte_page)
+			free_page(new_pte_page);
+	}
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_flags(*pte) & _PAGE_USER) {
+		WARN_ONCE(1, "attempt to walk to user pte\n");
+		return NULL;
+	}
+	return pte;
+}
+
 static void __init
 pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 {
@@ -174,6 +217,25 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 	}
 }
 
+static void __init pti_setup_vsyscall(void)
+{
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+	pte_t *pte, *target_pte;
+	unsigned int level;
+
+	pte = lookup_address(VSYSCALL_ADDR, &level);
+	if (!pte || WARN_ON(level != PG_LEVEL_4K))
+		return;
+
+	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR);
+	if (WARN_ON(!target_pte))
+		return;
+
+	*target_pte = *pte;
+	set_vsyscall_pgtable_user_bits(kernel_to_user_pgdp(swapper_pg_dir));
+#endif
+}
+
 /*
  * Clone the populated PMDs of the user shared fixmaps into the user space
  * visible page table.
@@ -212,4 +274,5 @@ void __init pti_init(void)
 
 	pti_clone_user_shared();
 	pti_clone_entry_text();
+	pti_setup_vsyscall();
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-12-11  6:47 ` [PATCH PTI v2 5/6] x86/pti: Map the vsyscall page if needed Andy Lutomirski
@ 2017-12-11  6:47 ` Andy Lutomirski
  2017-12-11 17:49   ` Dave Hansen
  2017-12-11  6:54 ` [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
  6 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:47 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

With PTI on, we need the LDT to be in the usermode tables somewhere,
and the LDT is per-mm.

tglx had a hack to have a per-cpu LDT and context switch it, but it
was probably insanely slow due to the required TLB flushes.

Instead, take advantage of the fact that we have an address space
hole that gives us a completely unused pgd and make that pgd be
per-mm.  We can put the LDT in it.

This has a down side: the LDT isn't (currently) randomized, and an
attack that can write the LDT is instant root due to call gates
(thanks, AMD, for leaving call gates in AMD64 but designing them
wrong so they're only useful for exploits).  We could mitigate this
by making the LDT read-only or randomizing it, either of which is
strightforward on top of this patch.

XXX: The 5-level case needs testing and docs updates

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 Documentation/x86/x86_64/mm.txt         |  11 ++-
 arch/x86/include/asm/mmu_context.h      |  33 +++++++-
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/include/asm/processor.h        |  23 ++++--
 arch/x86/kernel/ldt.c                   | 138 +++++++++++++++++++++++++++++---
 5 files changed, 184 insertions(+), 23 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 2d7d6590ade8..bfa44e1cb293 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -12,13 +12,15 @@ ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
 ... unused hole ...
 ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
 ... unused hole ...
+fffffe8000000000 - fffffeffffffffff (=39 bits) LDT range
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
 ... unused hole ...
 ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
-ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space (variable)
-ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffa0000000 - [fixmap start]   (~1526 MB) module mapping space (variable)
+[fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
+ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
 
 Virtual memory map with 5 level page tables:
@@ -39,8 +41,9 @@ ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
 ... unused hole ...
 ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
-ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
-ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffa0000000 - [fixmap start]   (~1526 MB) module mapping space
+[fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
+ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
 
 Architecture defines a 64-bit virtual address. Implementations can support
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 5e1a1ecb65c6..eb87bbeddacc 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -52,13 +52,29 @@ struct ldt_struct {
 	 */
 	struct desc_struct *entries;
 	unsigned int nr_entries;
+
+	/*
+	 * If PTI is in use, then the entries array is not mapped while we're
+	 * in user mode.  The whole array will be aliased at the addressed
+	 * given by ldt_slot_va(slot).
+	 */
+	int slot;
 };
 
+/* This is a multiple of PAGE_SIZE. */
+#define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE)
+
+static void *ldt_slot_va(int slot)
+{
+	return (void *)(LDT_BASE_ADDR + LDT_SLOT_STRIDE * slot);
+}
+
 /*
  * Used for LDT copy/destruction.
  */
 int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
+void ldt_arch_exit_mmap(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
 static inline int init_new_context_ldt(struct task_struct *tsk,
 				       struct mm_struct *mm)
@@ -90,10 +106,20 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 	 * that we can see.
 	 */
 
-	if (unlikely(ldt))
-		set_ldt(ldt->entries, ldt->nr_entries);
-	else
+	if (unlikely(ldt)) {
+		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
+			if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
+				clear_LDT();
+				return;
+			}
+
+			set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
+		} else {
+			set_ldt(ldt->entries, ldt->nr_entries);
+		}
+	} else {
 		clear_LDT();
+	}
 #else
 	clear_LDT();
 #endif
@@ -185,6 +211,7 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm,
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
 	paravirt_arch_exit_mmap(mm);
+	ldt_arch_exit_mmap(mm);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 6d5f45dcd4a1..130f575f8d1e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -100,6 +100,8 @@ typedef struct { pteval_t pte; } pte_t;
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 #define ESPFIX_PGD_ENTRY _AC(-2, UL)
 #define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT)
+#define LDT_PGD_ENTRY _AC(-3, UL)
+#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT)
 #define EFI_VA_START	 ( -4 * (_AC(1, UL) << 30))
 #define EFI_VA_END	 (-68 * (_AC(1, UL) << 30))
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 9e482d8b0b97..9c18da64daa9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -851,13 +851,22 @@ static inline void spin_lock_prefetch(const void *x)
 
 #else
 /*
- * User space process size. 47bits minus one guard page.  The guard
- * page is necessary on Intel CPUs: if a SYSCALL instruction is at
- * the highest possible canonical userspace address, then that
- * syscall will enter the kernel with a non-canonical return
- * address, and SYSRET will explode dangerously.  We avoid this
- * particular problem by preventing anything from being mapped
- * at the maximum canonical address.
+ * User space process size.  This is the first address outside the user range.
+ * There are a few constraints that determine this:
+ *
+ * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
+ * address, then that syscall will enter the kernel with a
+ * non-canonical return address, and SYSRET will explode dangerously.
+ * We avoid this particular problem by preventing anything executable
+ * from being mapped at the maximum canonical address.
+ *
+ * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
+ * CPUs malfunction if they execute code from the highest canonical page.
+ * They'll speculate right off the end of the canonical space, and
+ * bad things happen.  This is worked around in the same way as the
+ * Intel problem.
+ *
+ * With page table isolation enabled, we map the LDT in ... [stay tuned]
  */
 #define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
 
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index ae5615b03def..46ad333ed797 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -19,6 +19,7 @@
 #include <linux/uaccess.h>
 
 #include <asm/ldt.h>
+#include <asm/tlb.h>
 #include <asm/desc.h>
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
@@ -46,13 +47,12 @@ static void refresh_ldt_segments(void)
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
-	mm_context_t *pc;
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
 		return;
 
-	pc = &mm->context;
-	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
+	__flush_tlb_all();
+	load_mm_ldt(mm);
 
 	refresh_ldt_segments();
 }
@@ -90,9 +90,112 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	}
 
 	new_ldt->nr_entries = num_entries;
+	new_ldt->slot = -1;
 	return new_ldt;
 }
 
+/*
+ * If PTI is enabled, this maps the LDT into the kernelmode and
+ * usermode tables for the given mm.
+ *
+ * There is no corresponding unmap function.  Even if the LDT is freed, we
+ * leave the PTEs around until the slot is reused or the mm is destroyed.
+ * This is harmless: the LDT is always in ordinary memory, and no one will
+ * access the freed slot.
+ *
+ * If we wanted to unmap freed LDTs, we'd also need to do a flush to make
+ * it useful, and the flush would slow down modify_ldt().
+ */
+static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	spinlock_t *ptl;
+	bool is_vmalloc;
+	bool had_top_level_entry;
+	pgd_t *pgd;
+	int i;
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
+		return 0;
+
+	WARN_ON(ldt->slot != -1);
+
+	/*
+	 * Did we already have the top level entry allocated?  We can't
+	 * use pgd_none() for this because it doens't do anything on
+	 * 4-level page table kernels.
+	 */
+	pgd = pgd_offset(mm, LDT_BASE_ADDR);
+	had_top_level_entry = (pgd->pgd != 0);
+
+	is_vmalloc = is_vmalloc_addr(ldt->entries);
+
+	for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) {
+		unsigned long offset = i << PAGE_SHIFT;
+		unsigned long va = (unsigned long)ldt_slot_va(slot) + offset;
+		const void *src = (char *)ldt->entries + offset;
+		unsigned long pfn = is_vmalloc ? vmalloc_to_pfn(src) :
+			page_to_pfn(virt_to_page(src));
+		pte_t pte, *ptep;
+
+		ptep = get_locked_pte(mm, va, &ptl);
+		if (!ptep)
+			return -ENOMEM;
+		pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL));
+			      set_pte_at(mm, va, ptep, pte);
+		pte_unmap_unlock(ptep, ptl);
+	}
+
+	if (mm->context.ldt) {
+		/*
+		 * We already had an LDT.  The top-level entry should already
+		 * have been allocated and synchronized with the usermode
+		 * tables.
+		 */
+		WARN_ON(!had_top_level_entry);
+		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
+			WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
+	} else {
+		/*
+		 * This is the first time we're mapping an LDT for this process.
+		 * Sync the pgd to the usermode tables.
+		 */
+		WARN_ON(had_top_level_entry);
+		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
+			WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
+			set_pgd(kernel_to_user_pgdp(pgd), *pgd);
+		}
+	}
+
+	flush_tlb_mm_range(mm,
+			   (unsigned long)ldt_slot_va(slot),
+			   (unsigned long)ldt_slot_va(slot) + LDT_SLOT_STRIDE,
+			   0);
+
+	ldt->slot = slot;
+
+	return 0;
+#else
+	return -EINVAL;
+#endif
+}
+
+static void free_ldt_pgtables(struct mm_struct *mm)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	struct mmu_gather tlb;
+	unsigned long start = LDT_BASE_ADDR;
+	unsigned long end = start + (1UL << PGDIR_SHIFT);
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
+		return;
+
+	tlb_gather_mmu(&tlb, mm, start, end);
+	free_pgd_range(&tlb, start, end, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+#endif
+}
+
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
@@ -134,17 +237,15 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 	int retval = 0;
 
 	mutex_init(&mm->context.lock);
+	mm->context.ldt = NULL;
+
 	old_mm = current->mm;
-	if (!old_mm) {
-		mm->context.ldt = NULL;
+	if (!old_mm)
 		return 0;
-	}
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt) {
-		mm->context.ldt = NULL;
+	if (!old_mm->context.ldt)
 		goto out_unlock;
-	}
 
 	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
 	if (!new_ldt) {
@@ -155,8 +256,17 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
 	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
 	finalize_ldt_struct(new_ldt);
+	retval = map_ldt_struct(mm, new_ldt, 0);
+	if (retval)
+		goto out_free;
 
 	mm->context.ldt = new_ldt;
+	goto out_unlock;
+
+out_free:
+	free_ldt_pgtables(mm);
+	free_ldt_struct(new_ldt);
+	return retval;
 
 out_unlock:
 	mutex_unlock(&old_mm->context.lock);
@@ -174,6 +284,11 @@ void destroy_context_ldt(struct mm_struct *mm)
 	mm->context.ldt = NULL;
 }
 
+void ldt_arch_exit_mmap(struct mm_struct *mm)
+{
+	free_ldt_pgtables(mm);
+}
+
 static int read_ldt(void __user *ptr, unsigned long bytecount)
 {
 	struct mm_struct *mm = current->mm;
@@ -285,6 +400,11 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 
 	new_ldt->entries[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);
+	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
+	if (error) {
+		free_ldt_struct(old_ldt);
+		goto out_unlock;
+	}
 
 	install_ldt(mm, new_ldt);
 	free_ldt_struct(old_ldt);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT
  2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
                   ` (5 preceding siblings ...)
  2017-12-11  6:47 ` [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
@ 2017-12-11  6:54 ` Andy Lutomirski
  2017-12-12 16:01   ` Kirill A. Shutemov
  6 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11  6:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra

On Sun, Dec 10, 2017 at 10:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
> I'm getting reasonably happy with this.  It still needs more testing,
> but I want to get it out there.
>
> The main things that need testing are the 5-level case for the both
> vsyscalls and the LDT.  I'm getting a double-fault in ldt_gdt_64,
> but I haven't checked whether it's a bug in this series, and it
> kind of looks like it isn't.  I'll figure it out in the morning.
> The docs also want updating for the 5 level case.
>

The actual failure looks a lot like the ESPFIX stacks aren't mapped in
the usermode tables, which reinforces my old belief that this bit of
code is bogus:

    /*
     * Just copy the top-level PGD that is mapping the espfix area to
     * ensure it is mapped into the user page tables.
     *
     * For 5-level paging, the espfix pgd was populated when
     * pti_init() pre-populated all the pgd entries.  The above
     * p4d_alloc() would never do anything and the p4d_populate() would
     * be done to a p4d already mapped in the userspace pgd.
     */
#ifdef CONFIG_PAGE_TABLE_ISOLATION
    if (CONFIG_PGTABLE_LEVELS <= 4) {
        set_pgd(kernel_to_user_pgdp(pgd),
            __pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
    }
#endif

Of course, the comment is even more wrong with this series applied,
but I think it's been wrong all along.

I'll fix it in the morning if no one beats me to it.

(Hint: this can be tested in QEMU with -machine accel=tcg -cpu qemu64,+la57)

> Changes from v1:
>  - vsyscalls actually work.
>  - Added the "Warn and fail" patch to prevent the testing goof I had on v1.
>  - Lots of cleanups
>
> Andy Lutomirski (6):
>   x86/pti: Vastly simplify pgd synchronization
>   Revert "x86/mm/pti: Disable native VSYSCALL"
>   x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
>   x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
>   x86/pti: Map the vsyscall page if needed
>   x86/pti: Put the LDT in its own PGD if PTI is on
>
>  Documentation/x86/x86_64/mm.txt         |  11 ++-
>  arch/x86/Kconfig                        |   8 --
>  arch/x86/entry/vsyscall/vsyscall_64.c   |  37 ++++++++-
>  arch/x86/include/asm/mmu_context.h      |  33 +++++++-
>  arch/x86/include/asm/pgtable.h          |   6 +-
>  arch/x86/include/asm/pgtable_64.h       |  77 ++++++++----------
>  arch/x86/include/asm/pgtable_64_types.h |   2 +
>  arch/x86/include/asm/processor.h        |  23 ++++--
>  arch/x86/include/asm/vsyscall.h         |   1 +
>  arch/x86/kernel/ldt.c                   | 138 +++++++++++++++++++++++++++++---
>  arch/x86/mm/pti.c                       | 115 ++++++++++++++++----------
>  11 files changed, 331 insertions(+), 120 deletions(-)
>
> --
> 2.13.6
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
  2017-12-11  6:47 ` [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy Andy Lutomirski
@ 2017-12-11 13:39   ` Ingo Molnar
  2017-12-11 16:01     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2017-12-11 13:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra

* Andy Lutomirski <luto@kernel.org> wrote:

> The kernel is very erratic as to which pagetables have _PAGE_USER
> set.  The vsyscall page gets lucky: it seems that all of the
> relevant pagetables are among the apparently arbitrary ones that set
> _PAGE_USER.  Rather than relying on chance, just explicitly set
> _PAGE_USER.
> 
> This will let us clean up pagetable setup to stop setting
> _PAGE_USER.  The added code can also be reused by pagetable
> isolation to manage the _PAGE_USER bit in the usermode tables.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/vsyscall/vsyscall_64.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)

Btw., would it make sense to clean up all this confusion?

In particular a 'KERNEL' pre of post fix is ambiguous in this context I think, and 
the PAGE_KERNEL_ prefix is actively harmful I think and is at the root of the 
confusion.

So if renamed it and used this nomenclature consistently instead:

  PAGE_USER_
  PAGE_SYSTEM_

... and got rid of PAGE_KERNEL uses in arch/x86/, then it would be obvious at 
first glance what kind of mapping is established in a particular place - and it 
would stay so in the future as well.

( There's some interaction with generic MM code which needs the original defines 
  like PAGE_KERNEL[_EXEC], but those generic masks could be defined as aliases, to 
  keep this cleanup within x86 for now. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
  2017-12-11 13:39   ` Ingo Molnar
@ 2017-12-11 16:01     ` Andy Lutomirski
  2017-12-11 16:24       ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Brian Gerst, David Laight, Kees Cook, Peter Zijlstra

On Mon, Dec 11, 2017 at 5:39 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> The kernel is very erratic as to which pagetables have _PAGE_USER
>> set.  The vsyscall page gets lucky: it seems that all of the
>> relevant pagetables are among the apparently arbitrary ones that set
>> _PAGE_USER.  Rather than relying on chance, just explicitly set
>> _PAGE_USER.
>>
>> This will let us clean up pagetable setup to stop setting
>> _PAGE_USER.  The added code can also be reused by pagetable
>> isolation to manage the _PAGE_USER bit in the usermode tables.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/entry/vsyscall/vsyscall_64.c | 33 ++++++++++++++++++++++++++++++++-
>>  1 file changed, 32 insertions(+), 1 deletion(-)
>
> Btw., would it make sense to clean up all this confusion?
>
> In particular a 'KERNEL' pre of post fix is ambiguous in this context I think, and
> the PAGE_KERNEL_ prefix is actively harmful I think and is at the root of the
> confusion.
>
> So if renamed it and used this nomenclature consistently instead:
>
>   PAGE_USER_
>   PAGE_SYSTEM_

Like _PAGE_USER_VSYSCALL?

Anyway, that's not the confusion I'm talking about.  I'm talking about
_KERNPG_TABLE vs _PAGE_TABLE.  The latter should be called
_USERPG_TABLE, and a whole bunch of its users should be switched to
_KERNPG_TABLE.

But, since PTI is intended for backporting, I think these types of big
cleanups should wait.

>
> ... and got rid of PAGE_KERNEL uses in arch/x86/, then it would be obvious at
> first glance what kind of mapping is established in a particular place - and it
> would stay so in the future as well.
>
> ( There's some interaction with generic MM code which needs the original defines
>   like PAGE_KERNEL[_EXEC], but those generic masks could be defined as aliases, to
>   keep this cleanup within x86 for now. )
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
  2017-12-11 16:01     ` Andy Lutomirski
@ 2017-12-11 16:24       ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2017-12-11 16:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra


* Andy Lutomirski <luto@kernel.org> wrote:

> On Mon, Dec 11, 2017 at 5:39 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@kernel.org> wrote:
> >
> >> The kernel is very erratic as to which pagetables have _PAGE_USER
> >> set.  The vsyscall page gets lucky: it seems that all of the
> >> relevant pagetables are among the apparently arbitrary ones that set
> >> _PAGE_USER.  Rather than relying on chance, just explicitly set
> >> _PAGE_USER.
> >>
> >> This will let us clean up pagetable setup to stop setting
> >> _PAGE_USER.  The added code can also be reused by pagetable
> >> isolation to manage the _PAGE_USER bit in the usermode tables.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>  arch/x86/entry/vsyscall/vsyscall_64.c | 33 ++++++++++++++++++++++++++++++++-
> >>  1 file changed, 32 insertions(+), 1 deletion(-)
> >
> > Btw., would it make sense to clean up all this confusion?
> >
> > In particular a 'KERNEL' pre of post fix is ambiguous in this context I think, and
> > the PAGE_KERNEL_ prefix is actively harmful I think and is at the root of the
> > confusion.
> >
> > So if renamed it and used this nomenclature consistently instead:
> >
> >   PAGE_USER_
> >   PAGE_SYSTEM_
> 
> Like _PAGE_USER_VSYSCALL?
> 
> Anyway, that's not the confusion I'm talking about.  I'm talking about
> _KERNPG_TABLE vs _PAGE_TABLE.  The latter should be called
> _USERPG_TABLE, and a whole bunch of its users should be switched to
> _KERNPG_TABLE.

Yeah.

> But, since PTI is intended for backporting, I think these types of big
> cleanups should wait.

Absolutely.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-11  6:47 ` [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
@ 2017-12-11 17:49   ` Dave Hansen
  2017-12-11 18:40     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2017-12-11 17:49 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Kirill A. Shutemov

So, before this,

On 12/10/2017 10:47 PM, Andy Lutomirski wrote:
...> +	if (unlikely(ldt)) {
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
> +			if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
> +				clear_LDT();
> +				return;
> +			}

I'm missing the purpose of the slots.  Are you hoping to use those
eventually for randomization, but just punting on implementing it for now?

> +
> +			set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
> +		} else {
> +			set_ldt(ldt->entries, ldt->nr_entries);
> +		}

This seems like a much better place to point out why the aliasing exists
and what it is doing than the place it is actually commented.

Maybe:

			/*
			 * ldt->entries is not mapped into the user page
			 * tables when page table isolation is enabled.
			 * Point the hardware to the alias we created.
			 */
			set_ldt(ldt_slot_va(ldt->slot), ...
		} else {
			/*
			 * Point the hardware at the normal kernel
			 * mapping when not isolated.
			 */
			set_ldt(ldt->entries, ldt->nr_entries);
		}

>  /*
> - * User space process size. 47bits minus one guard page.  The guard
> - * page is necessary on Intel CPUs: if a SYSCALL instruction is at
> - * the highest possible canonical userspace address, then that
> - * syscall will enter the kernel with a non-canonical return
> - * address, and SYSRET will explode dangerously.  We avoid this
> - * particular problem by preventing anything from being mapped
> - * at the maximum canonical address.
> + * User space process size.  This is the first address outside the user range.
> + * There are a few constraints that determine this:
> + *
> + * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
> + * address, then that syscall will enter the kernel with a
> + * non-canonical return address, and SYSRET will explode dangerously.
> + * We avoid this particular problem by preventing anything executable
> + * from being mapped at the maximum canonical address.
> + *
> + * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
> + * CPUs malfunction if they execute code from the highest canonical page.
> + * They'll speculate right off the end of the canonical space, and
> + * bad things happen.  This is worked around in the same way as the
> + * Intel problem.
> + *
> + * With page table isolation enabled, we map the LDT in ... [stay tuned]
>   */
>  #define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>  
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index ae5615b03def..46ad333ed797 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -19,6 +19,7 @@
>  #include <linux/uaccess.h>
>  
>  #include <asm/ldt.h>
> +#include <asm/tlb.h>
>  #include <asm/desc.h>
>  #include <asm/mmu_context.h>
>  #include <asm/syscalls.h>
> @@ -46,13 +47,12 @@ static void refresh_ldt_segments(void)
>  static void flush_ldt(void *__mm)
>  {
>  	struct mm_struct *mm = __mm;
> -	mm_context_t *pc;
>  
>  	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
>  		return;
>  
> -	pc = &mm->context;
> -	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
> +	__flush_tlb_all();
> +	load_mm_ldt(mm);

Why the new TLB flush?

Shouldn't this be done in the more obvious place closer to the page
table manipulation?

> @@ -90,9 +90,112 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>  	}
>  
>  	new_ldt->nr_entries = num_entries;
> +	new_ldt->slot = -1;
>  	return new_ldt;
>  }

This seems a bit silly to do given that 'slot' is an int and this patch
introduces warnings looking for positive values:

	if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {

Seems like a good idea to just have a single warning in there looking
for non-zero, probably covering the PTI and non-PTI cases (at least for
now until the slots get used).

> +/*
> + * If PTI is enabled, this maps the LDT into the kernelmode and
> + * usermode tables for the given mm.
> + *
> + * There is no corresponding unmap function.  Even if the LDT is freed, we
> + * leave the PTEs around until the slot is reused or the mm is destroyed.
> + * This is harmless: the LDT is always in ordinary memory, and no one will
> + * access the freed slot.
> + *
> + * If we wanted to unmap freed LDTs, we'd also need to do a flush to make
> + * it useful, and the flush would slow down modify_ldt().
> + */
> +static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
> +{
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +	spinlock_t *ptl;
> +	bool is_vmalloc;
> +	bool had_top_level_entry;
> +	pgd_t *pgd;
> +	int i;
> +
> +	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +		return 0;
> +
> +	WARN_ON(ldt->slot != -1);

Only allow mapping newly-allocated LDTs?

> +	/*
> +	 * Did we already have the top level entry allocated?  We can't
> +	 * use pgd_none() for this because it doens't do anything on
> +	 * 4-level page table kernels.
> +	 */
> +	pgd = pgd_offset(mm, LDT_BASE_ADDR);
> +	had_top_level_entry = (pgd->pgd != 0);
> +
> +	is_vmalloc = is_vmalloc_addr(ldt->entries);
> +
> +	for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) {
> +		unsigned long offset = i << PAGE_SHIFT;
> +		unsigned long va = (unsigned long)ldt_slot_va(slot) + offset;
> +		const void *src = (char *)ldt->entries + offset;
> +		unsigned long pfn = is_vmalloc ? vmalloc_to_pfn(src) :
> +			page_to_pfn(virt_to_page(src));
> +		pte_t pte, *ptep;
> +
> +		ptep = get_locked_pte(mm, va, &ptl);

It's *probably* worth calling out that all the page table allocation
happens in there.  I went looking for it in this patch and it took me a
few minutes to find it.

> +		if (!ptep)
> +			return -ENOMEM;
> +		pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL));

This ~_PAGE_GLOBAL is for the same reason as all the other KPTI code,
right?  BTW, does this function deserve to be in the LDT code or kpti?

> +			      set_pte_at(mm, va, ptep, pte);
> +		pte_unmap_unlock(ptep, ptl);
> +	}

Might want to fix up the set_pte_at() whitespace damage.

> +	if (mm->context.ldt) {
> +		/*
> +		 * We already had an LDT.  The top-level entry should already
> +		 * have been allocated and synchronized with the usermode
> +		 * tables.
> +		 */
> +		WARN_ON(!had_top_level_entry);
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +			WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
> +	} else {
> +		/*
> +		 * This is the first time we're mapping an LDT for this process.
> +		 * Sync the pgd to the usermode tables.
> +		 */
> +		WARN_ON(had_top_level_entry);
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
> +			WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
> +			set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> +		}
> +	}
> +
> +	flush_tlb_mm_range(mm,
> +			   (unsigned long)ldt_slot_va(slot),
> +			   (unsigned long)ldt_slot_va(slot) + LDT_SLOT_STRIDE,
> +			   0);

Why wait until here to flush?  Isn't this primarily for the case where
set_pte_at() overwrote something?

> +	ldt->slot = slot;
> +
> +	return 0;
> +#else
> +	return -EINVAL;
> +#endif
> +}
> +
> +static void free_ldt_pgtables(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +	struct mmu_gather tlb;
> +	unsigned long start = LDT_BASE_ADDR;
> +	unsigned long end = start + (1UL << PGDIR_SHIFT);
> +
> +	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +		return;
> +
> +	tlb_gather_mmu(&tlb, mm, start, end);
> +	free_pgd_range(&tlb, start, end, start, end);
> +	tlb_finish_mmu(&tlb, start, end);
> +#endif
> +}

Isn't this primarily called at exit()?  Isn't it a bit of a shame we
can't combine this with the other exit()-time TLB flushing?

>  /* After calling this, the LDT is immutable. */
>  static void finalize_ldt_struct(struct ldt_struct *ldt)
>  {
> @@ -134,17 +237,15 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
>  	int retval = 0;
>  
>  	mutex_init(&mm->context.lock);
> +	mm->context.ldt = NULL;
> +
>  	old_mm = current->mm;
> -	if (!old_mm) {
> -		mm->context.ldt = NULL;
> +	if (!old_mm)
>  		return 0;
> -	}
>  
>  	mutex_lock(&old_mm->context.lock);
> -	if (!old_mm->context.ldt) {
> -		mm->context.ldt = NULL;
> +	if (!old_mm->context.ldt)
>  		goto out_unlock;
> -	}
>  
>  	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
>  	if (!new_ldt) {
> @@ -155,8 +256,17 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
>  	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
>  	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
>  	finalize_ldt_struct(new_ldt);
> +	retval = map_ldt_struct(mm, new_ldt, 0);
> +	if (retval)
> +		goto out_free;
>  
>  	mm->context.ldt = new_ldt;
> +	goto out_unlock;
> +
> +out_free:
> +	free_ldt_pgtables(mm);
> +	free_ldt_struct(new_ldt);
> +	return retval;
>  
>  out_unlock:
>  	mutex_unlock(&old_mm->context.lock);
> @@ -174,6 +284,11 @@ void destroy_context_ldt(struct mm_struct *mm)
>  	mm->context.ldt = NULL;
>  }
>  
> +void ldt_arch_exit_mmap(struct mm_struct *mm)
> +{
> +	free_ldt_pgtables(mm);
> +}
> +
>  static int read_ldt(void __user *ptr, unsigned long bytecount)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -285,6 +400,11 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
>  
>  	new_ldt->entries[ldt_info.entry_number] = ldt;
>  	finalize_ldt_struct(new_ldt);
> +	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
> +	if (error) {
> +		free_ldt_struct(old_ldt);
> +		goto out_unlock;
> +	}

Ahh, and the slots finally get used in the last hunk!  That was a long
wait! :)

OK, so slots can be 0 or 1, and we need that so we *have* an LDT to use
while we're setting up the new one.  Sounds sane, but it was pretty
non-obvious from everything up to this point and it's still pretty hard
to spot with the !old_ldt->slot in there.

Might be worth commenting when slot is defined.

Also, from a high level, this does increase the overhead of KPTI in a
non-trivial way, right?  It costs us three more page table pages per
process allocated at fork() and freed at exit() and a new TLB flush.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-11 17:49   ` Dave Hansen
@ 2017-12-11 18:40     ` Andy Lutomirski
  2017-12-11 19:32       ` Dave Hansen
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11 18:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Brian Gerst, David Laight, Kees Cook, Peter Zijlstra,
	Kirill A. Shutemov

On Mon, Dec 11, 2017 at 9:49 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> So, before this,
>
> On 12/10/2017 10:47 PM, Andy Lutomirski wrote:
> ...> +  if (unlikely(ldt)) {
>> +             if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
>> +                     if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
>> +                             clear_LDT();
>> +                             return;
>> +                     }
>
> I'm missing the purpose of the slots.  Are you hoping to use those
> eventually for randomization, but just punting on implementing it for now?
>
>> +
>> +                     set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
>> +             } else {
>> +                     set_ldt(ldt->entries, ldt->nr_entries);
>> +             }
>
> This seems like a much better place to point out why the aliasing exists
> and what it is doing than the place it is actually commented.
>
> Maybe:
>
>                         /*
>                          * ldt->entries is not mapped into the user page
>                          * tables when page table isolation is enabled.
>                          * Point the hardware to the alias we created.
>                          */
>                         set_ldt(ldt_slot_va(ldt->slot), ...
>                 } else {
>                         /*
>                          * Point the hardware at the normal kernel
>                          * mapping when not isolated.
>                          */
>                         set_ldt(ldt->entries, ldt->nr_entries);
>                 }
>

Good call.

>>  /*
>> - * User space process size. 47bits minus one guard page.  The guard
>> - * page is necessary on Intel CPUs: if a SYSCALL instruction is at
>> - * the highest possible canonical userspace address, then that
>> - * syscall will enter the kernel with a non-canonical return
>> - * address, and SYSRET will explode dangerously.  We avoid this
>> - * particular problem by preventing anything from being mapped
>> - * at the maximum canonical address.
>> + * User space process size.  This is the first address outside the user range.
>> + * There are a few constraints that determine this:
>> + *
>> + * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
>> + * address, then that syscall will enter the kernel with a
>> + * non-canonical return address, and SYSRET will explode dangerously.
>> + * We avoid this particular problem by preventing anything executable
>> + * from being mapped at the maximum canonical address.
>> + *
>> + * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
>> + * CPUs malfunction if they execute code from the highest canonical page.
>> + * They'll speculate right off the end of the canonical space, and
>> + * bad things happen.  This is worked around in the same way as the
>> + * Intel problem.
>> + *
>> + * With page table isolation enabled, we map the LDT in ... [stay tuned]
>>   */
>>  #define TASK_SIZE_MAX        ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>>
>> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
>> index ae5615b03def..46ad333ed797 100644
>> --- a/arch/x86/kernel/ldt.c
>> +++ b/arch/x86/kernel/ldt.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/uaccess.h>
>>
>>  #include <asm/ldt.h>
>> +#include <asm/tlb.h>
>>  #include <asm/desc.h>
>>  #include <asm/mmu_context.h>
>>  #include <asm/syscalls.h>
>> @@ -46,13 +47,12 @@ static void refresh_ldt_segments(void)
>>  static void flush_ldt(void *__mm)
>>  {
>>       struct mm_struct *mm = __mm;
>> -     mm_context_t *pc;
>>
>>       if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
>>               return;
>>
>> -     pc = &mm->context;
>> -     set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
>> +     __flush_tlb_all();
>> +     load_mm_ldt(mm);
>
> Why the new TLB flush?

It was an attempt to debug a bug and I forgot to delete it.

>> @@ -90,9 +90,112 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>>       }
>>
>>       new_ldt->nr_entries = num_entries;
>> +     new_ldt->slot = -1;
>>       return new_ldt;
>>  }
>
> This seems a bit silly to do given that 'slot' is an int and this patch
> introduces warnings looking for positive values:
>
>         if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
>
> Seems like a good idea to just have a single warning in there looking
> for non-zero, probably covering the PTI and non-PTI cases (at least for
> now until the slots get used).

The idea is to warn if we haven't mapped it yet.

>
>> +/*
>> + * If PTI is enabled, this maps the LDT into the kernelmode and
>> + * usermode tables for the given mm.
>> + *
>> + * There is no corresponding unmap function.  Even if the LDT is freed, we
>> + * leave the PTEs around until the slot is reused or the mm is destroyed.
>> + * This is harmless: the LDT is always in ordinary memory, and no one will
>> + * access the freed slot.
>> + *
>> + * If we wanted to unmap freed LDTs, we'd also need to do a flush to make
>> + * it useful, and the flush would slow down modify_ldt().
>> + */
>> +static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
>> +{
>> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
>> +     spinlock_t *ptl;
>> +     bool is_vmalloc;
>> +     bool had_top_level_entry;
>> +     pgd_t *pgd;
>> +     int i;
>> +
>> +     if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
>> +             return 0;
>> +
>> +     WARN_ON(ldt->slot != -1);
>
> Only allow mapping newly-allocated LDTs?
>
>> +     /*
>> +      * Did we already have the top level entry allocated?  We can't
>> +      * use pgd_none() for this because it doens't do anything on
>> +      * 4-level page table kernels.
>> +      */
>> +     pgd = pgd_offset(mm, LDT_BASE_ADDR);
>> +     had_top_level_entry = (pgd->pgd != 0);
>> +
>> +     is_vmalloc = is_vmalloc_addr(ldt->entries);
>> +
>> +     for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) {
>> +             unsigned long offset = i << PAGE_SHIFT;
>> +             unsigned long va = (unsigned long)ldt_slot_va(slot) + offset;
>> +             const void *src = (char *)ldt->entries + offset;
>> +             unsigned long pfn = is_vmalloc ? vmalloc_to_pfn(src) :
>> +                     page_to_pfn(virt_to_page(src));
>> +             pte_t pte, *ptep;
>> +
>> +             ptep = get_locked_pte(mm, va, &ptl);
>
> It's *probably* worth calling out that all the page table allocation
> happens in there.  I went looking for it in this patch and it took me a
> few minutes to find it.
>
>> +             if (!ptep)
>> +                     return -ENOMEM;
>> +             pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL));
>
> This ~_PAGE_GLOBAL is for the same reason as all the other KPTI code,
> right?  BTW, does this function deserve to be in the LDT code or kpti?
>
>> +                           set_pte_at(mm, va, ptep, pte);
>> +             pte_unmap_unlock(ptep, ptl);
>> +     }
>
> Might want to fix up the set_pte_at() whitespace damage.

It's not damaged -- it lines up nicely if you look at the file instead
of the patch :)

>
>> +     if (mm->context.ldt) {
>> +             /*
>> +              * We already had an LDT.  The top-level entry should already
>> +              * have been allocated and synchronized with the usermode
>> +              * tables.
>> +              */
>> +             WARN_ON(!had_top_level_entry);
>> +             if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
>> +                     WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
>> +     } else {
>> +             /*
>> +              * This is the first time we're mapping an LDT for this process.
>> +              * Sync the pgd to the usermode tables.
>> +              */
>> +             WARN_ON(had_top_level_entry);
>> +             if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
>> +                     WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
>> +                     set_pgd(kernel_to_user_pgdp(pgd), *pgd);
>> +             }
>> +     }
>> +
>> +     flush_tlb_mm_range(mm,
>> +                        (unsigned long)ldt_slot_va(slot),
>> +                        (unsigned long)ldt_slot_va(slot) + LDT_SLOT_STRIDE,
>> +                        0);
>
> Why wait until here to flush?  Isn't this primarily for the case where
> set_pte_at() overwrote something?

I think it would be okay if we did it sooner, but CPUs are allowed to
cache intermediate mappings, and we're changing the userspace tables
above it.

>> +
>> +static void free_ldt_pgtables(struct mm_struct *mm)
>> +{
>> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
>> +     struct mmu_gather tlb;
>> +     unsigned long start = LDT_BASE_ADDR;
>> +     unsigned long end = start + (1UL << PGDIR_SHIFT);
>> +
>> +     if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
>> +             return;
>> +
>> +     tlb_gather_mmu(&tlb, mm, start, end);
>> +     free_pgd_range(&tlb, start, end, start, end);
>> +     tlb_finish_mmu(&tlb, start, end);
>> +#endif
>> +}
>
> Isn't this primarily called at exit()?  Isn't it a bit of a shame we
> can't combine this with the other exit()-time TLB flushing?

Yes.  In fact, we don't really need that flush at all since the mm is
totally dead.  But the free_pgd_range() API is too dumb.  And yes, we
have the same issue in the normal mm/memory.c code.  In general, exit
handling is seriously overengineered.

> Also, from a high level, this does increase the overhead of KPTI in a
> non-trivial way, right?  It costs us three more page table pages per
> process allocated at fork() and freed at exit() and a new TLB flush.

Yeah, but no one will care.  modify_ldt() is used for DOSEMU, Wine,
and really old 32-bit programs.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-11 18:40     ` Andy Lutomirski
@ 2017-12-11 19:32       ` Dave Hansen
  2017-12-11 19:39         ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2017-12-11 19:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Kirill A. Shutemov

On 12/11/2017 10:40 AM, Andy Lutomirski wrote:
>> Also, from a high level, this does increase the overhead of KPTI in a
>> non-trivial way, right?  It costs us three more page table pages per
>> process allocated at fork() and freed at exit() and a new TLB flush.
> Yeah, but no one will care.  modify_ldt() is used for DOSEMU, Wine,
> and really old 32-bit programs.

The heavyweight part of map_ldt_struct() (and unmap) looks to run
whenever we have KPTI enabled.  I'm missing how it gets avoided for the
non-DOSEMU cases.

I thought there would be a "fast path" where we just use the normal
clear_LDT() LDT from the cpu_entry_area and don't have to do any of
this, but I'm missing where that happens.  Do we need a check in
(un)map_ldt_struct() for !mm->context.ldt?

Just to make sure I understand this: We now have two places that LDTs
live in virtual space:

1. The "plain" one that we get from clear_LDT() which lives in the
   cpu_entry_area.  (No additional overhead when doing this)
2. The new one under the special PGD that's only used for modify_ldt()
   and is fairly slow.  (plenty of overhead, but nobody cares).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-11 19:32       ` Dave Hansen
@ 2017-12-11 19:39         ` Andy Lutomirski
  2017-12-11 19:47           ` Dave Hansen
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11 19:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Brian Gerst, David Laight, Kees Cook, Peter Zijlstra,
	Kirill A. Shutemov

On Mon, Dec 11, 2017 at 11:32 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/11/2017 10:40 AM, Andy Lutomirski wrote:
>>> Also, from a high level, this does increase the overhead of KPTI in a
>>> non-trivial way, right?  It costs us three more page table pages per
>>> process allocated at fork() and freed at exit() and a new TLB flush.
>> Yeah, but no one will care.  modify_ldt() is used for DOSEMU, Wine,
>> and really old 32-bit programs.
>
> The heavyweight part of map_ldt_struct() (and unmap) looks to run
> whenever we have KPTI enabled.  I'm missing how it gets avoided for the
> non-DOSEMU cases.

It doesn't get called unless modify_ldt() is used.

>
> I thought there would be a "fast path" where we just use the normal
> clear_LDT() LDT from the cpu_entry_area and don't have to do any of
> this, but I'm missing where that happens.  Do we need a check in
> (un)map_ldt_struct() for !mm->context.ldt?

I'm confused.

if (unlikely(ldt)) {
  do something slowish;
} else {
  clear_LD();
}

>
> Just to make sure I understand this: We now have two places that LDTs
> live in virtual space:
>
> 1. The "plain" one that we get from clear_LDT() which lives in the
>    cpu_entry_area.  (No additional overhead when doing this)
> 2. The new one under the special PGD that's only used for modify_ldt()
>    and is fairly slow.  (plenty of overhead, but nobody cares).

Yes.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-11 19:39         ` Andy Lutomirski
@ 2017-12-11 19:47           ` Dave Hansen
  2017-12-11 20:06             ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2017-12-11 19:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Kirill A. Shutemov

On 12/11/2017 11:39 AM, Andy Lutomirski wrote:
>> I thought there would be a "fast path" where we just use the normal
>> clear_LDT() LDT from the cpu_entry_area and don't have to do any of
>> this, but I'm missing where that happens.  Do we need a check in
>> (un)map_ldt_struct() for !mm->context.ldt?
> I'm confused.
> 
> if (unlikely(ldt)) {
>   do something slowish;
> } else {
>   clear_LD();
> }

I was looking at the map/unmap paths.  It looks to me like the cases
where there is map/unmap overhead, we *are* doing checking against
mm->context.ldt.  It just wasn't visible from the patch context.

In any case, it would be really nice to call that out if you revise
these in the patch description: none of these LDT acrobatics are used in
the common case.  Virtually every process uses the !ldt paths which
don't do any of this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-11 19:47           ` Dave Hansen
@ 2017-12-11 20:06             ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-12-11 20:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Brian Gerst, David Laight, Kees Cook, Peter Zijlstra,
	Kirill A. Shutemov

On Mon, Dec 11, 2017 at 11:47 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/11/2017 11:39 AM, Andy Lutomirski wrote:
>>> I thought there would be a "fast path" where we just use the normal
>>> clear_LDT() LDT from the cpu_entry_area and don't have to do any of
>>> this, but I'm missing where that happens.  Do we need a check in
>>> (un)map_ldt_struct() for !mm->context.ldt?
>> I'm confused.
>>
>> if (unlikely(ldt)) {
>>   do something slowish;
>> } else {
>>   clear_LD();
>> }
>
> I was looking at the map/unmap paths.  It looks to me like the cases
> where there is map/unmap overhead, we *are* doing checking against
> mm->context.ldt.  It just wasn't visible from the patch context.
>
> In any case, it would be really nice to call that out if you revise
> these in the patch description: none of these LDT acrobatics are used in
> the common case.  Virtually every process uses the !ldt paths which
> don't do any of this.

Will do.

I'm currently fighting with the 5 level case.  I need to reorganize
the memory map a bit, but it's blowing up, and I'm not sure why yet.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT
  2017-12-11  6:54 ` [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
@ 2017-12-12 16:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 18+ messages in thread
From: Kirill A. Shutemov @ 2017-12-12 16:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra

On Sun, Dec 10, 2017 at 10:54:38PM -0800, Andy Lutomirski wrote:
> On Sun, Dec 10, 2017 at 10:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > I'm getting reasonably happy with this.  It still needs more testing,
> > but I want to get it out there.
> >
> > The main things that need testing are the 5-level case for the both
> > vsyscalls and the LDT.  I'm getting a double-fault in ldt_gdt_64,
> > but I haven't checked whether it's a bug in this series, and it
> > kind of looks like it isn't.  I'll figure it out in the morning.
> > The docs also want updating for the 5 level case.
> >
> 
> The actual failure looks a lot like the ESPFIX stacks aren't mapped in
> the usermode tables, which reinforces my old belief that this bit of
> code is bogus:
> 
>     /*
>      * Just copy the top-level PGD that is mapping the espfix area to
>      * ensure it is mapped into the user page tables.
>      *
>      * For 5-level paging, the espfix pgd was populated when
>      * pti_init() pre-populated all the pgd entries.  The above
>      * p4d_alloc() would never do anything and the p4d_populate() would
>      * be done to a p4d already mapped in the userspace pgd.
>      */
> #ifdef CONFIG_PAGE_TABLE_ISOLATION
>     if (CONFIG_PGTABLE_LEVELS <= 4) {
>         set_pgd(kernel_to_user_pgdp(pgd),
>             __pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
>     }
> #endif
> 
> Of course, the comment is even more wrong with this series applied,
> but I think it's been wrong all along.
> 
> I'll fix it in the morning if no one beats me to it.
> 
> (Hint: this can be tested in QEMU with -machine accel=tcg -cpu qemu64,+la57)

I thought something like patch below would help (for both paging modes). It
indeed made sigreturn_64 run a bit longer in 5-level paging mode, but it still
double faults in the end.

Maybe it's another bug. I don't know. I don't have time right now to look into.

BTW, sigreturn_64 reports FAILS in 4-level paging mode.

t a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 1c44e72ed1bc..0725c04b2d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -129,20 +129,11 @@ void __init init_espfix_bsp(void)
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
-	/*
-	 * Just copy the top-level PGD that is mapping the espfix area to
-	 * ensure it is mapped into the user page tables.
-	 *
-	 * For 5-level paging, the espfix pgd was populated when
-	 * pti_init() pre-populated all the pgd entries.  The above
-	 * p4d_alloc() would never do anything and the p4d_populate() would
-	 * be done to a p4d already mapped in the userspace pgd.
-	 */
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
-	if (CONFIG_PGTABLE_LEVELS <= 4) {
-		set_pgd(kernel_to_user_pgdp(pgd),
-			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
-	}
+	/* Install the espfix pud into user space page tables too */
+	pgd = kernel_to_user_pgdp(pgd);
+	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
+	p4d_populate(&init_mm, p4d, espfix_pud_page);
 #endif
 
 	/* Randomize the locations */
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-12-12 16:01 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-11  6:47 [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
2017-12-11  6:47 ` [PATCH PTI v2 1/6] x86/pti: Vastly simplify pgd synchronization Andy Lutomirski
2017-12-11  6:47 ` [PATCH PTI v2 2/6] Revert "x86/mm/pti: Disable native VSYSCALL" Andy Lutomirski
2017-12-11  6:47 ` [PATCH PTI v2 3/6] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy Andy Lutomirski
2017-12-11 13:39   ` Ingo Molnar
2017-12-11 16:01     ` Andy Lutomirski
2017-12-11 16:24       ` Ingo Molnar
2017-12-11  6:47 ` [PATCH PTI v2 4/6] x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode Andy Lutomirski
2017-12-11  6:47 ` [PATCH PTI v2 5/6] x86/pti: Map the vsyscall page if needed Andy Lutomirski
2017-12-11  6:47 ` [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on Andy Lutomirski
2017-12-11 17:49   ` Dave Hansen
2017-12-11 18:40     ` Andy Lutomirski
2017-12-11 19:32       ` Dave Hansen
2017-12-11 19:39         ` Andy Lutomirski
2017-12-11 19:47           ` Dave Hansen
2017-12-11 20:06             ` Andy Lutomirski
2017-12-11  6:54 ` [PATCH PTI v2 0/6] Clean up pgd handling and fix VSYSCALL and LDT Andy Lutomirski
2017-12-12 16:01   ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).