linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/10] Process-local memory allocations for hiding KVM secrets
@ 2019-06-12 17:08 Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 01/10] x86/mm/kaslr: refactor to use enum indices for regions Marius Hillenbrand
                   ` (11 more replies)
  0 siblings, 12 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse

The Linux kernel has a global address space that is the same for any
kernel code. This address space becomes a liability in a world with
processor information leak vulnerabilities, such as L1TF. With the right
cache load gadget, an attacker-controlled hyperthread pair can leak
arbitrary data via L1TF. Disabling hyperthreading is one recommended
mitigation, but it comes with a large performance hit for a wide range
of workloads.

An alternative mitigation is to not make certain data in the kernel
globally visible, but only when the kernel executes in the context of
the process where this data belongs to.

This patch series proposes to introduce a region for what we call
process-local memory into the kernel's virtual address space. Page
tables and mappings in that region will be exclusive to one address
space, instead of implicitly shared between all kernel address spaces.
Any data placed in that region will be out of reach of cache load
gadgets that execute in different address spaces. To implement
process-local memory, we introduce a new interface kmalloc_proclocal() /
kfree_proclocal() that allocates and maps pages exclusively into the
current kernel address space. As a first use case, we move architectural
state of guest CPUs in KVM out of reach of other kernel address spaces.

The patch set is a prototype for x86-64 that we have developed on top of
kernel 4.20.17 (with cherry-picked commit d253ca0c3865 "x86/mm/cpa: Add
set_direct_map_*() functions"). I am aware that the integration with KVM
will see some changes while rebasing to 5.x. Patches 7 and 8, in
particular, help make patch 9 more readable, but will be dropped in
rebasing. We have tested the code on both Intel and AMDs, launching VMs
in a loop. So far, we have not done in-depth performance evaluation.
Impact on starting VMs was within measurement noise.

---

Julian Stecklina (2):
  kvm, vmx: move CR2 context switch out of assembly path
  kvm, vmx: move register clearing out of assembly path

Marius Hillenbrand (8):
  x86/mm/kaslr: refactor to use enum indices for regions
  x86/speculation, mm: add process local virtual memory region
  x86/mm, mm,kernel: add teardown for process-local memory to mm cleanup
  mm: allocate virtual space for process-local memory
  mm: allocate/release physical pages for process-local memory
  kvm/x86: add support for storing vCPU state in process-local memory
  kvm, vmx: move gprs to process local memory
  kvm, x86: move guest FPU state into process local memory

 Documentation/x86/x86_64/mm.txt         |  11 +-
 arch/x86/Kconfig                        |   1 +
 arch/x86/include/asm/kvm_host.h         |  40 ++-
 arch/x86/include/asm/page_64.h          |   4 +
 arch/x86/include/asm/pgtable_64_types.h |  12 +
 arch/x86/include/asm/proclocal.h        |  11 +
 arch/x86/kernel/head64.c                |   8 +
 arch/x86/kvm/Kconfig                    |  10 +
 arch/x86/kvm/kvm_cache_regs.h           |   4 +-
 arch/x86/kvm/svm.c                      | 104 +++++--
 arch/x86/kvm/vmx.c                      | 213 ++++++++++-----
 arch/x86/kvm/x86.c                      |  31 ++-
 arch/x86/mm/Makefile                    |   1 +
 arch/x86/mm/dump_pagetables.c           |   9 +
 arch/x86/mm/fault.c                     |  19 ++
 arch/x86/mm/kaslr.c                     |  63 ++++-
 arch/x86/mm/proclocal.c                 | 136 +++++++++
 include/linux/mm_types.h                |  13 +
 include/linux/proclocal.h               |  35 +++
 kernel/fork.c                           |   6 +
 mm/Makefile                             |   1 +
 mm/proclocal.c                          | 348 ++++++++++++++++++++++++
 security/Kconfig                        |  18 ++
 23 files changed, 978 insertions(+), 120 deletions(-)
 create mode 100644 arch/x86/include/asm/proclocal.h
 create mode 100644 arch/x86/mm/proclocal.c
 create mode 100644 include/linux/proclocal.h
 create mode 100644 mm/proclocal.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC 01/10] x86/mm/kaslr: refactor to use enum indices for regions
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 02/10] x86/speculation, mm: add process local virtual memory region Marius Hillenbrand
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse

The KASLR randomization code currently refers to specific regions, such
as the vmalloc area, by literal indices into an array. When adding new
regions, we have to be careful to also change all indices that may
potentially change. Avoid that risk by introducing an enum used as
indices.

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/mm/kaslr.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 3f452ffed7e9..c455f1ffba29 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -41,6 +41,12 @@
  */
 static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE;
 
+enum {
+	PHYSMAP,
+	VMALLOC,
+	VMMEMMAP,
+};
+
 /*
  * Memory regions randomized by KASLR (except modules that use a separate logic
  * earlier during boot). The list is ordered based on virtual addresses. This
@@ -50,9 +56,9 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 0 },
-	{ &vmalloc_base, 0 },
-	{ &vmemmap_base, 1 },
+	[PHYSMAP] = { &page_offset_base, 0 },
+	[VMALLOC] = { &vmalloc_base, 0 },
+	[VMMEMMAP] = { &vmemmap_base, 1 },
 };
 
 /* Get size in bytes used by the memory region */
@@ -94,20 +100,20 @@ void __init kernel_randomize_memory(void)
 	if (!kaslr_memory_enabled())
 		return;
 
-	kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
-	kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
+	kaslr_regions[PHYSMAP].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
+	kaslr_regions[VMALLOC].size_tb = VMALLOC_SIZE_TB;
 
 	/*
 	 * Update Physical memory mapping to available and
 	 * add padding if needed (especially for memory hotplug support).
 	 */
-	BUG_ON(kaslr_regions[0].base != &page_offset_base);
+	BUG_ON(kaslr_regions[PHYSMAP].base != &page_offset_base);
 	memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
 		CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
 
 	/* Adapt phyiscal memory region size based on available memory */
-	if (memory_tb < kaslr_regions[0].size_tb)
-		kaslr_regions[0].size_tb = memory_tb;
+	if (memory_tb < kaslr_regions[PHYSMAP].size_tb)
+		kaslr_regions[PHYSMAP].size_tb = memory_tb;
 
 	/* Calculate entropy available between regions */
 	remain_entropy = vaddr_end - vaddr_start;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 02/10] x86/speculation, mm: add process local virtual memory region
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 01/10] x86/mm/kaslr: refactor to use enum indices for regions Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 03/10] x86/mm, mm,kernel: add teardown for process-local memory to mm cleanup Marius Hillenbrand
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse, Julian Stecklina

The Linux kernel has a global address space that is the same for any
kernel code. This address space becomes a liability in a world with
processor information leak vulnerabilities, such as L1TF. With the right
cache load gadget, an attacker-controlled hyperthread pair can leak
arbitrary data via L1TF. Disabling hyperthreading is one recommended
mitigation, but it comes with a large performance hit for a wide range
of workloads.

An alternative mitigation is to not make certain data in the kernel
globally visible, but only when the kernel executes in the context of
the process where this data belongs to.

This patch introduces a region for process-local memory into the
kernel's virtual address space. It has a length of 64 GiB (to give more
than enough space while leaving enough room for KASLR) and will always
occupy a pgd entry that is exclusive for process-local mappings (other
pgds may point to shared page tables for the kernel space).

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Inspired-by: Julian Stecklina <js@alien8.de> (while jsteckli@amazon.de)
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 Documentation/x86/x86_64/mm.txt         | 11 +++++--
 arch/x86/Kconfig                        |  1 +
 arch/x86/include/asm/page_64.h          |  4 +++
 arch/x86/include/asm/pgtable_64_types.h | 12 ++++++++
 arch/x86/kernel/head64.c                |  8 +++++
 arch/x86/mm/dump_pagetables.c           |  9 ++++++
 arch/x86/mm/fault.c                     | 19 ++++++++++++
 arch/x86/mm/kaslr.c                     | 41 +++++++++++++++++++++++++
 security/Kconfig                        | 18 +++++++++++
 9 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 804f9426ed17..476519759cdc 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -40,7 +40,10 @@ ____________________________________________________________|___________________
  ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
  ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
  ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
- ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
+ ffffeb0000000000 |  -21    TB | ffffeb7fffffffff |  512 GB | ... unused hole
+ ffffeb8000000000 |  -20.5  TB | ffffebffffffffff |  512 GB | process-local kernel memory (layout shared but mappings
+                  |            |                  |         | exclusive to processes, needs an exclusive entry in the
+                  |            |                  |         | top-level page table)
  ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
 __________________|____________|__________________|_________|____________________________________________________________
                                                             |
@@ -98,7 +101,11 @@ ____________________________________________________________|___________________
  ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
  ffd2000000000000 |  -11.5  PB | ffd3ffffffffffff |  0.5 PB | ... unused hole
  ffd4000000000000 |  -11    PB | ffd5ffffffffffff |  0.5 PB | virtual memory map (vmemmap_base)
- ffd6000000000000 |  -10.5  PB | ffdeffffffffffff | 2.25 PB | ... unused hole
+ ffd6000000000000 |  -10.5  PB | ffd7ffffffffffff |  0.5 PB | ... unused hole
+ ffd8000000000000 |  -10    PB | ffd8ffffffffffff |  256 TB | process-local kernel memory (layout shared but mappings
+                  |            |                  |         | exclusive to processes, needs an exclusive entry in the
+                  |            |                  |         | top-level page table)
+ ffd9000000000000 |   -9.75 PB | ffdeffffffffffff |  1.5 PB | ... unused hole
  ffdf000000000000 |   -8.25 PB | fffffdffffffffff |   ~8 PB | KASAN shadow memory
 __________________|____________|__________________|_________|____________________________________________________________
                                                             |
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3b8cc39ae52d..9924d542d44a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -32,6 +32,7 @@ config X86_64
 	select SWIOTLB
 	select X86_DEV_DMA_OPS
 	select ARCH_HAS_SYSCALL_WRAPPER
+	select ARCH_SUPPORTS_PROCLOCAL
 
 #
 # Arch settings
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..e6f0d76de849 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -15,6 +15,10 @@ extern unsigned long page_offset_base;
 extern unsigned long vmalloc_base;
 extern unsigned long vmemmap_base;
 
+#ifdef CONFIG_PROCLOCAL
+extern unsigned long proclocal_base;
+#endif
+
 static inline unsigned long __phys_addr_nodebug(unsigned long x)
 {
 	unsigned long y = x - __START_KERNEL_map;
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 14cd41b989d6..cb1b789a55c2 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -141,6 +141,18 @@ extern unsigned int ptrs_per_p4d;
 
 #define VMALLOC_END		(VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)
 
+#ifdef CONFIG_PROCLOCAL
+# define __PROCLOCAL_BASE_L4	0xffffeb8000000000UL
+# define __PROCLOCAL_BASE_L5	0xffd8000000000000UL
+# define PROCLOCAL_SIZE	(64UL * 1024 * 1024 * 1024)
+
+# ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
+#  define PROCLOCAL_START	proclocal_base
+# else /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
+#  define PROCLOCAL_START	__PROCLOCAL_BASE_L4
+# endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
+#endif /* CONFIG_PROCLOCAL */
+
 #define MODULES_VADDR		(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
 /* The module sections ends with the start of the fixmap */
 #define MODULES_END		_AC(0xfffffffff4000000, UL)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 509de5a2a122..490b5255aad3 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -59,6 +59,10 @@ unsigned long vmalloc_base __ro_after_init = __VMALLOC_BASE_L4;
 EXPORT_SYMBOL(vmalloc_base);
 unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
 EXPORT_SYMBOL(vmemmap_base);
+#ifdef CONFIG_PROCLOCAL
+unsigned long proclocal_base __ro_after_init = __PROCLOCAL_BASE_L4;
+EXPORT_SYMBOL(proclocal_base);
+#endif
 #endif
 
 #define __head	__section(.head.text)
@@ -94,6 +98,10 @@ static bool __head check_la57_support(unsigned long physaddr)
 	*fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE_L5;
 	*fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE_L5;
 	*fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE_L5;
+#ifdef CONFIG_PROCLOCAL
+#warning "Process-local memory with 5-level page tables is compile-tested only."
+	*fixup_long(&proclocal_base, physaddr) = __PROCLOCAL_BASE_L5;
+#endif
 
 	return true;
 }
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index abcb8d00b014..88fa2da94cfe 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -61,6 +61,9 @@ enum address_markers_idx {
 	LOW_KERNEL_NR,
 	VMALLOC_START_NR,
 	VMEMMAP_START_NR,
+#ifdef CONFIG_PROCLOCAL
+	PROCLOCAL_START_NR,
+#endif
 #ifdef CONFIG_KASAN
 	KASAN_SHADOW_START_NR,
 	KASAN_SHADOW_END_NR,
@@ -85,6 +88,9 @@ static struct addr_marker address_markers[] = {
 	[LOW_KERNEL_NR]		= { 0UL,		"Low Kernel Mapping" },
 	[VMALLOC_START_NR]	= { 0UL,		"vmalloc() Area" },
 	[VMEMMAP_START_NR]	= { 0UL,		"Vmemmap" },
+#ifdef CONFIG_PROCLOCAL
+	[PROCLOCAL_START_NR]    = { 0UL,                "Process local" },
+#endif
 #ifdef CONFIG_KASAN
 	/*
 	 * These fields get initialized with the (dynamic)
@@ -622,6 +628,9 @@ static int __init pt_dump_init(void)
 	address_markers[KASAN_SHADOW_START_NR].start_address = KASAN_SHADOW_START;
 	address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
 #endif
+#ifdef CONFIG_PROCLOCAL
+	address_markers[PROCLOCAL_START_NR].start_address = PROCLOCAL_START;
+#endif
 #endif
 #ifdef CONFIG_X86_32
 	address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ba51652fbd33..befea89c5d6f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1171,6 +1171,15 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 	return true;
 }
 
+static int fault_in_process_local(unsigned long address)
+{
+#ifdef CONFIG_PROCLOCAL
+	return address >= PROCLOCAL_START && address < (PROCLOCAL_START + PROCLOCAL_SIZE);
+#else
+	return false;
+#endif
+}
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1214,6 +1223,16 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	if (spurious_kernel_fault(hw_error_code, address))
 		return;
 
+	/*
+	 * Faults in process-local memory may be caused by process-local
+	 * addresses leaking into other contexts.
+	 * tbd: warn and handle gracefully.
+	 */
+	if (unlikely(fault_in_process_local(address))) {
+		pr_err("page fault in PROCLOCAL at %lx", address);
+		force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
+	}
+
 	/* kprobes don't want to hook the spurious faults: */
 	if (kprobes_fault(regs))
 		return;
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index c455f1ffba29..395d8868aeb8 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -45,6 +45,9 @@ enum {
 	PHYSMAP,
 	VMALLOC,
 	VMMEMMAP,
+#ifdef CONFIG_PROCLOCAL
+	PROCLOCAL,
+#endif
 };
 
 /*
@@ -59,6 +62,9 @@ static __initdata struct kaslr_memory_region {
 	[PHYSMAP] = { &page_offset_base, 0 },
 	[VMALLOC] = { &vmalloc_base, 0 },
 	[VMMEMMAP] = { &vmemmap_base, 1 },
+#ifdef CONFIG_PROCLOCAL
+	[PROCLOCAL] = { &proclocal_base, 0 },
+#endif
 };
 
 /* Get size in bytes used by the memory region */
@@ -76,6 +82,26 @@ static inline bool kaslr_memory_enabled(void)
 	return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
 }
 
+#ifdef CONFIG_PROCLOCAL
+/*
+ * The process-local memory area must use an exclusive pgd entry. The area is
+ * allocated as 2x PGDIR_SIZE such that it contains at least one exclusive pgd
+ * entry. Shift the base address into that exclusive pgd. Keep the offset from
+ * randomization but make sure the whole actual process-local memory region fits
+ * into the pgd.
+ */
+static void adjust_proclocal_base(void)
+{
+	unsigned long size_tb = kaslr_regions[PROCLOCAL].size_tb;
+	proclocal_base += ((size_tb << TB_SHIFT) / 2);
+	if ((proclocal_base % PGDIR_SIZE) > (PGDIR_SIZE - PROCLOCAL_SIZE))
+		proclocal_base -= PROCLOCAL_SIZE;
+
+	BUILD_BUG_ON(2 * PROCLOCAL_SIZE >= PGDIR_SIZE);
+	BUG_ON(((proclocal_base % PGDIR_SIZE) + PROCLOCAL_SIZE) > PGDIR_SIZE);
+}
+#endif
+
 /* Initialize base and padding for each memory region randomized with KASLR */
 void __init kernel_randomize_memory(void)
 {
@@ -103,6 +129,17 @@ void __init kernel_randomize_memory(void)
 	kaslr_regions[PHYSMAP].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
 	kaslr_regions[VMALLOC].size_tb = VMALLOC_SIZE_TB;
 
+#ifdef CONFIG_PROCLOCAL
+	/*
+	 * Note that the process-local memory area must use a non-overlapping
+	 * pgd. Thus, round up the size to 2 pgd entries and adjust the base
+	 * address into the dedicated pgd below. With 4-level page tables, that
+	 * keeps the size at the minium of 1 TiB used by the kernel.
+	 */
+	kaslr_regions[PROCLOCAL].size_tb = round_up(round_up(PROCLOCAL_SIZE, 2ULL<<PGDIR_SHIFT),
+						    1ULL<<TB_SHIFT) / (1ULL<<TB_SHIFT);
+#endif
+
 	/*
 	 * Update Physical memory mapping to available and
 	 * add padding if needed (especially for memory hotplug support).
@@ -149,6 +186,10 @@ void __init kernel_randomize_memory(void)
 			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
+
+#ifdef CONFIG_PROCLOCAL
+	adjust_proclocal_base();
+#endif
 }
 
 static void __meminit init_trampoline_pud(void)
diff --git a/security/Kconfig b/security/Kconfig
index c7c581bac963..714808cf6604 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -35,6 +35,24 @@ config XPFO_DEBUG
 
 	 If in doubt, say "N".
 
+config ARCH_SUPPORTS_PROCLOCAL
+	bool
+	default n
+
+config PROCLOCAL
+	bool "Support process-local allocations in the kernel"
+	depends on ARCH_SUPPORTS_PROCLOCAL
+	select GENERIC_ALLOCATOR
+	default n
+	help
+	  This feature allows subsystems in the kernel to allocate memory that
+	  is only visible in the context of a specific process. This hardens the
+	  kernel against information leak vulnerabilities.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If you are unsure how to answer this question, answer N.
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 03/10] x86/mm, mm,kernel: add teardown for process-local memory to mm cleanup
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 01/10] x86/mm/kaslr: refactor to use enum indices for regions Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 02/10] x86/speculation, mm: add process local virtual memory region Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 04/10] mm: allocate virtual space for process-local memory Marius Hillenbrand
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse

Process-local memory uses a dedicated pgd entry in kernel space and its
own page table structure. Hook mm exit functions to cleanup those
dedicated page tables. As a preparation, release any left-over
process-local allocations in the address space.

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/proclocal.h |  11 +++
 arch/x86/mm/Makefile             |   1 +
 arch/x86/mm/proclocal.c          | 131 +++++++++++++++++++++++++++++++
 include/linux/mm_types.h         |   9 +++
 include/linux/proclocal.h        |  16 ++++
 kernel/fork.c                    |   6 ++
 mm/Makefile                      |   1 +
 mm/proclocal.c                   |  35 +++++++++
 8 files changed, 210 insertions(+)
 create mode 100644 arch/x86/include/asm/proclocal.h
 create mode 100644 arch/x86/mm/proclocal.c
 create mode 100644 include/linux/proclocal.h
 create mode 100644 mm/proclocal.c

diff --git a/arch/x86/include/asm/proclocal.h b/arch/x86/include/asm/proclocal.h
new file mode 100644
index 000000000000..a66983e49209
--- /dev/null
+++ b/arch/x86/include/asm/proclocal.h
@@ -0,0 +1,11 @@
+/*
+ * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+#ifndef _ASM_X86_PROCLOCAL_H
+#define _ASM_X86_PROCLOCAL_H
+
+struct mm_struct;
+
+void arch_proclocal_teardown_pages_and_pt(struct mm_struct *mm);
+
+#endif	/* _ASM_X86_PROCLOCAL_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index e4ffcf74770e..a7c7111eb8f0 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -56,3 +56,4 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
 
 obj-$(CONFIG_XPFO)		+= xpfo.o
 
+obj-$(CONFIG_PROCLOCAL) += proclocal.o
diff --git a/arch/x86/mm/proclocal.c b/arch/x86/mm/proclocal.c
new file mode 100644
index 000000000000..c64a8ea6360d
--- /dev/null
+++ b/arch/x86/mm/proclocal.c
@@ -0,0 +1,131 @@
+/*
+ * Architecture-specific code for handling process-local memory on x86-64.
+ *
+ * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/proclocal.h>
+#include <linux/set_memory.h>
+#include <asm/tlb.h>
+
+
+static void unmap_leftover_pages_pte(struct mm_struct *mm, pmd_t *pmd,
+				     unsigned long addr, unsigned long end,
+				     struct list_head *page_list)
+{
+	pte_t *pte;
+	struct page *page;
+
+	for (pte = pte_offset_map(pmd, addr);
+	     addr < end; addr += PAGE_SIZE, pte++) {
+		if (!pte_present(*pte))
+			continue;
+
+		page = pte_page(*pte);
+		pte_clear(mm, addr, pte);
+		set_direct_map_default_noflush(page);
+
+		/*
+		 * scrub page contents. since mm teardown happens from a
+		 * different mm, we cannot just use the process-local virtual
+		 * address; access the page via the physmap instead. note that
+		 * there is a small time frame where leftover data is globally
+		 * visible in the kernel address space.
+		 *
+		 * tbd in later commit: scrub the page via a temporary mapping
+		 * in process-local memory area before re-attaching it to the
+		 * physmap.
+		 */
+		memset(page_to_virt(page), 0, PAGE_SIZE);
+
+		/*
+		 * track page for cleanup later;
+		 * note that the proclocal_next list is used only for regular
+		 * kfree_proclocal, so ripping pages out now is fine.
+		 */
+		INIT_LIST_HEAD(&page->proclocal_next);
+		list_add_tail(&page->proclocal_next, page_list);
+	}
+}
+
+/*
+ * Walk through process-local mappings on each page table level. Avoid code
+ * duplication and use a macro to generate one function for each level.
+ *
+ * The macro generates a function for page table level LEVEL. The function is
+ * passed a pointer to the entry in the page table level ABOVE and recurses into
+ * the page table level BELOW.
+ */
+#define UNMAP_LEFTOVER_LEVEL(LEVEL, ABOVE, BELOW) \
+	static void unmap_leftover_pages_ ## LEVEL (struct mm_struct *mm, ABOVE ## _t *ABOVE,	\
+						    unsigned long addr, unsigned long end,	\
+						    struct list_head *page_list)		\
+	{										\
+		LEVEL ## _t *LEVEL  = LEVEL ## _offset(ABOVE, addr);			\
+		unsigned long next;							\
+		do {									\
+			next = LEVEL ## _addr_end(addr, end);				\
+			if (LEVEL ## _present(*LEVEL))					\
+				unmap_leftover_pages_## BELOW (mm, LEVEL, addr, next, page_list); \
+		} while (LEVEL++, addr = next, addr < end);				\
+	}
+
+UNMAP_LEFTOVER_LEVEL(pmd, pud, pte)
+UNMAP_LEFTOVER_LEVEL(pud, p4d, pmd)
+UNMAP_LEFTOVER_LEVEL(p4d, pgd, pud)
+#undef UNMAP_LEFTOVER_LEVEL
+
+extern void proclocal_release_pages(struct list_head *pages);
+
+static void unmap_free_leftover_proclocal_pages(struct mm_struct *mm)
+{
+	LIST_HEAD(page_list);
+	unsigned long addr = PROCLOCAL_START, next;
+	unsigned long end = PROCLOCAL_START + PROCLOCAL_SIZE;
+
+	/*
+	 * Walk page tables in process-local memory area and handle leftover
+	 * process-local pages. Note that we cannot use the kernel's
+	 * walk_page_range, because that function assumes walking across vmas.
+	 */
+	spin_lock(&mm->page_table_lock);
+	do {
+		pgd_t *pgd = pgd_offset(mm, addr);
+		next = pgd_addr_end(addr, end);
+
+		if (pgd_present(*pgd)) {
+			unmap_leftover_pages_p4d(mm, pgd, addr, next, &page_list);
+		}
+		addr = next;
+	} while (addr < end);
+	spin_unlock(&mm->page_table_lock);
+	/*
+	 * Flush any mappings of process-local pages from the TLBs, so that we
+	 * can release the pages afterwards.
+	 */
+	flush_tlb_mm_range(mm, PROCLOCAL_START, end, PAGE_SHIFT, false);
+	proclocal_release_pages(&page_list);
+}
+
+static void arch_proclocal_teardown_pt(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	/*
+	 * clean up page tables for the whole pgd used exclusively by
+	 * process-local memory.
+	 */
+	unsigned long proclocal_base_pgd = PROCLOCAL_START & PGDIR_MASK;
+	unsigned long proclocal_end_pgd = proclocal_base_pgd + PGDIR_SIZE;
+
+	tlb_gather_mmu(&tlb, mm, proclocal_base_pgd, proclocal_end_pgd);
+	free_pgd_range(&tlb, proclocal_base_pgd, proclocal_end_pgd, 0, 0);
+	tlb_finish_mmu(&tlb, proclocal_base_pgd, proclocal_end_pgd);
+}
+
+void arch_proclocal_teardown_pages_and_pt(struct mm_struct *mm)
+{
+	unmap_free_leftover_proclocal_pages(mm);
+	arch_proclocal_teardown_pt(mm);
+}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2fe4dbfcdebd..1cb9243dd299 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -158,6 +158,15 @@ struct page {
 
 		/** @rcu_head: You can use this to free a page by RCU. */
 		struct rcu_head rcu_head;
+
+#ifdef CONFIG_PROCLOCAL
+		struct {	/* PROCLOCAL pages */
+			struct list_head proclocal_next; /* track pages in one allocation */
+			unsigned long _proclocal_pad_1; /* mapping */
+			/* head page of an allocation stores its length */
+			size_t proclocal_nr_pages;
+		};
+#endif
 	};
 
 	union {		/* This union is 4 bytes in size. */
diff --git a/include/linux/proclocal.h b/include/linux/proclocal.h
new file mode 100644
index 000000000000..9dae140c0796
--- /dev/null
+++ b/include/linux/proclocal.h
@@ -0,0 +1,16 @@
+/*
+ * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+#ifndef _PROCLOCAL_H
+#define _PROCLOCAL_H
+
+#ifdef CONFIG_PROCLOCAL
+
+struct mm_struct;
+
+void proclocal_mm_exit(struct mm_struct *mm);
+#else  /* !CONFIG_PROCLOCAL */
+static inline void proclocal_mm_exit(struct mm_struct *mm) { }
+#endif
+
+#endif /* _PROCLOCAL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 6e37f5626417..caca6b16ee1e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -92,6 +92,7 @@
 #include <linux/livepatch.h>
 #include <linux/thread_info.h>
 #include <linux/stackleak.h>
+#include <linux/proclocal.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1051,6 +1052,11 @@ static inline void __mmput(struct mm_struct *mm)
 	exit_mmap(mm);
 	mm_put_huge_zero_page(mm);
 	set_mm_exe_file(mm, NULL);
+	/*
+	 * No real users of this address space left, dropping process-local
+	 * mappings.
+	 */
+	proclocal_mm_exit(mm);
 	if (!list_empty(&mm->mmlist)) {
 		spin_lock(&mmlist_lock);
 		list_del(&mm->mmlist);
diff --git a/mm/Makefile b/mm/Makefile
index e99e1e6ae5ae..029d7e2ee80b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -100,3 +100,4 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_XPFO) += xpfo.o
+obj-$(CONFIG_PROCLOCAL) += proclocal.o
diff --git a/mm/proclocal.c b/mm/proclocal.c
new file mode 100644
index 000000000000..72c485c450bf
--- /dev/null
+++ b/mm/proclocal.c
@@ -0,0 +1,35 @@
+/*
+ * mm/proclocal.c
+ *
+ * The code in this file implements process-local mappings in the Linux kernel
+ * address space. This memory is only usable in the process context. With memory
+ * not globally visible in the kernel, it cannot easily be prefetched and leaked
+ * via L1TF.
+ *
+ * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+#include <linux/mm.h>
+#include <linux/proclocal.h>
+#include <linux/set_memory.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include <asm/proclocal.h>
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
+
+void proclocal_release_pages(struct list_head *pages)
+{
+	struct page *pos, *n;
+	list_for_each_entry_safe(pos, n, pages, proclocal_next) {
+		list_del(&pos->proclocal_next);
+		__free_page(pos);
+	}
+}
+
+void proclocal_mm_exit(struct mm_struct *mm)
+{
+	pr_debug("proclocal_mm_exit for mm %p pgd %p (current is %p)\n", mm, mm->pgd, current);
+
+	arch_proclocal_teardown_pages_and_pt(mm);
+}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 04/10] mm: allocate virtual space for process-local memory
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (2 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 03/10] x86/mm, mm,kernel: add teardown for process-local memory to mm cleanup Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 05/10] mm: allocate/release physical pages " Marius Hillenbrand
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse

Implement first half of kmalloc and kfree for process-local memory,
which deals with allocating virtual address ranges in the process-local
memory area.

While the process-local mappings will be visible only in a
single address space, the address of each allocation is still unique to
aid in debugging (e.g., to see page faults instead of silent invalid
accesses). For that purpose, use a global allocator for virtual address
ranges of process-local mappings. Note that the single centralized lock
is good enough for our use case.

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/mm/proclocal.c   |   7 +-
 include/linux/mm_types.h  |   4 +
 include/linux/proclocal.h |  19 +++++
 mm/proclocal.c            | 150 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/proclocal.c b/arch/x86/mm/proclocal.c
index c64a8ea6360d..dd641995cc9f 100644
--- a/arch/x86/mm/proclocal.c
+++ b/arch/x86/mm/proclocal.c
@@ -10,6 +10,8 @@
 #include <linux/set_memory.h>
 #include <asm/tlb.h>
 
+extern void handle_proclocal_page(struct mm_struct *mm, struct page *page,
+				  unsigned long addr);
 
 static void unmap_leftover_pages_pte(struct mm_struct *mm, pmd_t *pmd,
 				     unsigned long addr, unsigned long end,
@@ -27,6 +29,8 @@ static void unmap_leftover_pages_pte(struct mm_struct *mm, pmd_t *pmd,
 		pte_clear(mm, addr, pte);
 		set_direct_map_default_noflush(page);
 
+		/* callback to non-arch allocator */
+		handle_proclocal_page(mm, page, addr);
 		/*
 		 * scrub page contents. since mm teardown happens from a
 		 * different mm, we cannot just use the process-local virtual
@@ -126,6 +130,7 @@ static void arch_proclocal_teardown_pt(struct mm_struct *mm)
 
 void arch_proclocal_teardown_pages_and_pt(struct mm_struct *mm)
 {
-	unmap_free_leftover_proclocal_pages(mm);
+	if (mm->proclocal_nr_pages)
+		unmap_free_leftover_proclocal_pages(mm);
 	arch_proclocal_teardown_pt(mm);
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1cb9243dd299..4bd8737cc7a6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -510,6 +510,10 @@ struct mm_struct {
 		/* HMM needs to track a few things per mm */
 		struct hmm *hmm;
 #endif
+
+#ifdef CONFIG_PROCLOCAL
+		size_t proclocal_nr_pages;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/proclocal.h b/include/linux/proclocal.h
index 9dae140c0796..c408e0d1104c 100644
--- a/include/linux/proclocal.h
+++ b/include/linux/proclocal.h
@@ -8,8 +8,27 @@
 
 struct mm_struct;
 
+void *kmalloc_proclocal(size_t size);
+void *kzalloc_proclocal(size_t size);
+void kfree_proclocal(void *vaddr);
+
 void proclocal_mm_exit(struct mm_struct *mm);
 #else  /* !CONFIG_PROCLOCAL */
+static inline void *kmalloc_proclocal(size_t size)
+{
+	return kmalloc(size, GFP_KERNEL);
+}
+
+static inline void * kzalloc_proclocal(size_t size)
+{
+	return kzalloc(size, GFP_KERNEL);
+}
+
+static inline void kfree_proclocal(void *vaddr)
+{
+	kfree(vaddr);
+}
+
 static inline void proclocal_mm_exit(struct mm_struct *mm) { }
 #endif
 
diff --git a/mm/proclocal.c b/mm/proclocal.c
index 72c485c450bf..7a6217faf765 100644
--- a/mm/proclocal.c
+++ b/mm/proclocal.c
@@ -8,6 +8,7 @@
  *
  * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  */
+#include <linux/genalloc.h>
 #include <linux/mm.h>
 #include <linux/proclocal.h>
 #include <linux/set_memory.h>
@@ -18,6 +19,43 @@
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
+static pte_t *pte_lookup_map(struct mm_struct *mm, unsigned long kvaddr)
+{
+	pgd_t *pgd = pgd_offset(mm, kvaddr);
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (IS_ERR_OR_NULL(pgd) || !pgd_present(*pgd))
+		return ERR_PTR(-1);
+
+	p4d = p4d_offset(pgd, kvaddr);
+	if (IS_ERR_OR_NULL(p4d) || !p4d_present(*p4d))
+		return ERR_PTR(-1);
+
+	pud = pud_offset(p4d, kvaddr);
+	if (IS_ERR_OR_NULL(pud) || !pud_present(*pud))
+		return ERR_PTR(-1);
+
+	pmd = pmd_offset(pud, kvaddr);
+	if (IS_ERR_OR_NULL(pmd) || !pmd_present(*pmd))
+		return ERR_PTR(-1);
+
+	return pte_offset_map(pmd, kvaddr);
+}
+
+static struct page *proclocal_find_first_page(struct mm_struct *mm, const void *kvaddr)
+{
+	pte_t *ptep = pte_lookup_map(mm, (unsigned long) kvaddr);
+
+	if(IS_ERR_OR_NULL(ptep))
+		return NULL;
+	if (!pte_present(*ptep))
+		return NULL;
+
+	return pfn_to_page(pte_pfn(*ptep));
+}
+
 void proclocal_release_pages(struct list_head *pages)
 {
 	struct page *pos, *n;
@@ -27,9 +65,121 @@ void proclocal_release_pages(struct list_head *pages)
 	}
 }
 
+static DEFINE_SPINLOCK(proclocal_lock);
+static struct gen_pool *allocator;
+
+static int proclocal_allocator_init(void)
+{
+	int rc;
+
+	allocator = gen_pool_create(PAGE_SHIFT, -1);
+	if (unlikely(IS_ERR(allocator)))
+		return PTR_ERR(allocator);
+	if (!allocator)
+		return -1;
+
+	rc = gen_pool_add(allocator, PROCLOCAL_START, PROCLOCAL_SIZE, -1);
+
+	if (rc)
+		gen_pool_destroy(allocator);
+
+	return rc;
+}
+late_initcall(proclocal_allocator_init);
+
+static void *alloc_virtual(size_t nr_pages)
+{
+	void *kvaddr;
+	spin_lock(&proclocal_lock);
+	kvaddr = (void *)gen_pool_alloc(allocator, nr_pages * PAGE_SIZE);
+	spin_unlock(&proclocal_lock);
+	return kvaddr;
+}
+
+static void free_virtual(const void *kvaddr, size_t nr_pages)
+{
+	spin_lock(&proclocal_lock);
+	gen_pool_free(allocator, (unsigned long)kvaddr,
+			     nr_pages * PAGE_SIZE);
+	spin_unlock(&proclocal_lock);
+}
+
+void *kmalloc_proclocal(size_t size)
+{
+	void *kvaddr = NULL;
+	size_t nr_pages = round_up(size, PAGE_SIZE) / PAGE_SIZE;
+	size_t nr_pages_virtual = nr_pages + 1; /* + guard page */
+
+	BUG_ON(!current);
+	if (!size)
+		return ZERO_SIZE_PTR;
+	might_sleep();
+
+	kvaddr = alloc_virtual(nr_pages_virtual);
+
+	if (IS_ERR_OR_NULL(kvaddr))
+		return kvaddr;
+
+	/* tbd: subsequent patch will allocate and map physical pages */
+
+	return kvaddr;
+}
+EXPORT_SYMBOL(kmalloc_proclocal);
+
+void * kzalloc_proclocal(size_t size)
+{
+	void * kvaddr = kmalloc_proclocal(size);
+
+	if (!IS_ERR_OR_NULL(kvaddr))
+		memset(kvaddr, 0, size);
+	return kvaddr;
+}
+EXPORT_SYMBOL(kzalloc_proclocal);
+
+void kfree_proclocal(void *kvaddr)
+{
+	struct page *first_page;
+	int nr_pages;
+	struct mm_struct *mm;
+
+	if (!kvaddr || kvaddr == ZERO_SIZE_PTR)
+		return;
+
+	pr_debug("kfree for %p (current %p mm %p)\n", kvaddr,
+		 current, current ? current->mm : 0);
+
+	BUG_ON((unsigned long)kvaddr < PROCLOCAL_START);
+	BUG_ON((unsigned long)kvaddr >= (PROCLOCAL_START + PROCLOCAL_SIZE));
+	BUG_ON(!current);
+
+	mm = current->mm;
+	down_write(&mm->mmap_sem);
+	first_page = proclocal_find_first_page(mm, kvaddr);
+	if (IS_ERR_OR_NULL(first_page)) {
+		pr_err("double-free?!\n");
+		BUG();
+	} /* double-free? */
+	nr_pages = first_page->proclocal_nr_pages;
+	BUG_ON(!nr_pages);
+	mm->proclocal_nr_pages -= nr_pages;
+	/* subsequent patch will unmap and release physical pages */
+	up_write(&mm->mmap_sem);
+
+	free_virtual(kvaddr, nr_pages + 1);
+}
+EXPORT_SYMBOL(kfree_proclocal);
+
 void proclocal_mm_exit(struct mm_struct *mm)
 {
 	pr_debug("proclocal_mm_exit for mm %p pgd %p (current is %p)\n", mm, mm->pgd, current);
 
 	arch_proclocal_teardown_pages_and_pt(mm);
 }
+
+void handle_proclocal_page(struct mm_struct *mm, struct page *page,
+				  unsigned long addr)
+{
+	if (page->proclocal_nr_pages) {
+		free_virtual((void *)addr, page->proclocal_nr_pages + 1);
+	}
+}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 05/10] mm: allocate/release physical pages for process-local memory
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (3 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 04/10] mm: allocate virtual space for process-local memory Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 06/10] kvm/x86: add support for storing vCPU state in " Marius Hillenbrand
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse

Implement second half of kmalloc and kfree for process-local memory,
which allocates physical pages, maps them into the kernel virtual
address space of the current process, and removes them from the kernel's
shared direct physical mapping. On kfree, the code performs that
sequence in reverse, after scrubbing the pages' contents.

Note that both the allocation and free path require TLB flushes, to
flush remaining mappings in the direct physical mappings (on allocation)
and in the process-local memory (on release). Aim to keep the impact of
these flushes minimal by flushing only necessary address ranges.

The allocation only handles the smallest page size (i.e., 4 KiB), no
huge pages, because in our use case, the size of individual allocations
is in the order of 4 KiB.

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 mm/proclocal.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 165 insertions(+), 2 deletions(-)

diff --git a/mm/proclocal.c b/mm/proclocal.c
index 7a6217faf765..26bc1f3f68a2 100644
--- a/mm/proclocal.c
+++ b/mm/proclocal.c
@@ -56,6 +56,70 @@ static struct page *proclocal_find_first_page(struct mm_struct *mm, const void *
 	return pfn_to_page(pte_pfn(*ptep));
 }
 
+/*
+ * Lookup PTE for a given virtual address. Allocate page table structures, if
+ * they are not present yet.
+ */
+static pte_t *pte_lookup_alloc_map(struct mm_struct *mm, unsigned long kvaddr)
+{
+	pgd_t *pgd = pgd_offset(mm, kvaddr);
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	p4d = p4d_alloc(mm, pgd, kvaddr);
+	if (IS_ERR_OR_NULL(p4d))
+		return (pte_t *)p4d;
+
+	pud = pud_alloc(mm, p4d, kvaddr);
+	if (IS_ERR_OR_NULL(pud))
+		return (pte_t *)pud;
+
+	pmd = pmd_alloc(mm, pud, kvaddr);
+	if (IS_ERR_OR_NULL(pmd))
+		return (pte_t *)pmd;
+
+	return pte_alloc_map(mm, pmd, kvaddr);
+}
+
+static int proclocal_map_notlbflush(struct mm_struct *mm, struct page *page, void *kvaddr)
+{
+	int rc;
+	pte_t *ptep = pte_lookup_alloc_map(mm, (unsigned long)kvaddr);
+
+	if (IS_ERR_OR_NULL(ptep)) {
+		pr_err("failed to pte_lookup_alloc_map, ptep=0x%lx\n",
+		       (unsigned long)ptep);
+		return ptep ? PTR_ERR(ptep) : -ENOMEM;
+	}
+
+	set_pte(ptep, mk_pte(page, kmap_prot));
+	rc = set_direct_map_invalid_noflush(page);
+	if (rc)
+		pte_clear(mm, (unsigned long)kvaddr, ptep);
+	else
+		pr_debug("map pfn %lx at %p for mm %p pgd %p\n", page_to_pfn(page), kvaddr, mm, mm->pgd);
+	return rc;
+}
+
+static void proclocal_unmap_page_notlbflush(struct mm_struct *mm, void *vaddr)
+{
+	pte_t *ptep = pte_lookup_map(mm, (unsigned long)vaddr);
+	pte_t pte;
+	struct page *page;
+
+	BUG_ON(IS_ERR_OR_NULL(ptep));
+	BUG_ON(!pte_present(*ptep)); // already cleared?!
+
+	/* scrub page contents */
+	memset(vaddr, 0, PAGE_SIZE);
+
+	pte = ptep_get_and_clear(mm, (unsigned long)vaddr, ptep);
+	page = pfn_to_page(pte_pfn(pte));
+
+	BUG_ON(set_direct_map_default_noflush(page)); /* should never fail for mapped 4K-pages */
+}
+
 void proclocal_release_pages(struct list_head *pages)
 {
 	struct page *pos, *n;
@@ -65,6 +129,76 @@ void proclocal_release_pages(struct list_head *pages)
 	}
 }
 
+static void proclocal_release_pages_incl_head(struct list_head *pages)
+{
+	proclocal_release_pages(pages);
+	/* the list_head itself is embedded in a struct page we want to release. */
+	__free_page(list_entry(pages, struct page, proclocal_next));
+}
+
+struct physmap_tlb_flush {
+	unsigned long start;
+	unsigned long end;
+};
+
+static inline void track_page_to_flush(struct physmap_tlb_flush *flush, struct page *page)
+{
+	const unsigned long page_start = (unsigned long)page_to_virt(page);
+	const unsigned long page_end = page_start + PAGE_SIZE;
+
+	if (page_start < flush->start)
+		flush->start = page_start;
+	if (page_end > flush->end)
+		flush->end = page_end;
+}
+
+static int alloc_and_map_proclocal_pages(struct mm_struct *mm, void *kvaddr, size_t nr_pages)
+{
+	int rc;
+	size_t i, j;
+	struct page *page;
+	struct list_head *pages_list = NULL;
+	struct physmap_tlb_flush flush = { -1, 0 };
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_page(GFP_KERNEL);
+
+		if (!page) {
+			rc = -ENOMEM;
+			goto unmap_release;
+		}
+
+		rc = proclocal_map_notlbflush(mm, page, kvaddr + i * PAGE_SIZE);
+		if (rc) {
+			__free_page(page);
+			goto unmap_release;
+		}
+
+		track_page_to_flush(&flush, page);
+		INIT_LIST_HEAD(&page->proclocal_next);
+		/* track allocation in first struct page */
+		if (!pages_list) {
+			pages_list = &page->proclocal_next;
+			page->proclocal_nr_pages = nr_pages;
+		} else {
+			list_add_tail(&page->proclocal_next, pages_list);
+			page->proclocal_nr_pages = 0;
+		}
+	}
+
+	/* flush direct mappings of allocated pages from TLBs. */
+	flush_tlb_kernel_range(flush.start, flush.end);
+	return 0;
+
+unmap_release:
+	for (j = 0; j < i; j++)
+		proclocal_unmap_page_notlbflush(mm, kvaddr + j * PAGE_SIZE);
+
+	if (pages_list)
+		proclocal_release_pages_incl_head(pages_list);
+	return rc;
+}
+
 static DEFINE_SPINLOCK(proclocal_lock);
 static struct gen_pool *allocator;
 
@@ -106,9 +240,11 @@ static void free_virtual(const void *kvaddr, size_t nr_pages)
 
 void *kmalloc_proclocal(size_t size)
 {
+	int rc;
 	void *kvaddr = NULL;
 	size_t nr_pages = round_up(size, PAGE_SIZE) / PAGE_SIZE;
 	size_t nr_pages_virtual = nr_pages + 1; /* + guard page */
+	struct mm_struct *mm;
 
 	BUG_ON(!current);
 	if (!size)
@@ -120,7 +256,18 @@ void *kmalloc_proclocal(size_t size)
 	if (IS_ERR_OR_NULL(kvaddr))
 		return kvaddr;
 
-	/* tbd: subsequent patch will allocate and map physical pages */
+	mm = current->mm;
+	down_write(&mm->mmap_sem);
+	rc = alloc_and_map_proclocal_pages(mm, kvaddr, nr_pages);
+	if (!rc)
+		mm->proclocal_nr_pages += nr_pages;
+	up_write(&mm->mmap_sem);
+
+	if (unlikely(rc))
+		kvaddr = ERR_PTR(rc);
+
+	pr_debug("allocated %zd bytes at %p (current %p mm %p)\n", size, kvaddr,
+		 current, current ? current->mm : 0);
 
 	return kvaddr;
 }
@@ -138,6 +285,7 @@ EXPORT_SYMBOL(kzalloc_proclocal);
 
 void kfree_proclocal(void *kvaddr)
 {
+	int i;
 	struct page *first_page;
 	int nr_pages;
 	struct mm_struct *mm;
@@ -152,8 +300,10 @@ void kfree_proclocal(void *kvaddr)
 	BUG_ON((unsigned long)kvaddr >= (PROCLOCAL_START + PROCLOCAL_SIZE));
 	BUG_ON(!current);
 
+	might_sleep();
 	mm = current->mm;
 	down_write(&mm->mmap_sem);
+
 	first_page = proclocal_find_first_page(mm, kvaddr);
 	if (IS_ERR_OR_NULL(first_page)) {
 		pr_err("double-free?!\n");
@@ -162,9 +312,22 @@ void kfree_proclocal(void *kvaddr)
 	nr_pages = first_page->proclocal_nr_pages;
 	BUG_ON(!nr_pages);
 	mm->proclocal_nr_pages -= nr_pages;
-	/* subsequent patch will unmap and release physical pages */
+
+	for (i = 0; i < nr_pages; i++)
+		proclocal_unmap_page_notlbflush(mm, kvaddr + i * PAGE_SIZE);
+
 	up_write(&mm->mmap_sem);
 
+	/*
+	 * Flush process-local mappings from TLBs so that we can release the
+	 * pages afterwards.
+	 */
+	flush_tlb_mm_range(mm, (unsigned long)kvaddr,
+			   (unsigned long)kvaddr + nr_pages * PAGE_SIZE,
+			   PAGE_SHIFT, false);
+
+	proclocal_release_pages_incl_head(&first_page->proclocal_next);
+
 	free_virtual(kvaddr, nr_pages + 1);
 }
 EXPORT_SYMBOL(kfree_proclocal);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 06/10] kvm/x86: add support for storing vCPU state in process-local memory
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (4 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 05/10] mm: allocate/release physical pages " Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 07/10] kvm, vmx: move CR2 context switch out of assembly path Marius Hillenbrand
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse

The hidden KVM state will both contain guest state that is specific to
x86-64 as well as state specific to SVM or VMX, respectively. Thus,
allocate the hidden state in the code paths specific to SVM and VMX. For
the code that is shared between SVM and VMX, introduce a common struct
for hidden guest state.

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  9 ++++++++
 arch/x86/kvm/Kconfig            | 10 +++++++++
 arch/x86/kvm/svm.c              | 37 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx.c              | 38 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |  5 +++++
 5 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5772fba1c64e..41c7b06588f9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -534,7 +534,16 @@ struct kvm_vcpu_hv {
 	cpumask_t tlb_flush;
 };
 
+#ifdef CONFIG_KVM_PROCLOCAL
+struct kvm_vcpu_arch_hidden {
+	u64 placeholder;
+};
+#endif
+
 struct kvm_vcpu_arch {
+#ifdef CONFIG_KVM_PROCLOCAL
+	struct kvm_vcpu_arch_hidden *hidden;
+#endif
 	/*
 	 * rip and regs accesses must go through
 	 * kvm_{register,rip}_{read,write} functions.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 80abc68b3e90..a3640e2f1a32 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -97,6 +97,16 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+config KVM_PROCLOCAL
+	bool "Use process-local allocation for KVM"
+	depends on KVM && PROCLOCAL
+	---help---
+	  Use process-local memory for storing vCPU state in KVM. This
+	  option removes assets from the kernel's global direct mapping
+	  of physical memory and stores them only in the address space
+	  of the process hosting a VM.
+
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cb202c238de2..af66b93902e5 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -41,6 +41,7 @@
 #include <linux/file.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
+#include <linux/proclocal.h>
 
 #include <asm/apic.h>
 #include <asm/perf_event.h>
@@ -190,9 +191,20 @@ static u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
  */
 static uint64_t osvw_len = 4, osvw_status;
 
+#ifdef CONFIG_KVM_PROCLOCAL
+struct vcpu_svm_hidden {
+	struct { /* mimic topology in vcpu_svm: */
+		struct kvm_vcpu_arch_hidden arch;
+	} vcpu;
+};
+#endif
+
 struct vcpu_svm {
 	struct kvm_vcpu vcpu;
 	struct vmcb *vmcb;
+#ifdef CONFIG_KVM_PROCLOCAL
+	struct vcpu_svm_hidden *hidden;
+#endif
 	unsigned long vmcb_pa;
 	struct svm_cpu_data *svm_data;
 	uint64_t asid_generation;
@@ -2129,9 +2141,18 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id)
 		goto out;
 	}
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	svm->hidden = kzalloc_proclocal(sizeof(struct vcpu_svm_hidden));
+	if (!svm->hidden) {
+		err = -ENOMEM;
+		goto free_svm;
+	}
+	svm->vcpu.arch.hidden = &svm->hidden->vcpu.arch;
+#endif
+
 	err = kvm_vcpu_init(&svm->vcpu, kvm, id);
 	if (err)
-		goto free_svm;
+		goto free_hidden;
 
 	err = -ENOMEM;
 	page = alloc_page(GFP_KERNEL);
@@ -2187,7 +2208,11 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id)
 	__free_page(page);
 uninit:
 	kvm_vcpu_uninit(&svm->vcpu);
+free_hidden:
+#ifdef CONFIG_KVM_PROCLOCAL
+	kfree_proclocal(svm->hidden);
 free_svm:
+#endif
 	kmem_cache_free(kvm_vcpu_cache, svm);
 out:
 	return ERR_PTR(err);
@@ -2205,6 +2230,16 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	/*
+	 * note that the hidden vCPU state in a process-local allocation is
+	 * already cleaned up, because a process's mm is torn down before files
+	 * are closed. make any access in the cleanup code very visible.
+	 */
+	svm->hidden = (struct vcpu_svm_hidden *)POISON_POINTER_DELTA;
+	svm->vcpu.arch.hidden = (struct kvm_vcpu_arch_hidden *)POISON_POINTER_DELTA;
+#endif
+
 	/*
 	 * The vmcb page can be recycled, causing a false negative in
 	 * svm_vcpu_load(). So, ensure that no logical CPU has this
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f9a4faf2d1bc..6f59a6ad7835 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -37,6 +37,8 @@
 #include <linux/hrtimer.h>
 #include <linux/frame.h>
 #include <linux/nospec.h>
+#include <linux/proclocal.h>
+
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -975,8 +977,19 @@ struct vmx_msrs {
 	struct vmx_msr_entry	val[NR_AUTOLOAD_MSRS];
 };
 
+#ifdef CONFIG_KVM_PROCLOCAL
+struct vcpu_vmx_hidden {
+	struct { /* mimic topology in vcpu_svm: */
+		struct kvm_vcpu_arch_hidden arch;
+	} vcpu;
+};
+#endif
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
+#ifdef CONFIG_KVM_PROCLOCAL
+	struct vcpu_vmx_hidden *hidden;
+#endif
 	unsigned long         host_rsp;
 	u8                    fail;
 	u8		      msr_bitmap_mode;
@@ -11756,6 +11769,16 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	/*
+	 * note that the hidden vCPU state in a process-local allocation is
+	 * already cleaned up, because a process's mm is torn down before files
+	 * are closed. make any access in the cleanup code very visible.
+	 */
+	vmx->hidden = (struct vcpu_vmx_hidden *)POISON_POINTER_DELTA;
+	vmx->vcpu.arch.hidden = (struct kvm_vcpu_arch_hidden *)POISON_POINTER_DELTA;
+#endif
+
 	if (enable_pml)
 		vmx_destroy_pml_buffer(vmx);
 	free_vpid(vmx->vpid);
@@ -11777,11 +11800,20 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	if (!vmx)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	vmx->hidden = kzalloc_proclocal(sizeof(struct vcpu_vmx_hidden));
+	if (!vmx->hidden) {
+		err = -ENOMEM;
+		goto free_vcpu;
+	}
+	vmx->vcpu.arch.hidden = &vmx->hidden->vcpu.arch;
+#endif
+
 	vmx->vpid = allocate_vpid();
 
 	err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
 	if (err)
-		goto free_vcpu;
+		goto free_hidden;
 
 	err = -ENOMEM;
 
@@ -11868,7 +11900,11 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	vmx_destroy_pml_buffer(vmx);
 uninit_vcpu:
 	kvm_vcpu_uninit(&vmx->vcpu);
+free_hidden:
+#ifdef CONFIG_KVM_PROCLOCAL
+	kfree_proclocal(vmx->hidden);
 free_vcpu:
+#endif
 	free_vpid(vmx->vpid);
 	kmem_cache_free(kvm_vcpu_cache, vmx);
 	return ERR_PTR(err);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 371d98422631..2cfb96ca8cc8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9309,6 +9309,11 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 	free_page((unsigned long)vcpu->arch.pio_data);
 	if (!lapic_in_kernel(vcpu))
 		static_key_slow_dec(&kvm_no_apic_vcpu);
+	/*
+	 * note that the hidden vCPU state in a process-local allocation is
+	 * already cleaned up at this point, because a process's mm is torn down
+	 * before files are closed.
+	 */
 }
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 07/10] kvm, vmx: move CR2 context switch out of assembly path
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (5 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 06/10] kvm/x86: add support for storing vCPU state in " Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 08/10] kvm, vmx: move register clearing " Marius Hillenbrand
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse, Julian Stecklina

From: Julian Stecklina <jsteckli@amazon.de>

The VM entry/exit path is a giant inline assembly statement. Simplify it
by doing CR2 context switching in plain C. Move CR2 restore behind IBRS
clearing, so we reduce the amount of code we execute with IBRS on.

Using {read,write}_cr2() means KVM will use pv_mmu_ops instead of open
coding native_{read,write}_cr2(). The CR2 code has been done in
assembly since KVM's genesis[1], which predates the addition of the
paravirt ops[2], i.e. KVM isn't deliberately avoiding the paravirt
ops.

[1] Commit 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
[2] Commit d3561b7fa0fb ("[PATCH] paravirt: header and stubs for paravirtualisation")

Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
[rebased; note that this patch mainly improves the readability of
subsequent patches; we will drop it when rebasing to 5.x, since major
refactoring of KVM makes this patch redundant.]
Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/vmx.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6f59a6ad7835..16a383635b59 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11513,6 +11513,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
 		(unsigned long)&current_evmcs->host_rsp : 0;
 
+	if (read_cr2() != vcpu->arch.cr2)
+		write_cr2(vcpu->arch.cr2);
+
 	if (static_branch_unlikely(&vmx_l1d_should_flush))
 		vmx_l1d_flush(vcpu);
 
@@ -11532,13 +11535,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		"2: \n\t"
 		__ex("vmwrite %%" _ASM_SP ", %%" _ASM_DX) "\n\t"
 		"1: \n\t"
-		/* Reload cr2 if changed */
-		"mov %c[cr2](%0), %%" _ASM_AX " \n\t"
-		"mov %%cr2, %%" _ASM_DX " \n\t"
-		"cmp %%" _ASM_AX ", %%" _ASM_DX " \n\t"
-		"je 3f \n\t"
-		"mov %%" _ASM_AX", %%cr2 \n\t"
-		"3: \n\t"
 		/* Check if vmlaunch of vmresume is needed */
 		"cmpl $0, %c[launched](%0) \n\t"
 		/* Load guest registers.  Don't clobber flags. */
@@ -11599,8 +11595,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		"xor %%r14d, %%r14d \n\t"
 		"xor %%r15d, %%r15d \n\t"
 #endif
-		"mov %%cr2, %%" _ASM_AX "   \n\t"
-		"mov %%" _ASM_AX ", %c[cr2](%0) \n\t"
 
 		"xor %%eax, %%eax \n\t"
 		"xor %%ebx, %%ebx \n\t"
@@ -11632,7 +11626,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		[r14]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R14])),
 		[r15]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R15])),
 #endif
-		[cr2]"i"(offsetof(struct vcpu_vmx, vcpu.arch.cr2)),
 		[wordsize]"i"(sizeof(ulong))
 	      : "cc", "memory"
 #ifdef CONFIG_X86_64
@@ -11666,6 +11659,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	/* Eliminate branch target predictions from guest mode */
 	vmexit_fill_RSB();
 
+	vcpu->arch.cr2 = read_cr2();
+
 	/* All fields are clean at this point */
 	if (static_branch_unlikely(&enable_evmcs))
 		current_evmcs->hv_clean_fields |=
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 08/10] kvm, vmx: move register clearing out of assembly path
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (6 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 07/10] kvm, vmx: move CR2 context switch out of assembly path Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 09/10] kvm, vmx: move gprs to process local memory Marius Hillenbrand
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse, Julian Stecklina

From: Julian Stecklina <jsteckli@amazon.de>

Split the security related register clearing out of the large inline
assembly VM entry path. This results in two slightly less complicated
inline assembly statements, where it is clearer what each one does.

Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
[rebased to 4.20; note that the purpose of this patch is to make the
changes in the next commit more readable. we will drop this patch when
rebasing to 5.x, since major refactoring of KVM makes it redundant.]
Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/vmx.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 16a383635b59..0fe9a4ab8268 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11582,24 +11582,7 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		"mov %%r13, %c[r13](%0) \n\t"
 		"mov %%r14, %c[r14](%0) \n\t"
 		"mov %%r15, %c[r15](%0) \n\t"
-		/*
-		* Clear host registers marked as clobbered to prevent
-		* speculative use.
-		*/
-		"xor %%r8d,  %%r8d \n\t"
-		"xor %%r9d,  %%r9d \n\t"
-		"xor %%r10d, %%r10d \n\t"
-		"xor %%r11d, %%r11d \n\t"
-		"xor %%r12d, %%r12d \n\t"
-		"xor %%r13d, %%r13d \n\t"
-		"xor %%r14d, %%r14d \n\t"
-		"xor %%r15d, %%r15d \n\t"
 #endif
-
-		"xor %%eax, %%eax \n\t"
-		"xor %%ebx, %%ebx \n\t"
-		"xor %%esi, %%esi \n\t"
-		"xor %%edi, %%edi \n\t"
 		"pop  %%" _ASM_BP "; pop  %%" _ASM_DX " \n\t"
 		".pushsection .rodata \n\t"
 		".global vmx_return \n\t"
@@ -11636,6 +11619,35 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
 	      );
 
+	/*
+         * Explicitly clear (in addition to marking them as clobbered) all GPRs
+         * that have not been loaded with host state to prevent speculatively
+         * using the guest's values.
+         */
+	asm volatile (
+		"xor %%eax, %%eax \n\t"
+		"xor %%ebx, %%ebx \n\t"
+		"xor %%esi, %%esi \n\t"
+		"xor %%edi, %%edi \n\t"
+#ifdef CONFIG_X86_64
+		"xor %%r8d,  %%r8d \n\t"
+		"xor %%r9d,  %%r9d \n\t"
+		"xor %%r10d, %%r10d \n\t"
+		"xor %%r11d, %%r11d \n\t"
+		"xor %%r12d, %%r12d \n\t"
+		"xor %%r13d, %%r13d \n\t"
+		"xor %%r14d, %%r14d \n\t"
+		"xor %%r15d, %%r15d \n\t"
+#endif
+		::: "cc"
+#ifdef CONFIG_X86_64
+		 , "rax", "rbx", "rsi", "rdi"
+		 , "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
+#else
+		 , "eax", "ebx", "esi", "edi"
+#endif
+		);
+
 	/*
 	 * We do not use IBRS in the kernel. If this vCPU has used the
 	 * SPEC_CTRL MSR it may have left it on; save the value and
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 09/10] kvm, vmx: move gprs to process local memory
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (7 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 08/10] kvm, vmx: move register clearing " Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 17:08 ` [RFC 10/10] kvm, x86: move guest FPU state into " Marius Hillenbrand
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse, Julian Stecklina

General-purpose registers (GPRs) contain guest data and must be
protected from information leak vulnerabilities in the kernel.

Move GPRs into process local memory and change the VMX and SVM world
switch and related code accordingly.

The VMX and SVM world switch are giant inline assembly code blocks. To
keep the changes minimal, we pass all required state as a pointer to a
single struct in process-local memory, which is hidden from other
processes.

Note that this feature is strictly opt-in. When disabled, the world
switch code remains unchanged.

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Inspired-by: Julian Stecklina <js@alien8.de> (while jsteckli@amazon.de)
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  27 +++++++-
 arch/x86/kvm/kvm_cache_regs.h   |   4 +-
 arch/x86/kvm/svm.c              |  67 +++++++++++--------
 arch/x86/kvm/vmx.c              | 114 +++++++++++++++++++++-----------
 arch/x86/kvm/x86.c              |   2 +-
 5 files changed, 143 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 41c7b06588f9..4896ecde1c11 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -534,21 +534,30 @@ struct kvm_vcpu_hv {
 	cpumask_t tlb_flush;
 };
 
+typedef unsigned long kvm_arch_regs_t[NR_VCPU_REGS];
 #ifdef CONFIG_KVM_PROCLOCAL
+/*
+ * access to vcpu guest state must go through kvm_vcpu_arch_state(vcpu).
+ */
 struct kvm_vcpu_arch_hidden {
-	u64 placeholder;
+	/*
+	 * rip and regs accesses must go through
+	 * kvm_{register,rip}_{read,write} functions.
+	 */
+	kvm_arch_regs_t regs;
 };
 #endif
 
 struct kvm_vcpu_arch {
 #ifdef CONFIG_KVM_PROCLOCAL
 	struct kvm_vcpu_arch_hidden *hidden;
-#endif
+#else
 	/*
 	 * rip and regs accesses must go through
 	 * kvm_{register,rip}_{read,write} functions.
 	 */
-	unsigned long regs[NR_VCPU_REGS];
+	kvm_arch_regs_t regs;
+#endif
 	u32 regs_avail;
 	u32 regs_dirty;
 
@@ -791,6 +800,18 @@ struct kvm_vcpu_arch {
 	bool l1tf_flush_l1d;
 };
 
+#ifdef CONFIG_KVM_PROCLOCAL
+static inline struct kvm_vcpu_arch_hidden *kvm_vcpu_arch_state(struct kvm_vcpu_arch *arch)
+{
+	return arch->hidden;
+}
+#else
+static inline struct kvm_vcpu_arch *kvm_vcpu_arch_state(struct kvm_vcpu_arch *vcpu_arch)
+{
+	return vcpu_arch;
+}
+#endif
+
 struct kvm_lpage_info {
 	int disallow_lpage;
 };
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 9619dcc2b325..b62c42d637e0 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -13,14 +13,14 @@ static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu,
 	if (!test_bit(reg, (unsigned long *)&vcpu->arch.regs_avail))
 		kvm_x86_ops->cache_reg(vcpu, reg);
 
-	return vcpu->arch.regs[reg];
+	return kvm_vcpu_arch_state(&vcpu->arch)->regs[reg];
 }
 
 static inline void kvm_register_write(struct kvm_vcpu *vcpu,
 				      enum kvm_reg reg,
 				      unsigned long val)
 {
-	vcpu->arch.regs[reg] = val;
+	kvm_vcpu_arch_state(&vcpu->arch)->regs[reg] = val;
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 }
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index af66b93902e5..486ad451a67d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -196,6 +196,7 @@ struct vcpu_svm_hidden {
 	struct { /* mimic topology in vcpu_svm: */
 		struct kvm_vcpu_arch_hidden arch;
 	} vcpu;
+	unsigned long vmcb_pa;
 };
 #endif
 
@@ -1585,7 +1586,7 @@ static void init_vmcb(struct vcpu_svm *svm)
 	save->dr6 = 0xffff0ff0;
 	kvm_set_rflags(&svm->vcpu, 2);
 	save->rip = 0x0000fff0;
-	svm->vcpu.arch.regs[VCPU_REGS_RIP] = save->rip;
+	kvm_vcpu_arch_state(&svm->vcpu.arch)->regs[VCPU_REGS_RIP] = save->rip;
 
 	/*
 	 * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
@@ -3150,7 +3151,7 @@ static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
 	if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT)))
 		return NESTED_EXIT_HOST;
 
-	msr    = svm->vcpu.arch.regs[VCPU_REGS_RCX];
+	msr    = kvm_vcpu_arch_state(&svm->vcpu.arch)->regs[VCPU_REGS_RCX];
 	offset = svm_msrpm_offset(msr);
 	write  = svm->vmcb->control.exit_info_1 & 1;
 	mask   = 1 << ((2 * (msr & 0xf)) + write);
@@ -5656,10 +5657,11 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu)
 static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	unsigned long *regs = kvm_vcpu_arch_state(&vcpu->arch)->regs;
 
-	svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX];
-	svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
-	svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
+	svm->vmcb->save.rax = regs[VCPU_REGS_RAX];
+	svm->vmcb->save.rsp = regs[VCPU_REGS_RSP];
+	svm->vmcb->save.rip = regs[VCPU_REGS_RIP];
 
 	/*
 	 * A vmexit emulation is required before the vcpu can be executed
@@ -5690,6 +5692,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	svm->vmcb->save.cr2 = vcpu->arch.cr2;
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	svm->hidden->vmcb_pa = svm->vmcb_pa;
+#endif
+
 	clgi();
 
 	/*
@@ -5765,24 +5771,31 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 		"xor %%edi, %%edi \n\t"
 		"pop %%" _ASM_BP
 		:
+#ifdef CONFIG_KVM_PROCLOCAL
+		: [svm]"a"(svm->hidden),
+#define SVM_STATE_STRUCT vcpu_svm_hidden
+#else
 		: [svm]"a"(svm),
-		  [vmcb]"i"(offsetof(struct vcpu_svm, vmcb_pa)),
-		  [rbx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBX])),
-		  [rcx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RCX])),
-		  [rdx]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDX])),
-		  [rsi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RSI])),
-		  [rdi]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RDI])),
-		  [rbp]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_RBP]))
+#define SVM_STATE_STRUCT vcpu_svm
+#endif
+		  [vmcb]"i"(offsetof(struct SVM_STATE_STRUCT, vmcb_pa)),
+		  [rbx]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RBX])),
+		  [rcx]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RCX])),
+		  [rdx]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RDX])),
+		  [rsi]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RSI])),
+		  [rdi]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RDI])),
+		  [rbp]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RBP]))
 #ifdef CONFIG_X86_64
-		  , [r8]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R8])),
-		  [r9]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R9])),
-		  [r10]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R10])),
-		  [r11]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R11])),
-		  [r12]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R12])),
-		  [r13]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R13])),
-		  [r14]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R14])),
-		  [r15]"i"(offsetof(struct vcpu_svm, vcpu.arch.regs[VCPU_REGS_R15]))
+		  , [r8]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R8])),
+		  [r9]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R9])),
+		  [r10]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R10])),
+		  [r11]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R11])),
+		  [r12]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R12])),
+		  [r13]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R13])),
+		  [r14]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R14])),
+		  [r15]"i"(offsetof(struct SVM_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R15]))
 #endif
+#undef SVM_STATE_STRUCT
 		: "cc", "memory"
 #ifdef CONFIG_X86_64
 		, "rbx", "rcx", "rdx", "rsi", "rdi"
@@ -5829,9 +5842,9 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 	x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl);
 
 	vcpu->arch.cr2 = svm->vmcb->save.cr2;
-	vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax;
-	vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
-	vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
+	regs[VCPU_REGS_RAX] = svm->vmcb->save.rax;
+	regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
+	regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
 
 	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
 		kvm_before_interrupt(&svm->vcpu);
@@ -6265,14 +6278,16 @@ static int svm_pre_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
 	int ret;
 
 	if (is_guest_mode(vcpu)) {
+		unsigned long *regs = kvm_vcpu_arch_state(&vcpu->arch)->regs;
+
 		/* FED8h - SVM Guest */
 		put_smstate(u64, smstate, 0x7ed8, 1);
 		/* FEE0h - SVM Guest VMCB Physical Address */
 		put_smstate(u64, smstate, 0x7ee0, svm->nested.vmcb);
 
-		svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX];
-		svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
-		svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
+		svm->vmcb->save.rax = regs[VCPU_REGS_RAX];
+		svm->vmcb->save.rsp = regs[VCPU_REGS_RSP];
+		svm->vmcb->save.rip = regs[VCPU_REGS_RIP];
 
 		ret = nested_svm_vmexit(svm);
 		if (ret)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0fe9a4ab8268..0d2d6b7b0d50 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -982,6 +982,10 @@ struct vcpu_vmx_hidden {
 	struct { /* mimic topology in vcpu_svm: */
 		struct kvm_vcpu_arch_hidden arch;
 	} vcpu;
+	bool          __launched; /* temporary, used in vmx_vcpu_run */
+	/* shadow fields, used in vmx_vcpu_run */
+	u8            fail;
+	unsigned long host_rsp;
 };
 #endif
 
@@ -1024,7 +1028,9 @@ struct vcpu_vmx {
 	struct loaded_vmcs    vmcs01;
 	struct loaded_vmcs   *loaded_vmcs;
 	struct loaded_vmcs   *loaded_cpu_state;
+#ifndef CONFIG_KVM_PROCLOCAL
 	bool                  __launched; /* temporary, used in vmx_vcpu_run */
+#endif
 	struct msr_autoload {
 		struct vmx_msrs guest;
 		struct vmx_msrs host;
@@ -4612,10 +4618,10 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 	switch (reg) {
 	case VCPU_REGS_RSP:
-		vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
+		kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
 		break;
 	case VCPU_REGS_RIP:
-		vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
+		kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
 		break;
 	case VCPU_EXREG_PDPTR:
 		if (enable_ept)
@@ -6965,7 +6971,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx->rmode.vm86_active = 0;
 	vmx->spec_ctrl = 0;
 
-	vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
+	kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RDX] = get_rdx_init_val();
 	kvm_set_cr8(vcpu, 0);
 
 	if (!init_event) {
@@ -7712,7 +7718,7 @@ static int handle_cpuid(struct kvm_vcpu *vcpu)
 
 static int handle_rdmsr(struct kvm_vcpu *vcpu)
 {
-	u32 ecx = vcpu->arch.regs[VCPU_REGS_RCX];
+	u32 ecx = kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RCX];
 	struct msr_data msr_info;
 
 	msr_info.index = ecx;
@@ -7726,17 +7732,17 @@ static int handle_rdmsr(struct kvm_vcpu *vcpu)
 	trace_kvm_msr_read(ecx, msr_info.data);
 
 	/* FIXME: handling of bits 32:63 of rax, rdx */
-	vcpu->arch.regs[VCPU_REGS_RAX] = msr_info.data & -1u;
-	vcpu->arch.regs[VCPU_REGS_RDX] = (msr_info.data >> 32) & -1u;
+	kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RAX] = msr_info.data & -1u;
+	kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RDX] = (msr_info.data >> 32) & -1u;
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
 static int handle_wrmsr(struct kvm_vcpu *vcpu)
 {
 	struct msr_data msr;
-	u32 ecx = vcpu->arch.regs[VCPU_REGS_RCX];
-	u64 data = (vcpu->arch.regs[VCPU_REGS_RAX] & -1u)
-		| ((u64)(vcpu->arch.regs[VCPU_REGS_RDX] & -1u) << 32);
+	u32 ecx = kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RCX];
+	u64 data = (kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RAX] & -1u)
+		| ((u64)(kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RDX] & -1u) << 32);
 
 	msr.data = data;
 	msr.index = ecx;
@@ -10036,7 +10042,7 @@ static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address)
 static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu,
 				     struct vmcs12 *vmcs12)
 {
-	u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
+	u32 index = kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RCX];
 	u64 address;
 	bool accessed_dirty;
 	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
@@ -10082,7 +10088,7 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmcs12 *vmcs12;
-	u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+	u32 function = kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RAX];
 
 	/*
 	 * VMFUNC is only supported for nested guests, but we always enable the
@@ -10241,7 +10247,7 @@ static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
 static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
 	struct vmcs12 *vmcs12, u32 exit_reason)
 {
-	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	u32 msr_index = kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RCX];
 	gpa_t bitmap;
 
 	if (!nested_cpu_has(vmcs12, CPU_BASED_USE_MSR_BITMAPS))
@@ -11467,9 +11473,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	}
 
 	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
+		vmcs_writel(GUEST_RSP, kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RSP]);
 	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+		vmcs_writel(GUEST_RIP, kvm_vcpu_arch_state(&vcpu->arch)->regs[VCPU_REGS_RIP]);
 
 	cr3 = __get_current_cr3_fast();
 	if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
@@ -11508,7 +11514,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	 */
 	x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	vmx->hidden->__launched = vmx->loaded_vmcs->launched;
+	vmx->hidden->host_rsp = vmx->host_rsp;
+#else
 	vmx->__launched = vmx->loaded_vmcs->launched;
+#endif
 
 	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
 		(unsigned long)&current_evmcs->host_rsp : 0;
@@ -11588,27 +11599,35 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		".global vmx_return \n\t"
 		"vmx_return: " _ASM_PTR " 2b \n\t"
 		".popsection"
-	      : : "c"(vmx), "d"((unsigned long)HOST_RSP), "S"(evmcs_rsp),
-		[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
-		[fail]"i"(offsetof(struct vcpu_vmx, fail)),
-		[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)),
-		[rax]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RAX])),
-		[rbx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RBX])),
-		[rcx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RCX])),
-		[rdx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RDX])),
-		[rsi]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RSI])),
-		[rdi]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RDI])),
-		[rbp]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RBP])),
+#ifdef CONFIG_KVM_PROCLOCAL
+		: : "c"(vmx->hidden),
+#define VMX_STATE_STRUCT vcpu_vmx_hidden
+#else
+		: : "c"(vmx),
+#define VMX_STATE_STRUCT vcpu_vmx
+#endif
+		"d"((unsigned long)HOST_RSP), "S"(evmcs_rsp),
+		[launched]"i"(offsetof(struct VMX_STATE_STRUCT, __launched)),
+		[fail]"i"(offsetof(struct VMX_STATE_STRUCT, fail)),
+		[host_rsp]"i"(offsetof(struct VMX_STATE_STRUCT, host_rsp)),
+		[rax]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RAX])),
+		[rbx]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RBX])),
+		[rcx]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RCX])),
+		[rdx]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RDX])),
+		[rsi]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RSI])),
+		[rdi]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RDI])),
+		[rbp]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_RBP])),
 #ifdef CONFIG_X86_64
-		[r8]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R8])),
-		[r9]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R9])),
-		[r10]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R10])),
-		[r11]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R11])),
-		[r12]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R12])),
-		[r13]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R13])),
-		[r14]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R14])),
-		[r15]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R15])),
+		[r8]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R8])),
+		[r9]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R9])),
+		[r10]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R10])),
+		[r11]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R11])),
+		[r12]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R12])),
+		[r13]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R13])),
+		[r14]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R14])),
+		[r15]"i"(offsetof(struct VMX_STATE_STRUCT, vcpu.arch.regs[VCPU_REGS_R15])),
 #endif
+#undef VMX_STATE_STRUCT
 		[wordsize]"i"(sizeof(ulong))
 	      : "cc", "memory"
 #ifdef CONFIG_X86_64
@@ -11671,6 +11690,11 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	/* Eliminate branch target predictions from guest mode */
 	vmexit_fill_RSB();
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	vmx->fail = vmx->hidden->fail;
+	vmx->host_rsp = vmx->hidden->host_rsp;
+#endif
+
 	vcpu->arch.cr2 = read_cr2();
 
 	/* All fields are clean at this point */
@@ -11794,7 +11818,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 	free_loaded_vmcs(vmx->loaded_vmcs);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
-	kmem_cache_free(kvm_vcpu_cache, vmx);
+		kmem_cache_free(kvm_vcpu_cache, vmx);
 }
 
 static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
@@ -13570,7 +13594,12 @@ static int __noclone nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
 		vmx->loaded_vmcs->host_state.cr4 = cr4;
 	}
 
+#ifdef CONFIG_KVM_PROCLOCAL
+	vmx->hidden->__launched = vmx->loaded_vmcs->launched;
+	vmx->hidden->host_rsp = vmx->host_rsp;
+#else
 	vmx->__launched = vmx->loaded_vmcs->launched;
+#endif
 
 	asm(
 		/* Set HOST_RSP */
@@ -13593,12 +13622,19 @@ static int __noclone nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
 		".global vmx_early_consistency_check_return\n\t"
 		"vmx_early_consistency_check_return: " _ASM_PTR " 2b\n\t"
 		".popsection"
-	      :
-	      : "c"(vmx), "d"((unsigned long)HOST_RSP),
-		[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
-		[fail]"i"(offsetof(struct vcpu_vmx, fail)),
-		[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp))
+#ifdef CONFIG_KVM_PROCLOCAL
+	      : : "c"(vmx->hidden),
+#define VMX_STATE_STRUCT vcpu_vmx_hidden
+#else
+	      : : "c"(vmx),
+#define VMX_STATE_STRUCT vcpu_vmx
+#endif
+	      "d"((unsigned long)HOST_RSP),
+		[launched]"i"(offsetof(struct VMX_STATE_STRUCT, __launched)),
+		[fail]"i"(offsetof(struct VMX_STATE_STRUCT, fail)),
+		[host_rsp]"i"(offsetof(struct VMX_STATE_STRUCT, host_rsp))
 	      : "rax", "cc", "memory"
+#undef VMX_STATE_STRUCT
 	);
 
 	vmcs_writel(HOST_RIP, vmx_return);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2cfb96ca8cc8..35e41a772807 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9048,7 +9048,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		vcpu->arch.xcr0 = XFEATURE_MASK_FP;
 	}
 
-	memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
+	memset(kvm_vcpu_arch_state(&vcpu->arch)->regs, 0, sizeof(kvm_arch_regs_t));
 	vcpu->arch.regs_avail = ~0;
 	vcpu->arch.regs_dirty = ~0;
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC 10/10] kvm, x86: move guest FPU state into process local memory
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (8 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 09/10] kvm, vmx: move gprs to process local memory Marius Hillenbrand
@ 2019-06-12 17:08 ` Marius Hillenbrand
  2019-06-12 18:25 ` [RFC 00/10] Process-local memory allocations for hiding KVM secrets Sean Christopherson
  2019-06-12 19:55 ` Dave Hansen
  11 siblings, 0 replies; 44+ messages in thread
From: Marius Hillenbrand @ 2019-06-12 17:08 UTC (permalink / raw)
  To: kvm
  Cc: Marius Hillenbrand, linux-kernel, kernel-hardening, linux-mm,
	Alexander Graf, David Woodhouse, Julian Stecklina

FPU registers contain guest data and must be protected from information
leak vulnerabilities in the kernel.

FPU register state for vCPUs are allocated from the globally-visible
kernel heap. Change this to use process-local memory instead and thus
prevent access (or prefetching) in any other context in the kernel.

Signed-off-by: Marius Hillenbrand <mhillenb@amazon.de>
Inspired-by: Julian Stecklina <js@alien8.de> (while jsteckli@amazon.de)
Cc: Alexander Graf <graf@amazon.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  8 ++++++++
 arch/x86/kvm/x86.c              | 24 ++++++++++++------------
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4896ecde1c11..b3574217b011 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -36,6 +36,7 @@
 #include <asm/asm.h>
 #include <asm/kvm_page_track.h>
 #include <asm/hyperv-tlfs.h>
+#include <asm/proclocal.h>
 
 #define KVM_MAX_VCPUS 288
 #define KVM_SOFT_MAX_VCPUS 240
@@ -545,6 +546,7 @@ struct kvm_vcpu_arch_hidden {
 	 * kvm_{register,rip}_{read,write} functions.
 	 */
 	kvm_arch_regs_t regs;
+	struct fpu guest_fpu;
 };
 #endif
 
@@ -631,9 +633,15 @@ struct kvm_vcpu_arch {
 	 * it is switched out separately at VMENTER and VMEXIT time. The
 	 * "guest_fpu" state here contains the guest FPU context, with the
 	 * host PRKU bits.
+	 *
+	 * With process-local memory, the guest FPU state will be hidden in
+	 * kvm_vcpu_arch_hidden. Thus, access to this struct must go through
+	 * kvm_vcpu_arch_state(vcpu).
 	 */
 	struct fpu user_fpu;
+#ifndef CONFIG_KVM_PROCLOCAL
 	struct fpu guest_fpu;
+#endif
 
 	u64 xcr0;
 	u64 guest_supported_xcr0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 35e41a772807..480b4ed438ae 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3792,7 +3792,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
 
 static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu.state.xsave;
+	struct xregs_state *xsave = &kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.xsave;
 	u64 xstate_bv = xsave->header.xfeatures;
 	u64 valid;
 
@@ -3834,7 +3834,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 
 static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu.state.xsave;
+	struct xregs_state *xsave = &kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.xsave;
 	u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
 	u64 valid;
 
@@ -3882,7 +3882,7 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 		fill_xsave((u8 *) guest_xsave->region, vcpu);
 	} else {
 		memcpy(guest_xsave->region,
-			&vcpu->arch.guest_fpu.state.fxsave,
+			&kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.fxsave,
 			sizeof(struct fxregs_state));
 		*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
 			XFEATURE_MASK_FPSSE;
@@ -3912,7 +3912,7 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 		if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
 			mxcsr & ~mxcsr_feature_mask)
 			return -EINVAL;
-		memcpy(&vcpu->arch.guest_fpu.state.fxsave,
+		memcpy(&kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.fxsave,
 			guest_xsave->region, sizeof(struct fxregs_state));
 	}
 	return 0;
@@ -8302,7 +8302,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 	preempt_disable();
 	copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
 	/* PKRU is separately restored in kvm_x86_ops->run.  */
-	__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
+	__copy_kernel_to_fpregs(&kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state,
 				~XFEATURE_MASK_PKRU);
 	preempt_enable();
 	trace_kvm_fpu(1);
@@ -8312,7 +8312,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
-	copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
+	copy_fpregs_to_fpstate(&kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu);
 	copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
 	preempt_enable();
 	++vcpu->stat.fpu_reload;
@@ -8807,7 +8807,7 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu.state.fxsave;
+	fxsave = &kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.fxsave;
 	memcpy(fpu->fpr, fxsave->st_space, 128);
 	fpu->fcw = fxsave->cwd;
 	fpu->fsw = fxsave->swd;
@@ -8827,7 +8827,7 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu.state.fxsave;
+	fxsave = &kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.fxsave;
 
 	memcpy(fxsave->st_space, fpu->fpr, 128);
 	fxsave->cwd = fpu->fcw;
@@ -8883,9 +8883,9 @@ static int sync_regs(struct kvm_vcpu *vcpu)
 
 static void fx_init(struct kvm_vcpu *vcpu)
 {
-	fpstate_init(&vcpu->arch.guest_fpu.state);
+	fpstate_init(&kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state);
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		vcpu->arch.guest_fpu.state.xsave.header.xcomp_bv =
+		kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.xsave.header.xcomp_bv =
 			host_xcr0 | XSTATE_COMPACTION_ENABLED;
 
 	/*
@@ -9009,11 +9009,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		 */
 		if (init_event)
 			kvm_put_guest_fpu(vcpu);
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
+		mpx_state_buffer = get_xsave_addr(&kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.xsave,
 					XFEATURE_MASK_BNDREGS);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
+		mpx_state_buffer = get_xsave_addr(&kvm_vcpu_arch_state(&vcpu->arch)->guest_fpu.state.xsave,
 					XFEATURE_MASK_BNDCSR);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (9 preceding siblings ...)
  2019-06-12 17:08 ` [RFC 10/10] kvm, x86: move guest FPU state into " Marius Hillenbrand
@ 2019-06-12 18:25 ` Sean Christopherson
  2019-06-13  7:20   ` Alexander Graf
  2019-06-13 10:54   ` Liran Alon
  2019-06-12 19:55 ` Dave Hansen
  11 siblings, 2 replies; 44+ messages in thread
From: Sean Christopherson @ 2019-06-12 18:25 UTC (permalink / raw)
  To: Marius Hillenbrand
  Cc: kvm, linux-kernel, kernel-hardening, linux-mm, Alexander Graf,
	David Woodhouse

On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote:
> The Linux kernel has a global address space that is the same for any
> kernel code. This address space becomes a liability in a world with
> processor information leak vulnerabilities, such as L1TF. With the right
> cache load gadget, an attacker-controlled hyperthread pair can leak
> arbitrary data via L1TF. Disabling hyperthreading is one recommended
> mitigation, but it comes with a large performance hit for a wide range
> of workloads.
> 
> An alternative mitigation is to not make certain data in the kernel
> globally visible, but only when the kernel executes in the context of
> the process where this data belongs to.
>
> This patch series proposes to introduce a region for what we call
> process-local memory into the kernel's virtual address space. Page
> tables and mappings in that region will be exclusive to one address
> space, instead of implicitly shared between all kernel address spaces.
> Any data placed in that region will be out of reach of cache load
> gadgets that execute in different address spaces. To implement
> process-local memory, we introduce a new interface kmalloc_proclocal() /
> kfree_proclocal() that allocates and maps pages exclusively into the
> current kernel address space. As a first use case, we move architectural
> state of guest CPUs in KVM out of reach of other kernel address spaces.

Can you briefly describe what types of attacks this is intended to
mitigate?  E.g. guest-guest, userspace-guest, etc...  I don't want to
make comments based on my potentially bad assumptions.
 
> The patch set is a prototype for x86-64 that we have developed on top of
> kernel 4.20.17 (with cherry-picked commit d253ca0c3865 "x86/mm/cpa: Add
> set_direct_map_*() functions"). I am aware that the integration with KVM
> will see some changes while rebasing to 5.x. Patches 7 and 8, in

Ha, "some" :-)

> particular, help make patch 9 more readable, but will be dropped in
> rebasing. We have tested the code on both Intel and AMDs, launching VMs
> in a loop. So far, we have not done in-depth performance evaluation.
> Impact on starting VMs was within measurement noise.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
                   ` (10 preceding siblings ...)
  2019-06-12 18:25 ` [RFC 00/10] Process-local memory allocations for hiding KVM secrets Sean Christopherson
@ 2019-06-12 19:55 ` Dave Hansen
  2019-06-12 20:27   ` Andy Lutomirski
  2019-06-13  7:27   ` Alexander Graf
  11 siblings, 2 replies; 44+ messages in thread
From: Dave Hansen @ 2019-06-12 19:55 UTC (permalink / raw)
  To: Marius Hillenbrand, kvm
  Cc: linux-kernel, kernel-hardening, linux-mm, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Andy Lutomirski,
	Peter Zijlstra

On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> This patch series proposes to introduce a region for what we call
> process-local memory into the kernel's virtual address space. 

It might be fun to cc some x86 folks on this series.  They might have
some relevant opinions. ;)

A few high-level questions:

Why go to all this trouble to hide guest state like registers if all the
guest data itself is still mapped?

Where's the context-switching code?  Did I just miss it?

We've discussed having per-cpu page tables where a given PGD is only in
use from one CPU at a time.  I *think* this scheme still works in such a
case, it just adds one more PGD entry that would have to context-switched.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 19:55 ` Dave Hansen
@ 2019-06-12 20:27   ` Andy Lutomirski
  2019-06-12 20:41     ` Dave Hansen
                       ` (2 more replies)
  2019-06-13  7:27   ` Alexander Graf
  1 sibling, 3 replies; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-12 20:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Marius Hillenbrand, kvm, linux-kernel, kernel-hardening,
	linux-mm, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Andy Lutomirski, Peter Zijlstra



> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. 
> 
> It might be fun to cc some x86 folks on this series.  They might have
> some relevant opinions. ;)
> 
> A few high-level questions:
> 
> Why go to all this trouble to hide guest state like registers if all the
> guest data itself is still mapped?
> 
> Where's the context-switching code?  Did I just miss it?
> 
> We've discussed having per-cpu page tables where a given PGD is only in
> use from one CPU at a time.  I *think* this scheme still works in such a
> case, it just adds one more PGD entry that would have to context-switched.

Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 20:27   ` Andy Lutomirski
@ 2019-06-12 20:41     ` Dave Hansen
  2019-06-12 20:56       ` Andy Lutomirski
  2019-06-13  1:30     ` Andy Lutomirski
  2019-06-14 14:21     ` Thomas Gleixner
  2 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2019-06-12 20:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Marius Hillenbrand, kvm, linux-kernel, kernel-hardening,
	linux-mm, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Andy Lutomirski, Peter Zijlstra

On 6/12/19 1:27 PM, Andy Lutomirski wrote:
>> We've discussed having per-cpu page tables where a given PGD is
>> only in use from one CPU at a time.  I *think* this scheme still
>> works in such a case, it just adds one more PGD entry that would
>> have to context-switched.
> Fair warning: Linus is on record as absolutely hating this idea. He
> might change his mind, but it’s an uphill battle.

Just to be clear, are you referring to the per-cpu PGDs, or to this
patch set with a per-mm kernel area?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 20:41     ` Dave Hansen
@ 2019-06-12 20:56       ` Andy Lutomirski
  0 siblings, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-12 20:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Marius Hillenbrand, kvm, linux-kernel, kernel-hardening,
	linux-mm, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Andy Lutomirski, Peter Zijlstra



> On Jun 12, 2019, at 1:41 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/12/19 1:27 PM, Andy Lutomirski wrote:
>>> We've discussed having per-cpu page tables where a given PGD is
>>> only in use from one CPU at a time.  I *think* this scheme still
>>> works in such a case, it just adds one more PGD entry that would
>>> have to context-switched.
>> Fair warning: Linus is on record as absolutely hating this idea. He
>> might change his mind, but it’s an uphill battle.
> 
> Just to be clear, are you referring to the per-cpu PGDs, or to this
> patch set with a per-mm kernel area?

per-CPU PGDs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 20:27   ` Andy Lutomirski
  2019-06-12 20:41     ` Dave Hansen
@ 2019-06-13  1:30     ` Andy Lutomirski
  2019-06-13  1:50       ` Nadav Amit
  2019-06-13  7:52       ` Alexander Graf
  2019-06-14 14:21     ` Thomas Gleixner
  2 siblings, 2 replies; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-13  1:30 UTC (permalink / raw)
  To: Dave Hansen, Nadav Amit
  Cc: Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Andy Lutomirski, Peter Zijlstra

On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >> This patch series proposes to introduce a region for what we call
> >> process-local memory into the kernel's virtual address space.
> >
> > It might be fun to cc some x86 folks on this series.  They might have
> > some relevant opinions. ;)
> >
> > A few high-level questions:
> >
> > Why go to all this trouble to hide guest state like registers if all the
> > guest data itself is still mapped?
> >
> > Where's the context-switching code?  Did I just miss it?
> >
> > We've discussed having per-cpu page tables where a given PGD is only in
> > use from one CPU at a time.  I *think* this scheme still works in such a
> > case, it just adds one more PGD entry that would have to context-switched.
>
> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.

I looked at the patch, and it (sensibly) has nothing to do with
per-cpu PGDs.  So it's in great shape!

Seriously, though, here are some very high-level review comments:

Please don't call it "process local", since "process" is meaningless.
Call it "mm local" or something like that.

We already have a per-mm kernel mapping: the LDT.  So please nix all
the code that adds a new VA region, etc, except to the extent that
some of it consists of valid cleanups in and of itself.  Instead,
please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
it use a more general "mm local" address range, and then reuse the
same infrastructure for other fancy things.  The code that makes it
KASLR-able should be in its very own patch that applies *after* the
code that makes it all work so that, when the KASLR part causes a
crash, we can bisect it.

+ /*
+ * Faults in process-local memory may be caused by process-local
+ * addresses leaking into other contexts.
+ * tbd: warn and handle gracefully.
+ */
+ if (unlikely(fault_in_process_local(address))) {
+ pr_err("page fault in PROCLOCAL at %lx", address);
+ force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
+ }
+

Huh?  Either it's an OOPS or you shouldn't print any special
debugging.  As it is, you're just blatantly leaking the address of the
mm-local range to malicious user programs.

Also, you should IMO consider using this mechanism for kmap_atomic().
Hi, Nadav!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13  1:30     ` Andy Lutomirski
@ 2019-06-13  1:50       ` Nadav Amit
  2019-06-13 16:16         ` Andy Lutomirski
  2019-06-13  7:52       ` Alexander Graf
  1 sibling, 1 reply; 44+ messages in thread
From: Nadav Amit @ 2019-06-13  1:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Peter Zijlstra

> On Jun 12, 2019, at 6:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>>>> This patch series proposes to introduce a region for what we call
>>>> process-local memory into the kernel's virtual address space.
>>> 
>>> It might be fun to cc some x86 folks on this series.  They might have
>>> some relevant opinions. ;)
>>> 
>>> A few high-level questions:
>>> 
>>> Why go to all this trouble to hide guest state like registers if all the
>>> guest data itself is still mapped?
>>> 
>>> Where's the context-switching code?  Did I just miss it?
>>> 
>>> We've discussed having per-cpu page tables where a given PGD is only in
>>> use from one CPU at a time.  I *think* this scheme still works in such a
>>> case, it just adds one more PGD entry that would have to context-switched.
>> 
>> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> 
> I looked at the patch, and it (sensibly) has nothing to do with
> per-cpu PGDs.  So it's in great shape!
> 
> Seriously, though, here are some very high-level review comments:
> 
> Please don't call it "process local", since "process" is meaningless.
> Call it "mm local" or something like that.
> 
> We already have a per-mm kernel mapping: the LDT.  So please nix all
> the code that adds a new VA region, etc, except to the extent that
> some of it consists of valid cleanups in and of itself.  Instead,
> please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> it use a more general "mm local" address range, and then reuse the
> same infrastructure for other fancy things.  The code that makes it
> KASLR-able should be in its very own patch that applies *after* the
> code that makes it all work so that, when the KASLR part causes a
> crash, we can bisect it.
> 
> + /*
> + * Faults in process-local memory may be caused by process-local
> + * addresses leaking into other contexts.
> + * tbd: warn and handle gracefully.
> + */
> + if (unlikely(fault_in_process_local(address))) {
> + pr_err("page fault in PROCLOCAL at %lx", address);
> + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> + }
> +
> 
> Huh?  Either it's an OOPS or you shouldn't print any special
> debugging.  As it is, you're just blatantly leaking the address of the
> mm-local range to malicious user programs.
> 
> Also, you should IMO consider using this mechanism for kmap_atomic().
> Hi, Nadav!

Well, some context for the “hi” would have been helpful. (Do I have a bug
and I still don’t understand it?)

Perhaps you regard some use-case for a similar mechanism that I mentioned
before. I did implement something similar (but not the way that you wanted)
to improve the performance of seccomp and system-calls when retpolines are
used. I set per-mm code area that held code that used direct calls to invoke
seccomp filters and frequently used system-calls.

My mechanism, I think, is more not suitable for this use-case. I needed my
code-page to be at the same 2GB range as the kernel text/modules, which does
complicate things. Due to the same reason, it is also limited in the size of
the data/code that it can hold.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 18:25 ` [RFC 00/10] Process-local memory allocations for hiding KVM secrets Sean Christopherson
@ 2019-06-13  7:20   ` Alexander Graf
  2019-06-13 10:54   ` Liran Alon
  1 sibling, 0 replies; 44+ messages in thread
From: Alexander Graf @ 2019-06-13  7:20 UTC (permalink / raw)
  To: Sean Christopherson, Marius Hillenbrand
  Cc: kvm, linux-kernel, kernel-hardening, linux-mm, Alexander Graf,
	David Woodhouse


On 12.06.19 20:25, Sean Christopherson wrote:
> On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote:
>> The Linux kernel has a global address space that is the same for any
>> kernel code. This address space becomes a liability in a world with
>> processor information leak vulnerabilities, such as L1TF. With the right
>> cache load gadget, an attacker-controlled hyperthread pair can leak
>> arbitrary data via L1TF. Disabling hyperthreading is one recommended
>> mitigation, but it comes with a large performance hit for a wide range
>> of workloads.
>>
>> An alternative mitigation is to not make certain data in the kernel
>> globally visible, but only when the kernel executes in the context of
>> the process where this data belongs to.
>>
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. Page
>> tables and mappings in that region will be exclusive to one address
>> space, instead of implicitly shared between all kernel address spaces.
>> Any data placed in that region will be out of reach of cache load
>> gadgets that execute in different address spaces. To implement
>> process-local memory, we introduce a new interface kmalloc_proclocal() /
>> kfree_proclocal() that allocates and maps pages exclusively into the
>> current kernel address space. As a first use case, we move architectural
>> state of guest CPUs in KVM out of reach of other kernel address spaces.
> Can you briefly describe what types of attacks this is intended to
> mitigate?  E.g. guest-guest, userspace-guest, etc...  I don't want to
> make comments based on my potentially bad assumptions.


(quickly jumping in for Marius, he's offline today)

The main purpose of this is to protect from leakage of data from one 
guest into another guest using speculation gadgets on the host.

The same mechanism can be used to prevent leakage of secrets from one 
host process into another host process though, as host processes 
potentially have access to gadgets via the syscall interface.


Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 19:55 ` Dave Hansen
  2019-06-12 20:27   ` Andy Lutomirski
@ 2019-06-13  7:27   ` Alexander Graf
  2019-06-13 14:19     ` Dave Hansen
  1 sibling, 1 reply; 44+ messages in thread
From: Alexander Graf @ 2019-06-13  7:27 UTC (permalink / raw)
  To: Dave Hansen, Marius Hillenbrand, kvm
  Cc: linux-kernel, kernel-hardening, linux-mm, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Andy Lutomirski,
	Peter Zijlstra


On 12.06.19 21:55, Dave Hansen wrote:
> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space.
> It might be fun to cc some x86 folks on this series.  They might have
> some relevant opinions. ;)
>
> A few high-level questions:
>
> Why go to all this trouble to hide guest state like registers if all the
> guest data itself is still mapped?


(jumping in for Marius, he's offline today)

Glad you asked :). I hope this cover letter explains well how to achieve 
guest data not being mapped:

https://lkml.org/lkml/2019/1/31/933


> Where's the context-switching code?  Did I just miss it?


I'm not sure I understand the question. With this mechanism, the global 
linear map pages are just not present anymore, so there is no context 
switching needed. For the process local memory, the page table is 
already mm local, so we don't need to do anything special during context 
switch, no?


Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13  1:30     ` Andy Lutomirski
  2019-06-13  1:50       ` Nadav Amit
@ 2019-06-13  7:52       ` Alexander Graf
  2019-06-13 16:13         ` Andy Lutomirski
  1 sibling, 1 reply; 44+ messages in thread
From: Alexander Graf @ 2019-06-13  7:52 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen, Nadav Amit
  Cc: Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Peter Zijlstra


On 13.06.19 03:30, Andy Lutomirski wrote:
> On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>
>>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>
>>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>>>> This patch series proposes to introduce a region for what we call
>>>> process-local memory into the kernel's virtual address space.
>>> It might be fun to cc some x86 folks on this series.  They might have
>>> some relevant opinions. ;)
>>>
>>> A few high-level questions:
>>>
>>> Why go to all this trouble to hide guest state like registers if all the
>>> guest data itself is still mapped?
>>>
>>> Where's the context-switching code?  Did I just miss it?
>>>
>>> We've discussed having per-cpu page tables where a given PGD is only in
>>> use from one CPU at a time.  I *think* this scheme still works in such a
>>> case, it just adds one more PGD entry that would have to context-switched.
>> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> I looked at the patch, and it (sensibly) has nothing to do with
> per-cpu PGDs.  So it's in great shape!


Thanks a lot for the very timely review!


>
> Seriously, though, here are some very high-level review comments:
>
> Please don't call it "process local", since "process" is meaningless.
> Call it "mm local" or something like that.


Naming is hard, yes :). Is "mmlocal" obvious enough to most readers? I'm 
not fully convinced, but I don't find it better or worse than proclocal. 
So whatever flies with the majority works for me :).


> We already have a per-mm kernel mapping: the LDT.  So please nix all
> the code that adds a new VA region, etc, except to the extent that
> some of it consists of valid cleanups in and of itself.  Instead,
> please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> it use a more general "mm local" address range, and then reuse the
> same infrastructure for other fancy things.  The code that makes it


I don't fully understand how those two are related. Are you referring to 
the KPTI enabling code in there? That just maps the LDT at the same 
address in both kernel and user mappings, no?

So you're suggesting we use the new mm local address as LDT address 
instead and have that mapped in both kernel and user space? This patch 
set today maps "mm local" data only in kernel space, not in user space, 
as it's meant for kernel data structures.

So I'm not really seeing the path to adapt any of the LDT logic to this. 
Could you please elaborate?


> KASLR-able should be in its very own patch that applies *after* the
> code that makes it all work so that, when the KASLR part causes a
> crash, we can bisect it.


That sounds very reasonable, yes.


>
> + /*
> + * Faults in process-local memory may be caused by process-local
> + * addresses leaking into other contexts.
> + * tbd: warn and handle gracefully.
> + */
> + if (unlikely(fault_in_process_local(address))) {
> + pr_err("page fault in PROCLOCAL at %lx", address);
> + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> + }
> +
>
> Huh?  Either it's an OOPS or you shouldn't print any special
> debugging.  As it is, you're just blatantly leaking the address of the
> mm-local range to malicious user programs.


Yes, this is a left over bit from an idea that we discussed and rejected 
yesterday. The idea was to have a DEBUG config option that allows 
proclocal memory to leak into other processes, but print debug output so 
that it's easier to catch bugs. After discussion, I think we managed to 
convince everyone that an OOPS is the better tool to find bugs :).

Any trace of this will disappear in the next version.


>
> Also, you should IMO consider using this mechanism for kmap_atomic().


It might make sense to use it for kmap_atomic() for debug purposes, as 
it ensures that other users can no longer access the same mapping 
through the linear map. However, it does come at quite a big cost, as we 
need to shoot down the TLB of all other threads in the system. So I'm 
not sure it's of general value?


Alex


> Hi, Nadav!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 18:25 ` [RFC 00/10] Process-local memory allocations for hiding KVM secrets Sean Christopherson
  2019-06-13  7:20   ` Alexander Graf
@ 2019-06-13 10:54   ` Liran Alon
  1 sibling, 0 replies; 44+ messages in thread
From: Liran Alon @ 2019-06-13 10:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marius Hillenbrand, kvm, linux-kernel, kernel-hardening,
	linux-mm, Alexander Graf, David Woodhouse



> On 12 Jun 2019, at 21:25, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote:
>> The Linux kernel has a global address space that is the same for any
>> kernel code. This address space becomes a liability in a world with
>> processor information leak vulnerabilities, such as L1TF. With the right
>> cache load gadget, an attacker-controlled hyperthread pair can leak
>> arbitrary data via L1TF. Disabling hyperthreading is one recommended
>> mitigation, but it comes with a large performance hit for a wide range
>> of workloads.
>> 
>> An alternative mitigation is to not make certain data in the kernel
>> globally visible, but only when the kernel executes in the context of
>> the process where this data belongs to.
>> 
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. Page
>> tables and mappings in that region will be exclusive to one address
>> space, instead of implicitly shared between all kernel address spaces.
>> Any data placed in that region will be out of reach of cache load
>> gadgets that execute in different address spaces. To implement
>> process-local memory, we introduce a new interface kmalloc_proclocal() /
>> kfree_proclocal() that allocates and maps pages exclusively into the
>> current kernel address space. As a first use case, we move architectural
>> state of guest CPUs in KVM out of reach of other kernel address spaces.
> 
> Can you briefly describe what types of attacks this is intended to
> mitigate?  E.g. guest-guest, userspace-guest, etc...  I don't want to
> make comments based on my potentially bad assumptions.

I think I can assist in the explanation.

Consider the following scenario:
1) Hyperthread A in CPU core runs in guest and triggers a VMExit which is handled by host kernel.
While hyperthread A runs VMExit handler, it populates CPU core cache / internal-resources (e.g. MDS buffers)
with some sensitive data it have speculatively/architecturally access.
2) During hyperthread A running on host kernel, hyperthread B on same CPU core runs in guest and use
some CPU speculative execution vulnerability to leak the sensitive host data populated by hyperthread A
in CPU core cache / internal-resources.

Current CPU microcode mitigations (L1D/MDS flush) only handle the case of a single hyperthread and don’t
provide a mechanism to mitigate this hyperthreading attack scenario.

Assuming there is some guest triggerable speculative load gadget in some VMExit path,
it can be used to force any data that is mapped into kernel address space to be loaded into CPU resource that is subject to leak.
Therefore, there were multiple attempts to reduce sensitive information from being mapped into the kernel address space
that is accessible by this VMExit path.

One attempt was XPFO which attempts to remove from kernel direct-map any page that is currently used only by userspace.
Unfortunately, XPFO currently exhibits multiple performance issues that *currently* makes it impractical as far as I know.

Another attempt is this patch-series which attempts to remove from one vCPU thread host kernel address space,
the state of vCPUs of other guests. Which is very specific but I personally have additional ideas on how this patch series can be further used.
For example, vhost-net needs to kmap entire guest memory into kernel-space to write ingress packets data into guest memory.
Thus, vCPU thread kernel address space now maps entire other guest memory which can be leaked using the technique described above.
Therefore, it should be useful to also move this kmap() to happen on process-local kernel virtual address region.

One could argue however that there is still a much bigger issue because of kernel direct-map that maps all physical pages that kernel
manage (i.e. have struct page) in kernel virtual address space. And all of those pages can theoretically be leaked.
However, this could be handled by complementary techniques such as booting host kernel with “mem=X” and mapping guest memory
by directly mmap relevant portion of /dev/mem.
Which is probably what AWS does given these upstream KVM patches they have contributed:
bd53cb35a3e9 X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
e45adf665a53 KVM: Introduce a new guest mapping API
0c55671f84ff kvm, x86: Properly check whether a pfn is an MMIO or not

Also note that when using such “mem=X” technique, you can also avoid performance penalties introduced by CPU microcode mitigations.
E.g. You can avoid doing L1D flush on VMEntry if VMExit handler run only in kernel and didn’t context-switch as you assume kernel address
space don’t map any host sensitive data.

It’s also worth mentioning that another alternative that I have attempted to this “mem=X” technique
was to create an isolated address space that is only used when running KVM VMExit handlers.
For more information, refer to:
https://lkml.org/lkml/2019/5/13/515
(See some of my comments on that thread)

This is my 2cents on this at least.

-Liran



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13  7:27   ` Alexander Graf
@ 2019-06-13 14:19     ` Dave Hansen
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Hansen @ 2019-06-13 14:19 UTC (permalink / raw)
  To: Alexander Graf, Marius Hillenbrand, kvm
  Cc: linux-kernel, kernel-hardening, linux-mm, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Andy Lutomirski,
	Peter Zijlstra

On 6/13/19 12:27 AM, Alexander Graf wrote:
>> Where's the context-switching code?  Did I just miss it?
> 
> I'm not sure I understand the question. With this mechanism, the global
> linear map pages are just not present anymore, so there is no context
> switching needed. For the process local memory, the page table is
> already mm local, so we don't need to do anything special during context
> switch, no?

Thanks for explaining, I was just confused.

Andy reminded me when comparing it to the LDT area: since this area is
per-mm/pgd and we context switch that *obviously* there's no extra work
to do at context switch time, as long as the area is marked non-Global.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13  7:52       ` Alexander Graf
@ 2019-06-13 16:13         ` Andy Lutomirski
  2019-06-13 16:20           ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-13 16:13 UTC (permalink / raw)
  To: Alexander Graf, Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Peter Zijlstra

On Thu, Jun 13, 2019 at 12:53 AM Alexander Graf <graf@amazon.com> wrote:
>
>
> On 13.06.19 03:30, Andy Lutomirski wrote:
> > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >>
> >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >>>
> >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >>>> This patch series proposes to introduce a region for what we call
> >>>> process-local memory into the kernel's virtual address space.
> >>> It might be fun to cc some x86 folks on this series.  They might have
> >>> some relevant opinions. ;)
> >>>
> >>> A few high-level questions:
> >>>
> >>> Why go to all this trouble to hide guest state like registers if all the
> >>> guest data itself is still mapped?
> >>>
> >>> Where's the context-switching code?  Did I just miss it?
> >>>
> >>> We've discussed having per-cpu page tables where a given PGD is only in
> >>> use from one CPU at a time.  I *think* this scheme still works in such a
> >>> case, it just adds one more PGD entry that would have to context-switched.
> >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> > I looked at the patch, and it (sensibly) has nothing to do with
> > per-cpu PGDs.  So it's in great shape!
>
>
> Thanks a lot for the very timely review!
>
>
> >
> > Seriously, though, here are some very high-level review comments:
> >
> > Please don't call it "process local", since "process" is meaningless.
> > Call it "mm local" or something like that.
>
>
> Naming is hard, yes :). Is "mmlocal" obvious enough to most readers? I'm
> not fully convinced, but I don't find it better or worse than proclocal.
> So whatever flies with the majority works for me :).

My objection to "proc" is that we have many concepts of "process" in
the kernel: task, mm, signal handling context, etc.  These memory
ranges are specifically local to the mm.  Admittedly, it would be very
surprising to have memory that is local to a signal handling context,
but still.

>
>
> > We already have a per-mm kernel mapping: the LDT.  So please nix all
> > the code that adds a new VA region, etc, except to the extent that
> > some of it consists of valid cleanups in and of itself.  Instead,
> > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> > it use a more general "mm local" address range, and then reuse the
> > same infrastructure for other fancy things.  The code that makes it
>
>
> I don't fully understand how those two are related. Are you referring to
> the KPTI enabling code in there? That just maps the LDT at the same
> address in both kernel and user mappings, no?

The relevance here is that, when KPTI is on, the exact same address
refers to a different LDT in different mms, so it's genuinely an
mm-local mapping.  It works just like yours: a whole top-level paging
entry is reserved for it.  What I'm suggesting is that, when you're
all done, the LDT should be more or less just one more mm-local
mapping, with two caveats.  First, the LDT needs special KPTI
handling, but that's fine.  Second, the LDT address is visible to user
code on non-UMIP systems, so you'll have to decide if that's okay.  My
suggestion is to have the LDT be the very first address in the
mm-local range and then to randomize everything else in the mm-local
range.

>
> So you're suggesting we use the new mm local address as LDT address
> instead and have that mapped in both kernel and user space? This patch
> set today maps "mm local" data only in kernel space, not in user space,
> as it's meant for kernel data structures.

Yes, exactly.

>
> So I'm not really seeing the path to adapt any of the LDT logic to this.
> Could you please elaborate?
>
>
> > KASLR-able should be in its very own patch that applies *after* the
> > code that makes it all work so that, when the KASLR part causes a
> > crash, we can bisect it.
>
>
> That sounds very reasonable, yes.
>
>
> >
> > + /*
> > + * Faults in process-local memory may be caused by process-local
> > + * addresses leaking into other contexts.
> > + * tbd: warn and handle gracefully.
> > + */
> > + if (unlikely(fault_in_process_local(address))) {
> > + pr_err("page fault in PROCLOCAL at %lx", address);
> > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> > + }
> > +
> >
> > Huh?  Either it's an OOPS or you shouldn't print any special
> > debugging.  As it is, you're just blatantly leaking the address of the
> > mm-local range to malicious user programs.
>
>
> Yes, this is a left over bit from an idea that we discussed and rejected
> yesterday. The idea was to have a DEBUG config option that allows
> proclocal memory to leak into other processes, but print debug output so
> that it's easier to catch bugs. After discussion, I think we managed to
> convince everyone that an OOPS is the better tool to find bugs :).
>
> Any trace of this will disappear in the next version.
>
>
> >
> > Also, you should IMO consider using this mechanism for kmap_atomic().
>
>
> It might make sense to use it for kmap_atomic() for debug purposes, as
> it ensures that other users can no longer access the same mapping
> through the linear map. However, it does come at quite a big cost, as we
> need to shoot down the TLB of all other threads in the system. So I'm
> not sure it's of general value?

What I meant was that kmap_atomic() could use mm-local memory so that
it doesn't need to do a global shootdown.  But I guess it's not
actually used for real on 64-bit, so this is mostly moot.  Are you
planning to support mm-local on 32-bit?

--Andy
>
>
> Alex
>
>
> > Hi, Nadav!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13  1:50       ` Nadav Amit
@ 2019-06-13 16:16         ` Andy Lutomirski
  0 siblings, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-13 16:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Peter Zijlstra

On Wed, Jun 12, 2019 at 6:50 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jun 12, 2019, at 6:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >>>
> >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >>>> This patch series proposes to introduce a region for what we call
> >>>> process-local memory into the kernel's virtual address space.
> >>>
> >>> It might be fun to cc some x86 folks on this series.  They might have
> >>> some relevant opinions. ;)
> >>>
> >>> A few high-level questions:
> >>>
> >>> Why go to all this trouble to hide guest state like registers if all the
> >>> guest data itself is still mapped?
> >>>
> >>> Where's the context-switching code?  Did I just miss it?
> >>>
> >>> We've discussed having per-cpu page tables where a given PGD is only in
> >>> use from one CPU at a time.  I *think* this scheme still works in such a
> >>> case, it just adds one more PGD entry that would have to context-switched.
> >>
> >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> >
> > I looked at the patch, and it (sensibly) has nothing to do with
> > per-cpu PGDs.  So it's in great shape!
> >
> > Seriously, though, here are some very high-level review comments:
> >
> > Please don't call it "process local", since "process" is meaningless.
> > Call it "mm local" or something like that.
> >
> > We already have a per-mm kernel mapping: the LDT.  So please nix all
> > the code that adds a new VA region, etc, except to the extent that
> > some of it consists of valid cleanups in and of itself.  Instead,
> > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> > it use a more general "mm local" address range, and then reuse the
> > same infrastructure for other fancy things.  The code that makes it
> > KASLR-able should be in its very own patch that applies *after* the
> > code that makes it all work so that, when the KASLR part causes a
> > crash, we can bisect it.
> >
> > + /*
> > + * Faults in process-local memory may be caused by process-local
> > + * addresses leaking into other contexts.
> > + * tbd: warn and handle gracefully.
> > + */
> > + if (unlikely(fault_in_process_local(address))) {
> > + pr_err("page fault in PROCLOCAL at %lx", address);
> > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> > + }
> > +
> >
> > Huh?  Either it's an OOPS or you shouldn't print any special
> > debugging.  As it is, you're just blatantly leaking the address of the
> > mm-local range to malicious user programs.
> >
> > Also, you should IMO consider using this mechanism for kmap_atomic().
> > Hi, Nadav!
>
> Well, some context for the “hi” would have been helpful. (Do I have a bug
> and I still don’t understand it?)

Fair enough :)

>
> Perhaps you regard some use-case for a similar mechanism that I mentioned
> before. I did implement something similar (but not the way that you wanted)
> to improve the performance of seccomp and system-calls when retpolines are
> used. I set per-mm code area that held code that used direct calls to invoke
> seccomp filters and frequently used system-calls.
>
> My mechanism, I think, is more not suitable for this use-case. I needed my
> code-page to be at the same 2GB range as the kernel text/modules, which does
> complicate things. Due to the same reason, it is also limited in the size of
> the data/code that it can hold.
>

I actually meant the opposite.  If we had a general-purpose per-mm
kernel address range, could it be used to optimize kmap_atomic() by
limiting the scope of any shootdowns?  As a rough sketch, we'd have
some kmap_atomic slots for each cpu *in the mm-local region*.  I'm not
entirely sure this is a win.

--Andy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13 16:13         ` Andy Lutomirski
@ 2019-06-13 16:20           ` Dave Hansen
  2019-06-13 17:29             ` Nadav Amit
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2019-06-13 16:20 UTC (permalink / raw)
  To: Andy Lutomirski, Alexander Graf, Nadav Amit
  Cc: Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Peter Zijlstra

On 6/13/19 9:13 AM, Andy Lutomirski wrote:
>> It might make sense to use it for kmap_atomic() for debug purposes, as
>> it ensures that other users can no longer access the same mapping
>> through the linear map. However, it does come at quite a big cost, as we
>> need to shoot down the TLB of all other threads in the system. So I'm
>> not sure it's of general value?
> What I meant was that kmap_atomic() could use mm-local memory so that
> it doesn't need to do a global shootdown.  But I guess it's not
> actually used for real on 64-bit, so this is mostly moot.  Are you
> planning to support mm-local on 32-bit?

Do we *do* global shootdowns on kmap_atomic()s on 32-bit?  I thought we
used entirely per-cpu addresses, so a stale entry from another CPU can
get loaded in the TLB speculatively but it won't ever actually get used.
 I think it goes:

kunmap_atomic() ->
__kunmap_atomic() ->
kpte_clear_flush() ->
__flush_tlb_one_kernel() ->
__flush_tlb_one_user() ->
__native_flush_tlb_one_user() ->
invlpg

The per-cpu address calculation is visible in kmap_atomic_prot():

        idx = type + KM_TYPE_NR*smp_processor_id();

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13 16:20           ` Dave Hansen
@ 2019-06-13 17:29             ` Nadav Amit
  2019-06-13 17:49               ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Nadav Amit @ 2019-06-13 17:29 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Alexander Graf, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Peter Zijlstra

> On Jun 13, 2019, at 9:20 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/13/19 9:13 AM, Andy Lutomirski wrote:
>>> It might make sense to use it for kmap_atomic() for debug purposes, as
>>> it ensures that other users can no longer access the same mapping
>>> through the linear map. However, it does come at quite a big cost, as we
>>> need to shoot down the TLB of all other threads in the system. So I'm
>>> not sure it's of general value?
>> What I meant was that kmap_atomic() could use mm-local memory so that
>> it doesn't need to do a global shootdown.  But I guess it's not
>> actually used for real on 64-bit, so this is mostly moot.  Are you
>> planning to support mm-local on 32-bit?
> 
> Do we *do* global shootdowns on kmap_atomic()s on 32-bit?  I thought we
> used entirely per-cpu addresses, so a stale entry from another CPU can
> get loaded in the TLB speculatively but it won't ever actually get used.
> I think it goes:
> 
> kunmap_atomic() ->
> __kunmap_atomic() ->
> kpte_clear_flush() ->
> __flush_tlb_one_kernel() ->
> __flush_tlb_one_user() ->
> __native_flush_tlb_one_user() ->
> invlpg
> 
> The per-cpu address calculation is visible in kmap_atomic_prot():
> 
>        idx = type + KM_TYPE_NR*smp_processor_id();

From a security point-of-view, having such an entry is still not too good,
since the mapping protection might override the default protection. This
might lead to potential W+X cases, for example, that might stay for a long
time if they are speculatively cached in the TLB and not invalidated upon
kunmap_atomic().

Having said that, I am not too excited to deal with this issue. Do people
still care about x86/32-bit? In addition, if kunmap_atomic() is used when
IRQs are disabled, sending a TLB shootdown during kunmap_atomic() can cause
a deadlock.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13 17:29             ` Nadav Amit
@ 2019-06-13 17:49               ` Dave Hansen
  2019-06-13 20:05                 ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2019-06-13 17:49 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski
  Cc: Alexander Graf, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Peter Zijlstra

On 6/13/19 10:29 AM, Nadav Amit wrote:
> Having said that, I am not too excited to deal with this issue. Do
> people still care about x86/32-bit?
No, not really.  It's just fun to try to give history lessons about the
good old days. ;)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-13 17:49               ` Dave Hansen
@ 2019-06-13 20:05                 ` Sean Christopherson
  0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2019-06-13 20:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nadav Amit, Andy Lutomirski, Alexander Graf, Marius Hillenbrand,
	kvm list, LKML, Kernel Hardening, Linux-MM, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Peter Zijlstra

On Thu, Jun 13, 2019 at 10:49:42AM -0700, Dave Hansen wrote:
> On 6/13/19 10:29 AM, Nadav Amit wrote:
> > Having said that, I am not too excited to deal with this issue. Do
> > people still care about x86/32-bit?
> No, not really.

Especially not for KVM, given the number of times 32-bit KVM has been
broken recently without anyone noticing for several kernel releases.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-12 20:27   ` Andy Lutomirski
  2019-06-12 20:41     ` Dave Hansen
  2019-06-13  1:30     ` Andy Lutomirski
@ 2019-06-14 14:21     ` Thomas Gleixner
  2019-06-16 22:18       ` Andy Lutomirski
  2019-06-17  7:38       ` Alexander Graf
  2 siblings, 2 replies; 44+ messages in thread
From: Thomas Gleixner @ 2019-06-14 14:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Marius Hillenbrand, kvm, linux-kernel,
	kernel-hardening, linux-mm, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Andy Lutomirski, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]

On Wed, 12 Jun 2019, Andy Lutomirski wrote:
> > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >> This patch series proposes to introduce a region for what we call
> >> process-local memory into the kernel's virtual address space. 
> > 
> > It might be fun to cc some x86 folks on this series.  They might have
> > some relevant opinions. ;)
> > 
> > A few high-level questions:
> > 
> > Why go to all this trouble to hide guest state like registers if all the
> > guest data itself is still mapped?
> > 
> > Where's the context-switching code?  Did I just miss it?
> > 
> > We've discussed having per-cpu page tables where a given PGD is only in
> > use from one CPU at a time.  I *think* this scheme still works in such a
> > case, it just adds one more PGD entry that would have to context-switched.
>
> Fair warning: Linus is on record as absolutely hating this idea. He might
> change his mind, but it’s an uphill battle.

Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
the entry code as we could just put the percpu space into the local PGD.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-14 14:21     ` Thomas Gleixner
@ 2019-06-16 22:18       ` Andy Lutomirski
  2019-06-16 22:28         ` Thomas Gleixner
  2019-06-17  7:38       ` Alexander Graf
  1 sibling, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-16 22:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Andy Lutomirski, Peter Zijlstra

On Fri, Jun 14, 2019 at 7:21 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, 12 Jun 2019, Andy Lutomirski wrote:
> > > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> > >> This patch series proposes to introduce a region for what we call
> > >> process-local memory into the kernel's virtual address space.
> > >
> > > It might be fun to cc some x86 folks on this series.  They might have
> > > some relevant opinions. ;)
> > >
> > > A few high-level questions:
> > >
> > > Why go to all this trouble to hide guest state like registers if all the
> > > guest data itself is still mapped?
> > >
> > > Where's the context-switching code?  Did I just miss it?
> > >
> > > We've discussed having per-cpu page tables where a given PGD is only in
> > > use from one CPU at a time.  I *think* this scheme still works in such a
> > > case, it just adds one more PGD entry that would have to context-switched.
> >
> > Fair warning: Linus is on record as absolutely hating this idea. He might
> > change his mind, but it’s an uphill battle.
>
> Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
> the entry code as we could just put the percpu space into the local PGD.
>

I have personally suggested this to Linus on a couple of occasions,
and he seemed quite skeptical.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-16 22:18       ` Andy Lutomirski
@ 2019-06-16 22:28         ` Thomas Gleixner
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2019-06-16 22:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

On Sun, 16 Jun 2019, Andy Lutomirski wrote:
> On Fri, Jun 14, 2019 at 7:21 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 12 Jun 2019, Andy Lutomirski wrote:
> > >
> > > Fair warning: Linus is on record as absolutely hating this idea. He might
> > > change his mind, but it’s an uphill battle.
> >
> > Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
> > the entry code as we could just put the percpu space into the local PGD.
> >
> 
> I have personally suggested this to Linus on a couple of occasions,
> and he seemed quite skeptical.

The only way to find out is the good old: numbers talk ....

So someone has to bite the bullet, implement it and figure out whether it's
bollocks or not. :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-14 14:21     ` Thomas Gleixner
  2019-06-16 22:18       ` Andy Lutomirski
@ 2019-06-17  7:38       ` Alexander Graf
  2019-06-17 15:50         ` Dave Hansen
  1 sibling, 1 reply; 44+ messages in thread
From: Alexander Graf @ 2019-06-17  7:38 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Dave Hansen, Marius Hillenbrand, kvm, linux-kernel,
	kernel-hardening, linux-mm, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Andy Lutomirski, Peter Zijlstra


On 14.06.19 16:21, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Andy Lutomirski wrote:
>>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>
>>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>>>> This patch series proposes to introduce a region for what we call
>>>> process-local memory into the kernel's virtual address space.
>>> It might be fun to cc some x86 folks on this series.  They might have
>>> some relevant opinions. ;)
>>>
>>> A few high-level questions:
>>>
>>> Why go to all this trouble to hide guest state like registers if all the
>>> guest data itself is still mapped?
>>>
>>> Where's the context-switching code?  Did I just miss it?
>>>
>>> We've discussed having per-cpu page tables where a given PGD is only in
>>> use from one CPU at a time.  I *think* this scheme still works in such a
>>> case, it just adds one more PGD entry that would have to context-switched.
>> Fair warning: Linus is on record as absolutely hating this idea. He might
>> change his mind, but it’s an uphill battle.
> Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
> the entry code as we could just put the percpu space into the local PGD.


Would that mean that with Meltdown affected CPUs we open speculation 
attacks against the mmlocal memory from KVM user space?


Alex


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17  7:38       ` Alexander Graf
@ 2019-06-17 15:50         ` Dave Hansen
  2019-06-17 15:54           ` Andy Lutomirski
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2019-06-17 15:50 UTC (permalink / raw)
  To: Alexander Graf, Thomas Gleixner, Andy Lutomirski
  Cc: Marius Hillenbrand, kvm, linux-kernel, kernel-hardening,
	linux-mm, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Andy Lutomirski, Peter Zijlstra

On 6/17/19 12:38 AM, Alexander Graf wrote:
>> Yes I know, but as a benefit we could get rid of all the GSBASE
>> horrors in
>> the entry code as we could just put the percpu space into the local PGD.
> 
> Would that mean that with Meltdown affected CPUs we open speculation
> attacks against the mmlocal memory from KVM user space?

Not necessarily.  There would likely be a _set_ of local PGDs.  We could
still have pair of PTI PGDs just like we do know, they'd just be a local
PGD pair.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 15:50         ` Dave Hansen
@ 2019-06-17 15:54           ` Andy Lutomirski
  2019-06-17 16:03             ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-17 15:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexander Graf, Thomas Gleixner, Marius Hillenbrand, kvm list,
	LKML, Kernel Hardening, Linux-MM, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Andy Lutomirski,
	Peter Zijlstra

On Mon, Jun 17, 2019 at 8:50 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/17/19 12:38 AM, Alexander Graf wrote:
> >> Yes I know, but as a benefit we could get rid of all the GSBASE
> >> horrors in
> >> the entry code as we could just put the percpu space into the local PGD.
> >
> > Would that mean that with Meltdown affected CPUs we open speculation
> > attacks against the mmlocal memory from KVM user space?
>
> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
> still have pair of PTI PGDs just like we do know, they'd just be a local
> PGD pair.
>

Unfortunately, this would mean that we need to sync twice as many
top-level entries when we context switch.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 15:54           ` Andy Lutomirski
@ 2019-06-17 16:03             ` Dave Hansen
  2019-06-17 16:14               ` Andy Lutomirski
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2019-06-17 16:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexander Graf, Thomas Gleixner, Marius Hillenbrand, kvm list,
	LKML, Kernel Hardening, Linux-MM, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Peter Zijlstra

On 6/17/19 8:54 AM, Andy Lutomirski wrote:
>>> Would that mean that with Meltdown affected CPUs we open speculation
>>> attacks against the mmlocal memory from KVM user space?
>> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
>> still have pair of PTI PGDs just like we do know, they'd just be a local
>> PGD pair.
>>
> Unfortunately, this would mean that we need to sync twice as many
> top-level entries when we context switch.

Yeah, PTI sucks. :)

For anyone following along at home, I'm going to go off into crazy
per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)

But, I was thinking we could get away with not doing this on _every_
context switch at least.  For instance, couldn't 'struct tlb_context'
have PGD pointer (or two with PTI) in addition to the TLB info?  That
way we only do the copying when we change the context.  Or does that tie
the implementation up too much with PCIDs?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 16:03             ` Dave Hansen
@ 2019-06-17 16:14               ` Andy Lutomirski
  2019-06-17 16:53                 ` Nadav Amit
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-17 16:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Alexander Graf, Thomas Gleixner,
	Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Peter Zijlstra

On Mon, Jun 17, 2019 at 9:09 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/17/19 8:54 AM, Andy Lutomirski wrote:
> >>> Would that mean that with Meltdown affected CPUs we open speculation
> >>> attacks against the mmlocal memory from KVM user space?
> >> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
> >> still have pair of PTI PGDs just like we do know, they'd just be a local
> >> PGD pair.
> >>
> > Unfortunately, this would mean that we need to sync twice as many
> > top-level entries when we context switch.
>
> Yeah, PTI sucks. :)
>
> For anyone following along at home, I'm going to go off into crazy
> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>
> But, I was thinking we could get away with not doing this on _every_
> context switch at least.  For instance, couldn't 'struct tlb_context'
> have PGD pointer (or two with PTI) in addition to the TLB info?  That
> way we only do the copying when we change the context.  Or does that tie
> the implementation up too much with PCIDs?

Hmm, that seems entirely reasonable.  I think the nasty bit would be
figuring out all the interactions with PV TLB flushing.  PV TLB
flushes already don't play so well with PCID tracking, and this will
make it worse.  We probably need to rewrite all that code regardless.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 16:14               ` Andy Lutomirski
@ 2019-06-17 16:53                 ` Nadav Amit
  2019-06-17 18:07                   ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Nadav Amit @ 2019-06-17 16:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Alexander Graf, Thomas Gleixner, Marius Hillenbrand,
	kvm list, LKML, Kernel Hardening, Linux-MM, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Peter Zijlstra

> On Jun 17, 2019, at 9:14 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Jun 17, 2019 at 9:09 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 6/17/19 8:54 AM, Andy Lutomirski wrote:
>>>>> Would that mean that with Meltdown affected CPUs we open speculation
>>>>> attacks against the mmlocal memory from KVM user space?
>>>> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
>>>> still have pair of PTI PGDs just like we do know, they'd just be a local
>>>> PGD pair.
>>> Unfortunately, this would mean that we need to sync twice as many
>>> top-level entries when we context switch.
>> 
>> Yeah, PTI sucks. :)
>> 
>> For anyone following along at home, I'm going to go off into crazy
>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>> 
>> But, I was thinking we could get away with not doing this on _every_
>> context switch at least.  For instance, couldn't 'struct tlb_context'
>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
>> way we only do the copying when we change the context.  Or does that tie
>> the implementation up too much with PCIDs?
> 
> Hmm, that seems entirely reasonable.  I think the nasty bit would be
> figuring out all the interactions with PV TLB flushing.  PV TLB
> flushes already don't play so well with PCID tracking, and this will
> make it worse.  We probably need to rewrite all that code regardless.

How is PCID (as you implemented) related to TLB flushing of kernel (not
user) PTEs? These kernel PTEs would be global, so they would be invalidated
from all the address-spaces using INVLPG, I presume. No?

Having said that, the fact that every hypervisor implements PV-TLB
completely differently might be unwarranted.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 16:53                 ` Nadav Amit
@ 2019-06-17 18:07                   ` Dave Hansen
  2019-06-17 18:45                     ` Konrad Rzeszutek Wilk
  2019-06-17 18:50                     ` Nadav Amit
  0 siblings, 2 replies; 44+ messages in thread
From: Dave Hansen @ 2019-06-17 18:07 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski
  Cc: Alexander Graf, Thomas Gleixner, Marius Hillenbrand, kvm list,
	LKML, Kernel Hardening, Linux-MM, Alexander Graf,
	David Woodhouse, the arch/x86 maintainers, Peter Zijlstra

On 6/17/19 9:53 AM, Nadav Amit wrote:
>>> For anyone following along at home, I'm going to go off into crazy
>>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>>>
>>> But, I was thinking we could get away with not doing this on _every_
>>> context switch at least.  For instance, couldn't 'struct tlb_context'
>>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
>>> way we only do the copying when we change the context.  Or does that tie
>>> the implementation up too much with PCIDs?
>> Hmm, that seems entirely reasonable.  I think the nasty bit would be
>> figuring out all the interactions with PV TLB flushing.  PV TLB
>> flushes already don't play so well with PCID tracking, and this will
>> make it worse.  We probably need to rewrite all that code regardless.
> How is PCID (as you implemented) related to TLB flushing of kernel (not
> user) PTEs? These kernel PTEs would be global, so they would be invalidated
> from all the address-spaces using INVLPG, I presume. No?

The idea is that you have a per-cpu address space.  Certain kernel
virtual addresses would map to different physical address based on where
you are running.  Each of the physical addresses would be "owned" by a
single CPU and would, by convention, never use a PGD that mapped an
address unless that CPU that "owned" it.

In that case, you never really invalidate those addresses.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 18:07                   ` Dave Hansen
@ 2019-06-17 18:45                     ` Konrad Rzeszutek Wilk
  2019-06-17 18:49                       ` Dave Hansen
  2019-06-17 18:53                       ` Andy Lutomirski
  2019-06-17 18:50                     ` Nadav Amit
  1 sibling, 2 replies; 44+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-06-17 18:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nadav Amit, Andy Lutomirski, Alexander Graf, Thomas Gleixner,
	Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Peter Zijlstra

On Mon, Jun 17, 2019 at 11:07:45AM -0700, Dave Hansen wrote:
> On 6/17/19 9:53 AM, Nadav Amit wrote:
> >>> For anyone following along at home, I'm going to go off into crazy
> >>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
> >>>
> >>> But, I was thinking we could get away with not doing this on _every_
> >>> context switch at least.  For instance, couldn't 'struct tlb_context'
> >>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
> >>> way we only do the copying when we change the context.  Or does that tie
> >>> the implementation up too much with PCIDs?
> >> Hmm, that seems entirely reasonable.  I think the nasty bit would be
> >> figuring out all the interactions with PV TLB flushing.  PV TLB
> >> flushes already don't play so well with PCID tracking, and this will
> >> make it worse.  We probably need to rewrite all that code regardless.
> > How is PCID (as you implemented) related to TLB flushing of kernel (not
> > user) PTEs? These kernel PTEs would be global, so they would be invalidated
> > from all the address-spaces using INVLPG, I presume. No?
> 
> The idea is that you have a per-cpu address space.  Certain kernel
> virtual addresses would map to different physical address based on where
> you are running.  Each of the physical addresses would be "owned" by a
> single CPU and would, by convention, never use a PGD that mapped an
> address unless that CPU that "owned" it.
> 
> In that case, you never really invalidate those addresses.

But you would need to invalidate if the process moved to another CPU, correct?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 18:45                     ` Konrad Rzeszutek Wilk
@ 2019-06-17 18:49                       ` Dave Hansen
  2019-06-17 18:53                       ` Andy Lutomirski
  1 sibling, 0 replies; 44+ messages in thread
From: Dave Hansen @ 2019-06-17 18:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Nadav Amit, Andy Lutomirski, Alexander Graf, Thomas Gleixner,
	Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Peter Zijlstra

On 6/17/19 11:45 AM, Konrad Rzeszutek Wilk wrote:
>> The idea is that you have a per-cpu address space.  Certain kernel
>> virtual addresses would map to different physical address based on where
>> you are running.  Each of the physical addresses would be "owned" by a
>> single CPU and would, by convention, never use a PGD that mapped an
>> address unless that CPU that "owned" it.
>>
>> In that case, you never really invalidate those addresses.
> But you would need to invalidate if the process moved to another CPU, correct?

If you have a per-cpu PGD, the rule is that you never use the PGD on
more than one CPU.  In that model, processes "take over" an existing PGD
to run on the CPU rather than having the CPU use an existing PGD that
came with the process.

But we've really hijacked the original thread at this point, which is
probably causing a ton of confusion.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 18:07                   ` Dave Hansen
  2019-06-17 18:45                     ` Konrad Rzeszutek Wilk
@ 2019-06-17 18:50                     ` Nadav Amit
  2019-06-17 18:55                       ` Dave Hansen
  1 sibling, 1 reply; 44+ messages in thread
From: Nadav Amit @ 2019-06-17 18:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Alexander Graf, Thomas Gleixner,
	Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Peter Zijlstra

> On Jun 17, 2019, at 11:07 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/17/19 9:53 AM, Nadav Amit wrote:
>>>> For anyone following along at home, I'm going to go off into crazy
>>>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>>>> 
>>>> But, I was thinking we could get away with not doing this on _every_
>>>> context switch at least.  For instance, couldn't 'struct tlb_context'
>>>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
>>>> way we only do the copying when we change the context.  Or does that tie
>>>> the implementation up too much with PCIDs?
>>> Hmm, that seems entirely reasonable.  I think the nasty bit would be
>>> figuring out all the interactions with PV TLB flushing.  PV TLB
>>> flushes already don't play so well with PCID tracking, and this will
>>> make it worse.  We probably need to rewrite all that code regardless.
>> How is PCID (as you implemented) related to TLB flushing of kernel (not
>> user) PTEs? These kernel PTEs would be global, so they would be invalidated
>> from all the address-spaces using INVLPG, I presume. No?
> 
> The idea is that you have a per-cpu address space.  Certain kernel
> virtual addresses would map to different physical address based on where
> you are running.  Each of the physical addresses would be "owned" by a
> single CPU and would, by convention, never use a PGD that mapped an
> address unless that CPU that "owned" it.
> 
> In that case, you never really invalidate those addresses.

I understand, but as I see it, this is not related directly to PCIDs.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 18:45                     ` Konrad Rzeszutek Wilk
  2019-06-17 18:49                       ` Dave Hansen
@ 2019-06-17 18:53                       ` Andy Lutomirski
  1 sibling, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2019-06-17 18:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Dave Hansen, Nadav Amit, Andy Lutomirski, Alexander Graf,
	Thomas Gleixner, Marius Hillenbrand, kvm list, LKML,
	Kernel Hardening, Linux-MM, Alexander Graf, David Woodhouse,
	the arch/x86 maintainers, Peter Zijlstra

On Mon, Jun 17, 2019 at 11:44 AM Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>
> On Mon, Jun 17, 2019 at 11:07:45AM -0700, Dave Hansen wrote:
> > On 6/17/19 9:53 AM, Nadav Amit wrote:
> > >>> For anyone following along at home, I'm going to go off into crazy
> > >>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
> > >>>
> > >>> But, I was thinking we could get away with not doing this on _every_
> > >>> context switch at least.  For instance, couldn't 'struct tlb_context'
> > >>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
> > >>> way we only do the copying when we change the context.  Or does that tie
> > >>> the implementation up too much with PCIDs?
> > >> Hmm, that seems entirely reasonable.  I think the nasty bit would be
> > >> figuring out all the interactions with PV TLB flushing.  PV TLB
> > >> flushes already don't play so well with PCID tracking, and this will
> > >> make it worse.  We probably need to rewrite all that code regardless.
> > > How is PCID (as you implemented) related to TLB flushing of kernel (not
> > > user) PTEs? These kernel PTEs would be global, so they would be invalidated
> > > from all the address-spaces using INVLPG, I presume. No?
> >
> > The idea is that you have a per-cpu address space.  Certain kernel
> > virtual addresses would map to different physical address based on where
> > you are running.  Each of the physical addresses would be "owned" by a
> > single CPU and would, by convention, never use a PGD that mapped an
> > address unless that CPU that "owned" it.
> >
> > In that case, you never really invalidate those addresses.
>
> But you would need to invalidate if the process moved to another CPU, correct?
>

There's nothing to invalidate.  It's a different CPU with a different TLB.

The big problem is that you have a choice.  Either you can have one
PGD per (mm, cpu) or you just have one or a few PGDs per CPU and you
change them every time you change processes.  Dave's idea to have one
or two per (cpu, asid) is right, though.  It means we have a decent
chance of context switching without rewriting the whole thing, and it
also means we don't need to write to the one that's currently loaded
when we switch CR3.  The latter could plausibly be important enough
that we'd want to pretend we're using PCID even if we're not.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets
  2019-06-17 18:50                     ` Nadav Amit
@ 2019-06-17 18:55                       ` Dave Hansen
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Hansen @ 2019-06-17 18:55 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Alexander Graf, Thomas Gleixner,
	Marius Hillenbrand, kvm list, LKML, Kernel Hardening, Linux-MM,
	Alexander Graf, David Woodhouse, the arch/x86 maintainers,
	Peter Zijlstra

On 6/17/19 11:50 AM, Nadav Amit wrote:
>> The idea is that you have a per-cpu address space.  Certain kernel
>> virtual addresses would map to different physical address based on where
>> you are running.  Each of the physical addresses would be "owned" by a
>> single CPU and would, by convention, never use a PGD that mapped an
>> address unless that CPU that "owned" it.
>>
>> In that case, you never really invalidate those addresses.
> I understand, but as I see it, this is not related directly to PCIDs.

Yeah, the only link I was thinking of is that we can manage per-CPU PGDs
in the same way that we manage PCIDs.  Basically we can reuse a chunk of
the software concept.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2019-06-17 18:56 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-12 17:08 [RFC 00/10] Process-local memory allocations for hiding KVM secrets Marius Hillenbrand
2019-06-12 17:08 ` [RFC 01/10] x86/mm/kaslr: refactor to use enum indices for regions Marius Hillenbrand
2019-06-12 17:08 ` [RFC 02/10] x86/speculation, mm: add process local virtual memory region Marius Hillenbrand
2019-06-12 17:08 ` [RFC 03/10] x86/mm, mm,kernel: add teardown for process-local memory to mm cleanup Marius Hillenbrand
2019-06-12 17:08 ` [RFC 04/10] mm: allocate virtual space for process-local memory Marius Hillenbrand
2019-06-12 17:08 ` [RFC 05/10] mm: allocate/release physical pages " Marius Hillenbrand
2019-06-12 17:08 ` [RFC 06/10] kvm/x86: add support for storing vCPU state in " Marius Hillenbrand
2019-06-12 17:08 ` [RFC 07/10] kvm, vmx: move CR2 context switch out of assembly path Marius Hillenbrand
2019-06-12 17:08 ` [RFC 08/10] kvm, vmx: move register clearing " Marius Hillenbrand
2019-06-12 17:08 ` [RFC 09/10] kvm, vmx: move gprs to process local memory Marius Hillenbrand
2019-06-12 17:08 ` [RFC 10/10] kvm, x86: move guest FPU state into " Marius Hillenbrand
2019-06-12 18:25 ` [RFC 00/10] Process-local memory allocations for hiding KVM secrets Sean Christopherson
2019-06-13  7:20   ` Alexander Graf
2019-06-13 10:54   ` Liran Alon
2019-06-12 19:55 ` Dave Hansen
2019-06-12 20:27   ` Andy Lutomirski
2019-06-12 20:41     ` Dave Hansen
2019-06-12 20:56       ` Andy Lutomirski
2019-06-13  1:30     ` Andy Lutomirski
2019-06-13  1:50       ` Nadav Amit
2019-06-13 16:16         ` Andy Lutomirski
2019-06-13  7:52       ` Alexander Graf
2019-06-13 16:13         ` Andy Lutomirski
2019-06-13 16:20           ` Dave Hansen
2019-06-13 17:29             ` Nadav Amit
2019-06-13 17:49               ` Dave Hansen
2019-06-13 20:05                 ` Sean Christopherson
2019-06-14 14:21     ` Thomas Gleixner
2019-06-16 22:18       ` Andy Lutomirski
2019-06-16 22:28         ` Thomas Gleixner
2019-06-17  7:38       ` Alexander Graf
2019-06-17 15:50         ` Dave Hansen
2019-06-17 15:54           ` Andy Lutomirski
2019-06-17 16:03             ` Dave Hansen
2019-06-17 16:14               ` Andy Lutomirski
2019-06-17 16:53                 ` Nadav Amit
2019-06-17 18:07                   ` Dave Hansen
2019-06-17 18:45                     ` Konrad Rzeszutek Wilk
2019-06-17 18:49                       ` Dave Hansen
2019-06-17 18:53                       ` Andy Lutomirski
2019-06-17 18:50                     ` Nadav Amit
2019-06-17 18:55                       ` Dave Hansen
2019-06-13  7:27   ` Alexander Graf
2019-06-13 14:19     ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).