KVM Archive on lore.kernel.org
 help / color / Atom feed
* [RFC v2 00/27] Kernel Address Space Isolation
@ 2019-07-11 14:25 Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 01/26] mm/x86: Introduce kernel address space isolation Alexandre Chartre
                   ` (28 more replies)
  0 siblings, 29 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Hi,

This is version 2 of the "KVM Address Space Isolation" RFC. The code
has been completely changed compared to v1 and it now provides a generic
kernel framework which provides Address Space Isolation; and KVM is now
a simple consumer of that framework. That's why the RFC title has been
changed from "KVM Address Space Isolation" to "Kernel Address Space
Isolation".

Kernel Address Space Isolation aims to use address spaces to isolate some
parts of the kernel (for example KVM) to prevent leaking sensitive data
between hyper-threads under speculative execution attacks. You can refer
to the first version of this RFC for more context:

   https://lkml.org/lkml/2019/5/13/515

The new code is still a proof of concept. It is much more stable than v1:
I am able to run a VM with a full OS (and also a nested VM) with multiple
vcpus. But it looks like there are still some corner cases which cause the
system to crash/hang.

I am looking for feedback about this new approach where address space
isolation is provided by the kernel, and KVM is a just a consumer of this
new framework.


Changes
=======

- Address Space Isolation (ASI) is now provided as a kernel framework:
  interfaces for creating and managing an ASI are provided by the kernel,
  there are not implemented in KVM.

- An ASI is associated with a page-table, we don't use mm anymore. Entering
  isolation is done by just updating CR3 to use the ASI page-table. Exiting
  isolation restores CR3 with the CR3 value present before entering isolation.

- Isolation is exited at the beginning of any interrupt/exception handler,
  and on context switch.

- Isolation doesn't disable interrupt, but if an interrupt occurs the
  interrupt handler will exit isolation.

- The current stack is mapped when entering isolation and unmapped when
  exiting isolation.

- The current task is not mapped by default, but there's an option to map it.
  In such a case, the current task is mapped when entering isolation and
  unmap when exiting isolation.

- Kernel code mapped to the ASI page-table has been reduced to:
  . the entire kernel (I still need to test with only the kernel text)
  . the cpu entry area (because we need the GDT to be mapped)
  . the cpu ASI session (for managing ASI)
  . the current stack

- Optionally, an ASI can request the following kernel mapping to be added:
  . the stack canary
  . the cpu offsets (this_cpu_off)
  . the current task
  . RCU data (rcu_data)
  . CPU HW events (cpu_hw_events).

  All these optional mappings are used for KVM isolation.
  

Patches:
========

The proposed patches provides a framework for creating an Address Space
Isolation (ASI) (represented by a struct asi). The ASI has a page-table which
can be populated by copying mappings from the kernel page-table. The ASI can
then be entered/exited by switching between the kernel page-table and the
ASI page-table. In addition, any interrupt, exception or context switch
will automatically abort and exit the isolation. Finally patches use the
ASI framework to implement KVM isolation.

- 01-03: Core of the ASI framework: create/destroy ASI, enter/exit/abort
  isolation, ASI page-fault handler.

- 04-14: Functions to manage, populate and clear an ASI page-table.

- 15-20: ASI core mappings and optional mappings.

- 21: Make functions to read cr3/cr4 ASI aware

- 22-26: Use ASI in KVM to provide isolation for VMExit handlers.


API Overview:
=============
Here is a short description of the main ASI functions provided by the framwork.

struct asi *asi_create(int map_flags)

  Create an Address Space Isolation (ASI). map_flags can be used to specify
  optional kernel mapping to be added to the ASI page-table (for example,
  ASI_MAP_STACK_CANARY to map the stack canary).


void asi_destroy(struct asi *asi)

  Destroy an ASI.


int asi_enter(struct asi *asi)

  Enter isolation for the specified ASI. This switches from the kernel page-table
  to the page-table associated with the ASI.


void asi_exit(struct asi *asi)

  Exit isolation for the specified ASI. This switches back to the kernel
  page-table


int asi_map(struct asi *asi, void *ptr, unsigned long size);

  Copy kernel mapping to the specified ASI page-table.


void asi_unmap(struct asi *asi, void *ptr);

  Clear kernel mapping from the specified ASI page-table.


----
Alexandre Chartre (23):
  mm/x86: Introduce kernel address space isolation
  mm/asi: Abort isolation on interrupt, exception and context switch
  mm/asi: Handle page fault due to address space isolation
  mm/asi: Functions to track buffers allocated for an ASI page-table
  mm/asi: Add ASI page-table entry offset functions
  mm/asi: Add ASI page-table entry allocation functions
  mm/asi: Add ASI page-table entry set functions
  mm/asi: Functions to populate an ASI page-table from a VA range
  mm/asi: Helper functions to map module into ASI
  mm/asi: Keep track of VA ranges mapped in ASI page-table
  mm/asi: Functions to clear ASI page-table entries for a VA range
  mm/asi: Function to copy page-table entries for percpu buffer
  mm/asi: Add asi_remap() function
  mm/asi: Handle ASI mapped range leaks and overlaps
  mm/asi: Initialize the ASI page-table with core mappings
  mm/asi: Option to map current task into ASI
  rcu: Move tree.h static forward declarations to tree.c
  rcu: Make percpu rcu_data non-static
  mm/asi: Add option to map RCU data
  mm/asi: Add option to map cpu_hw_events
  mm/asi: Make functions to read cr3/cr4 ASI aware
  KVM: x86/asi: Populate the KVM ASI page-table
  KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI

Liran Alon (3):
  KVM: x86/asi: Introduce address_space_isolation module parameter
  KVM: x86/asi: Introduce KVM address space isolation
  KVM: x86/asi: Switch to KVM address space on entry to guest

 arch/x86/entry/entry_64.S          |   42 ++-
 arch/x86/include/asm/asi.h         |  237 ++++++++
 arch/x86/include/asm/mmu_context.h |   20 +-
 arch/x86/include/asm/tlbflush.h    |   10 +
 arch/x86/kernel/asm-offsets.c      |    4 +
 arch/x86/kvm/Makefile              |    3 +-
 arch/x86/kvm/mmu.c                 |    2 +-
 arch/x86/kvm/vmx/isolation.c       |  231 ++++++++
 arch/x86/kvm/vmx/vmx.c             |   14 +-
 arch/x86/kvm/vmx/vmx.h             |   24 +
 arch/x86/kvm/x86.c                 |   68 +++-
 arch/x86/kvm/x86.h                 |    1 +
 arch/x86/mm/Makefile               |    2 +
 arch/x86/mm/asi.c                  |  459 +++++++++++++++
 arch/x86/mm/asi_pagetable.c        | 1077 ++++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                |    7 +
 include/linux/kvm_host.h           |    7 +
 kernel/rcu/tree.c                  |   56 ++-
 kernel/rcu/tree.h                  |   56 +--
 kernel/sched/core.c                |    4 +
 security/Kconfig                   |   10 +
 21 files changed, 2269 insertions(+), 65 deletions(-)
 create mode 100644 arch/x86/include/asm/asi.h
 create mode 100644 arch/x86/kvm/vmx/isolation.c
 create mode 100644 arch/x86/mm/asi.c
 create mode 100644 arch/x86/mm/asi_pagetable.c


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 01/26] mm/x86: Introduce kernel address space isolation
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 21:33   ` Thomas Gleixner
  2019-07-11 14:25 ` [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch Alexandre Chartre
                   ` (27 subsequent siblings)
  28 siblings, 1 reply; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Introduce core functions and structures for implementing Address Space
Isolation (ASI). Kernel address space isolation provides the ability to
run some kernel code with a reduced kernel address space.

An address space isolation is defined with a struct asi structure which
has its own page-table. While, for now, this page-table is empty, it
will eventually be possible to populate it so that it is much smaller
than the full kernel page-table.

Isolation is entered by calling asi_enter() which switches the kernel
page-table to the address space isolation page-table. Isolation is then
exited by calling asi_exit() which switches the page-table back to the
kernel page-table.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h |   41 ++++++++++++
 arch/x86/mm/Makefile       |    2 +
 arch/x86/mm/asi.c          |  152 ++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig           |   10 +++
 4 files changed, 205 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/asi.h
 create mode 100644 arch/x86/mm/asi.c

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
new file mode 100644
index 0000000..8a13f73
--- /dev/null
+++ b/arch/x86/include/asm/asi.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_X86_MM_ASI_H
+#define ARCH_X86_MM_ASI_H
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+#include <linux/spinlock.h>
+#include <asm/pgtable.h>
+
+struct asi {
+	spinlock_t		lock;		/* protect all attributes */
+	pgd_t			*pgd;		/* ASI page-table */
+};
+
+/*
+ * An ASI session maintains the state of address state isolation on a
+ * cpu. There is one ASI session per cpu. There is no lock to protect
+ * members of the asi_session structure as each cpu is managing its
+ * own ASI session.
+ */
+
+enum asi_session_state {
+	ASI_SESSION_STATE_INACTIVE,	/* no address space isolation */
+	ASI_SESSION_STATE_ACTIVE,	/* address space isolation is active */
+};
+
+struct asi_session {
+	struct asi		*asi;		/* ASI for this session */
+	enum asi_session_state	state;		/* state of ASI session */
+	unsigned long		original_cr3;	/* cr3 before entering ASI */
+	struct task_struct	*task;		/* task during isolation */
+} __aligned(PAGE_SIZE);
+
+extern struct asi *asi_create(void);
+extern void asi_destroy(struct asi *asi);
+extern int asi_enter(struct asi *asi);
+extern void asi_exit(struct asi *asi);
+
+#endif	/* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 84373dc..dae5c8a 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -49,7 +49,9 @@ obj-$(CONFIG_X86_INTEL_MPX)			+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
+obj-$(CONFIG_ADDRESS_SPACE_ISOLATION)		+= asi.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
new file mode 100644
index 0000000..c3993b7
--- /dev/null
+++ b/arch/x86/mm/asi.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ * Kernel Address Space Isolation (ASI)
+ */
+
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+
+#include <asm/asi.h>
+#include <asm/bug.h>
+#include <asm/mmu_context.h>
+
+/* ASI sessions, one per cpu */
+DEFINE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
+
+static int asi_init_mapping(struct asi *asi)
+{
+	/*
+	 * TODO: Populate the ASI page-table with minimal mappings so
+	 * that we can at least enter isolation and abort.
+	 */
+	return 0;
+}
+
+struct asi *asi_create(void)
+{
+	struct page *page;
+	struct asi *asi;
+	int err;
+
+	asi = kzalloc(sizeof(*asi), GFP_KERNEL);
+	if (!asi)
+		return NULL;
+
+	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!page)
+		goto error;
+
+	asi->pgd = page_address(page);
+	spin_lock_init(&asi->lock);
+
+	err = asi_init_mapping(asi);
+	if (err)
+		goto error;
+
+	return asi;
+
+error:
+	asi_destroy(asi);
+	return NULL;
+}
+EXPORT_SYMBOL(asi_create);
+
+void asi_destroy(struct asi *asi)
+{
+	if (!asi)
+		return;
+
+	if (asi->pgd)
+		free_page((unsigned long)asi->pgd);
+
+	kfree(asi);
+}
+EXPORT_SYMBOL(asi_destroy);
+
+
+/*
+ * When isolation is active, the address space doesn't necessarily map
+ * the percpu offset value (this_cpu_off) which is used to get pointers
+ * to percpu variables. So functions which can be invoked while isolation
+ * is active shouldn't be getting pointers to percpu variables (i.e. with
+ * get_cpu_var() or this_cpu_ptr()). Instead percpu variable should be
+ * directly read or written to (i.e. with this_cpu_read() or
+ * this_cpu_write()).
+ */
+
+int asi_enter(struct asi *asi)
+{
+	enum asi_session_state state;
+	struct asi *current_asi;
+	struct asi_session *asi_session;
+
+	state = this_cpu_read(cpu_asi_session.state);
+	/*
+	 * We can re-enter isolation, but only with the same ASI (we don't
+	 * support nesting isolation). Also, if isolation is still active,
+	 * then we should be re-entering with the same task.
+	 */
+	if (state == ASI_SESSION_STATE_ACTIVE) {
+		current_asi = this_cpu_read(cpu_asi_session.asi);
+		if (current_asi != asi) {
+			WARN_ON(1);
+			return -EBUSY;
+		}
+		WARN_ON(this_cpu_read(cpu_asi_session.task) != current);
+		return 0;
+	}
+
+	/* isolation is not active so we can safely access the percpu pointer */
+	asi_session = &get_cpu_var(cpu_asi_session);
+	asi_session->asi = asi;
+	asi_session->task = current;
+	asi_session->original_cr3 = __get_current_cr3_fast();
+	if (!asi_session->original_cr3) {
+		WARN_ON(1);
+		err = -EINVAL;
+		goto err_clear_asi;
+	}
+	asi_session->state = ASI_SESSION_STATE_ACTIVE;
+
+	load_cr3(asi->pgd);
+
+	return 0;
+
+err_clear_asi:
+	asi_session->asi = NULL;
+	asi_session->task = NULL;
+
+	return err;
+
+}
+EXPORT_SYMBOL(asi_enter);
+
+void asi_exit(struct asi *asi)
+{
+	struct asi_session *asi_session;
+	enum asi_session_state asi_state;
+	unsigned long original_cr3;
+
+	asi_state = this_cpu_read(cpu_asi_session.state);
+	if (asi_state == ASI_SESSION_STATE_INACTIVE)
+		return;
+
+	/* TODO: Kick sibling hyperthread before switching to kernel cr3 */
+	original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
+	if (original_cr3)
+		write_cr3(original_cr3);
+
+	/* page-table was switched, we can now access the percpu pointer */
+	asi_session = &get_cpu_var(cpu_asi_session);
+	WARN_ON(asi_session->task != current);
+	asi_session->state = ASI_SESSION_STATE_INACTIVE;
+	asi_session->asi = NULL;
+	asi_session->task = NULL;
+	asi_session->original_cr3 = 0;
+}
+EXPORT_SYMBOL(asi_exit);
diff --git a/security/Kconfig b/security/Kconfig
index 466cc1f..241b9a7 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -65,6 +65,16 @@ config PAGE_TABLE_ISOLATION
 
 	  See Documentation/x86/pti.txt for more details.
 
+config ADDRESS_SPACE_ISOLATION
+	bool "Allow code to run with a reduced kernel address space"
+	default y
+	depends on (X86_64 || X86_PAE) && !UML
+	help
+	   This feature provides the ability to run some kernel code
+	   with a reduced kernel address space. This can be used to
+	   mitigate speculative execution attacks which are able to
+	   leak data between sibling CPU hyper-threads.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 01/26] mm/x86: Introduce kernel address space isolation Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 20:11   ` Andi Kleen
  2019-07-12  0:05   ` Andy Lutomirski
  2019-07-11 14:25 ` [RFC v2 03/26] mm/asi: Handle page fault due to address space isolation Alexandre Chartre
                   ` (26 subsequent siblings)
  28 siblings, 2 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Address space isolation should be aborted if there is an interrupt,
an exception or a context switch. Interrupt/exception handlers and
context switch code need to run with the full kernel address space.
Address space isolation is aborted by restoring the original CR3
value used before entering address space isolation.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S     |   42 ++++++++++-
 arch/x86/include/asm/asi.h    |  114 ++++++++++++++++++++++++++++
 arch/x86/kernel/asm-offsets.c |    4 +
 arch/x86/mm/asi.c             |  165 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c           |    4 +
 5 files changed, 315 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 11aa3b2..3dc6174 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -38,6 +38,7 @@
 #include <asm/export.h>
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
+#include <asm/asi.h>
 #include <linux/err.h>
 
 #include "calling.h"
@@ -558,8 +559,15 @@ ENTRY(interrupt_entry)
 	TRACE_IRQS_OFF
 
 	CALL_enter_from_user_mode
-
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	jmp	2f
+#endif
 1:
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	/* Abort address space isolation if it is active */
+	ASI_START_ABORT
+2:
+#endif
 	ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
@@ -583,6 +591,9 @@ common_interrupt:
 	call	do_IRQ	/* rdi points to pt_regs */
 	/* 0(%rsp): old RSP */
 ret_from_intr:
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	ASI_FINISH_ABORT
+#endif
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 
@@ -947,6 +958,9 @@ ENTRY(\sym)
 	addq	$\ist_offset, CPU_TSS_IST(\shift_ist)
 	.endif
 
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	ASI_FINISH_ABORT
+#endif
 	/* these procedures expect "no swapgs" flag in ebx */
 	.if \paranoid
 	jmp	paranoid_exit
@@ -1182,6 +1196,16 @@ ENTRY(paranoid_entry)
 	xorl	%ebx, %ebx
 
 1:
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	/*
+	 * If address space isolation is active then abort it and return
+	 * the original kernel CR3 in %r14.
+	 */
+	ASI_START_ABORT_ELSE_JUMP 2f
+	movq	%rdi, %r14
+	ret
+2:
+#endif
 	/*
 	 * Always stash CR3 in %r14.  This value will be restored,
 	 * verbatim, at exit.  Needed if paranoid_entry interrupted
@@ -1265,6 +1289,15 @@ ENTRY(error_entry)
 	CALL_enter_from_user_mode
 	ret
 
+.Lerror_entry_check_address_space_isolation:
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	/*
+	 * Abort address space isolation if it is active. This will restore
+	 * the original kernel CR3.
+	 */
+	ASI_START_ABORT
+#endif
+
 .Lerror_entry_done:
 	TRACE_IRQS_OFF
 	ret
@@ -1283,7 +1316,7 @@ ENTRY(error_entry)
 	cmpq	%rax, RIP+8(%rsp)
 	je	.Lbstep_iret
 	cmpq	$.Lgs_change, RIP+8(%rsp)
-	jne	.Lerror_entry_done
+	jne	.Lerror_entry_check_address_space_isolation
 
 	/*
 	 * hack: .Lgs_change can fail with user gsbase.  If this happens, fix up
@@ -1632,7 +1665,10 @@ end_repeat_nmi:
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
 	call	do_nmi
-
+	
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	ASI_FINISH_ABORT
+#endif
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
 
diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 8a13f73..ff126e1 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -4,6 +4,8 @@
 
 #ifdef CONFIG_ADDRESS_SPACE_ISOLATION
 
+#ifndef __ASSEMBLY__
+
 #include <linux/spinlock.h>
 #include <asm/pgtable.h>
 
@@ -22,20 +24,132 @@ struct asi {
 enum asi_session_state {
 	ASI_SESSION_STATE_INACTIVE,	/* no address space isolation */
 	ASI_SESSION_STATE_ACTIVE,	/* address space isolation is active */
+	ASI_SESSION_STATE_ABORTED,	/* isolation has been aborted */
 };
 
 struct asi_session {
 	struct asi		*asi;		/* ASI for this session */
 	enum asi_session_state	state;		/* state of ASI session */
+	bool			retry_abort;	/* always retry abort */
+	unsigned int		abort_depth;	/* abort depth */
 	unsigned long		original_cr3;	/* cr3 before entering ASI */
 	struct task_struct	*task;		/* task during isolation */
 } __aligned(PAGE_SIZE);
 
+DECLARE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
+
 extern struct asi *asi_create(void);
 extern void asi_destroy(struct asi *asi);
 extern int asi_enter(struct asi *asi);
 extern void asi_exit(struct asi *asi);
 
+/*
+ * Function to exit the current isolation. This is used to abort isolation
+ * when a task using isolation is scheduled out.
+ */
+static inline void asi_abort(void)
+{
+	enum asi_session_state asi_state;
+
+	asi_state = this_cpu_read(cpu_asi_session.state);
+	if (asi_state == ASI_SESSION_STATE_INACTIVE)
+		return;
+
+	asi_exit(this_cpu_read(cpu_asi_session.asi));
+}
+
+/*
+ * Barriers for code which sets CR3 to use the ASI page-table. That's
+ * the case, for example, when entering isolation, or during a VMExit if
+ * isolation was active. If such a code is interrupted before CR3 is
+ * effectively set, then the interrupt will abort isolation and restore
+ * the original CR3 value. But then, the code will sets CR3 to use the
+ * ASI page-table while isolation has been aborted by the interrupt.
+ *
+ * To prevent this issue, such a code should call asi_barrier_begin()
+ * before CR3 gets updated, and asi_barrier_end() after CR3 has been
+ * updated.
+ *
+ * asi_barrier_begin() will set retry_abort to true. This will force
+ * interrupts to retain the isolation abort state. Then, after the code
+ * has updated CR3, asi_barrier_end() will be able to check if isolation
+ * was aborted and effectively abort isolation in that case. Setting
+ * retry_abort to true will also force all interrupt to restore the
+ * original CR3; that's in case we have interrupts both before and
+ * after CR3 is set.
+ */
+static inline unsigned long asi_restore_cr3(void)
+{
+	unsigned long original_cr3;
+
+	/* TODO: Kick sibling hyperthread before switching to kernel cr3 */
+	original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
+	if (original_cr3)
+		write_cr3(original_cr3);
+
+	return original_cr3;
+}
+
+static inline void asi_barrier_begin(void)
+{
+	this_cpu_write(cpu_asi_session.retry_abort, true);
+	mb();
+}
+
+static inline void asi_barrier_end(void)
+{
+	enum asi_session_state state;
+
+	this_cpu_write(cpu_asi_session.retry_abort, false);
+	mb();
+	state = this_cpu_read(cpu_asi_session.state);
+	if (state == ASI_SESSION_STATE_ABORTED) {
+		(void) asi_restore_cr3();
+		asi_abort();
+		return;
+	}
+
+}
+
+#else  /* __ASSEMBLY__ */
+
+/*
+ * If address space isolation is active, start aborting isolation.
+ */
+.macro ASI_START_ABORT
+	movl	PER_CPU_VAR(cpu_asi_session + CPU_ASI_SESSION_state), %edi
+	testl	%edi, %edi
+	jz	.Lasi_start_abort_done_\@
+	call	asi_start_abort
+.Lasi_start_abort_done_\@:
+.endm
+
+/*
+ * If address space isolation is active, finish aborting isolation.
+ */
+.macro ASI_FINISH_ABORT
+	movl	PER_CPU_VAR(cpu_asi_session + CPU_ASI_SESSION_state), %edi
+	testl	%edi, %edi
+	jz	.Lasi_finish_abort_done_\@
+	call	asi_finish_abort
+.Lasi_finish_abort_done_\@:
+.endm
+
+/*
+ * If address space isolation is inactive then jump to the specified
+ * label. Otherwise, start aborting isolation.
+ */
+.macro ASI_START_ABORT_ELSE_JUMP asi_inactive_label:req
+	movl	PER_CPU_VAR(cpu_asi_session + CPU_ASI_SESSION_state), %edi
+	testl	%edi, %edi
+	jz	\asi_inactive_label
+	call	asi_start_abort
+	testq	%rdi, %rdi
+	jz	\asi_inactive_label
+.endm
+
+#endif	/* __ASSEMBLY__ */
+
 #endif	/* CONFIG_ADDRESS_SPACE_ISOLATION */
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 168543d..395d0c6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
 #include <asm/bootparam.h>
 #include <asm/suspend.h>
 #include <asm/tlbflush.h>
+#include <asm/asi.h>
 
 #ifdef CONFIG_XEN
 #include <xen/interface/xen.h>
@@ -105,4 +106,7 @@ static void __used common(void)
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+	BLANK();
+	OFFSET(CPU_ASI_SESSION_state, asi_session, state);
 }
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index c3993b7..fabb923 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -84,9 +84,17 @@ int asi_enter(struct asi *asi)
 	enum asi_session_state state;
 	struct asi *current_asi;
 	struct asi_session *asi_session;
+	unsigned long original_cr3;
 
 	state = this_cpu_read(cpu_asi_session.state);
 	/*
+	 * The "aborted" state is a transient state used in interrupt and
+	 * exception handlers while aborting isolation. So it shouldn't be
+	 * set when entering isolation.
+	 */
+	WARN_ON(state == ASI_SESSION_STATE_ABORTED);
+
+	/*
 	 * We can re-enter isolation, but only with the same ASI (we don't
 	 * support nesting isolation). Also, if isolation is still active,
 	 * then we should be re-entering with the same task.
@@ -105,15 +113,44 @@ int asi_enter(struct asi *asi)
 	asi_session = &get_cpu_var(cpu_asi_session);
 	asi_session->asi = asi;
 	asi_session->task = current;
-	asi_session->original_cr3 = __get_current_cr3_fast();
-	if (!asi_session->original_cr3) {
+	WARN_ON(asi_session->abort_depth > 0);
+
+	/*
+	 * Instructions ordering is important here because we should be
+	 * able to deal with any interrupt/exception which will abort
+	 * the isolation and restore CR3 to its original value:
+	 *
+	 * - asi_session->original_cr3 must be set before the ASI session
+	 *   becomes active (i.e. before setting asi_session->state to
+	 *   ASI_SESSION_STATE_ACTIVE);
+	 * - the ASI session must be marked as active (i.e. set
+	 *   asi_session->state to ASI_SESSION_STATE_ACTIVE) before
+	 *   loading the CR3 used during isolation.
+	 *
+	 * Any exception or interrupt occurring after asi_session->state is
+	 * set to ASI_SESSION_STATE_ACTIVE will cause the exception/interrupt
+	 * handler to abort the isolation. The handler will then restore
+	 * cr3 to asi_session->original_cr3 and move asi_session->state to
+	 * ASI_SESSION_STATE_ABORTED.
+	 */
+	original_cr3 = __get_current_cr3_fast();
+	if (!original_cr3) {
 		WARN_ON(1);
 		err = -EINVAL;
 		goto err_clear_asi;
 	}
-	asi_session->state = ASI_SESSION_STATE_ACTIVE;
+	asi_session->original_cr3 = original_cr3;
 
+	/*
+	 * Use ASI barrier as we are setting CR3 with the ASI page-table.
+	 * The barrier should begin before setting the state to active as
+	 * any interrupt after the state is active will abort isolation.
+	 */
+	asi_barrier_begin();
+	asi_session->state = ASI_SESSION_STATE_ACTIVE;
+	mb();
 	load_cr3(asi->pgd);
+	asi_barrier_end();
 
 	return 0;
 
@@ -130,23 +167,129 @@ void asi_exit(struct asi *asi)
 {
 	struct asi_session *asi_session;
 	enum asi_session_state asi_state;
-	unsigned long original_cr3;
 
 	asi_state = this_cpu_read(cpu_asi_session.state);
-	if (asi_state == ASI_SESSION_STATE_INACTIVE)
+	switch (asi_state) {
+	case ASI_SESSION_STATE_INACTIVE:
 		return;
-
-	/* TODO: Kick sibling hyperthread before switching to kernel cr3 */
-	original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
-	if (original_cr3)
-		write_cr3(original_cr3);
+	case ASI_SESSION_STATE_ACTIVE:
+		(void) asi_restore_cr3();
+		break;
+	case ASI_SESSION_STATE_ABORTED:
+		/*
+		 * No need to restore cr3, this was already done during
+		 * the isolation abort.
+		 */
+		break;
+	}
 
 	/* page-table was switched, we can now access the percpu pointer */
 	asi_session = &get_cpu_var(cpu_asi_session);
-	WARN_ON(asi_session->task != current);
+	/*
+	 * asi_exit() can be interrupted before setting the state to
+	 * ASI_SESSION_STATE_INACTIVE. In that case, the interrupt will
+	 * exit isolation before we have started the actual exit. So
+	 * check that the session ASI is still set to verify that an
+	 * exit hasn't already be done.
+	 */
 	asi_session->state = ASI_SESSION_STATE_INACTIVE;
+	mb();
+	if (asi_session->asi == NULL) {
+		/* exit was already done */
+		return;
+	}
+	WARN_ON(asi_session->retry_abort);
+	WARN_ON(asi_session->task != current);
 	asi_session->asi = NULL;
 	asi_session->task = NULL;
 	asi_session->original_cr3 = 0;
+
+	/*
+	 * Reset abort_depth because some interrupt/exception handlers
+	 * (like the user page-fault handler) can schedule us out and so
+	 * exit isolation before abort_depth reaches 0.
+	 */
+	asi_session->abort_depth = 0;
 }
 EXPORT_SYMBOL(asi_exit);
+
+/*
+ * Functions to abort isolation. When address space isolation is active,
+ * these functions are used by interrupt/exception handlers to abort
+ * isolation.
+ *
+ * Common Case
+ * -----------
+ * asi_start_abort() is invoked at the beginning of the interrupt/exception
+ * handler. It aborts isolation by restoring the original CR3 value,
+ * increments the abort count, and move the isolation state to "aborted"
+ * (ASI_SESSION_STATE_ABORTED). If the interrupt/exception is interrupted
+ * by another interrupt/exception then the new interrupt/exception will
+ * just increment the abort count.
+ *
+ * asi_finish_abort() is invoked at the end of the interrupt/exception
+ * handler. It decrements is abort count and if that count reaches zero
+ * then it invokes asi_exit() to exit isolation.
+ *
+ * Special Case When Entering Isolation
+ * ------------------------------------
+ * When entering isolation, asi_enter() will set cpu_asi_session.retry_abort
+ * while updating CR3 to the ASI page-table. This forces asi_start_abort()
+ * handlers to abort isolation even if isolation was already aborted. Also
+ * asi_finish_abort() will retain the aborted state and not exit isolation
+ * (no call to asi_exit()).
+ */
+unsigned long asi_start_abort(void)
+{
+	enum asi_session_state state;
+	unsigned long original_cr3;
+
+	state = this_cpu_read(cpu_asi_session.state);
+
+	switch (state) {
+
+	case ASI_SESSION_STATE_INACTIVE:
+		return 0;
+
+	case ASI_SESSION_STATE_ACTIVE:
+		original_cr3 = asi_restore_cr3();
+		this_cpu_write(cpu_asi_session.state,
+			       ASI_SESSION_STATE_ABORTED);
+		break;
+
+	case ASI_SESSION_STATE_ABORTED:
+		/*
+		 * In the normal case, if the session was already aborted
+		 * then CR3 has already been restored. However if retry_abort
+		 * is set then we restore CR3 again.
+		 */
+		if (this_cpu_read(cpu_asi_session.retry_abort))
+			original_cr3 = asi_restore_cr3();
+		else
+			original_cr3 = this_cpu_read(
+				cpu_asi_session.original_cr3);
+		break;
+	}
+
+	this_cpu_inc(cpu_asi_session.abort_depth);
+
+	return original_cr3;
+}
+
+void asi_finish_abort(void)
+{
+	enum asi_session_state state;
+
+	state = this_cpu_read(cpu_asi_session.state);
+	if (state == ASI_SESSION_STATE_INACTIVE)
+		return;
+
+	WARN_ON(state != ASI_SESSION_STATE_ABORTED);
+
+	/* if retry_abort is set then we retain the abort state */
+	if (this_cpu_dec_return(cpu_asi_session.abort_depth) > 0 ||
+	    this_cpu_read(cpu_asi_session.retry_abort))
+		return;
+
+	asi_exit(this_cpu_read(cpu_asi_session.asi));
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..bb363f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -14,6 +14,7 @@
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
+#include <asm/asi.h>
 
 #include "../workqueue_internal.h"
 #include "../smpboot.h"
@@ -2597,6 +2598,9 @@ static inline void finish_lock_switch(struct rq *rq)
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	asi_abort();
+#endif
 	kcov_prepare_switch(prev);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 03/26] mm/asi: Handle page fault due to address space isolation
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 01/26] mm/x86: Introduce kernel address space isolation Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 04/26] mm/asi: Functions to track buffers allocated for an ASI page-table Alexandre Chartre
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

When address space isolation is active, kernel page faults can occur
because data are not mapped in the ASI page-table. In such a case, log
information about the fault and report the page fault as handled. As
the page fault handler (like any exception handler) aborts isolation
and switch back to the full kernel page-table, the faulty instruction
will be retried using the full kernel address space.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h |    7 ++++
 arch/x86/mm/asi.c          |   68 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c        |    7 ++++
 3 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index ff126e1..013d77a 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -9,9 +9,14 @@
 #include <linux/spinlock.h>
 #include <asm/pgtable.h>
 
+#define ASI_FAULT_LOG_SIZE	128
+
 struct asi {
 	spinlock_t		lock;		/* protect all attributes */
 	pgd_t			*pgd;		/* ASI page-table */
+	spinlock_t		fault_lock;	/* protect fault_log */
+	unsigned long		fault_log[ASI_FAULT_LOG_SIZE];
+	bool			fault_stack;	/* display stack of fault? */
 };
 
 /*
@@ -42,6 +47,8 @@ struct asi_session {
 extern void asi_destroy(struct asi *asi);
 extern int asi_enter(struct asi *asi);
 extern void asi_exit(struct asi *asi);
+extern bool asi_fault(struct pt_regs *regs, unsigned long error_code,
+		      unsigned long address);
 
 /*
  * Function to exit the current isolation. This is used to abort isolation
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index fabb923..717160d 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -9,6 +9,7 @@
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
+#include <linux/sched/debug.h>
 #include <linux/slab.h>
 
 #include <asm/asi.h>
@@ -18,6 +19,72 @@
 /* ASI sessions, one per cpu */
 DEFINE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
 
+static void asi_log_fault(struct asi *asi, struct pt_regs *regs,
+			  unsigned long error_code, unsigned long address)
+{
+	int i = 0;
+
+	/*
+	 * Log information about the fault only if this is a fault
+	 * we don't know about yet (and the fault log is not full).
+	 */
+	spin_lock(&asi->fault_lock);
+	for (i = 0; i < ASI_FAULT_LOG_SIZE; i++) {
+		if (asi->fault_log[i] == regs->ip) {
+			spin_unlock(&asi->fault_lock);
+			return;
+		}
+		if (!asi->fault_log[i]) {
+			asi->fault_log[i] = regs->ip;
+			break;
+		}
+	}
+	spin_unlock(&asi->fault_lock);
+
+	if (i >= ASI_FAULT_LOG_SIZE)
+		pr_warn("ASI %p: fault log buffer is full [%d]\n", asi, i);
+
+	pr_info("ASI %p: PF#%d (%ld) at %pS on %px\n", asi, i,
+		error_code, (void *)regs->ip, (void *)address);
+
+	if (asi->fault_stack)
+		show_stack(NULL, (unsigned long *)regs->sp);
+}
+
+bool asi_fault(struct pt_regs *regs, unsigned long error_code,
+	       unsigned long address)
+{
+	struct asi_session *asi_session;
+
+	/*
+	 * If address space isolation was active when the fault occurred
+	 * then the page fault handler has already aborted the isolation
+	 * (exception handlers abort isolation very early) and switched
+	 * CR3 back to its original value.
+	 */
+
+	/*
+	 * If address space isolation is not active, or we have a fault
+	 * after isolation was aborted then this is a regular kernel fault,
+	 * and we don't handle it.
+	 */
+	asi_session = &get_cpu_var(cpu_asi_session);
+	if (asi_session->state == ASI_SESSION_STATE_INACTIVE)
+		return false;
+
+	WARN_ON(asi_session->state != ASI_SESSION_STATE_ABORTED);
+	WARN_ON(asi_session->abort_depth != 1);
+
+	/*
+	 * We have a fault while the cpu is using address space isolation.
+	 * Log the fault and report that we have handled fault. This way,
+	 * the faulty instruction will be retried with no isolation.
+	 *
+	 */
+	asi_log_fault(asi_session->asi, regs, error_code, address);
+	return true;
+}
+
 static int asi_init_mapping(struct asi *asi)
 {
 	/*
@@ -43,6 +110,7 @@ struct asi *asi_create(void)
 
 	asi->pgd = page_address(page);
 	spin_lock_init(&asi->lock);
+	spin_lock_init(&asi->fault_lock);
 
 	err = asi_init_mapping(asi);
 	if (err)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 46df4c6..a405c43 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -29,6 +29,7 @@
 #include <asm/efi.h>			/* efi_recover_from_page_fault()*/
 #include <asm/desc.h>			/* store_idt(), ...		*/
 #include <asm/cpu_entry_area.h>		/* exception stack		*/
+#include <asm/asi.h>			/* asi_fault()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -1252,6 +1253,12 @@ static int fault_in_kernel_space(unsigned long address)
 	 */
 	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	/* Check if the fault occurs with address space isolation */
+	if (asi_fault(regs, hw_error_code, address))
+		return;
+#endif
+
 	/*
 	 * We can fault-in kernel-space virtual memory on-demand. The
 	 * 'reference' page table is init_mm.pgd.
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 04/26] mm/asi: Functions to track buffers allocated for an ASI page-table
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (2 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 03/26] mm/asi: Handle page fault due to address space isolation Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 05/26] mm/asi: Add ASI page-table entry offset functions Alexandre Chartre
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add functions to track buffers allocated for an ASI page-table. An ASI
page-table can have direct references to the kernel page table, at
different levels (PGD, P4D, PUD, PMD). When freeing an ASI page-table,
we should make sure that we free parts actually allocated for the ASI
page-table, and not parts of the kernel page table referenced from the
ASI page-table. To do so, we will keep track of buffers when building
the ASI page-table.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |   26 +++++++++++
 arch/x86/mm/Makefile        |    2 +-
 arch/x86/mm/asi.c           |    3 +
 arch/x86/mm/asi_pagetable.c |   99 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 129 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/mm/asi_pagetable.c

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 013d77a..3d965e6 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -8,12 +8,35 @@
 
 #include <linux/spinlock.h>
 #include <asm/pgtable.h>
+#include <linux/xarray.h>
+
+enum page_table_level {
+	PGT_LEVEL_PTE,
+	PGT_LEVEL_PMD,
+	PGT_LEVEL_PUD,
+	PGT_LEVEL_P4D,
+	PGT_LEVEL_PGD
+};
 
 #define ASI_FAULT_LOG_SIZE	128
 
 struct asi {
 	spinlock_t		lock;		/* protect all attributes */
 	pgd_t			*pgd;		/* ASI page-table */
+
+	/*
+	 * An ASI page-table can have direct references to the full kernel
+	 * page-table, at different levels (PGD, P4D, PUD, PMD). When freeing
+	 * an ASI page-table, we should make sure that we free parts actually
+	 * allocated for the ASI page-table, and not part of the full kernel
+	 * page-table referenced from the ASI page-table.
+	 *
+	 * To do so, the backend_pages XArray is used to keep track of pages
+	 * used for the kernel isolation page-table.
+	 */
+	struct xarray		backend_pages;		/* page-table pages */
+	unsigned long		backend_pages_count;	/* pages count */
+
 	spinlock_t		fault_lock;	/* protect fault_log */
 	unsigned long		fault_log[ASI_FAULT_LOG_SIZE];
 	bool			fault_stack;	/* display stack of fault? */
@@ -43,6 +66,9 @@ struct asi_session {
 
 DECLARE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
 
+void asi_init_backend(struct asi *asi);
+void asi_fini_backend(struct asi *asi);
+
 extern struct asi *asi_create(void);
 extern void asi_destroy(struct asi *asi);
 extern int asi_enter(struct asi *asi);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dae5c8a..b972f0f 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -49,7 +49,7 @@ obj-$(CONFIG_X86_INTEL_MPX)			+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
-obj-$(CONFIG_ADDRESS_SPACE_ISOLATION)		+= asi.o
+obj-$(CONFIG_ADDRESS_SPACE_ISOLATION)		+= asi.o asi_pagetable.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 717160d..dfde245 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -111,6 +111,7 @@ struct asi *asi_create(void)
 	asi->pgd = page_address(page);
 	spin_lock_init(&asi->lock);
 	spin_lock_init(&asi->fault_lock);
+	asi_init_backend(asi);
 
 	err = asi_init_mapping(asi);
 	if (err)
@@ -132,6 +133,8 @@ void asi_destroy(struct asi *asi)
 	if (asi->pgd)
 		free_page((unsigned long)asi->pgd);
 
+	asi_fini_backend(asi);
+
 	kfree(asi);
 }
 EXPORT_SYMBOL(asi_destroy);
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
new file mode 100644
index 0000000..7a8f791
--- /dev/null
+++ b/arch/x86/mm/asi_pagetable.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ */
+
+#include <asm/asi.h>
+
+/*
+ * Get the pointer to the beginning of a page table directory from a page
+ * table directory entry.
+ */
+#define ASI_BACKEND_PAGE_ALIGN(entry)	\
+	((typeof(entry))(((unsigned long)(entry)) & PAGE_MASK))
+
+/*
+ * Pages used to build the address space isolation page-table are stored
+ * in the backend_pages XArray. Each entry in the array is a logical OR
+ * of the page address and the page table level (PTE, PMD, PUD, P4D) this
+ * page is used for in the address space isolation page-table.
+ *
+ * As a page address is aligned with PAGE_SIZE, we have plenty of space
+ * for storing the page table level (which is a value between 0 and 4) in
+ * the low bits of the page address.
+ *
+ */
+
+#define ASI_BACKEND_PAGE_ENTRY(addr, level)	\
+	((typeof(addr))(((unsigned long)(addr)) | ((unsigned long)(level))))
+#define ASI_BACKEND_PAGE_ADDR(entry)		\
+	((void *)(((unsigned long)(entry)) & PAGE_MASK))
+#define ASI_BACKEND_PAGE_LEVEL(entry)		\
+	((enum page_table_level)(((unsigned long)(entry)) & ~PAGE_MASK))
+
+static int asi_add_backend_page(struct asi *asi, void *addr,
+				enum page_table_level level)
+{
+	unsigned long index;
+	void *old_entry;
+
+	if ((!addr) || ((unsigned long)addr) & ~PAGE_MASK)
+		return -EINVAL;
+
+	lockdep_assert_held(&asi->lock);
+	index = asi->backend_pages_count;
+
+	old_entry = xa_store(&asi->backend_pages, index,
+			     ASI_BACKEND_PAGE_ENTRY(addr, level),
+			     GFP_KERNEL);
+	if (xa_is_err(old_entry))
+		return xa_err(old_entry);
+	if (old_entry)
+		return -EBUSY;
+
+	asi->backend_pages_count++;
+
+	return 0;
+}
+
+void asi_init_backend(struct asi *asi)
+{
+	xa_init(&asi->backend_pages);
+}
+
+void asi_fini_backend(struct asi *asi)
+{
+	unsigned long index;
+	void *entry;
+
+	if (asi->backend_pages_count) {
+		xa_for_each(&asi->backend_pages, index, entry)
+			free_page((unsigned long)ASI_BACKEND_PAGE_ADDR(entry));
+	}
+}
+
+/*
+ * Check if an offset in the address space isolation page-table is valid,
+ * i.e. check that the offset is on a page effectively belonging to the
+ * address space isolation page-table.
+ */
+static bool asi_valid_offset(struct asi *asi, void *offset)
+{
+	unsigned long index;
+	void *addr, *entry;
+	bool valid;
+
+	addr = ASI_BACKEND_PAGE_ALIGN(offset);
+	valid = false;
+
+	lockdep_assert_held(&asi->lock);
+	xa_for_each(&asi->backend_pages, index, entry) {
+		if (ASI_BACKEND_PAGE_ADDR(entry) == addr) {
+			valid = true;
+			break;
+		}
+	}
+
+	return valid;
+}
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 05/26] mm/asi: Add ASI page-table entry offset functions
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (3 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 04/26] mm/asi: Functions to track buffers allocated for an ASI page-table Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 06/26] mm/asi: Add ASI page-table entry allocation functions Alexandre Chartre
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add wrappers around the p4d/pud/pmd/pte offset kernel functions which
ensure that page-table pointers are in the specified ASI page-table.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/mm/asi_pagetable.c |   62 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index 7a8f791..a89e02e 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -97,3 +97,65 @@ static bool asi_valid_offset(struct asi *asi, void *offset)
 
 	return valid;
 }
+
+/*
+ * asi_pXX_offset() functions are equivalent to kernel pXX_offset()
+ * functions but, in addition, they ensure that page table pointers
+ * are in the kernel isolation page table. Otherwise an error is
+ * returned.
+ */
+
+static pte_t *asi_pte_offset(struct asi *asi, pmd_t *pmd, unsigned long addr)
+{
+	pte_t *pte;
+
+	pte = pte_offset_map(pmd, addr);
+	if (!asi_valid_offset(asi, pte)) {
+		pr_err("ASI %p: PTE %px not found\n", asi, pte);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return pte;
+}
+
+static pmd_t *asi_pmd_offset(struct asi *asi, pud_t *pud, unsigned long addr)
+{
+	pmd_t *pmd;
+
+	pmd = pmd_offset(pud, addr);
+	if (!asi_valid_offset(asi, pmd)) {
+		pr_err("ASI %p: PMD %px not found\n", asi, pmd);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return pmd;
+}
+
+static pud_t *asi_pud_offset(struct asi *asi, p4d_t *p4d, unsigned long addr)
+{
+	pud_t *pud;
+
+	pud = pud_offset(p4d, addr);
+	if (!asi_valid_offset(asi, pud)) {
+		pr_err("ASI %p: PUD %px not found\n", asi, pud);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return pud;
+}
+
+static p4d_t *asi_p4d_offset(struct asi *asi, pgd_t *pgd, unsigned long addr)
+{
+	p4d_t *p4d;
+
+	p4d = p4d_offset(pgd, addr);
+	/*
+	 * p4d is the same has pgd if we don't have a 5-level page table.
+	 */
+	if ((p4d != (p4d_t *)pgd) && !asi_valid_offset(asi, p4d)) {
+		pr_err("ASI %p: P4D %px not found\n", asi, p4d);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return p4d;
+}
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 06/26] mm/asi: Add ASI page-table entry allocation functions
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (4 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 05/26] mm/asi: Add ASI page-table entry offset functions Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 07/26] mm/asi: Add ASI page-table entry set functions Alexandre Chartre
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add functions to allocate p4d/pud/pmd/pte pages for an ASI page-table
and keep track of them.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/mm/asi_pagetable.c |  111 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 111 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index a89e02e..0fc6d59 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -4,6 +4,8 @@
  *
  */
 
+#include <linux/mm.h>
+
 #include <asm/asi.h>
 
 /*
@@ -159,3 +161,112 @@ static bool asi_valid_offset(struct asi *asi, void *offset)
 
 	return p4d;
 }
+
+/*
+ * asi_pXX_alloc() functions are equivalent to kernel pXX_alloc() functions
+ * but, in addition, they keep track of new pages allocated for the specified
+ * ASI.
+ */
+
+static pte_t *asi_pte_alloc(struct asi *asi, pmd_t *pmd, unsigned long addr)
+{
+	struct page *page;
+	pte_t *pte;
+	int err;
+
+	if (pmd_none(*pmd)) {
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		pte = (pte_t *)page_address(page);
+		err = asi_add_backend_page(asi, pte, PGT_LEVEL_PTE);
+		if (err) {
+			free_page((unsigned long)pte);
+			return ERR_PTR(err);
+		}
+		set_pmd_safe(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
+		pte = pte_offset_map(pmd, addr);
+	} else {
+		pte = asi_pte_offset(asi, pmd,  addr);
+	}
+
+	return pte;
+}
+
+static pmd_t *asi_pmd_alloc(struct asi *asi, pud_t *pud, unsigned long addr)
+{
+	struct page *page;
+	pmd_t *pmd;
+	int err;
+
+	if (pud_none(*pud)) {
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		pmd = (pmd_t *)page_address(page);
+		err = asi_add_backend_page(asi, pmd, PGT_LEVEL_PMD);
+		if (err) {
+			free_page((unsigned long)pmd);
+			return ERR_PTR(err);
+		}
+		set_pud_safe(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+		pmd = pmd_offset(pud, addr);
+	} else {
+		pmd = asi_pmd_offset(asi, pud, addr);
+	}
+
+	return pmd;
+}
+
+static pud_t *asi_pud_alloc(struct asi *asi, p4d_t *p4d, unsigned long addr)
+{
+	struct page *page;
+	pud_t *pud;
+	int err;
+
+	if (p4d_none(*p4d)) {
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		pud = (pud_t *)page_address(page);
+		err = asi_add_backend_page(asi, pud, PGT_LEVEL_PUD);
+		if (err) {
+			free_page((unsigned long)pud);
+			return ERR_PTR(err);
+		}
+		set_p4d_safe(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+		pud = pud_offset(p4d, addr);
+	} else {
+		pud = asi_pud_offset(asi, p4d, addr);
+	}
+
+	return pud;
+}
+
+static p4d_t *asi_p4d_alloc(struct asi *asi, pgd_t *pgd, unsigned long addr)
+{
+	struct page *page;
+	p4d_t *p4d;
+	int err;
+
+	if (!pgtable_l5_enabled())
+		return (p4d_t *)pgd;
+
+	if (pgd_none(*pgd)) {
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		p4d = (p4d_t *)page_address(page);
+		err = asi_add_backend_page(asi, p4d, PGT_LEVEL_P4D);
+		if (err) {
+			free_page((unsigned long)p4d);
+			return ERR_PTR(err);
+		}
+		set_pgd_safe(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+		p4d = p4d_offset(pgd, addr);
+	} else {
+		p4d = asi_p4d_offset(asi, pgd, addr);
+	}
+
+	return p4d;
+}
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 07/26] mm/asi: Add ASI page-table entry set functions
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (5 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 06/26] mm/asi: Add ASI page-table entry allocation functions Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 08/26] mm/asi: Functions to populate an ASI page-table from a VA range Alexandre Chartre
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add wrappers around the page table entry (pgd/p4d/pud/pmd) set
functions which check that an existing entry is not being
overwritten.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/mm/asi_pagetable.c |  124 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 124 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index 0fc6d59..e17af9e 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -270,3 +270,127 @@ static bool asi_valid_offset(struct asi *asi, void *offset)
 
 	return p4d;
 }
+
+/*
+ * asi_set_pXX() functions are equivalent to kernel set_pXX() functions
+ * but, in addition, they ensure that they are not overwriting an already
+ * existing reference in the page table. Otherwise an error is returned.
+ */
+static int asi_set_pte(struct asi *asi, pte_t *pte, pte_t pte_value)
+{
+#ifdef DEBUG
+	/*
+	 * The pte pointer should come from asi_pte_alloc() or asi_pte_offset()
+	 * both of which check if the pointer is in the kernel isolation page
+	 * table. So this is a paranoid check to ensure the pointer is really
+	 * in the kernel page table.
+	 */
+	if (!asi_valid_offset(asi, pte)) {
+		pr_err("ASI %p: PTE %px not found\n", asi, pte);
+		return -EINVAL;
+	}
+#endif
+	set_pte(pte, pte_value);
+
+	return 0;
+}
+
+static int asi_set_pmd(struct asi *asi, pmd_t *pmd, pmd_t pmd_value)
+{
+#ifdef DEBUG
+	/*
+	 * The pmd pointer should come from asi_pmd_alloc() or asi_pmd_offset()
+	 * both of which check if the pointer is in the kernel isolation page
+	 * table. So this is a paranoid check to ensure the pointer is really
+	 * in the kernel page table.
+	 */
+	if (!asi_valid_offset(asi, pmd)) {
+		pr_err("ASI %p: PMD %px not found\n", asi, pmd);
+		return -EINVAL;
+	}
+#endif
+	if (pmd_val(*pmd) == pmd_val(pmd_value))
+		return 0;
+
+	if (!pmd_none(*pmd)) {
+		pr_err("ASI %p: PMD %px overwriting %lx with %lx\n",
+		       asi, pmd, pmd_val(*pmd), pmd_val(pmd_value));
+		return -EBUSY;
+	}
+
+	set_pmd(pmd, pmd_value);
+
+	return 0;
+}
+
+static int asi_set_pud(struct asi *asi, pud_t *pud, pud_t pud_value)
+{
+#ifdef DEBUG
+	/*
+	 * The pud pointer should come from asi_pud_alloc() or asi_pud_offset()
+	 * both of which check if the pointer is in the kernel isolation page
+	 * table. So this is a paranoid check to ensure the pointer is really
+	 * in the kernel page table.
+	 */
+	if (!asi_valid_offset(asi, pud)) {
+		pr_err("ASI %p: PUD %px not found\n", asi, pud);
+		return -EINVAL;
+	}
+#endif
+	if (pud_val(*pud) == pud_val(pud_value))
+		return 0;
+
+	if (!pud_none(*pud)) {
+		pr_err("ASI %p: PUD %px overwriting %lx with %lx\n",
+		       asi, pud, pud_val(*pud), pud_val(pud_value));
+		return -EBUSY;
+	}
+
+	set_pud(pud, pud_value);
+
+	return 0;
+}
+
+static int asi_set_p4d(struct asi *asi, p4d_t *p4d, p4d_t p4d_value)
+{
+#ifdef DEBUG
+	/*
+	 * The p4d pointer should come from asi_p4d_alloc() or asi_p4d_offset()
+	 * both of which check if the pointer is in the kernel isolation page
+	 * table. So this is a paranoid check to ensure the pointer is really
+	 * in the kernel page table.
+	 */
+	if (!asi_valid_offset(asi, p4d)) {
+		pr_err("ASI %p: P4D %px not found\n", asi, p4d);
+		return -EINVAL;
+	}
+#endif
+	if (p4d_val(*p4d) == p4d_val(p4d_value))
+		return 0;
+
+	if (!p4d_none(*p4d)) {
+		pr_err("ASI %p: P4D %px overwriting %lx with %lx\n",
+		       asi, p4d, p4d_val(*p4d), p4d_val(p4d_value));
+		return -EBUSY;
+	}
+
+	set_p4d(p4d, p4d_value);
+
+	return 0;
+}
+
+static int asi_set_pgd(struct asi *asi, pgd_t *pgd, pgd_t pgd_value)
+{
+	if (pgd_val(*pgd) == pgd_val(pgd_value))
+		return 0;
+
+	if (!pgd_none(*pgd)) {
+		pr_err("ASI %p: PGD %px overwriting %lx with %lx\n",
+		       asi, pgd, pgd_val(*pgd), pgd_val(pgd_value));
+		return -EBUSY;
+	}
+
+	set_pgd(pgd, pgd_value);
+
+	return 0;
+}
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 08/26] mm/asi: Functions to populate an ASI page-table from a VA range
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (6 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 07/26] mm/asi: Add ASI page-table entry set functions Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 09/26] mm/asi: Helper functions to map module into ASI Alexandre Chartre
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Provide functions to copy page-table entries from the kernel page-table
to an ASI page-table for a specified VA range. These functions are based
on the copy_pxx_range() functions defined in mm/memory.c. A difference
is that a level parameter can be specified to indicate the page-table
level (PGD, P4D, PUD PMD, PTE) at which the copy should be done. Also
functions don't rely on mm or vma, and they don't alter the source
page-table even if an entry is bad. Also the VA range start and size
don't need to be page-aligned.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |    4 +
 arch/x86/mm/asi_pagetable.c |  205 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 3d965e6..19656aa 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -76,6 +76,10 @@ struct asi_session {
 extern bool asi_fault(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address);
 
+extern int asi_map_range(struct asi *asi, void *ptr, size_t size,
+			 enum page_table_level level);
+extern int asi_map(struct asi *asi, void *ptr, unsigned long size);
+
 /*
  * Function to exit the current isolation. This is used to abort isolation
  * when a task using isolation is scheduled out.
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index e17af9e..0169395 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -394,3 +394,208 @@ static int asi_set_pgd(struct asi *asi, pgd_t *pgd, pgd_t pgd_value)
 
 	return 0;
 }
+
+static int asi_copy_pte_range(struct asi *asi, pmd_t *dst_pmd, pmd_t *src_pmd,
+			      unsigned long addr, unsigned long end)
+{
+	pte_t *src_pte, *dst_pte;
+
+	dst_pte = asi_pte_alloc(asi, dst_pmd, addr);
+	if (IS_ERR(dst_pte))
+		return PTR_ERR(dst_pte);
+
+	addr &= PAGE_MASK;
+	src_pte = pte_offset_map(src_pmd, addr);
+
+	do {
+		asi_set_pte(asi, dst_pte, *src_pte);
+
+	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr < end);
+
+	return 0;
+}
+
+static int asi_copy_pmd_range(struct asi *asi, pud_t *dst_pud, pud_t *src_pud,
+			      unsigned long addr, unsigned long end,
+			      enum page_table_level level)
+{
+	pmd_t *src_pmd, *dst_pmd;
+	unsigned long next;
+	int err;
+
+	dst_pmd = asi_pmd_alloc(asi, dst_pud, addr);
+	if (IS_ERR(dst_pmd))
+		return PTR_ERR(dst_pmd);
+
+	src_pmd = pmd_offset(src_pud, addr);
+
+	do {
+		next = pmd_addr_end(addr, end);
+		if (level == PGT_LEVEL_PMD || pmd_none(*src_pmd) ||
+		    pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+			err = asi_set_pmd(asi, dst_pmd, *src_pmd);
+			if (err)
+				return err;
+			continue;
+		}
+
+		if (!pmd_present(*src_pmd)) {
+			pr_warn("ASI %p: PMD not present for [%lx,%lx]\n",
+				asi, addr, next - 1);
+			pmd_clear(dst_pmd);
+			continue;
+		}
+
+		err = asi_copy_pte_range(asi, dst_pmd, src_pmd, addr, next);
+		if (err) {
+			pr_err("ASI %p: PMD error copying PTE addr=%lx next=%lx\n",
+			       asi, addr, next);
+			return err;
+		}
+
+	} while (dst_pmd++, src_pmd++, addr = next, addr < end);
+
+	return 0;
+}
+
+static int asi_copy_pud_range(struct asi *asi, p4d_t *dst_p4d, p4d_t *src_p4d,
+			      unsigned long addr, unsigned long end,
+			      enum page_table_level level)
+{
+	pud_t *src_pud, *dst_pud;
+	unsigned long next;
+	int err;
+
+	dst_pud = asi_pud_alloc(asi, dst_p4d, addr);
+	if (IS_ERR(dst_pud))
+		return PTR_ERR(dst_pud);
+
+	src_pud = pud_offset(src_p4d, addr);
+
+	do {
+		next = pud_addr_end(addr, end);
+		if (level == PGT_LEVEL_PUD || pud_none(*src_pud) ||
+		    pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+			err = asi_set_pud(asi, dst_pud, *src_pud);
+			if (err)
+				return err;
+			continue;
+		}
+
+		err = asi_copy_pmd_range(asi, dst_pud, src_pud, addr, next,
+					 level);
+		if (err) {
+			pr_err("ASI %p: PUD error copying PMD addr=%lx next=%lx\n",
+			       asi, addr, next);
+			return err;
+		}
+
+	} while (dst_pud++, src_pud++, addr = next, addr < end);
+
+	return 0;
+}
+
+static int asi_copy_p4d_range(struct asi *asi, pgd_t *dst_pgd, pgd_t *src_pgd,
+			      unsigned long addr, unsigned long end,
+			      enum page_table_level level)
+{
+	p4d_t *src_p4d, *dst_p4d;
+	unsigned long next;
+	int err;
+
+	dst_p4d = asi_p4d_alloc(asi, dst_pgd, addr);
+	if (IS_ERR(dst_p4d))
+		return PTR_ERR(dst_p4d);
+
+	src_p4d = p4d_offset(src_pgd, addr);
+
+	do {
+		next = p4d_addr_end(addr, end);
+		if (level == PGT_LEVEL_P4D || p4d_none(*src_p4d)) {
+			err = asi_set_p4d(asi, dst_p4d, *src_p4d);
+			if (err)
+				return err;
+			continue;
+		}
+
+		err = asi_copy_pud_range(asi, dst_p4d, src_p4d, addr, next,
+					 level);
+		if (err) {
+			pr_err("ASI %p: P4D error copying PUD addr=%lx next=%lx\n",
+			       asi, addr, next);
+			return err;
+		}
+
+	} while (dst_p4d++, src_p4d++, addr = next, addr < end);
+
+	return 0;
+}
+
+static int asi_copy_pgd_range(struct asi *asi,
+			      pgd_t *dst_pagetable, pgd_t *src_pagetable,
+			      unsigned long addr, unsigned long end,
+			      enum page_table_level level)
+{
+	pgd_t *src_pgd, *dst_pgd;
+	unsigned long next;
+	int err;
+
+	dst_pgd = pgd_offset_pgd(dst_pagetable, addr);
+	src_pgd = pgd_offset_pgd(src_pagetable, addr);
+
+	do {
+		next = pgd_addr_end(addr, end);
+		if (level == PGT_LEVEL_PGD || pgd_none(*src_pgd)) {
+			err = asi_set_pgd(asi, dst_pgd, *src_pgd);
+			if (err)
+				return err;
+			continue;
+		}
+
+		err = asi_copy_p4d_range(asi, dst_pgd, src_pgd, addr, next,
+					 level);
+		if (err) {
+			pr_err("ASI %p: PGD error copying P4D addr=%lx next=%lx\n",
+			       asi, addr, next);
+			return err;
+		}
+
+	} while (dst_pgd++, src_pgd++, addr = next, addr < end);
+
+	return 0;
+}
+
+/*
+ * Copy page table entries from the current page table (i.e. from the
+ * kernel page table) to the specified ASI page-table. The level
+ * parameter specifies the page-table level (PGD, P4D, PUD PMD, PTE)
+ * at which the copy should be done.
+ */
+int asi_map_range(struct asi *asi, void *ptr, size_t size,
+		  enum page_table_level level)
+{
+	unsigned long addr = (unsigned long)ptr;
+	unsigned long end = addr + ((unsigned long)size);
+	unsigned long flags;
+	int err;
+
+	pr_debug("ASI %p: MAP %px/%lx/%d\n", asi, ptr, size, level);
+
+	spin_lock_irqsave(&asi->lock, flags);
+	err = asi_copy_pgd_range(asi, asi->pgd, current->mm->pgd,
+				 addr, end, level);
+	spin_unlock_irqrestore(&asi->lock, flags);
+
+	return err;
+}
+EXPORT_SYMBOL(asi_map_range);
+
+/*
+ * Copy page-table PTE entries from the current page-table to the
+ * specified ASI page-table.
+ */
+int asi_map(struct asi *asi, void *ptr, unsigned long size)
+{
+	return asi_map_range(asi, ptr, size, PGT_LEVEL_PTE);
+}
+EXPORT_SYMBOL(asi_map);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 09/26] mm/asi: Helper functions to map module into ASI
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (7 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 08/26] mm/asi: Functions to populate an ASI page-table from a VA range Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 10/26] mm/asi: Keep track of VA ranges mapped in ASI page-table Alexandre Chartre
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add helper functions to easily map a module into an ASI.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 19656aa..b5dbc49 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -6,6 +6,7 @@
 
 #ifndef __ASSEMBLY__
 
+#include <linux/module.h>
 #include <linux/spinlock.h>
 #include <asm/pgtable.h>
 #include <linux/xarray.h>
@@ -81,6 +82,26 @@ extern int asi_map_range(struct asi *asi, void *ptr, size_t size,
 extern int asi_map(struct asi *asi, void *ptr, unsigned long size);
 
 /*
+ * Copy the memory mapping for the current module. This is defined as a
+ * macro to ensure it is expanded in the module making the call so that
+ * THIS_MODULE has the correct value.
+ */
+#define ASI_MAP_THIS_MODULE(asi)			\
+	(asi_map(asi, THIS_MODULE->core_layout.base,	\
+		 THIS_MODULE->core_layout.size))
+
+static inline int asi_map_module(struct asi *asi, char *module_name)
+{
+	struct module *module;
+
+	module = find_module(module_name);
+	if (!module)
+		return -ESRCH;
+
+	return asi_map(asi, module->core_layout.base, module->core_layout.size);
+}
+
+/*
  * Function to exit the current isolation. This is used to abort isolation
  * when a task using isolation is scheduled out.
  */
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 10/26] mm/asi: Keep track of VA ranges mapped in ASI page-table
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (8 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 09/26] mm/asi: Helper functions to map module into ASI Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 11/26] mm/asi: Functions to clear ASI page-table entries for a VA range Alexandre Chartre
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add functions to keep track of VA ranges mapped in an ASI page-table.
This will be used when unmapping to ensure the same range is unmapped,
at the same page-table level. This is also be used to handle mapping
and unmapping of overlapping VA ranges.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |    3 ++
 arch/x86/mm/asi.c           |    3 ++
 arch/x86/mm/asi_pagetable.c |   71 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index b5dbc49..be1c190 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -24,6 +24,7 @@ enum page_table_level {
 struct asi {
 	spinlock_t		lock;		/* protect all attributes */
 	pgd_t			*pgd;		/* ASI page-table */
+	struct list_head	mapping_list;	/* list of VA range mapping */
 
 	/*
 	 * An ASI page-table can have direct references to the full kernel
@@ -69,6 +70,8 @@ struct asi_session {
 
 void asi_init_backend(struct asi *asi);
 void asi_fini_backend(struct asi *asi);
+void asi_init_range_mapping(struct asi *asi);
+void asi_fini_range_mapping(struct asi *asi);
 
 extern struct asi *asi_create(void);
 extern void asi_destroy(struct asi *asi);
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index dfde245..25633a6 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -104,6 +104,8 @@ struct asi *asi_create(void)
 	if (!asi)
 		return NULL;
 
+	asi_init_range_mapping(asi);
+
 	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!page)
 		goto error;
@@ -133,6 +135,7 @@ void asi_destroy(struct asi *asi)
 	if (asi->pgd)
 		free_page((unsigned long)asi->pgd);
 
+	asi_fini_range_mapping(asi);
 	asi_fini_backend(asi);
 
 	kfree(asi);
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index 0169395..a09a22d 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -5,10 +5,21 @@
  */
 
 #include <linux/mm.h>
+#include <linux/slab.h>
 
 #include <asm/asi.h>
 
 /*
+ * Structure to keep track of address ranges mapped into an ASI.
+ */
+struct asi_range_mapping {
+	struct list_head list;
+	void *ptr;			/* range start address */
+	size_t size;			/* range size */
+	enum page_table_level level;	/* mapping level */
+};
+
+/*
  * Get the pointer to the beginning of a page table directory from a page
  * table directory entry.
  */
@@ -75,6 +86,39 @@ void asi_fini_backend(struct asi *asi)
 	}
 }
 
+void asi_init_range_mapping(struct asi *asi)
+{
+	INIT_LIST_HEAD(&asi->mapping_list);
+}
+
+void asi_fini_range_mapping(struct asi *asi)
+{
+	struct asi_range_mapping *range, *range_next;
+
+	list_for_each_entry_safe(range, range_next, &asi->mapping_list, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+/*
+ * Return the range mapping starting at the specified address, or NULL if
+ * no such range is found.
+ */
+static struct asi_range_mapping *asi_get_range_mapping(struct asi *asi,
+						       void *ptr)
+{
+	struct asi_range_mapping *range;
+
+	lockdep_assert_held(&asi->lock);
+	list_for_each_entry(range, &asi->mapping_list, list) {
+		if (range->ptr == ptr)
+			return range;
+	}
+
+	return NULL;
+}
+
 /*
  * Check if an offset in the address space isolation page-table is valid,
  * i.e. check that the offset is on a page effectively belonging to the
@@ -574,6 +618,7 @@ static int asi_copy_pgd_range(struct asi *asi,
 int asi_map_range(struct asi *asi, void *ptr, size_t size,
 		  enum page_table_level level)
 {
+	struct asi_range_mapping *range_mapping;
 	unsigned long addr = (unsigned long)ptr;
 	unsigned long end = addr + ((unsigned long)size);
 	unsigned long flags;
@@ -582,8 +627,34 @@ int asi_map_range(struct asi *asi, void *ptr, size_t size,
 	pr_debug("ASI %p: MAP %px/%lx/%d\n", asi, ptr, size, level);
 
 	spin_lock_irqsave(&asi->lock, flags);
+
+	/* check if the range is already mapped */
+	range_mapping = asi_get_range_mapping(asi, ptr);
+	if (range_mapping) {
+		pr_debug("ASI %p: MAP %px/%lx/%d already mapped\n",
+			 asi, ptr, size, level);
+		err = -EBUSY;
+		goto done;
+	}
+
+	/* map new range */
+	range_mapping = kmalloc(sizeof(*range_mapping), GFP_KERNEL);
+	if (!range_mapping) {
+		err = -ENOMEM;
+		goto done;
+	}
+
 	err = asi_copy_pgd_range(asi, asi->pgd, current->mm->pgd,
 				 addr, end, level);
+	if (err)
+		goto done;
+
+	INIT_LIST_HEAD(&range_mapping->list);
+	range_mapping->ptr = ptr;
+	range_mapping->size = size;
+	range_mapping->level = level;
+	list_add(&range_mapping->list, &asi->mapping_list);
+done:
 	spin_unlock_irqrestore(&asi->lock, flags);
 
 	return err;
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 11/26] mm/asi: Functions to clear ASI page-table entries for a VA range
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (9 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 10/26] mm/asi: Keep track of VA ranges mapped in ASI page-table Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 12/26] mm/asi: Function to copy page-table entries for percpu buffer Alexandre Chartre
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Provide functions to clear page-table entries in the ASI page-table for
a specified VA range. Functions also check that the clearing effectively
happens in the ASI page-table and there is no crossing of the ASI
page-table boundary (through references to the kernel page table), so
that the kernel page table is not modified by mistake.

As information (address, size, page-table level) about VA ranges mapped
to the ASI page-table is tracked, clearing is done with just specifying
the start address of the range.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |    1 +
 arch/x86/mm/asi_pagetable.c |  134 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index be1c190..919129f 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -83,6 +83,7 @@ extern bool asi_fault(struct pt_regs *regs, unsigned long error_code,
 extern int asi_map_range(struct asi *asi, void *ptr, size_t size,
 			 enum page_table_level level);
 extern int asi_map(struct asi *asi, void *ptr, unsigned long size);
+extern void asi_unmap(struct asi *asi, void *ptr);
 
 /*
  * Copy the memory mapping for the current module. This is defined as a
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index a09a22d..7aee236 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -670,3 +670,137 @@ int asi_map(struct asi *asi, void *ptr, unsigned long size)
 	return asi_map_range(asi, ptr, size, PGT_LEVEL_PTE);
 }
 EXPORT_SYMBOL(asi_map);
+
+static void asi_clear_pte_range(struct asi *asi, pmd_t *pmd,
+				unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+
+	pte = asi_pte_offset(asi, pmd, addr);
+	if (IS_ERR(pte))
+		return;
+
+	do {
+		pte_clear(NULL, addr, pte);
+	} while (pte++, addr += PAGE_SIZE, addr < end);
+}
+
+static void asi_clear_pmd_range(struct asi *asi, pud_t *pud,
+				unsigned long addr, unsigned long end,
+				enum page_table_level level)
+{
+	unsigned long next;
+	pmd_t *pmd;
+
+	pmd = asi_pmd_offset(asi, pud, addr);
+	if (IS_ERR(pmd))
+		return;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(*pmd) || pmd_present(*pmd))
+			continue;
+		if (level == PGT_LEVEL_PMD || pmd_trans_huge(*pmd) ||
+		    pmd_devmap(*pmd)) {
+			pmd_clear(pmd);
+			continue;
+		}
+		asi_clear_pte_range(asi, pmd, addr, next);
+	} while (pmd++, addr = next, addr < end);
+}
+
+static void asi_clear_pud_range(struct asi *asi, p4d_t *p4d,
+				unsigned long addr, unsigned long end,
+				enum page_table_level level)
+{
+	unsigned long next;
+	pud_t *pud;
+
+	pud = asi_pud_offset(asi, p4d, addr);
+	if (IS_ERR(pud))
+		return;
+
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none(*pud))
+			continue;
+		if (level == PGT_LEVEL_PUD || pud_trans_huge(*pud) ||
+		    pud_devmap(*pud)) {
+			pud_clear(pud);
+			continue;
+		}
+		asi_clear_pmd_range(asi, pud, addr, next, level);
+	} while (pud++, addr = next, addr < end);
+}
+
+static void asi_clear_p4d_range(struct asi *asi, pgd_t *pgd,
+				unsigned long addr, unsigned long end,
+				enum page_table_level level)
+{
+	unsigned long next;
+	p4d_t *p4d;
+
+	p4d = asi_p4d_offset(asi, pgd, addr);
+	if (IS_ERR(p4d))
+		return;
+
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none(*p4d))
+			continue;
+		if (level == PGT_LEVEL_P4D) {
+			p4d_clear(p4d);
+			continue;
+		}
+		asi_clear_pud_range(asi, p4d, addr, next, level);
+	} while (p4d++, addr = next, addr < end);
+}
+
+static void asi_clear_pgd_range(struct asi *asi, pgd_t *pagetable,
+				unsigned long addr, unsigned long end,
+				enum page_table_level level)
+{
+	unsigned long next;
+	pgd_t *pgd;
+
+	pgd = pgd_offset_pgd(pagetable, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgd))
+			continue;
+		if (level == PGT_LEVEL_PGD) {
+			pgd_clear(pgd);
+			continue;
+		}
+		asi_clear_p4d_range(asi, pgd, addr, next, level);
+	} while (pgd++, addr = next, addr < end);
+}
+
+/*
+ * Clear page table entries in the specified ASI page-table.
+ */
+void asi_unmap(struct asi *asi, void *ptr)
+{
+	struct asi_range_mapping *range_mapping;
+	unsigned long addr, end;
+	unsigned long flags;
+
+	spin_lock_irqsave(&asi->lock, flags);
+
+	range_mapping = asi_get_range_mapping(asi, ptr);
+	if (!range_mapping) {
+		pr_debug("ASI %p: UNMAP %px - not mapped\n", asi, ptr);
+		goto done;
+	}
+
+	addr = (unsigned long)range_mapping->ptr;
+	end = addr + range_mapping->size;
+	pr_debug("ASI %p: UNMAP %px/%lx/%d\n", asi, ptr,
+		 range_mapping->size, range_mapping->level);
+	asi_clear_pgd_range(asi, asi->pgd, addr, end, range_mapping->level);
+	list_del(&range_mapping->list);
+	kfree(range_mapping);
+done:
+	spin_unlock_irqrestore(&asi->lock, flags);
+}
+EXPORT_SYMBOL(asi_unmap);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 12/26] mm/asi: Function to copy page-table entries for percpu buffer
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (10 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 11/26] mm/asi: Functions to clear ASI page-table entries for a VA range Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 13/26] mm/asi: Add asi_remap() function Alexandre Chartre
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Provide functions to copy page-table entries from the kernel page-table
to an ASI page-table for a percpu buffer. A percpu buffer have a different
VA range for each cpu and all them have to be copied.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |    6 ++++++
 arch/x86/mm/asi_pagetable.c |   38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 919129f..912b6a7 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -105,6 +105,12 @@ static inline int asi_map_module(struct asi *asi, char *module_name)
 	return asi_map(asi, module->core_layout.base, module->core_layout.size);
 }
 
+#define	ASI_MAP_CPUVAR(asi, cpuvar)	\
+	asi_map_percpu(asi, &cpuvar, sizeof(cpuvar))
+
+extern int asi_map_percpu(struct asi *asi, void *percpu_ptr, size_t size);
+extern void asi_unmap_percpu(struct asi *asi, void *percpu_ptr);
+
 /*
  * Function to exit the current isolation. This is used to abort isolation
  * when a task using isolation is scheduled out.
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index 7aee236..a4fe867 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -804,3 +804,41 @@ void asi_unmap(struct asi *asi, void *ptr)
 	spin_unlock_irqrestore(&asi->lock, flags);
 }
 EXPORT_SYMBOL(asi_unmap);
+
+void asi_unmap_percpu(struct asi *asi, void *percpu_ptr)
+{
+	void *ptr;
+	int cpu;
+
+	pr_debug("ASI %p: UNMAP PERCPU %px\n", asi, percpu_ptr);
+	for_each_possible_cpu(cpu) {
+		ptr = per_cpu_ptr(percpu_ptr, cpu);
+		pr_debug("ASI %p: UNMAP PERCPU%d %px\n", asi, cpu, ptr);
+		asi_unmap(asi, ptr);
+	}
+}
+EXPORT_SYMBOL(asi_unmap_percpu);
+
+int asi_map_percpu(struct asi *asi, void *percpu_ptr, size_t size)
+{
+	int cpu, err;
+	void *ptr;
+
+	pr_debug("ASI %p: MAP PERCPU %px\n", asi, percpu_ptr);
+	for_each_possible_cpu(cpu) {
+		ptr = per_cpu_ptr(percpu_ptr, cpu);
+		pr_debug("ASI %p: MAP PERCPU%d %px\n", asi, cpu, ptr);
+		err = asi_map(asi, ptr, size);
+		if (err) {
+			/*
+			 * Need to unmap any percpu mapping which has
+			 * succeeded before the failure.
+			 */
+			asi_unmap_percpu(asi, percpu_ptr);
+			return err;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(asi_map_percpu);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 13/26] mm/asi: Add asi_remap() function
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (11 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 12/26] mm/asi: Function to copy page-table entries for percpu buffer Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 14/26] mm/asi: Handle ASI mapped range leaks and overlaps Alexandre Chartre
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add a function to remap an already mapped buffer with a new address
in an ASI page-table: the already mapped buffer is unmapped, and a
new mapping is added for the specified new address.

This is useful to track and remap a buffer which can be freed and
then reallocated.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |    1 +
 arch/x86/mm/asi_pagetable.c |   25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 912b6a7..cf5d198 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -84,6 +84,7 @@ extern int asi_map_range(struct asi *asi, void *ptr, size_t size,
 			 enum page_table_level level);
 extern int asi_map(struct asi *asi, void *ptr, unsigned long size);
 extern void asi_unmap(struct asi *asi, void *ptr);
+extern int asi_remap(struct asi *asi, void **mapping, void *ptr, size_t size);
 
 /*
  * Copy the memory mapping for the current module. This is defined as a
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index a4fe867..1ff0c47 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -842,3 +842,28 @@ int asi_map_percpu(struct asi *asi, void *percpu_ptr, size_t size)
 	return 0;
 }
 EXPORT_SYMBOL(asi_map_percpu);
+
+int asi_remap(struct asi *asi, void **current_ptrp, void *new_ptr, size_t size)
+{
+	void *current_ptr = *current_ptrp;
+	int err;
+
+	if (current_ptr == new_ptr) {
+		/* no change, already mapped */
+		return 0;
+	}
+
+	if (current_ptr) {
+		asi_unmap(asi, current_ptr);
+		*current_ptrp = NULL;
+	}
+
+	err = asi_map(asi, new_ptr, size);
+	if (err)
+		return err;
+
+	*current_ptrp = new_ptr;
+
+	return 0;
+}
+EXPORT_SYMBOL(asi_remap);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 14/26] mm/asi: Handle ASI mapped range leaks and overlaps
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (12 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 13/26] mm/asi: Add asi_remap() function Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 15/26] mm/asi: Initialize the ASI page-table with core mappings Alexandre Chartre
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

When mapping a buffer in an ASI page-table, data around the buffer can
also be mapped if the entire buffer is not aligned with the page directory
size used for the mapping. So, data can potentially leak into the ASI
page-table. In such a case, print a warning that data are leaking.

Also data effectively mapped can overlap with an already mapped buffer.
This is not an issue when mapping data but, when unmapping, make sure
data from another buffer don't get unmapped as a side effect.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/mm/asi_pagetable.c |  230 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 212 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index 1ff0c47..f1ee65b 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -9,6 +9,14 @@
 
 #include <asm/asi.h>
 
+static unsigned long page_directory_size[] = {
+	[PGT_LEVEL_PTE] = PAGE_SIZE,
+	[PGT_LEVEL_PMD] = PMD_SIZE,
+	[PGT_LEVEL_PUD] = PUD_SIZE,
+	[PGT_LEVEL_P4D] = P4D_SIZE,
+	[PGT_LEVEL_PGD] = PGDIR_SIZE,
+};
+
 /*
  * Structure to keep track of address ranges mapped into an ASI.
  */
@@ -17,8 +25,16 @@ struct asi_range_mapping {
 	void *ptr;			/* range start address */
 	size_t size;			/* range size */
 	enum page_table_level level;	/* mapping level */
+	int overlap;			/* overlap count */
 };
 
+#define ASI_RANGE_MAP_ADDR(r)	\
+	round_down((unsigned long)((r)->ptr), page_directory_size[(r)->level])
+
+#define ASI_RANGE_MAP_END(r)	\
+	round_up((unsigned long)((r)->ptr + (r)->size), \
+		 page_directory_size[(r)->level])
+
 /*
  * Get the pointer to the beginning of a page table directory from a page
  * table directory entry.
@@ -609,6 +625,71 @@ static int asi_copy_pgd_range(struct asi *asi,
 	return 0;
 }
 
+
+/*
+ * Map a VA range, taking into account any overlap with already mapped
+ * VA ranges. On error, return < 0. Otherwise return the number of
+ * ranges the specified range is overlapping with.
+ */
+static int asi_map_overlap(struct asi *asi, void *ptr, size_t size,
+			   enum page_table_level level)
+{
+	unsigned long map_addr, map_end;
+	unsigned long addr, end;
+	struct asi_range_mapping *range;
+	bool need_mapping;
+	int err, overlap;
+
+	addr = (unsigned long)ptr;
+	end = addr + (unsigned long)size;
+	need_mapping = true;
+	overlap = 0;
+
+	lockdep_assert_held(&asi->lock);
+	list_for_each_entry(range, &asi->mapping_list, list) {
+
+		if (range->ptr == ptr && range->size == size) {
+			/* we are mapping the same range again */
+			pr_debug("ASI %p: MAP %px/%lx/%d already mapped\n",
+				 asi, ptr, size, level);
+			return -EBUSY;
+		}
+
+		/* check overlap with mapped range */
+		map_addr = ASI_RANGE_MAP_ADDR(range);
+		map_end = ASI_RANGE_MAP_END(range);
+		if (end <= map_addr || addr >= map_end) {
+			/* no overlap, continue */
+			continue;
+		}
+
+		pr_debug("ASI %p: MAP %px/%lx/%d overlaps with %px/%lx/%d\n",
+			 asi, ptr, size, level,
+			 range->ptr, range->size, range->level);
+		range->overlap++;
+		overlap++;
+
+		/*
+		 * Check if new range is included into an existing range.
+		 * If so then the new range is already entirely mapped.
+		 */
+		if (addr >= map_addr && end <= map_end) {
+			pr_debug("ASI %p: MAP %px/%lx/%d implicitly mapped\n",
+				 asi, ptr, size, level);
+			need_mapping = false;
+		}
+	}
+
+	if (need_mapping) {
+		err = asi_copy_pgd_range(asi, asi->pgd, current->mm->pgd,
+					 addr, end, level);
+		if (err)
+			return err;
+	}
+
+	return overlap;
+}
+
 /*
  * Copy page table entries from the current page table (i.e. from the
  * kernel page table) to the specified ASI page-table. The level
@@ -619,44 +700,53 @@ int asi_map_range(struct asi *asi, void *ptr, size_t size,
 		  enum page_table_level level)
 {
 	struct asi_range_mapping *range_mapping;
+	unsigned long page_dir_size = page_directory_size[level];
 	unsigned long addr = (unsigned long)ptr;
 	unsigned long end = addr + ((unsigned long)size);
+	unsigned long map_addr, map_end;
 	unsigned long flags;
-	int err;
+	int err, overlap;
+
+	map_addr = round_down(addr, page_dir_size);
+	map_end = round_up(end, page_dir_size);
 
-	pr_debug("ASI %p: MAP %px/%lx/%d\n", asi, ptr, size, level);
+	pr_debug("ASI %p: MAP %px/%lx/%d -> %lx-%lx\n", asi, ptr, size, level,
+		 map_addr, map_end);
+	if (map_addr < addr)
+		pr_debug("ASI %p: MAP LEAK %lx-%lx\n", asi, map_addr, addr);
+	if (map_end > end)
+		pr_debug("ASI %p: MAP LEAK %lx-%lx\n", asi, end, map_end);
 
 	spin_lock_irqsave(&asi->lock, flags);
 
-	/* check if the range is already mapped */
-	range_mapping = asi_get_range_mapping(asi, ptr);
-	if (range_mapping) {
-		pr_debug("ASI %p: MAP %px/%lx/%d already mapped\n",
-			 asi, ptr, size, level);
-		err = -EBUSY;
-		goto done;
+	/*
+	 * Map the new range with taking overlap with already mapped ranges
+	 * into account.
+	 */
+	overlap = asi_map_overlap(asi, ptr, size, level);
+	if (overlap < 0) {
+		err = overlap;
+		goto error;
 	}
 
-	/* map new range */
+	/* add new range */
 	range_mapping = kmalloc(sizeof(*range_mapping), GFP_KERNEL);
 	if (!range_mapping) {
 		err = -ENOMEM;
-		goto done;
+		goto error;
 	}
 
-	err = asi_copy_pgd_range(asi, asi->pgd, current->mm->pgd,
-				 addr, end, level);
-	if (err)
-		goto done;
-
 	INIT_LIST_HEAD(&range_mapping->list);
 	range_mapping->ptr = ptr;
 	range_mapping->size = size;
 	range_mapping->level = level;
+	range_mapping->overlap = overlap;
 	list_add(&range_mapping->list, &asi->mapping_list);
-done:
 	spin_unlock_irqrestore(&asi->lock, flags);
+	return 0;
 
+error:
+	spin_unlock_irqrestore(&asi->lock, flags);
 	return err;
 }
 EXPORT_SYMBOL(asi_map_range);
@@ -776,6 +866,110 @@ static void asi_clear_pgd_range(struct asi *asi, pgd_t *pagetable,
 	} while (pgd++, addr = next, addr < end);
 }
 
+
+/*
+ * Unmap a VA range, taking into account any overlap with other mapped
+ * VA ranges. This unmaps the specified range then remap any range this
+ * range was overlapping with.
+ */
+static void asi_unmap_overlap(struct asi *asi, struct asi_range_mapping *range)
+{
+	unsigned long map_addr, map_end;
+	struct asi_range_mapping *r;
+	unsigned long addr, end;
+	unsigned long r_addr;
+	bool need_unmapping;
+	int err, overlap;
+
+	addr = (unsigned long)range->ptr;
+	end = addr + (unsigned long)range->size;
+	overlap = range->overlap;
+	need_unmapping = true;
+
+	lockdep_assert_held(&asi->lock);
+
+	/*
+	 * Adjust overlap information and check if range effectively needs
+	 * to be unmapped.
+	 */
+	list_for_each_entry(r, &asi->mapping_list, list) {
+
+		if (!overlap) {
+			/* no more overlap */
+			break;
+		}
+
+		WARN_ON(range->ptr == r->ptr && range->size == r->size);
+
+		/* check overlap with other range */
+		map_addr = ASI_RANGE_MAP_ADDR(r);
+		map_end = ASI_RANGE_MAP_END(r);
+		if (end < map_addr || addr >= map_end) {
+			/* no overlap, continue */
+			continue;
+		}
+
+		pr_debug("ASI %p: UNMAP %px/%lx/%d overlaps with %px/%lx/%d\n",
+			 asi, range->ptr, range->size, range->level,
+			 r->ptr, r->size, r->level);
+		r->overlap--;
+		overlap--;
+
+		/*
+		 * Check if range is included into a remaining mapped range.
+		 * If so then there's no need to unmap.
+		 */
+		if (map_addr <= addr && end <= map_end) {
+			pr_debug("ASI %p: UNMAP %px/%lx/%d still mapped\n",
+				 asi, range->ptr, range->size, range->level);
+			need_unmapping = false;
+		}
+	}
+
+	WARN_ON(overlap);
+
+	if (need_unmapping) {
+		asi_clear_pgd_range(asi, asi->pgd, addr, end, range->level);
+
+		/*
+		 * Remap all range we overlap with as mapping clearing
+		 * will have unmap the overlap.
+		 */
+		overlap = range->overlap;
+		list_for_each_entry(r, &asi->mapping_list, list) {
+			if (!overlap) {
+				/* no more overlap */
+				break;
+			}
+
+			/* check overlap with other range */
+			map_addr = ASI_RANGE_MAP_ADDR(r);
+			map_end = ASI_RANGE_MAP_END(r);
+			if (end < map_addr || addr >= map_end) {
+				/* no overlap, continue */
+				continue;
+			}
+			pr_debug("ASI %p: UNMAP %px/%lx/%d remaps %px/%lx/%d\n",
+				 asi, range->ptr, range->size, range->level,
+				 r->ptr, r->size, r->level);
+			overlap--;
+
+			r_addr = (unsigned long)r->ptr;
+			err = asi_copy_pgd_range(asi, asi->pgd,
+						 current->mm->pgd,
+						 r_addr, r_addr + r->size,
+						 r->level);
+			if (err) {
+				pr_debug("ASI %p: UNMAP %px/%lx/%d remaps %px/%lx/%d error %d\n",
+					 asi, range->ptr, range->size,
+					 range->level,
+					 r->ptr, r->size, r->level,
+					 err);
+			}
+		}
+	}
+}
+
 /*
  * Clear page table entries in the specified ASI page-table.
  */
@@ -797,8 +991,8 @@ void asi_unmap(struct asi *asi, void *ptr)
 	end = addr + range_mapping->size;
 	pr_debug("ASI %p: UNMAP %px/%lx/%d\n", asi, ptr,
 		 range_mapping->size, range_mapping->level);
-	asi_clear_pgd_range(asi, asi->pgd, addr, end, range_mapping->level);
 	list_del(&range_mapping->list);
+	asi_unmap_overlap(asi, range_mapping);
 	kfree(range_mapping);
 done:
 	spin_unlock_irqrestore(&asi->lock, flags);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 15/26] mm/asi: Initialize the ASI page-table with core mappings
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (13 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 14/26] mm/asi: Handle ASI mapped range leaks and overlaps Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 16/26] mm/asi: Option to map current task into ASI Alexandre Chartre
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Core mappings are the minimal mappings we need to be able to
enter isolation and handle an isolation abort or exit. This
includes the kernel code, the GDT and the percpu ASI sessions.
We also need a stack so we map the current stack when entering
isolation and unmap it on exit/abort.

Optionally, additional mappins can be added like the stack canary
or the percpu offset to be able to use get_cpu_var()/this_cpu_ptr()
when isolation is active.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |    9 ++++-
 arch/x86/mm/asi.c           |   75 +++++++++++++++++++++++++++++++++++++++---
 arch/x86/mm/asi_pagetable.c |   30 ++++++++++++----
 3 files changed, 99 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index cf5d198..1ac8fd3 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -11,6 +11,13 @@
 #include <asm/pgtable.h>
 #include <linux/xarray.h>
 
+/*
+ * asi_create() map flags. Flags are used to map optional data
+ * when creating an ASI.
+ */
+#define ASI_MAP_STACK_CANARY	0x01	/* map stack canary */
+#define ASI_MAP_CPU_PTR		0x02	/* for get_cpu_var()/this_cpu_ptr() */
+
 enum page_table_level {
 	PGT_LEVEL_PTE,
 	PGT_LEVEL_PMD,
@@ -73,7 +80,7 @@ struct asi_session {
 void asi_init_range_mapping(struct asi *asi);
 void asi_fini_range_mapping(struct asi *asi);
 
-extern struct asi *asi_create(void);
+extern struct asi *asi_create(int map_flags);
 extern void asi_destroy(struct asi *asi);
 extern int asi_enter(struct asi *asi);
 extern void asi_exit(struct asi *asi);
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 25633a6..f049438 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -19,6 +19,17 @@
 /* ASI sessions, one per cpu */
 DEFINE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
 
+struct asi_map_option {
+	int	flag;
+	void	*ptr;
+	size_t	size;
+};
+
+struct asi_map_option asi_map_percpu_options[] = {
+	{ ASI_MAP_STACK_CANARY, &fixed_percpu_data, sizeof(fixed_percpu_data) },
+	{ ASI_MAP_CPU_PTR, &this_cpu_off, sizeof(this_cpu_off) },
+};
+
 static void asi_log_fault(struct asi *asi, struct pt_regs *regs,
 			  unsigned long error_code, unsigned long address)
 {
@@ -85,16 +96,55 @@ bool asi_fault(struct pt_regs *regs, unsigned long error_code,
 	return true;
 }
 
-static int asi_init_mapping(struct asi *asi)
+static int asi_init_mapping(struct asi *asi, int flags)
 {
+	struct asi_map_option *option;
+	int i, err;
+
+	/*
+	 * Map the kernel.
+	 *
+	 * XXX We should check if we can map only kernel text, i.e. map with
+	 * size = _etext - _text
+	 */
+	err = asi_map(asi, (void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE);
+	if (err)
+		return err;
+
 	/*
-	 * TODO: Populate the ASI page-table with minimal mappings so
-	 * that we can at least enter isolation and abort.
+	 * Map the cpu_entry_area because we need the GDT to be mapped.
+	 * Not sure we need anything else from cpu_entry_area.
 	 */
+	err = asi_map_range(asi, (void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE,
+			    PGT_LEVEL_P4D);
+	if (err)
+		return err;
+
+	/*
+	 * Map the percpu ASI sessions. This is used by interrupt handlers
+	 * to figure out if we have entered isolation and switch back to
+	 * the kernel address space.
+	 */
+	err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
+	if (err)
+		return err;
+
+	/*
+	 * Optional percpu mappings.
+	 */
+	for (i = 0; i < ARRAY_SIZE(asi_map_percpu_options); i++) {
+		option = &asi_map_percpu_options[i];
+		if (flags & option->flag) {
+			err = asi_map_percpu(asi, option->ptr, option->size);
+			if (err)
+				return err;
+		}
+	}
+
 	return 0;
 }
 
-struct asi *asi_create(void)
+struct asi *asi_create(int map_flags)
 {
 	struct page *page;
 	struct asi *asi;
@@ -115,7 +165,7 @@ struct asi *asi_create(void)
 	spin_lock_init(&asi->fault_lock);
 	asi_init_backend(asi);
 
-	err = asi_init_mapping(asi);
+	err = asi_init_mapping(asi, map_flags);
 	if (err)
 		goto error;
 
@@ -159,6 +209,7 @@ int asi_enter(struct asi *asi)
 	struct asi *current_asi;
 	struct asi_session *asi_session;
 	unsigned long original_cr3;
+	int err;
 
 	state = this_cpu_read(cpu_asi_session.state);
 	/*
@@ -190,6 +241,13 @@ int asi_enter(struct asi *asi)
 	WARN_ON(asi_session->abort_depth > 0);
 
 	/*
+	 * We need a stack to run with isolation, so map the current stack.
+	 */
+	err = asi_map(asi, current->stack, PAGE_SIZE << THREAD_SIZE_ORDER);
+	if (err)
+		goto err_clear_asi;
+
+	/*
 	 * Instructions ordering is important here because we should be
 	 * able to deal with any interrupt/exception which will abort
 	 * the isolation and restore CR3 to its original value:
@@ -211,7 +269,7 @@ int asi_enter(struct asi *asi)
 	if (!original_cr3) {
 		WARN_ON(1);
 		err = -EINVAL;
-		goto err_clear_asi;
+		goto err_unmap_stack;
 	}
 	asi_session->original_cr3 = original_cr3;
 
@@ -228,6 +286,8 @@ int asi_enter(struct asi *asi)
 
 	return 0;
 
+err_unmap_stack:
+	asi_unmap(asi, current->stack);
 err_clear_asi:
 	asi_session->asi = NULL;
 	asi_session->task = NULL;
@@ -284,6 +344,9 @@ void asi_exit(struct asi *asi)
 	 * exit isolation before abort_depth reaches 0.
 	 */
 	asi_session->abort_depth = 0;
+
+	/* unmap stack */
+	asi_unmap(asi, current->stack);
 }
 EXPORT_SYMBOL(asi_exit);
 
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index f1ee65b..bcc95f2 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -710,12 +710,20 @@ int asi_map_range(struct asi *asi, void *ptr, size_t size,
 	map_addr = round_down(addr, page_dir_size);
 	map_end = round_up(end, page_dir_size);
 
-	pr_debug("ASI %p: MAP %px/%lx/%d -> %lx-%lx\n", asi, ptr, size, level,
-		 map_addr, map_end);
-	if (map_addr < addr)
-		pr_debug("ASI %p: MAP LEAK %lx-%lx\n", asi, map_addr, addr);
-	if (map_end > end)
-		pr_debug("ASI %p: MAP LEAK %lx-%lx\n", asi, end, map_end);
+	/*
+	 * Don't log info the current stack because it is mapped/unmapped
+	 * everytime we enter/exit isolation.
+	 */
+	if (ptr != current->stack) {
+		pr_debug("ASI %p: MAP %px/%lx/%d -> %lx-%lx\n",
+			 asi, ptr, size, level, map_addr, map_end);
+		if (map_addr < addr)
+			pr_debug("ASI %p: MAP LEAK %lx-%lx\n",
+				 asi, map_addr, addr);
+		if (map_end > end)
+			pr_debug("ASI %p: MAP LEAK %lx-%lx\n",
+				 asi, end, map_end);
+	}
 
 	spin_lock_irqsave(&asi->lock, flags);
 
@@ -989,8 +997,14 @@ void asi_unmap(struct asi *asi, void *ptr)
 
 	addr = (unsigned long)range_mapping->ptr;
 	end = addr + range_mapping->size;
-	pr_debug("ASI %p: UNMAP %px/%lx/%d\n", asi, ptr,
-		 range_mapping->size, range_mapping->level);
+	/*
+	 * Don't log info the current stack because it is mapped/unmapped
+	 * everytime we enter/exit isolation.
+	 */
+	if (ptr != current->stack) {
+		pr_debug("ASI %p: UNMAP %px/%lx/%d\n", asi, ptr,
+			 range_mapping->size, range_mapping->level);
+	}
 	list_del(&range_mapping->list);
 	asi_unmap_overlap(asi, range_mapping);
 	kfree(range_mapping);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 16/26] mm/asi: Option to map current task into ASI
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (14 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 15/26] mm/asi: Initialize the ASI page-table with core mappings Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 17/26] rcu: Move tree.h static forward declarations to tree.c Alexandre Chartre
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add an option to map the current task into an ASI page-table.
The task is mapped when entering isolation and unmapped on
abort/exit.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h  |    2 ++
 arch/x86/mm/asi.c           |   25 +++++++++++++++++++++----
 arch/x86/mm/asi_pagetable.c |    4 ++--
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 1ac8fd3..a277e43 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -17,6 +17,7 @@
  */
 #define ASI_MAP_STACK_CANARY	0x01	/* map stack canary */
 #define ASI_MAP_CPU_PTR		0x02	/* for get_cpu_var()/this_cpu_ptr() */
+#define ASI_MAP_CURRENT_TASK	0x04	/* map the current task */
 
 enum page_table_level {
 	PGT_LEVEL_PTE,
@@ -31,6 +32,7 @@ enum page_table_level {
 struct asi {
 	spinlock_t		lock;		/* protect all attributes */
 	pgd_t			*pgd;		/* ASI page-table */
+	int			mapping_flags;	/* map flags */
 	struct list_head	mapping_list;	/* list of VA range mapping */
 
 	/*
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index f049438..acd1135 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -28,6 +28,7 @@ struct asi_map_option {
 struct asi_map_option asi_map_percpu_options[] = {
 	{ ASI_MAP_STACK_CANARY, &fixed_percpu_data, sizeof(fixed_percpu_data) },
 	{ ASI_MAP_CPU_PTR, &this_cpu_off, sizeof(this_cpu_off) },
+	{ ASI_MAP_CURRENT_TASK, &current_task, sizeof(current_task) },
 };
 
 static void asi_log_fault(struct asi *asi, struct pt_regs *regs,
@@ -96,8 +97,9 @@ bool asi_fault(struct pt_regs *regs, unsigned long error_code,
 	return true;
 }
 
-static int asi_init_mapping(struct asi *asi, int flags)
+static int asi_init_mapping(struct asi *asi)
 {
+	int flags = asi->mapping_flags;
 	struct asi_map_option *option;
 	int i, err;
 
@@ -164,8 +166,9 @@ struct asi *asi_create(int map_flags)
 	spin_lock_init(&asi->lock);
 	spin_lock_init(&asi->fault_lock);
 	asi_init_backend(asi);
+	asi->mapping_flags = map_flags;
 
-	err = asi_init_mapping(asi, map_flags);
+	err = asi_init_mapping(asi);
 	if (err)
 		goto error;
 
@@ -248,6 +251,15 @@ int asi_enter(struct asi *asi)
 		goto err_clear_asi;
 
 	/*
+	 * Optionally, also map the current task.
+	 */
+	if (asi->mapping_flags & ASI_MAP_CURRENT_TASK) {
+		err = asi_map(asi, current, sizeof(struct task_struct));
+		if (err)
+			goto err_unmap_stack;
+	}
+
+	/*
 	 * Instructions ordering is important here because we should be
 	 * able to deal with any interrupt/exception which will abort
 	 * the isolation and restore CR3 to its original value:
@@ -269,7 +281,7 @@ int asi_enter(struct asi *asi)
 	if (!original_cr3) {
 		WARN_ON(1);
 		err = -EINVAL;
-		goto err_unmap_stack;
+		goto err_unmap_task;
 	}
 	asi_session->original_cr3 = original_cr3;
 
@@ -286,6 +298,9 @@ int asi_enter(struct asi *asi)
 
 	return 0;
 
+err_unmap_task:
+	if (asi->mapping_flags & ASI_MAP_CURRENT_TASK)
+		asi_unmap(asi, current);
 err_unmap_stack:
 	asi_unmap(asi, current->stack);
 err_clear_asi:
@@ -345,8 +360,10 @@ void asi_exit(struct asi *asi)
 	 */
 	asi_session->abort_depth = 0;
 
-	/* unmap stack */
+	/* unmap stack and task */
 	asi_unmap(asi, current->stack);
+	if (asi->mapping_flags & ASI_MAP_CURRENT_TASK)
+		asi_unmap(asi, current);
 }
 EXPORT_SYMBOL(asi_exit);
 
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index bcc95f2..8076626 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -714,7 +714,7 @@ int asi_map_range(struct asi *asi, void *ptr, size_t size,
 	 * Don't log info the current stack because it is mapped/unmapped
 	 * everytime we enter/exit isolation.
 	 */
-	if (ptr != current->stack) {
+	if (ptr != current->stack && ptr != current) {
 		pr_debug("ASI %p: MAP %px/%lx/%d -> %lx-%lx\n",
 			 asi, ptr, size, level, map_addr, map_end);
 		if (map_addr < addr)
@@ -1001,7 +1001,7 @@ void asi_unmap(struct asi *asi, void *ptr)
 	 * Don't log info the current stack because it is mapped/unmapped
 	 * everytime we enter/exit isolation.
 	 */
-	if (ptr != current->stack) {
+	if (ptr != current->stack && ptr != current) {
 		pr_debug("ASI %p: UNMAP %px/%lx/%d\n", asi, ptr,
 			 range_mapping->size, range_mapping->level);
 	}
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 17/26] rcu: Move tree.h static forward declarations to tree.c
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (15 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 16/26] mm/asi: Option to map current task into ASI Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 18/26] rcu: Make percpu rcu_data non-static Alexandre Chartre
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

tree.h has static forward declarations for inline function declared
in tree_plugin.h and tree_stall.h. These forward declarations prevent
including tree.h into a file different from tree.c

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 kernel/rcu/tree.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/rcu/tree.h |   55 +----------------------------------------------------
 2 files changed, 55 insertions(+), 54 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 980ca3c..44dd3b4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -55,6 +55,60 @@
 #include "tree.h"
 #include "rcu.h"
 
+/* Forward declarations for tree_plugin.h */
+static void rcu_bootup_announce(void);
+static void rcu_qs(void);
+static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp);
+#ifdef CONFIG_HOTPLUG_CPU
+static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
+#endif /* #ifdef CONFIG_HOTPLUG_CPU */
+static int rcu_print_task_exp_stall(struct rcu_node *rnp);
+static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp);
+static void rcu_flavor_sched_clock_irq(int user);
+static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck);
+static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
+static void rcu_preempt_boost_start_gp(struct rcu_node *rnp);
+static void invoke_rcu_callbacks_kthread(void);
+static bool rcu_is_callbacks_kthread(void);
+static void __init rcu_spawn_boost_kthreads(void);
+static void rcu_prepare_kthreads(int cpu);
+static void rcu_cleanup_after_idle(void);
+static void rcu_prepare_for_idle(void);
+static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
+static bool rcu_preempt_need_deferred_qs(struct task_struct *t);
+static void rcu_preempt_deferred_qs(struct task_struct *t);
+static void zero_cpu_stall_ticks(struct rcu_data *rdp);
+static bool rcu_nocb_cpu_needs_barrier(int cpu);
+static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
+static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
+static void rcu_init_one_nocb(struct rcu_node *rnp);
+static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
+			    bool lazy, unsigned long flags);
+static bool rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp,
+				      struct rcu_data *rdp,
+				      unsigned long flags);
+static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
+static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
+static void rcu_spawn_cpu_nocb_kthread(int cpu);
+static void __init rcu_spawn_nocb_kthreads(void);
+#ifdef CONFIG_RCU_NOCB_CPU
+static void __init rcu_organize_nocb_kthreads(void);
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+static bool init_nocb_callback_list(struct rcu_data *rdp);
+static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp);
+static void rcu_bind_gp_kthread(void);
+static bool rcu_nohz_full_cpu(void);
+static void rcu_dynticks_task_enter(void);
+static void rcu_dynticks_task_exit(void);
+
+/* Forward declarations for tree_stall.h */
+static void record_gp_stall_check_time(void);
+static void rcu_iw_handler(struct irq_work *iwp);
+static void check_cpu_stall(struct rcu_data *rdp);
+static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
+				     const unsigned long gpssdelay);
+
 #ifdef MODULE_PARAM_PREFIX
 #undef MODULE_PARAM_PREFIX
 #endif
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e253d11..9790b58 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -392,58 +392,5 @@ struct rcu_state {
 #endif /* #else #ifdef CONFIG_TRACING */
 
 int rcu_dynticks_snap(struct rcu_data *rdp);
-
-/* Forward declarations for tree_plugin.h */
-static void rcu_bootup_announce(void);
-static void rcu_qs(void);
-static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp);
-#ifdef CONFIG_HOTPLUG_CPU
-static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
-#endif /* #ifdef CONFIG_HOTPLUG_CPU */
-static int rcu_print_task_exp_stall(struct rcu_node *rnp);
-static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp);
-static void rcu_flavor_sched_clock_irq(int user);
 void call_rcu(struct rcu_head *head, rcu_callback_t func);
-static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck);
-static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
-static void rcu_preempt_boost_start_gp(struct rcu_node *rnp);
-static void invoke_rcu_callbacks_kthread(void);
-static bool rcu_is_callbacks_kthread(void);
-static void __init rcu_spawn_boost_kthreads(void);
-static void rcu_prepare_kthreads(int cpu);
-static void rcu_cleanup_after_idle(void);
-static void rcu_prepare_for_idle(void);
-static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
-static bool rcu_preempt_need_deferred_qs(struct task_struct *t);
-static void rcu_preempt_deferred_qs(struct task_struct *t);
-static void zero_cpu_stall_ticks(struct rcu_data *rdp);
-static bool rcu_nocb_cpu_needs_barrier(int cpu);
-static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
-static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
-static void rcu_init_one_nocb(struct rcu_node *rnp);
-static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
-			    bool lazy, unsigned long flags);
-static bool rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp,
-				      struct rcu_data *rdp,
-				      unsigned long flags);
-static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
-static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
-static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
-static void rcu_spawn_cpu_nocb_kthread(int cpu);
-static void __init rcu_spawn_nocb_kthreads(void);
-#ifdef CONFIG_RCU_NOCB_CPU
-static void __init rcu_organize_nocb_kthreads(void);
-#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
-static bool init_nocb_callback_list(struct rcu_data *rdp);
-static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp);
-static void rcu_bind_gp_kthread(void);
-static bool rcu_nohz_full_cpu(void);
-static void rcu_dynticks_task_enter(void);
-static void rcu_dynticks_task_exit(void);
-
-/* Forward declarations for tree_stall.h */
-static void record_gp_stall_check_time(void);
-static void rcu_iw_handler(struct irq_work *iwp);
-static void check_cpu_stall(struct rcu_data *rdp);
-static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
-				     const unsigned long gpssdelay);
+
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 18/26] rcu: Make percpu rcu_data non-static
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (16 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 17/26] rcu: Move tree.h static forward declarations to tree.c Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 19/26] mm/asi: Add option to map RCU data Alexandre Chartre
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Make percpu rcu_data non-static so that it can be mapped into an
isolation address space page-table. This will allow address space
isolation to use RCU without faulting.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 kernel/rcu/tree.c |    2 +-
 kernel/rcu/tree.h |    1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 44dd3b4..2827b2b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -126,7 +126,7 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
 #define rcu_eqs_special_exit() do { } while (0)
 #endif
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
+DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
 	.dynticks_nesting = 1,
 	.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
 	.dynticks = ATOMIC_INIT(RCU_DYNTICK_CTRL_CTR),
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 9790b58..a043fde 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -394,3 +394,4 @@ struct rcu_state {
 int rcu_dynticks_snap(struct rcu_data *rdp);
 void call_rcu(struct rcu_head *head, rcu_callback_t func);
 
+DECLARE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 19/26] mm/asi: Add option to map RCU data
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (17 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 18/26] rcu: Make percpu rcu_data non-static Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 20/26] mm/asi: Add option to map cpu_hw_events Alexandre Chartre
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add an option to map RCU data when creating an ASI. This will map
the percpu rcu_data (which is not exported by the kernel), and
allow ASI to use RCU without faulting.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h |    1 +
 arch/x86/mm/asi.c          |    4 ++++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index a277e43..8199618 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -18,6 +18,7 @@
 #define ASI_MAP_STACK_CANARY	0x01	/* map stack canary */
 #define ASI_MAP_CPU_PTR		0x02	/* for get_cpu_var()/this_cpu_ptr() */
 #define ASI_MAP_CURRENT_TASK	0x04	/* map the current task */
+#define ASI_MAP_RCU_DATA	0x08	/* map rcu data */
 
 enum page_table_level {
 	PGT_LEVEL_PTE,
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index acd1135..20c23dc 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -7,6 +7,7 @@
 
 #include <linux/export.h>
 #include <linux/gfp.h>
+#include <linux/irq_work.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
 #include <linux/sched/debug.h>
@@ -16,6 +17,8 @@
 #include <asm/bug.h>
 #include <asm/mmu_context.h>
 
+#include "../../../kernel/rcu/tree.h"
+
 /* ASI sessions, one per cpu */
 DEFINE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
 
@@ -29,6 +32,7 @@ struct asi_map_option asi_map_percpu_options[] = {
 	{ ASI_MAP_STACK_CANARY, &fixed_percpu_data, sizeof(fixed_percpu_data) },
 	{ ASI_MAP_CPU_PTR, &this_cpu_off, sizeof(this_cpu_off) },
 	{ ASI_MAP_CURRENT_TASK, &current_task, sizeof(current_task) },
+	{ ASI_MAP_RCU_DATA, &rcu_data, sizeof(rcu_data) },
 };
 
 static void asi_log_fault(struct asi *asi, struct pt_regs *regs,
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 20/26] mm/asi: Add option to map cpu_hw_events
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (18 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 19/26] mm/asi: Add option to map RCU data Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 21/26] mm/asi: Make functions to read cr3/cr4 ASI aware Alexandre Chartre
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add option to map cpu_hw_events in ASI pagetable. Also restructure
to select ptions for percpu optional mapping.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h |    1 +
 arch/x86/mm/asi.c          |    3 +++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 8199618..f489551 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -19,6 +19,7 @@
 #define ASI_MAP_CPU_PTR		0x02	/* for get_cpu_var()/this_cpu_ptr() */
 #define ASI_MAP_CURRENT_TASK	0x04	/* map the current task */
 #define ASI_MAP_RCU_DATA	0x08	/* map rcu data */
+#define ASI_MAP_CPU_HW_EVENTS	0x10	/* map cpu hw events */
 
 enum page_table_level {
 	PGT_LEVEL_PTE,
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 20c23dc..d488704 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -8,6 +8,7 @@
 #include <linux/export.h>
 #include <linux/gfp.h>
 #include <linux/irq_work.h>
+#include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
 #include <linux/sched/debug.h>
@@ -17,6 +18,7 @@
 #include <asm/bug.h>
 #include <asm/mmu_context.h>
 
+#include "../events/perf_event.h"
 #include "../../../kernel/rcu/tree.h"
 
 /* ASI sessions, one per cpu */
@@ -33,6 +35,7 @@ struct asi_map_option asi_map_percpu_options[] = {
 	{ ASI_MAP_CPU_PTR, &this_cpu_off, sizeof(this_cpu_off) },
 	{ ASI_MAP_CURRENT_TASK, &current_task, sizeof(current_task) },
 	{ ASI_MAP_RCU_DATA, &rcu_data, sizeof(rcu_data) },
+	{ ASI_MAP_CPU_HW_EVENTS, &cpu_hw_events, sizeof(cpu_hw_events) },
 };
 
 static void asi_log_fault(struct asi *asi, struct pt_regs *regs,
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 21/26] mm/asi: Make functions to read cr3/cr4 ASI aware
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (19 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 20/26] mm/asi: Add option to map cpu_hw_events Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 22/26] KVM: x86/asi: Introduce address_space_isolation module parameter Alexandre Chartre
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

When address space isolation is active, cpu_tlbstate isn't necessarily
mapped in the ASI page-table, this would cause ASI to fault. Instead of
just mapping cpu_tlbstate, update __get_current_cr3_fast() and
cr4_read_shadow() by caching the cr3/cr4 values in the ASI session
when ASI is active.

Note that the cached cr3 value is the ASI cr3 value (i.e. the current
CR3 value when ASI is active). The cached cr4 value is the cr4 value
when isolation was entered (ASI doesn't change cr4).

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/asi.h         |    2 ++
 arch/x86/include/asm/mmu_context.h |   20 ++++++++++++++++++--
 arch/x86/include/asm/tlbflush.h    |   10 ++++++++++
 arch/x86/mm/asi.c                  |    3 +++
 4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index f489551..07c2b50 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -73,7 +73,9 @@ struct asi_session {
 	enum asi_session_state	state;		/* state of ASI session */
 	bool			retry_abort;	/* always retry abort */
 	unsigned int		abort_depth;	/* abort depth */
+	unsigned long		isolation_cr3;	/* cr3 when ASI is active */
 	unsigned long		original_cr3;	/* cr3 before entering ASI */
+	unsigned long		original_cr4;	/* cr4 before entering ASI */
 	struct task_struct	*task;		/* task during isolation */
 } __aligned(PAGE_SIZE);
 
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 9024236..8cec983 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -14,6 +14,7 @@
 #include <asm/paravirt.h>
 #include <asm/mpx.h>
 #include <asm/debugreg.h>
+#include <asm/asi.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -347,8 +348,23 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
-		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+	unsigned long cr3;
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	/*
+	 * If isolation is active, cpu_tlbstate isn't necessarily mapped
+	 * in the ASI page-table (and it doesn't have the current pgd anyway).
+	 * The current CR3 is cached in the CPU ASI session.
+	 */
+	if (this_cpu_read(cpu_asi_session.state) == ASI_SESSION_STATE_ACTIVE)
+		cr3 = this_cpu_read(cpu_asi_session.isolation_cr3);
+	else
+		cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+				this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+#else
+	cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+			this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+#endif
 
 	/* For now, be very restrictive about when this can be called. */
 	VM_WARN_ON(in_nmi() || preemptible());
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee3758..917f9a5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -12,6 +12,7 @@
 #include <asm/invpcid.h>
 #include <asm/pti.h>
 #include <asm/processor-flags.h>
+#include <asm/asi.h>
 
 /*
  * The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -324,6 +325,15 @@ static inline void cr4_toggle_bits_irqsoff(unsigned long mask)
 /* Read the CR4 shadow. */
 static inline unsigned long cr4_read_shadow(void)
 {
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	/*
+	 * If isolation is active, cpu_tlbstate isn't necessarily mapped
+	 * in the ASI page-table. The CR4 value is cached in the CPU
+	 * ASI session.
+	 */
+	if (this_cpu_read(cpu_asi_session.state) == ASI_SESSION_STATE_ACTIVE)
+		return this_cpu_read(cpu_asi_session.original_cr4);
+#endif
 	return this_cpu_read(cpu_tlbstate.cr4);
 }
 
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index d488704..4a5a4ba 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -23,6 +23,7 @@
 
 /* ASI sessions, one per cpu */
 DEFINE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
+EXPORT_SYMBOL(cpu_asi_session);
 
 struct asi_map_option {
 	int	flag;
@@ -291,6 +292,8 @@ int asi_enter(struct asi *asi)
 		goto err_unmap_task;
 	}
 	asi_session->original_cr3 = original_cr3;
+	asi_session->original_cr4 = cr4_read_shadow();
+	asi_session->isolation_cr3 = __sme_pa(asi->pgd);
 
 	/*
 	 * Use ASI barrier as we are setting CR3 with the ASI page-table.
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 22/26] KVM: x86/asi: Introduce address_space_isolation module parameter
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (20 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 21/26] mm/asi: Make functions to read cr3/cr4 ASI aware Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 23/26] KVM: x86/asi: Introduce KVM address space isolation Alexandre Chartre
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

From: Liran Alon <liran.alon@oracle.com>

Add the address_space_isolation parameter to the kvm module.

When set to true, KVM #VMExit handlers run in isolated address space
which maps only KVM required code and per-VM information instead of
entire kernel address space.

This mechanism is meant to mitigate memory-leak side-channels CPU
vulnerabilities (e.g. Spectre, L1TF and etc.) but can also be viewed
as security in-depth as it also helps generically against info-leaks
vulnerabilities in KVM #VMExit handlers and reduce the available
gadgets for ROP attacks.

This is set to false by default because it incurs a performance hit
which some users will not want to take for security gain.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/kvm/Makefile        |    3 ++-
 arch/x86/kvm/vmx/isolation.c |   26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/isolation.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31ecf7a..71579ed 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,8 @@ kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
 			   hyperv.o page_track.o debugfs.o
 
-kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
+kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
+			   vmx/evmcs.o vmx/nested.o vmx/isolation.o
 kvm-amd-y		+= svm.o pmu_amd.o
 
 obj-$(CONFIG_KVM)	+= kvm.o
diff --git a/arch/x86/kvm/vmx/isolation.c b/arch/x86/kvm/vmx/isolation.c
new file mode 100644
index 0000000..e25f663
--- /dev/null
+++ b/arch/x86/kvm/vmx/isolation.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ * KVM Address Space Isolation
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+/*
+ * When set to true, KVM #VMExit handlers run in isolated address space
+ * which maps only KVM required code and per-VM information instead of
+ * entire kernel address space.
+ *
+ * This mechanism is meant to mitigate memory-leak side-channels CPU
+ * vulnerabilities (e.g. Spectre, L1TF and etc.) but can also be viewed
+ * as security in-depth as it also helps generically against info-leaks
+ * vulnerabilities in KVM #VMExit handlers and reduce the available
+ * gadgets for ROP attacks.
+ *
+ * This is set to false by default because it incurs a performance hit
+ * which some users will not want to take for security gain.
+ */
+static bool __read_mostly address_space_isolation;
+module_param(address_space_isolation, bool, 0444);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 23/26] KVM: x86/asi: Introduce KVM address space isolation
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (21 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 22/26] KVM: x86/asi: Introduce address_space_isolation module parameter Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 24/26] KVM: x86/asi: Populate the KVM ASI page-table Alexandre Chartre
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

From: Liran Alon <liran.alon@oracle.com>

Create a separate address space for KVM that will be active when
KVM #VMExit handlers run. Up until the point which we architectully
need to access host (or other VM) sensitive data.

This patch just create the address space using address space
isolation (asi) but never makes it active yet. This will be done
by next commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/kvm/vmx/isolation.c |   58 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c       |    7 ++++-
 arch/x86/kvm/vmx/vmx.h       |    3 ++
 include/linux/kvm_host.h     |    5 +++
 4 files changed, 72 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx/isolation.c b/arch/x86/kvm/vmx/isolation.c
index e25f663..644d8d3 100644
--- a/arch/x86/kvm/vmx/isolation.c
+++ b/arch/x86/kvm/vmx/isolation.c
@@ -7,6 +7,15 @@
 
 #include <linux/module.h>
 #include <linux/moduleparam.h>
+#include <linux/printk.h>
+#include <asm/asi.h>
+#include <asm/vmx.h>
+
+#include "vmx.h"
+#include "x86.h"
+
+#define VMX_ASI_MAP_FLAGS	\
+	(ASI_MAP_STACK_CANARY | ASI_MAP_CPU_PTR | ASI_MAP_CURRENT_TASK)
 
 /*
  * When set to true, KVM #VMExit handlers run in isolated address space
@@ -24,3 +33,52 @@
  */
 static bool __read_mostly address_space_isolation;
 module_param(address_space_isolation, bool, 0444);
+
+static int vmx_isolation_init_mapping(struct asi *asi, struct vcpu_vmx *vmx)
+{
+	/* TODO: Populate the KVM ASI page-table */
+
+	return 0;
+}
+
+int vmx_isolation_init(struct vcpu_vmx *vmx)
+{
+	struct kvm_vcpu *vcpu = &vmx->vcpu;
+	struct asi *asi;
+	int err;
+
+	if (!address_space_isolation) {
+		vcpu->asi = NULL;
+		return 0;
+	}
+
+	asi = asi_create(VMX_ASI_MAP_FLAGS);
+	if (!asi) {
+		pr_debug("KVM: x86: Failed to create address space isolation\n");
+		return -ENXIO;
+	}
+
+	err = vmx_isolation_init_mapping(asi, vmx);
+	if (err) {
+		vcpu->asi = NULL;
+		return err;
+	}
+
+	vcpu->asi = asi;
+
+	pr_info("KVM: x86: Running with isolated address space\n");
+
+	return 0;
+}
+
+void vmx_isolation_uninit(struct vcpu_vmx *vmx)
+{
+	struct kvm_vcpu *vcpu = &vmx->vcpu;
+
+	if (!address_space_isolation || !vcpu->asi)
+		return;
+
+	asi_destroy(vcpu->asi);
+	vcpu->asi = NULL;
+	pr_info("KVM: x86: End of isolated address space\n");
+}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d98eac3..9b92467 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -202,7 +202,7 @@
 };
 
 #define L1D_CACHE_ORDER 4
-static void *vmx_l1d_flush_pages;
+void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 {
@@ -6561,6 +6561,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	vmx_isolation_uninit(vmx);
 	if (enable_pml)
 		vmx_destroy_pml_buffer(vmx);
 	free_vpid(vmx->vpid);
@@ -6672,6 +6673,10 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 
 	vmx->ept_pointer = INVALID_PAGE;
 
+	err = vmx_isolation_init(vmx);
+	if (err)
+		goto free_vmcs;
+
 	return &vmx->vcpu;
 
 free_vmcs:
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 61128b4..09c1593 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -525,4 +525,7 @@ static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)
 
 void dump_vmcs(void);
 
+int vmx_isolation_init(struct vcpu_vmx *vmx);
+void vmx_isolation_uninit(struct vcpu_vmx *vmx);
+
 #endif /* __KVM_X86_VMX_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d1ad38a..2a9d073 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
 #include <linux/kvm_types.h>
 
 #include <asm/kvm_host.h>
+#include <asm/asi.h>
 
 #ifndef KVM_MAX_VCPU_ID
 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -320,6 +321,10 @@ struct kvm_vcpu {
 	bool preempted;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+	struct asi *asi;
+#endif
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 24/26] KVM: x86/asi: Populate the KVM ASI page-table
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (22 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 23/26] KVM: x86/asi: Introduce KVM address space isolation Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 25/26] KVM: x86/asi: Switch to KVM address space on entry to guest Alexandre Chartre
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Add mappings to the KVM ASI page-table so that KVM can run with its
address space isolation without faulting too much.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/kvm/vmx/isolation.c |  155 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.c       |    1 -
 arch/x86/kvm/vmx/vmx.h       |    3 +
 3 files changed, 154 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/isolation.c b/arch/x86/kvm/vmx/isolation.c
index 644d8d3..d82f6b6 100644
--- a/arch/x86/kvm/vmx/isolation.c
+++ b/arch/x86/kvm/vmx/isolation.c
@@ -5,7 +5,7 @@
  * KVM Address Space Isolation
  */
 
-#include <linux/module.h>
+#include <linux/kvm_host.h>
 #include <linux/moduleparam.h>
 #include <linux/printk.h>
 #include <asm/asi.h>
@@ -14,8 +14,11 @@
 #include "vmx.h"
 #include "x86.h"
 
-#define VMX_ASI_MAP_FLAGS	\
-	(ASI_MAP_STACK_CANARY | ASI_MAP_CPU_PTR | ASI_MAP_CURRENT_TASK)
+#define VMX_ASI_MAP_FLAGS (ASI_MAP_STACK_CANARY |	\
+			   ASI_MAP_CPU_PTR |		\
+			   ASI_MAP_CURRENT_TASK |	\
+			   ASI_MAP_RCU_DATA |		\
+			   ASI_MAP_CPU_HW_EVENTS)
 
 /*
  * When set to true, KVM #VMExit handlers run in isolated address space
@@ -34,9 +37,153 @@
 static bool __read_mostly address_space_isolation;
 module_param(address_space_isolation, bool, 0444);
 
+/*
+ * Map various kernel data.
+ */
+static int vmx_isolation_map_kernel_data(struct asi *asi)
+{
+	int err;
+
+	/* map context_tracking, used by guest_enter_irqoff() */
+	err = ASI_MAP_CPUVAR(asi, context_tracking);
+	if (err)
+		return err;
+
+	/* map irq_stat, used by kvm_*_cpu_l1tf_flush_l1d */
+	err = ASI_MAP_CPUVAR(asi, irq_stat);
+	if (err)
+		return err;
+	return 0;
+}
+
+/*
+ * Map kvm module and data from that module.
+ */
+static int vmx_isolation_map_kvm_data(struct asi *asi, struct kvm *kvm)
+{
+	int err;
+
+	/* map kvm module */
+	err = asi_map_module(asi, "kvm");
+	if (err)
+		return err;
+
+	err = asi_map_percpu(asi, kvm->srcu.sda,
+			     sizeof(struct srcu_data));
+	if (err)
+		return err;
+
+	return 0;
+}
+
+/*
+ * Map kvm-intel module and generic x86 data.
+ */
+static int vmx_isolation_map_kvm_x86_data(struct asi *asi)
+{
+	int err;
+
+	/* map current module (kvm-intel) */
+	err = ASI_MAP_THIS_MODULE(asi);
+	if (err)
+		return err;
+
+	/* map current_vcpu, used by vcpu_enter_guest() */
+	err = ASI_MAP_CPUVAR(asi, current_vcpu);
+	if (err)
+		return (err);
+
+	return 0;
+}
+
+/*
+ * Map vmx data.
+ */
+static int vmx_isolation_map_kvm_vmx_data(struct asi *asi, struct vcpu_vmx *vmx)
+{
+	struct kvm_vmx *kvm_vmx;
+	struct kvm_vcpu *vcpu;
+	struct kvm *kvm;
+	int err;
+
+	vcpu = &vmx->vcpu;
+	kvm = vcpu->kvm;
+	kvm_vmx = to_kvm_vmx(kvm);
+
+	/* map kvm_vmx (this also maps kvm) */
+	err = asi_map(asi, kvm_vmx, sizeof(*kvm_vmx));
+	if (err)
+		return err;
+
+	/* map vmx (this also maps vcpu) */
+	err = asi_map(asi, vmx, sizeof(*vmx));
+	if (err)
+		return err;
+
+	/* map vcpu data */
+	err = asi_map(asi, vcpu->run, PAGE_SIZE);
+	if (err)
+		return err;
+
+	err = asi_map(asi, vcpu->arch.apic, sizeof(struct kvm_lapic));
+	if (err)
+		return err;
+
+	/*
+	 * Map additional vmx data.
+	 */
+
+	if (vmx_l1d_flush_pages) {
+		err = asi_map(asi, vmx_l1d_flush_pages,
+			      PAGE_SIZE << L1D_CACHE_ORDER);
+		if (err)
+			return err;
+	}
+
+	if (enable_pml) {
+		err = asi_map(asi, vmx->pml_pg, sizeof(struct page));
+		if (err)
+			return err;
+	}
+
+	err = asi_map(asi, vmx->guest_msrs, PAGE_SIZE);
+	if (err)
+		return err;
+
+	err = asi_map(asi, vmx->vmcs01.vmcs, PAGE_SIZE << vmcs_config.order);
+	if (err)
+		return err;
+
+	err = asi_map(asi, vmx->vmcs01.msr_bitmap, PAGE_SIZE);
+	if (err)
+		return err;
+
+	err = asi_map(asi, vmx->vcpu.arch.pio_data, PAGE_SIZE);
+	if (err)
+		return err;
+
+	return 0;
+}
+
 static int vmx_isolation_init_mapping(struct asi *asi, struct vcpu_vmx *vmx)
 {
-	/* TODO: Populate the KVM ASI page-table */
+	int err;
+
+	err = vmx_isolation_map_kernel_data(asi);
+	if (err)
+		return err;
+
+	err = vmx_isolation_map_kvm_data(asi, vmx->vcpu.kvm);
+	if (err)
+		return err;
+
+	err = vmx_isolation_map_kvm_x86_data(asi);
+	if (err)
+		return err;
+
+	err = vmx_isolation_map_kvm_vmx_data(asi, vmx);
+	if (err)
+		return err;
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9b92467..d47f093 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -201,7 +201,6 @@
 	[VMENTER_L1D_FLUSH_NOT_REQUIRED] = {"not required", false},
 };
 
-#define L1D_CACHE_ORDER 4
 void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 09c1593..e8de23b 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -11,6 +11,9 @@
 #include "ops.h"
 #include "vmcs.h"
 
+#define L1D_CACHE_ORDER 4
+extern void *vmx_l1d_flush_pages;
+
 extern const u32 vmx_msr_index[];
 extern u64 host_efer;
 
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 25/26] KVM: x86/asi: Switch to KVM address space on entry to guest
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (23 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 24/26] KVM: x86/asi: Populate the KVM ASI page-table Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:25 ` [RFC v2 26/26] KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI Alexandre Chartre
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

From: Liran Alon <liran.alon@oracle.com>

Switch to KVM address space on entry to guest. Most of KVM #VMExit
handlers will run in KVM isolated address space and switch back to
host address space only before accessing sensitive data. Sensitive
data is defined as either host data or other VM data.

Currently, we switch back to the host address space on the following
scenarios:
1) When handling guest page-faults:
   As this will access SPTs which contains host PFNs.
2) On schedule-out of vCPU thread
3) On write to guest virtual memory
   (kvm_write_guest_virt_system() can pull in tons of pages)
4) On return to userspace (e.g. QEMU)
5) On interrupt or exception

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/kvm/mmu.c           |    2 +-
 arch/x86/kvm/vmx/isolation.c |    2 +-
 arch/x86/kvm/vmx/vmx.c       |    6 ++++++
 arch/x86/kvm/vmx/vmx.h       |   18 ++++++++++++++++++
 arch/x86/kvm/x86.c           |   34 +++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.h           |    1 +
 6 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 98f6e4f..298f602 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4067,7 +4067,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 {
 	int r = 1;
 
-	vcpu->arch.l1tf_flush_l1d = true;
+	kvm_may_access_sensitive_data(vcpu);
 	switch (vcpu->arch.apf.host_apf_reason) {
 	default:
 		trace_kvm_page_fault(fault_address, error_code);
diff --git a/arch/x86/kvm/vmx/isolation.c b/arch/x86/kvm/vmx/isolation.c
index d82f6b6..8f57f10 100644
--- a/arch/x86/kvm/vmx/isolation.c
+++ b/arch/x86/kvm/vmx/isolation.c
@@ -34,7 +34,7 @@
  * This is set to false by default because it incurs a performance hit
  * which some users will not want to take for security gain.
  */
-static bool __read_mostly address_space_isolation;
+bool __read_mostly address_space_isolation;
 module_param(address_space_isolation, bool, 0444);
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d47f093..b5867cc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6458,8 +6458,14 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.cr2 != read_cr2())
 		write_cr2(vcpu->arch.cr2);
 
+	/*
+	 * Use an isolation barrier as VMExit will restore the isolation
+	 * CR3 while interrupts can abort isolation.
+	 */
+	vmx_isolation_barrier_begin(vmx);
 	vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
 				   vmx->loaded_vmcs->launched);
+	vmx_isolation_barrier_end(vmx);
 
 	vcpu->arch.cr2 = read_cr2();
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e8de23b..b65f059 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -531,4 +531,22 @@ static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)
 int vmx_isolation_init(struct vcpu_vmx *vmx);
 void vmx_isolation_uninit(struct vcpu_vmx *vmx);
 
+extern bool __read_mostly address_space_isolation;
+
+static inline void vmx_isolation_barrier_begin(struct vcpu_vmx *vmx)
+{
+	if (!address_space_isolation || !vmx->vcpu.asi)
+		return;
+
+	asi_barrier_begin();
+}
+
+static inline void vmx_isolation_barrier_end(struct vcpu_vmx *vmx)
+{
+	if (!address_space_isolation || !vmx->vcpu.asi)
+		return;
+
+	asi_barrier_end();
+}
+
 #endif /* __KVM_X86_VMX_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9857992..9458413 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3346,6 +3346,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	 * guest. do_debug expects dr6 to be cleared after it runs, do the same.
 	 */
 	set_debugreg(0, 6);
+
+	kvm_may_access_sensitive_data(vcpu);
 }
 
 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
@@ -5259,7 +5261,7 @@ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,
 				unsigned int bytes, struct x86_exception *exception)
 {
 	/* kvm_write_guest_virt_system can pull in tons of pages. */
-	vcpu->arch.l1tf_flush_l1d = true;
+	kvm_may_access_sensitive_data(vcpu);
 
 	return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,
 					   PFERR_WRITE_MASK, exception);
@@ -7744,6 +7746,32 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
 
+static void vcpu_isolation_enter(struct kvm_vcpu *vcpu)
+{
+	int err;
+
+	if (!vcpu->asi)
+		return;
+
+	err = asi_enter(vcpu->asi);
+	if (err)
+		pr_debug("KVM isolation failed: error %d\n", err);
+}
+
+static void vcpu_isolation_exit(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu->asi)
+		return;
+
+	asi_exit(vcpu->asi);
+}
+
+void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.l1tf_flush_l1d = true;
+	vcpu_isolation_exit(vcpu);
+}
+
 /*
  * Returns 1 to let vcpu_run() continue the guest execution loop without
  * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -7944,6 +7972,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		goto cancel_injection;
 	}
 
+	vcpu_isolation_enter(vcpu);
+
 	if (req_immediate_exit) {
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 		kvm_x86_ops->request_immediate_exit(vcpu);
@@ -8130,6 +8160,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 
 	srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
 
+	kvm_may_access_sensitive_data(vcpu);
+
 	return r;
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a470ff0..69a7402 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -356,5 +356,6 @@ static inline bool kvm_pat_valid(u64 data)
 
 void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu);
 void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu);
+void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu);
 
 #endif
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 26/26] KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (24 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 25/26] KVM: x86/asi: Switch to KVM address space on entry to guest Alexandre Chartre
@ 2019-07-11 14:25 ` Alexandre Chartre
  2019-07-11 14:40 ` [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:25 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	alexandre.chartre

Map KVM memslots and IO buses into KVM ASI. Mapping is checking on each
KVM ASI enter because they can change.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/kvm/x86.c       |   36 +++++++++++++++++++++++++++++++++++-
 include/linux/kvm_host.h |    2 ++
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9458413..7c52827 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7748,11 +7748,45 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
 
 static void vcpu_isolation_enter(struct kvm_vcpu *vcpu)
 {
-	int err;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_io_bus *bus;
+	int i, err;
 
 	if (!vcpu->asi)
 		return;
 
+	/*
+	 * Check memslots and buses mapping as they tend to change.
+	 */
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		if (vcpu->asi_memslots[i] == kvm->memslots[i])
+			continue;
+		pr_debug("remapping kvm memslots[%d]: %px -> %px\n",
+			 i, vcpu->asi_memslots[i], kvm->memslots[i]);
+		err = asi_remap(vcpu->asi, &vcpu->asi_memslots[i],
+				kvm->memslots[i], sizeof(struct kvm_memslots));
+		if (err) {
+			pr_debug("failed to map kvm memslots[%d]: error %d\n",
+				 i, err);
+		}
+	}
+
+
+	for (i = 0; i < KVM_NR_BUSES; i++) {
+		bus = kvm->buses[i];
+		if (bus == vcpu->asi_buses[i])
+			continue;
+		pr_debug("remapped kvm buses[%d]: %px -> %px\n",
+			 i, vcpu->asi_buses[i], bus);
+		err = asi_remap(vcpu->asi, &vcpu->asi_buses[i], bus,
+				sizeof(*bus) + bus->dev_count *
+				sizeof(struct kvm_io_range));
+		if (err) {
+			pr_debug("failed to map kvm buses[%d]: error %d\n",
+				 i, err);
+		}
+	}
+
 	err = asi_enter(vcpu->asi);
 	if (err)
 		pr_debug("KVM isolation failed: error %d\n", err);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2a9d073..1f82de4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -324,6 +324,8 @@ struct kvm_vcpu {
 
 #ifdef CONFIG_ADDRESS_SPACE_ISOLATION
 	struct asi *asi;
+	void *asi_memslots[KVM_ADDRESS_SPACE_NUM];
+	void *asi_buses[KVM_NR_BUSES];
 #endif
 };
 
-- 
1.7.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (25 preceding siblings ...)
  2019-07-11 14:25 ` [RFC v2 26/26] KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI Alexandre Chartre
@ 2019-07-11 14:40 ` Alexandre Chartre
  2019-07-11 22:38 ` Dave Hansen
  2019-07-12 11:44 ` Peter Zijlstra
  28 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 14:40 UTC (permalink / raw)
  To: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt


And I've just noticed that I've messed up the subject of the cover letter.
There are 26 patches, not 27. So it should have been 00/26 not 00/27.

Sorry about that.

alex.

On 7/11/19 4:25 PM, Alexandre Chartre wrote:
> Hi,
> 
> This is version 2 of the "KVM Address Space Isolation" RFC. The code
> has been completely changed compared to v1 and it now provides a generic
> kernel framework which provides Address Space Isolation; and KVM is now
> a simple consumer of that framework. That's why the RFC title has been
> changed from "KVM Address Space Isolation" to "Kernel Address Space
> Isolation".
> 
> Kernel Address Space Isolation aims to use address spaces to isolate some
> parts of the kernel (for example KVM) to prevent leaking sensitive data
> between hyper-threads under speculative execution attacks. You can refer
> to the first version of this RFC for more context:
> 
>     https://lkml.org/lkml/2019/5/13/515
> 
> The new code is still a proof of concept. It is much more stable than v1:
> I am able to run a VM with a full OS (and also a nested VM) with multiple
> vcpus. But it looks like there are still some corner cases which cause the
> system to crash/hang.
> 
> I am looking for feedback about this new approach where address space
> isolation is provided by the kernel, and KVM is a just a consumer of this
> new framework.
> 
> 
> Changes
> =======
> 
> - Address Space Isolation (ASI) is now provided as a kernel framework:
>    interfaces for creating and managing an ASI are provided by the kernel,
>    there are not implemented in KVM.
> 
> - An ASI is associated with a page-table, we don't use mm anymore. Entering
>    isolation is done by just updating CR3 to use the ASI page-table. Exiting
>    isolation restores CR3 with the CR3 value present before entering isolation.
> 
> - Isolation is exited at the beginning of any interrupt/exception handler,
>    and on context switch.
> 
> - Isolation doesn't disable interrupt, but if an interrupt occurs the
>    interrupt handler will exit isolation.
> 
> - The current stack is mapped when entering isolation and unmapped when
>    exiting isolation.
> 
> - The current task is not mapped by default, but there's an option to map it.
>    In such a case, the current task is mapped when entering isolation and
>    unmap when exiting isolation.
> 
> - Kernel code mapped to the ASI page-table has been reduced to:
>    . the entire kernel (I still need to test with only the kernel text)
>    . the cpu entry area (because we need the GDT to be mapped)
>    . the cpu ASI session (for managing ASI)
>    . the current stack
> 
> - Optionally, an ASI can request the following kernel mapping to be added:
>    . the stack canary
>    . the cpu offsets (this_cpu_off)
>    . the current task
>    . RCU data (rcu_data)
>    . CPU HW events (cpu_hw_events).
> 
>    All these optional mappings are used for KVM isolation.
>    
> 
> Patches:
> ========
> 
> The proposed patches provides a framework for creating an Address Space
> Isolation (ASI) (represented by a struct asi). The ASI has a page-table which
> can be populated by copying mappings from the kernel page-table. The ASI can
> then be entered/exited by switching between the kernel page-table and the
> ASI page-table. In addition, any interrupt, exception or context switch
> will automatically abort and exit the isolation. Finally patches use the
> ASI framework to implement KVM isolation.
> 
> - 01-03: Core of the ASI framework: create/destroy ASI, enter/exit/abort
>    isolation, ASI page-fault handler.
> 
> - 04-14: Functions to manage, populate and clear an ASI page-table.
> 
> - 15-20: ASI core mappings and optional mappings.
> 
> - 21: Make functions to read cr3/cr4 ASI aware
> 
> - 22-26: Use ASI in KVM to provide isolation for VMExit handlers.
> 
> 
> API Overview:
> =============
> Here is a short description of the main ASI functions provided by the framwork.
> 
> struct asi *asi_create(int map_flags)
> 
>    Create an Address Space Isolation (ASI). map_flags can be used to specify
>    optional kernel mapping to be added to the ASI page-table (for example,
>    ASI_MAP_STACK_CANARY to map the stack canary).
> 
> 
> void asi_destroy(struct asi *asi)
> 
>    Destroy an ASI.
> 
> 
> int asi_enter(struct asi *asi)
> 
>    Enter isolation for the specified ASI. This switches from the kernel page-table
>    to the page-table associated with the ASI.
> 
> 
> void asi_exit(struct asi *asi)
> 
>    Exit isolation for the specified ASI. This switches back to the kernel
>    page-table
> 
> 
> int asi_map(struct asi *asi, void *ptr, unsigned long size);
> 
>    Copy kernel mapping to the specified ASI page-table.
> 
> 
> void asi_unmap(struct asi *asi, void *ptr);
> 
>    Clear kernel mapping from the specified ASI page-table.
> 
> 
> ----
> Alexandre Chartre (23):
>    mm/x86: Introduce kernel address space isolation
>    mm/asi: Abort isolation on interrupt, exception and context switch
>    mm/asi: Handle page fault due to address space isolation
>    mm/asi: Functions to track buffers allocated for an ASI page-table
>    mm/asi: Add ASI page-table entry offset functions
>    mm/asi: Add ASI page-table entry allocation functions
>    mm/asi: Add ASI page-table entry set functions
>    mm/asi: Functions to populate an ASI page-table from a VA range
>    mm/asi: Helper functions to map module into ASI
>    mm/asi: Keep track of VA ranges mapped in ASI page-table
>    mm/asi: Functions to clear ASI page-table entries for a VA range
>    mm/asi: Function to copy page-table entries for percpu buffer
>    mm/asi: Add asi_remap() function
>    mm/asi: Handle ASI mapped range leaks and overlaps
>    mm/asi: Initialize the ASI page-table with core mappings
>    mm/asi: Option to map current task into ASI
>    rcu: Move tree.h static forward declarations to tree.c
>    rcu: Make percpu rcu_data non-static
>    mm/asi: Add option to map RCU data
>    mm/asi: Add option to map cpu_hw_events
>    mm/asi: Make functions to read cr3/cr4 ASI aware
>    KVM: x86/asi: Populate the KVM ASI page-table
>    KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI
> 
> Liran Alon (3):
>    KVM: x86/asi: Introduce address_space_isolation module parameter
>    KVM: x86/asi: Introduce KVM address space isolation
>    KVM: x86/asi: Switch to KVM address space on entry to guest
> 
>   arch/x86/entry/entry_64.S          |   42 ++-
>   arch/x86/include/asm/asi.h         |  237 ++++++++
>   arch/x86/include/asm/mmu_context.h |   20 +-
>   arch/x86/include/asm/tlbflush.h    |   10 +
>   arch/x86/kernel/asm-offsets.c      |    4 +
>   arch/x86/kvm/Makefile              |    3 +-
>   arch/x86/kvm/mmu.c                 |    2 +-
>   arch/x86/kvm/vmx/isolation.c       |  231 ++++++++
>   arch/x86/kvm/vmx/vmx.c             |   14 +-
>   arch/x86/kvm/vmx/vmx.h             |   24 +
>   arch/x86/kvm/x86.c                 |   68 +++-
>   arch/x86/kvm/x86.h                 |    1 +
>   arch/x86/mm/Makefile               |    2 +
>   arch/x86/mm/asi.c                  |  459 +++++++++++++++
>   arch/x86/mm/asi_pagetable.c        | 1077 ++++++++++++++++++++++++++++++++++++
>   arch/x86/mm/fault.c                |    7 +
>   include/linux/kvm_host.h           |    7 +
>   kernel/rcu/tree.c                  |   56 ++-
>   kernel/rcu/tree.h                  |   56 +--
>   kernel/sched/core.c                |    4 +
>   security/Kconfig                   |   10 +
>   21 files changed, 2269 insertions(+), 65 deletions(-)
>   create mode 100644 arch/x86/include/asm/asi.h
>   create mode 100644 arch/x86/kvm/vmx/isolation.c
>   create mode 100644 arch/x86/mm/asi.c
>   create mode 100644 arch/x86/mm/asi_pagetable.c
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch
  2019-07-11 14:25 ` [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch Alexandre Chartre
@ 2019-07-11 20:11   ` Andi Kleen
  2019-07-11 20:17     ` Mike Rapoport
  2019-07-12  0:05   ` Andy Lutomirski
  1 sibling, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2019-07-11 20:11 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt

Alexandre Chartre <alexandre.chartre@oracle.com> writes:
>  	jmp	paranoid_exit
> @@ -1182,6 +1196,16 @@ ENTRY(paranoid_entry)
>  	xorl	%ebx, %ebx
>  
>  1:
> +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
> +	/*
> +	 * If address space isolation is active then abort it and return
> +	 * the original kernel CR3 in %r14.
> +	 */
> +	ASI_START_ABORT_ELSE_JUMP 2f
> +	movq	%rdi, %r14
> +	ret
> +2:
> +#endif

Unless I missed it you don't map the exception stacks into ASI, so it
has likely already triple faulted at this point.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch
  2019-07-11 20:11   ` Andi Kleen
@ 2019-07-11 20:17     ` Mike Rapoport
  2019-07-11 20:41       ` Alexandre Chartre
  0 siblings, 1 reply; 68+ messages in thread
From: Mike Rapoport @ 2019-07-11 20:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexandre Chartre, pbonzini, rkrcmar, tglx, mingo, bp, hpa,
	dave.hansen, luto, peterz, kvm, x86, linux-mm, linux-kernel,
	konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt

On Thu, Jul 11, 2019 at 01:11:43PM -0700, Andi Kleen wrote:
> Alexandre Chartre <alexandre.chartre@oracle.com> writes:
> >  	jmp	paranoid_exit
> > @@ -1182,6 +1196,16 @@ ENTRY(paranoid_entry)
> >  	xorl	%ebx, %ebx
> >  
> >  1:
> > +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
> > +	/*
> > +	 * If address space isolation is active then abort it and return
> > +	 * the original kernel CR3 in %r14.
> > +	 */
> > +	ASI_START_ABORT_ELSE_JUMP 2f
> > +	movq	%rdi, %r14
> > +	ret
> > +2:
> > +#endif
> 
> Unless I missed it you don't map the exception stacks into ASI, so it
> has likely already triple faulted at this point.

The exception stacks are in the CPU entry area, aren't they?
 
> -Andi
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch
  2019-07-11 20:17     ` Mike Rapoport
@ 2019-07-11 20:41       ` Alexandre Chartre
  0 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-11 20:41 UTC (permalink / raw)
  To: Mike Rapoport, Andi Kleen
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt



On 7/11/19 10:17 PM, Mike Rapoport wrote:
> On Thu, Jul 11, 2019 at 01:11:43PM -0700, Andi Kleen wrote:
>> Alexandre Chartre <alexandre.chartre@oracle.com> writes:
>>>   	jmp	paranoid_exit
>>> @@ -1182,6 +1196,16 @@ ENTRY(paranoid_entry)
>>>   	xorl	%ebx, %ebx
>>>   
>>>   1:
>>> +#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
>>> +	/*
>>> +	 * If address space isolation is active then abort it and return
>>> +	 * the original kernel CR3 in %r14.
>>> +	 */
>>> +	ASI_START_ABORT_ELSE_JUMP 2f
>>> +	movq	%rdi, %r14
>>> +	ret
>>> +2:
>>> +#endif
>>
>> Unless I missed it you don't map the exception stacks into ASI, so it
>> has likely already triple faulted at this point.
> 
> The exception stacks are in the CPU entry area, aren't they?
>   

That's my understanding, stacks come from tss in the CPU entry area and
the CPU entry area is part for the core ASI mappings (see patch 15/26).

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 01/26] mm/x86: Introduce kernel address space isolation
  2019-07-11 14:25 ` [RFC v2 01/26] mm/x86: Introduce kernel address space isolation Alexandre Chartre
@ 2019-07-11 21:33   ` Thomas Gleixner
  2019-07-12  7:43     ` Alexandre Chartre
  0 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2019-07-11 21:33 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen, luto, peterz,
	kvm, x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt

On Thu, 11 Jul 2019, Alexandre Chartre wrote:
> +/*
> + * When isolation is active, the address space doesn't necessarily map
> + * the percpu offset value (this_cpu_off) which is used to get pointers
> + * to percpu variables. So functions which can be invoked while isolation
> + * is active shouldn't be getting pointers to percpu variables (i.e. with
> + * get_cpu_var() or this_cpu_ptr()). Instead percpu variable should be
> + * directly read or written to (i.e. with this_cpu_read() or
> + * this_cpu_write()).
> + */
> +
> +int asi_enter(struct asi *asi)
> +{
> +	enum asi_session_state state;
> +	struct asi *current_asi;
> +	struct asi_session *asi_session;
> +
> +	state = this_cpu_read(cpu_asi_session.state);
> +	/*
> +	 * We can re-enter isolation, but only with the same ASI (we don't
> +	 * support nesting isolation). Also, if isolation is still active,
> +	 * then we should be re-entering with the same task.
> +	 */
> +	if (state == ASI_SESSION_STATE_ACTIVE) {
> +		current_asi = this_cpu_read(cpu_asi_session.asi);
> +		if (current_asi != asi) {
> +			WARN_ON(1);
> +			return -EBUSY;
> +		}
> +		WARN_ON(this_cpu_read(cpu_asi_session.task) != current);
> +		return 0;
> +	}
> +
> +	/* isolation is not active so we can safely access the percpu pointer */
> +	asi_session = &get_cpu_var(cpu_asi_session);

get_cpu_var()?? Where is the matching put_cpu_var() ? get_cpu_var()
contains a preempt_disable ...

What's wrong with a simple this_cpu_ptr() here?

> +void asi_exit(struct asi *asi)
> +{
> +	struct asi_session *asi_session;
> +	enum asi_session_state asi_state;
> +	unsigned long original_cr3;
> +
> +	asi_state = this_cpu_read(cpu_asi_session.state);
> +	if (asi_state == ASI_SESSION_STATE_INACTIVE)
> +		return;
> +
> +	/* TODO: Kick sibling hyperthread before switching to kernel cr3 */
> +	original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
> +	if (original_cr3)

Why would this be 0 if the session is active?

> +		write_cr3(original_cr3);
> +
> +	/* page-table was switched, we can now access the percpu pointer */
> +	asi_session = &get_cpu_var(cpu_asi_session);

See above.

> +	WARN_ON(asi_session->task != current);
> +	asi_session->state = ASI_SESSION_STATE_INACTIVE;
> +	asi_session->asi = NULL;
> +	asi_session->task = NULL;
> +	asi_session->original_cr3 = 0;
> +}

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (26 preceding siblings ...)
  2019-07-11 14:40 ` [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
@ 2019-07-11 22:38 ` Dave Hansen
  2019-07-12  8:09   ` Alexandre Chartre
  2019-07-12 10:44   ` Thomas Gleixner
  2019-07-12 11:44 ` Peter Zijlstra
  28 siblings, 2 replies; 68+ messages in thread
From: Dave Hansen @ 2019-07-11 22:38 UTC (permalink / raw)
  To: Alexandre Chartre, pbonzini, rkrcmar, tglx, mingo, bp, hpa,
	dave.hansen, luto, peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt

On 7/11/19 7:25 AM, Alexandre Chartre wrote:
> - Kernel code mapped to the ASI page-table has been reduced to:
>   . the entire kernel (I still need to test with only the kernel text)
>   . the cpu entry area (because we need the GDT to be mapped)
>   . the cpu ASI session (for managing ASI)
>   . the current stack
> 
> - Optionally, an ASI can request the following kernel mapping to be added:
>   . the stack canary
>   . the cpu offsets (this_cpu_off)
>   . the current task
>   . RCU data (rcu_data)
>   . CPU HW events (cpu_hw_events).

I don't see the per-cpu areas in here.  But, the ASI macros in
entry_64.S (and asi_start_abort()) use per-cpu data.

Also, this stuff seems to do naughty stuff (calling C code, touching
per-cpu data) before the PTI CR3 writes have been done.  But, I don't
see anything excluding PTI and this code from coexisting.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch
  2019-07-11 14:25 ` [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch Alexandre Chartre
  2019-07-11 20:11   ` Andi Kleen
@ 2019-07-12  0:05   ` Andy Lutomirski
  2019-07-12  7:50     ` Alexandre Chartre
  1 sibling, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2019-07-12  0:05 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt


> On Jul 11, 2019, at 8:25 AM, Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
> 
> Address space isolation should be aborted if there is an interrupt,
> an exception or a context switch. Interrupt/exception handlers and
> context switch code need to run with the full kernel address space.
> Address space isolation is aborted by restoring the original CR3
> value used before entering address space isolation.
> 

NAK to the entry changes. That code you’re changing is already known to be a bit buggy, and it’s spaghetti. PeterZ and I are gradually working on fixing some bugs and C-ifying it. ASI can go on top.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 01/26] mm/x86: Introduce kernel address space isolation
  2019-07-11 21:33   ` Thomas Gleixner
@ 2019-07-12  7:43     ` Alexandre Chartre
  0 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12  7:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen, luto, peterz,
	kvm, x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt


On 7/11/19 11:33 PM, Thomas Gleixner wrote:
> On Thu, 11 Jul 2019, Alexandre Chartre wrote:
>> +/*
>> + * When isolation is active, the address space doesn't necessarily map
>> + * the percpu offset value (this_cpu_off) which is used to get pointers
>> + * to percpu variables. So functions which can be invoked while isolation
>> + * is active shouldn't be getting pointers to percpu variables (i.e. with
>> + * get_cpu_var() or this_cpu_ptr()). Instead percpu variable should be
>> + * directly read or written to (i.e. with this_cpu_read() or
>> + * this_cpu_write()).
>> + */
>> +
>> +int asi_enter(struct asi *asi)
>> +{
>> +	enum asi_session_state state;
>> +	struct asi *current_asi;
>> +	struct asi_session *asi_session;
>> +
>> +	state = this_cpu_read(cpu_asi_session.state);
>> +	/*
>> +	 * We can re-enter isolation, but only with the same ASI (we don't
>> +	 * support nesting isolation). Also, if isolation is still active,
>> +	 * then we should be re-entering with the same task.
>> +	 */
>> +	if (state == ASI_SESSION_STATE_ACTIVE) {
>> +		current_asi = this_cpu_read(cpu_asi_session.asi);
>> +		if (current_asi != asi) {
>> +			WARN_ON(1);
>> +			return -EBUSY;
>> +		}
>> +		WARN_ON(this_cpu_read(cpu_asi_session.task) != current);
>> +		return 0;
>> +	}
>> +
>> +	/* isolation is not active so we can safely access the percpu pointer */
>> +	asi_session = &get_cpu_var(cpu_asi_session);
> 
> get_cpu_var()?? Where is the matching put_cpu_var() ? get_cpu_var()
> contains a preempt_disable ...
> 
> What's wrong with a simple this_cpu_ptr() here?
> 

Oups, my mistake, I should be using this_cpu_ptr(). I will replace all get_cpu_var()
with this_cpu_ptr().


>> +void asi_exit(struct asi *asi)
>> +{
>> +	struct asi_session *asi_session;
>> +	enum asi_session_state asi_state;
>> +	unsigned long original_cr3;
>> +
>> +	asi_state = this_cpu_read(cpu_asi_session.state);
>> +	if (asi_state == ASI_SESSION_STATE_INACTIVE)
>> +		return;
>> +
>> +	/* TODO: Kick sibling hyperthread before switching to kernel cr3 */
>> +	original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
>> +	if (original_cr3)
> 
> Why would this be 0 if the session is active?
> 

Correct, original_cr3 won't be 0. I think this is a remain from a previous version
where original_cr3 was handled differently.


>> +		write_cr3(original_cr3);
>> +
>> +	/* page-table was switched, we can now access the percpu pointer */
>> +	asi_session = &get_cpu_var(cpu_asi_session);
> 
> See above.
> 

Will fix that.


Thanks,

alex.

>> +	WARN_ON(asi_session->task != current);
>> +	asi_session->state = ASI_SESSION_STATE_INACTIVE;
>> +	asi_session->asi = NULL;
>> +	asi_session->task = NULL;
>> +	asi_session->original_cr3 = 0;
>> +}
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch
  2019-07-12  0:05   ` Andy Lutomirski
@ 2019-07-12  7:50     ` Alexandre Chartre
  0 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12  7:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto,
	peterz, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt



On 7/12/19 2:05 AM, Andy Lutomirski wrote:
> 
>> On Jul 11, 2019, at 8:25 AM, Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
>>
>> Address space isolation should be aborted if there is an interrupt,
>> an exception or a context switch. Interrupt/exception handlers and
>> context switch code need to run with the full kernel address space.
>> Address space isolation is aborted by restoring the original CR3
>> value used before entering address space isolation.
>>
> 
> NAK to the entry changes. That code you’re changing is already known
> to be a bit buggy, and it’s spaghetti. PeterZ and I are gradually
> working on fixing some bugs and C-ifying it. ASI can go on top.
> 

Agree this is spaghetti and I will be happy to move ASI on top. I will keep
an eye for your changes, and I will change the ASI code accordingly.

Thanks,

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-11 22:38 ` Dave Hansen
@ 2019-07-12  8:09   ` Alexandre Chartre
  2019-07-12 13:51     ` Dave Hansen
  2019-07-12 10:44   ` Thomas Gleixner
  1 sibling, 1 reply; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12  8:09 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, rkrcmar, tglx, mingo, bp, hpa,
	dave.hansen, luto, peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt


On 7/12/19 12:38 AM, Dave Hansen wrote:
> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
>> - Kernel code mapped to the ASI page-table has been reduced to:
>>    . the entire kernel (I still need to test with only the kernel text)
>>    . the cpu entry area (because we need the GDT to be mapped)
>>    . the cpu ASI session (for managing ASI)
>>    . the current stack
>>
>> - Optionally, an ASI can request the following kernel mapping to be added:
>>    . the stack canary
>>    . the cpu offsets (this_cpu_off)
>>    . the current task
>>    . RCU data (rcu_data)
>>    . CPU HW events (cpu_hw_events).
> 
> I don't see the per-cpu areas in here.  But, the ASI macros in
> entry_64.S (and asi_start_abort()) use per-cpu data.

We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
is created (see patch 15/26):

+	/*
+	 * Map the percpu ASI sessions. This is used by interrupt handlers
+	 * to figure out if we have entered isolation and switch back to
+	 * the kernel address space.
+	 */
+	err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
+	if (err)
+		return err;


> Also, this stuff seems to do naughty stuff (calling C code, touching
> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
> see anything excluding PTI and this code from coexisting.

My understanding is that PTI CR3 writes only happens when switching to/from
userland. While ASI enter/exit/abort happens while we are already in the kernel,
so asi_start_abort() is not called when coming from userland and so not
interacting with PTI.

For example, if ASI in used during a syscall (e.g. with KVM), we have:

  -> syscall
     - PTI CR3 write (kernel CR3)
     - syscall handler:
       ...
       asi_enter()-> write ASI CR3
       .. code run with ASI ..
       asi_exit() or asi abort -> restore original CR3
       ...
     - PTI CR3 write (userland CR3)
  <- syscall


Thanks,

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-11 22:38 ` Dave Hansen
  2019-07-12  8:09   ` Alexandre Chartre
@ 2019-07-12 10:44   ` Thomas Gleixner
  2019-07-12 11:56     ` Alexandre Chartre
  1 sibling, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2019-07-12 10:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexandre Chartre, pbonzini, rkrcmar, mingo, bp, hpa,
	dave.hansen, luto, peterz, kvm, x86, linux-mm, linux-kernel,
	konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt

On Thu, 11 Jul 2019, Dave Hansen wrote:

> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
> > - Kernel code mapped to the ASI page-table has been reduced to:
> >   . the entire kernel (I still need to test with only the kernel text)
> >   . the cpu entry area (because we need the GDT to be mapped)
> >   . the cpu ASI session (for managing ASI)
> >   . the current stack
> > 
> > - Optionally, an ASI can request the following kernel mapping to be added:
> >   . the stack canary
> >   . the cpu offsets (this_cpu_off)
> >   . the current task
> >   . RCU data (rcu_data)
> >   . CPU HW events (cpu_hw_events).
> 
> I don't see the per-cpu areas in here.  But, the ASI macros in
> entry_64.S (and asi_start_abort()) use per-cpu data.
> 
> Also, this stuff seems to do naughty stuff (calling C code, touching
> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
> see anything excluding PTI and this code from coexisting.

That ASI thing is just PTI on steroids.

So why do we need two versions of the same thing? That's absolutely bonkers
and will just introduce subtle bugs and conflicting decisions all over the
place.

The need for ASI is very tightly coupled to the need for PTI and there is
absolutely no point in keeping them separate.

The only difference vs. interrupts and exceptions is that the PTI logic
cares whether they enter from user or from kernel space while ASI only
cares about the kernel entry.

But most exceptions/interrupts transitions do not require to be handled at
the entry code level because on VMEXIT the exit reason clearly tells
whether a switch to the kernel CR3 is necessary or not. So this has to be
handled at the VMM level already in a very clean and simple way.

I'm not a virt wizard, but according to code inspection and instrumentation
even the NMI on the host is actually reinjected manually into the host via
'int $2' after the VMEXIT and for MCE it looks like manual handling as
well. So why do we need to sprinkle that muck all over the entry code?

From a semantical perspective VMENTER/VMEXIT are very similar to the return
to user / enter to user mechanics. Just that the transition happens in the
VMM code and not at the regular user/kernel transition points.

So why do you want ot treat that differently? There is absolutely zero
reason to do so. And there is no reason to create a pointlessly different
version of PTI which introduces yet another variant of a restricted page
table instead of just reusing and extending what's there already.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
                   ` (27 preceding siblings ...)
  2019-07-11 22:38 ` Dave Hansen
@ 2019-07-12 11:44 ` Peter Zijlstra
  2019-07-12 12:17   ` Alexandre Chartre
  28 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2019-07-12 11:44 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner

On Thu, Jul 11, 2019 at 04:25:12PM +0200, Alexandre Chartre wrote:
> Kernel Address Space Isolation aims to use address spaces to isolate some
> parts of the kernel (for example KVM) to prevent leaking sensitive data
> between hyper-threads under speculative execution attacks. You can refer
> to the first version of this RFC for more context:
> 
>    https://lkml.org/lkml/2019/5/13/515

No, no, no!

That is the crux of this entire series; you're not punting on explaining
exactly why we want to go dig through 26 patches of gunk.

You get to exactly explain what (your definition of) sensitive data is,
and which speculative scenarios and how this approach mitigates them.

And included in that is a high level overview of the whole thing.

On the one hand you've made this implementation for KVM, while on the
other hand you're saying it is generic but then fail to describe any
!KVM user.

AFAIK all speculative fails this is relevant to are now public, so
excruciating horrible details are fine and required.

AFAIK2 this is all because of MDS but it also helps with v1.

AFAIK3 this wants/needs to be combined with core-scheduling to be
useful, but not a single mention of that is anywhere.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 10:44   ` Thomas Gleixner
@ 2019-07-12 11:56     ` Alexandre Chartre
  2019-07-12 12:50       ` Peter Zijlstra
  2019-07-12 16:00       ` Thomas Gleixner
  0 siblings, 2 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12 11:56 UTC (permalink / raw)
  To: Thomas Gleixner, Dave Hansen
  Cc: pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen, luto, peterz,
	kvm, x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt


On 7/12/19 12:44 PM, Thomas Gleixner wrote:
> On Thu, 11 Jul 2019, Dave Hansen wrote:
> 
>> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
>>> - Kernel code mapped to the ASI page-table has been reduced to:
>>>    . the entire kernel (I still need to test with only the kernel text)
>>>    . the cpu entry area (because we need the GDT to be mapped)
>>>    . the cpu ASI session (for managing ASI)
>>>    . the current stack
>>>
>>> - Optionally, an ASI can request the following kernel mapping to be added:
>>>    . the stack canary
>>>    . the cpu offsets (this_cpu_off)
>>>    . the current task
>>>    . RCU data (rcu_data)
>>>    . CPU HW events (cpu_hw_events).
>>
>> I don't see the per-cpu areas in here.  But, the ASI macros in
>> entry_64.S (and asi_start_abort()) use per-cpu data.
>>
>> Also, this stuff seems to do naughty stuff (calling C code, touching
>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>> see anything excluding PTI and this code from coexisting.
> 
> That ASI thing is just PTI on steroids.
> 
> So why do we need two versions of the same thing? That's absolutely bonkers
> and will just introduce subtle bugs and conflicting decisions all over the
> place.
> 
> The need for ASI is very tightly coupled to the need for PTI and there is
> absolutely no point in keeping them separate.
>
> The only difference vs. interrupts and exceptions is that the PTI logic
> cares whether they enter from user or from kernel space while ASI only
> cares about the kernel entry.

I think that's precisely what makes ASI and PTI different and independent.
PTI is just about switching between userland and kernel page-tables, while
ASI is about switching page-table inside the kernel. You can have ASI without
having PTI. You can also use ASI for kernel threads so for code that won't
be triggered from userland and so which won't involve PTI.

> But most exceptions/interrupts transitions do not require to be handled at
> the entry code level because on VMEXIT the exit reason clearly tells
> whether a switch to the kernel CR3 is necessary or not. So this has to be
> handled at the VMM level already in a very clean and simple way.
> 
> I'm not a virt wizard, but according to code inspection and instrumentation
> even the NMI on the host is actually reinjected manually into the host via
> 'int $2' after the VMEXIT and for MCE it looks like manual handling as
> well. So why do we need to sprinkle that muck all over the entry code?
> 
>  From a semantical perspective VMENTER/VMEXIT are very similar to the return
> to user / enter to user mechanics. Just that the transition happens in the
> VMM code and not at the regular user/kernel transition points.

VMExit returns to the kernel, and ASI is used to run the VMExit handler with
a limited kernel address space instead of using the full kernel address space.
Change in entry code is required to handle any interrupt/exception which
can happen while running code with ASI (like KVM VMExit handler).

Note that KVM is an example of an ASI consumer, but ASI is generic and can be
used to run (mostly) any kernel code if you want to run code with a reduced
kernel address space.

> So why do you want ot treat that differently? There is absolutely zero
> reason to do so. And there is no reason to create a pointlessly different
> version of PTI which introduces yet another variant of a restricted page
> table instead of just reusing and extending what's there already.
> 

As I've tried to explain, to me PTI and ASI are different and independent.
PTI manages switching between userland and kernel page-table, and ASI manages
switching between kernel and a reduced-kernel page-table.


Thanks,

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 11:44 ` Peter Zijlstra
@ 2019-07-12 12:17   ` Alexandre Chartre
  2019-07-12 12:36     ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12 12:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner


On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> On Thu, Jul 11, 2019 at 04:25:12PM +0200, Alexandre Chartre wrote:
>> Kernel Address Space Isolation aims to use address spaces to isolate some
>> parts of the kernel (for example KVM) to prevent leaking sensitive data
>> between hyper-threads under speculative execution attacks. You can refer
>> to the first version of this RFC for more context:
>>
>>     https://lkml.org/lkml/2019/5/13/515
> 
> No, no, no!
> 
> That is the crux of this entire series; you're not punting on explaining
> exactly why we want to go dig through 26 patches of gunk.
> 
> You get to exactly explain what (your definition of) sensitive data is,
> and which speculative scenarios and how this approach mitigates them.
> 
> And included in that is a high level overview of the whole thing.
> 

Ok, I will rework the explanation. Sorry about that.

> On the one hand you've made this implementation for KVM, while on the
> other hand you're saying it is generic but then fail to describe any
> !KVM user.
> 
> AFAIK all speculative fails this is relevant to are now public, so
> excruciating horrible details are fine and required.

Ok.

> AFAIK2 this is all because of MDS but it also helps with v1.

Yes, mostly MDS and also L1TF.

> AFAIK3 this wants/needs to be combined with core-scheduling to be
> useful, but not a single mention of that is anywhere.

No. This is actually an alternative to core-scheduling. Eventually, ASI
will kick all sibling hyperthreads when exiting isolation and it needs to
run with the full kernel page-table (note that's currently not in these
patches).

So ASI can be seen as an optimization to disabling hyperthreading: instead
of just disabling hyperthreading you run with ASI, and when ASI can't preserve
isolation you will basically run with a single thread.

I will add all that to the explanation.

Thanks,

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 12:17   ` Alexandre Chartre
@ 2019-07-12 12:36     ` Peter Zijlstra
  2019-07-12 12:47       ` Alexandre Chartre
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2019-07-12 12:36 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner

On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
> On 7/12/19 1:44 PM, Peter Zijlstra wrote:

> > AFAIK3 this wants/needs to be combined with core-scheduling to be
> > useful, but not a single mention of that is anywhere.
> 
> No. This is actually an alternative to core-scheduling. Eventually, ASI
> will kick all sibling hyperthreads when exiting isolation and it needs to
> run with the full kernel page-table (note that's currently not in these
> patches).
> 
> So ASI can be seen as an optimization to disabling hyperthreading: instead
> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
> isolation you will basically run with a single thread.

You can't do that without much of the scheduler changes present in the
core-scheduling patches.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 12:36     ` Peter Zijlstra
@ 2019-07-12 12:47       ` Alexandre Chartre
  2019-07-12 13:07         ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner


On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> 
>>> AFAIK3 this wants/needs to be combined with core-scheduling to be
>>> useful, but not a single mention of that is anywhere.
>>
>> No. This is actually an alternative to core-scheduling. Eventually, ASI
>> will kick all sibling hyperthreads when exiting isolation and it needs to
>> run with the full kernel page-table (note that's currently not in these
>> patches).
>>
>> So ASI can be seen as an optimization to disabling hyperthreading: instead
>> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
>> isolation you will basically run with a single thread.
> 
> You can't do that without much of the scheduler changes present in the
> core-scheduling patches.
> 

We hope we can do that without the whole core-scheduling mechanism. The idea
is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
sibling hyperthreads and have them wait for a condition that will allow them
to resume execution (for example when re-entering isolation). We are
investigating this in parallel to ASI.

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 11:56     ` Alexandre Chartre
@ 2019-07-12 12:50       ` Peter Zijlstra
  2019-07-12 13:43         ` Alexandre Chartre
                           ` (2 more replies)
  2019-07-12 16:00       ` Thomas Gleixner
  1 sibling, 3 replies; 68+ messages in thread
From: Peter Zijlstra @ 2019-07-12 12:50 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Thomas Gleixner, Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa,
	dave.hansen, luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner

On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:

> I think that's precisely what makes ASI and PTI different and independent.
> PTI is just about switching between userland and kernel page-tables, while
> ASI is about switching page-table inside the kernel. You can have ASI without
> having PTI. You can also use ASI for kernel threads so for code that won't
> be triggered from userland and so which won't involve PTI.

PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).

See how very similar they are?

Furthermore, to recover SMT for userspace (under MDS) we not only need
core-scheduling but core-scheduling per address space. And ASI was
specifically designed to help mitigate the trainwreck just described.

By explicitly exposing (hopefully harmless) part of the kernel to MDS,
we reduce the part that needs core-scheduling and thus reduce the rate
the SMT siblngs need to sync up/schedule.

But looking at it that way, it makes no sense to retain 3 address
spaces, namely:

  user / kernel exposed / kernel private.

Specifically, it makes no sense to expose part of the kernel through MDS
but not through Meltdow. Therefore we can merge the user and kernel
exposed address spaces.

And then we've fully replaced PTI.

So no, they're not orthogonal.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 12:47       ` Alexandre Chartre
@ 2019-07-12 13:07         ` Peter Zijlstra
  2019-07-12 13:46           ` Alexandre Chartre
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2019-07-12 13:07 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner

On Fri, Jul 12, 2019 at 02:47:23PM +0200, Alexandre Chartre wrote:
> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> > On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
> > > On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> > 
> > > > AFAIK3 this wants/needs to be combined with core-scheduling to be
> > > > useful, but not a single mention of that is anywhere.
> > > 
> > > No. This is actually an alternative to core-scheduling. Eventually, ASI
> > > will kick all sibling hyperthreads when exiting isolation and it needs to
> > > run with the full kernel page-table (note that's currently not in these
> > > patches).
> > > 
> > > So ASI can be seen as an optimization to disabling hyperthreading: instead
> > > of just disabling hyperthreading you run with ASI, and when ASI can't preserve
> > > isolation you will basically run with a single thread.
> > 
> > You can't do that without much of the scheduler changes present in the
> > core-scheduling patches.
> > 
> 
> We hope we can do that without the whole core-scheduling mechanism. The idea
> is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
> sibling hyperthreads and have them wait for a condition that will allow them
> to resume execution (for example when re-entering isolation). We are
> investigating this in parallel to ASI.

You cannot wait from IPI context, so you have to go somewhere else to
wait.

Also, consider what happens when the task that entered isolation decides
to schedule out / gets migrated.

I think you'll quickly find yourself back at core-scheduling.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 12:50       ` Peter Zijlstra
@ 2019-07-12 13:43         ` Alexandre Chartre
  2019-07-12 13:58           ` Dave Hansen
  2019-07-12 14:36           ` Andy Lutomirski
  2019-07-12 13:54         ` Dave Hansen
  2019-07-12 15:16         ` Thomas Gleixner
  2 siblings, 2 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa,
	dave.hansen, luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner


On 7/12/19 2:50 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> 
>> I think that's precisely what makes ASI and PTI different and independent.
>> PTI is just about switching between userland and kernel page-tables, while
>> ASI is about switching page-table inside the kernel. You can have ASI without
>> having PTI. You can also use ASI for kernel threads so for code that won't
>> be triggered from userland and so which won't involve PTI.
> 
> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> 
> See how very similar they are?
>
> 
> Furthermore, to recover SMT for userspace (under MDS) we not only need
> core-scheduling but core-scheduling per address space. And ASI was
> specifically designed to help mitigate the trainwreck just described.
> 
> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> we reduce the part that needs core-scheduling and thus reduce the rate
> the SMT siblngs need to sync up/schedule.
> 
> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
> 
>    user / kernel exposed / kernel private.
> 
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdow. Therefore we can merge the user and kernel
> exposed address spaces.

The goal of ASI is to provide a reduced address space which exclude sensitive
data. A user process (for example a database daemon, a web server, or a vmm
like qemu) will likely have sensitive data mapped in its user address space.
Such data shouldn't be mapped with ASI because it can potentially leak to the
sibling hyperthread. For example, if an hyperthread is running a VM then the
VM could potentially access user sensitive data if they are mapped on the
sibling hyperthread with ASI.

The current approach is assuming that anything in the user address space
can be sensitive, and so the user address space shouldn't be mapped in ASI.

It looks like what you are suggesting could be an optimization when creating
an ASI for a process which has no sensitive data (this could be an option to
specify when creating an ASI, for example).

alex.

> 
> And then we've fully replaced PTI.
> 
> So no, they're not orthogonal.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 13:07         ` Peter Zijlstra
@ 2019-07-12 13:46           ` Alexandre Chartre
  2019-07-31 16:31             ` Dario Faggioli
  0 siblings, 1 reply; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner


On 7/12/19 3:07 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 02:47:23PM +0200, Alexandre Chartre wrote:
>> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
>>> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
>>>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
>>>
>>>>> AFAIK3 this wants/needs to be combined with core-scheduling to be
>>>>> useful, but not a single mention of that is anywhere.
>>>>
>>>> No. This is actually an alternative to core-scheduling. Eventually, ASI
>>>> will kick all sibling hyperthreads when exiting isolation and it needs to
>>>> run with the full kernel page-table (note that's currently not in these
>>>> patches).
>>>>
>>>> So ASI can be seen as an optimization to disabling hyperthreading: instead
>>>> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
>>>> isolation you will basically run with a single thread.
>>>
>>> You can't do that without much of the scheduler changes present in the
>>> core-scheduling patches.
>>>
>>
>> We hope we can do that without the whole core-scheduling mechanism. The idea
>> is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
>> sibling hyperthreads and have them wait for a condition that will allow them
>> to resume execution (for example when re-entering isolation). We are
>> investigating this in parallel to ASI.
> 
> You cannot wait from IPI context, so you have to go somewhere else to
> wait.
> 
> Also, consider what happens when the task that entered isolation decides
> to schedule out / gets migrated.
> 
> I think you'll quickly find yourself back at core-scheduling.
> 

I haven't looked at details about what has been done so far. Hopefully, we
can do something not too complex, or reuse a (small) part of co-scheduling.

Thanks for pointing this out.

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12  8:09   ` Alexandre Chartre
@ 2019-07-12 13:51     ` Dave Hansen
  2019-07-12 14:06       ` Alexandre Chartre
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Hansen @ 2019-07-12 13:51 UTC (permalink / raw)
  To: Alexandre Chartre, pbonzini, rkrcmar, tglx, mingo, bp, hpa,
	dave.hansen, luto, peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt

On 7/12/19 1:09 AM, Alexandre Chartre wrote:
> On 7/12/19 12:38 AM, Dave Hansen wrote:
>> I don't see the per-cpu areas in here.  But, the ASI macros in
>> entry_64.S (and asi_start_abort()) use per-cpu data.
> 
> We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
> code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
> is created (see patch 15/26):

No fair!  I had per-cpu variables just for PTI at some point and had to
give them up! ;)

> +    /*
> +     * Map the percpu ASI sessions. This is used by interrupt handlers
> +     * to figure out if we have entered isolation and switch back to
> +     * the kernel address space.
> +     */
> +    err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
> +    if (err)
> +        return err;
> 
> 
>> Also, this stuff seems to do naughty stuff (calling C code, touching
>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>> see anything excluding PTI and this code from coexisting.
> 
> My understanding is that PTI CR3 writes only happens when switching to/from
> userland. While ASI enter/exit/abort happens while we are already in the
> kernel,
> so asi_start_abort() is not called when coming from userland and so not
> interacting with PTI.

OK, that makes sense.  You only need to call C code when interrupted
from something in the kernel (deeper than the entry code), and those
were already running kernel C code anyway.

If this continues to live in the entry code, I think you have a good
clue where to start commenting.

BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
from user vs. kernel.  It's tricky because there's a window both in the
entry and exit code where you are in the kernel but have a userspace CR3
value.  You end up needing a CR3 write when you have a userspace CR3
value when the interrupt occurred, not only when you interrupt userspace
itself.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 12:50       ` Peter Zijlstra
  2019-07-12 13:43         ` Alexandre Chartre
@ 2019-07-12 13:54         ` Dave Hansen
  2019-07-12 15:20           ` Peter Zijlstra
  2019-07-12 15:16         ` Thomas Gleixner
  2 siblings, 1 reply; 68+ messages in thread
From: Dave Hansen @ 2019-07-12 13:54 UTC (permalink / raw)
  To: Peter Zijlstra, Alexandre Chartre
  Cc: Thomas Gleixner, pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen,
	luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner

On 7/12/19 5:50 AM, Peter Zijlstra wrote:
> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> 
> See how very similar they are?

That's an interesting point.

I'd add that PTI maps a part of kernel space that partially overlaps
with what ASI wants.

> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
> 
>   user / kernel exposed / kernel private.
> 
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdown. Therefore we can merge the user and kernel
> exposed address spaces.
> 
> And then we've fully replaced PTI.

So, in one address space (PTI/user or ASI), we say, "screw it" and all
the data mapped is exposed to speculation attacks.  We have to be very
careful about what we map and expose here.

The other (full kernel) address space we are more careful about what we
*do* instead of what we map.  We map everything but have to add
mitigations to ensure that we don't leak anything back to the exposed
address space.

So, maybe we're not replacing PTI as much as we're growing PTI so that
we can run more kernel code with the (now inappropriately named) user
page tables.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 13:43         ` Alexandre Chartre
@ 2019-07-12 13:58           ` Dave Hansen
  2019-07-12 14:36           ` Andy Lutomirski
  1 sibling, 0 replies; 68+ messages in thread
From: Dave Hansen @ 2019-07-12 13:58 UTC (permalink / raw)
  To: Alexandre Chartre, Peter Zijlstra
  Cc: Thomas Gleixner, pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen,
	luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner

On 7/12/19 6:43 AM, Alexandre Chartre wrote:
> The current approach is assuming that anything in the user address space
> can be sensitive, and so the user address space shouldn't be mapped in ASI.

Is this universally true?

There's certainly *some* mitigation provided by SMAP that would allow
userspace to remain mapped and still protected.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 13:51     ` Dave Hansen
@ 2019-07-12 14:06       ` Alexandre Chartre
  2019-07-12 15:23         ` Thomas Gleixner
  0 siblings, 1 reply; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12 14:06 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, rkrcmar, tglx, mingo, bp, hpa,
	dave.hansen, luto, peterz, kvm, x86, linux-mm, linux-kernel
  Cc: konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt


On 7/12/19 3:51 PM, Dave Hansen wrote:
> On 7/12/19 1:09 AM, Alexandre Chartre wrote:
>> On 7/12/19 12:38 AM, Dave Hansen wrote:
>>> I don't see the per-cpu areas in here.  But, the ASI macros in
>>> entry_64.S (and asi_start_abort()) use per-cpu data.
>>
>> We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
>> code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
>> is created (see patch 15/26):
> 
> No fair!  I had per-cpu variables just for PTI at some point and had to
> give them up! ;)
> 
>> +    /*
>> +     * Map the percpu ASI sessions. This is used by interrupt handlers
>> +     * to figure out if we have entered isolation and switch back to
>> +     * the kernel address space.
>> +     */
>> +    err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
>> +    if (err)
>> +        return err;
>>
>>
>>> Also, this stuff seems to do naughty stuff (calling C code, touching
>>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>>> see anything excluding PTI and this code from coexisting.
>>
>> My understanding is that PTI CR3 writes only happens when switching to/from
>> userland. While ASI enter/exit/abort happens while we are already in the
>> kernel,
>> so asi_start_abort() is not called when coming from userland and so not
>> interacting with PTI.
> 
> OK, that makes sense.  You only need to call C code when interrupted
> from something in the kernel (deeper than the entry code), and those
> were already running kernel C code anyway.
> 

Exactly.

> If this continues to live in the entry code, I think you have a good
> clue where to start commenting.

Yeah, lot of writing to do... :-)
  
> BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
> from user vs. kernel.  It's tricky because there's a window both in the
> entry and exit code where you are in the kernel but have a userspace CR3
> value.  You end up needing a CR3 write when you have a userspace CR3
> value when the interrupt occurred, not only when you interrupt userspace
> itself.
> 

Right. ASI is simpler because it comes from the kernel and return to the
kernel. There's just a small window (on entry) where we have the ASI CR3
but we quickly switch to the full kernel CR3.

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 13:43         ` Alexandre Chartre
  2019-07-12 13:58           ` Dave Hansen
@ 2019-07-12 14:36           ` Andy Lutomirski
  2019-07-14 18:17             ` Alexander Graf
  1 sibling, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2019-07-12 14:36 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Peter Zijlstra, Thomas Gleixner, Dave Hansen, Paolo Bonzini,
	Radim Krcmar, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Andrew Lutomirski, kvm list, X86 ML, Linux-MM, LKML,
	Konrad Rzeszutek Wilk, jan.setjeeilers, Liran Alon,
	Jonathan Adams, Alexander Graf, Mike Rapoport, Paul Turner

On Fri, Jul 12, 2019 at 6:45 AM Alexandre Chartre
<alexandre.chartre@oracle.com> wrote:
>
>
> On 7/12/19 2:50 PM, Peter Zijlstra wrote:
> > On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> >
> >> I think that's precisely what makes ASI and PTI different and independent.
> >> PTI is just about switching between userland and kernel page-tables, while
> >> ASI is about switching page-table inside the kernel. You can have ASI without
> >> having PTI. You can also use ASI for kernel threads so for code that won't
> >> be triggered from userland and so which won't involve PTI.
> >
> > PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> > ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> >
> > See how very similar they are?
> >
> >
> > Furthermore, to recover SMT for userspace (under MDS) we not only need
> > core-scheduling but core-scheduling per address space. And ASI was
> > specifically designed to help mitigate the trainwreck just described.
> >
> > By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> > we reduce the part that needs core-scheduling and thus reduce the rate
> > the SMT siblngs need to sync up/schedule.
> >
> > But looking at it that way, it makes no sense to retain 3 address
> > spaces, namely:
> >
> >    user / kernel exposed / kernel private.
> >
> > Specifically, it makes no sense to expose part of the kernel through MDS
> > but not through Meltdow. Therefore we can merge the user and kernel
> > exposed address spaces.
>
> The goal of ASI is to provide a reduced address space which exclude sensitive
> data. A user process (for example a database daemon, a web server, or a vmm
> like qemu) will likely have sensitive data mapped in its user address space.
> Such data shouldn't be mapped with ASI because it can potentially leak to the
> sibling hyperthread. For example, if an hyperthread is running a VM then the
> VM could potentially access user sensitive data if they are mapped on the
> sibling hyperthread with ASI.

So I've proposed the following slightly hackish thing:

Add a mechanism (call it /dev/xpfo).  When you open /dev/xpfo and
fallocate it to some size, you allocate that amount of memory and kick
it out of the kernel direct map.  (And pay the IPI cost unless there
were already cached non-direct-mapped pages ready.)  Then you map
*that* into your VMs.  Now, for a dedicated VM host, you map *all* the
VM private memory from /dev/xpfo.  Pretend it's SEV if you want to
determine which pages can be set up like this.

Does this get enough of the benefit at a negligible fraction of the
code complexity cost?  (This plus core scheduling, anyway.)

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 12:50       ` Peter Zijlstra
  2019-07-12 13:43         ` Alexandre Chartre
  2019-07-12 13:54         ` Dave Hansen
@ 2019-07-12 15:16         ` Thomas Gleixner
  2019-07-12 16:37           ` Alexandre Chartre
  2 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2019-07-12 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexandre Chartre, Dave Hansen, pbonzini, rkrcmar, mingo, bp,
	hpa, dave.hansen, luto, kvm, x86, linux-mm, linux-kernel,
	konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	Paul Turner

On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> 
> > I think that's precisely what makes ASI and PTI different and independent.
> > PTI is just about switching between userland and kernel page-tables, while
> > ASI is about switching page-table inside the kernel. You can have ASI without
> > having PTI. You can also use ASI for kernel threads so for code that won't
> > be triggered from userland and so which won't involve PTI.
> 
> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> 
> See how very similar they are?
> 
> Furthermore, to recover SMT for userspace (under MDS) we not only need
> core-scheduling but core-scheduling per address space. And ASI was
> specifically designed to help mitigate the trainwreck just described.
> 
> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> we reduce the part that needs core-scheduling and thus reduce the rate
> the SMT siblngs need to sync up/schedule.
> 
> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
> 
>   user / kernel exposed / kernel private.
> 
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdow. Therefore we can merge the user and kernel
> exposed address spaces.
> 
> And then we've fully replaced PTI.
> 
> So no, they're not orthogonal.

Right. If we decide to expose more parts of the kernel mappings then that's
just adding more stuff to the existing user (PTI) map mechanics.

As a consequence the CR3 switching points become different or can be
consolidated and that can be handled right at those switching points
depending on static keys or alternatives as we do today with PTI and other
mitigations.

All of that can do without that obscure "state machine" which is solely
there to duct-tape the complete lack of design. The same applies to that
mapping thing. Just mapping randomly selected parts by sticking them into
an array is a non-maintainable approach. This needs proper separation of
text and data sections, so violations of the mapping constraints can be
statically analyzed. Depending solely on the page fault at run time for
analysis is just bound to lead to hard to diagnose failures in the field.

TBH we all know already that this can be done and that this will solve some
of the issues caused by the speculation mess, so just writing some hastily
cobbled together POC code which explodes just by looking at it, does not
lead to anything else than time waste on all ends.

This first needs a clear definition of protection scope. That scope clearly
defines the required mappings and consequently the transition requirements
which provide the necessary transition points for flipping CR3.

If we have agreed on that, then we can think about the implementation
details.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 13:54         ` Dave Hansen
@ 2019-07-12 15:20           ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2019-07-12 15:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexandre Chartre, Thomas Gleixner, pbonzini, rkrcmar, mingo, bp,
	hpa, dave.hansen, luto, kvm, x86, linux-mm, linux-kernel,
	konrad.wilk, jan.setjeeilers, liran.alon, jwadams, graf, rppt,
	Paul Turner

On Fri, Jul 12, 2019 at 06:54:22AM -0700, Dave Hansen wrote:
> On 7/12/19 5:50 AM, Peter Zijlstra wrote:
> > PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> > ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> > 
> > See how very similar they are?
> 
> That's an interesting point.
> 
> I'd add that PTI maps a part of kernel space that partially overlaps
> with what ASI wants.

Right, wherever we put the boundary, we need whatever is required to
cross it.

> > But looking at it that way, it makes no sense to retain 3 address
> > spaces, namely:
> > 
> >   user / kernel exposed / kernel private.
> > 
> > Specifically, it makes no sense to expose part of the kernel through MDS
> > but not through Meltdown. Therefore we can merge the user and kernel
> > exposed address spaces.
> > 
> > And then we've fully replaced PTI.
> 
> So, in one address space (PTI/user or ASI), we say, "screw it" and all
> the data mapped is exposed to speculation attacks.  We have to be very
> careful about what we map and expose here.

Yes, which is why, in an earlier email, I've asked for a clear
definition of 'sensitive" :-)

> So, maybe we're not replacing PTI as much as we're growing PTI so that
> we can run more kernel code with the (now inappropriately named) user
> page tables.

Right.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 14:06       ` Alexandre Chartre
@ 2019-07-12 15:23         ` Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: Thomas Gleixner @ 2019-07-12 15:23 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen,
	luto, peterz, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt

On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 3:51 PM, Dave Hansen wrote:
> > BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
> > from user vs. kernel.  It's tricky because there's a window both in the
> > entry and exit code where you are in the kernel but have a userspace CR3
> > value.  You end up needing a CR3 write when you have a userspace CR3
> > value when the interrupt occurred, not only when you interrupt userspace
> > itself.
> > 
> 
> Right. ASI is simpler because it comes from the kernel and return to the
> kernel. There's just a small window (on entry) where we have the ASI CR3
> but we quickly switch to the full kernel CR3.

That's wrong in several aspects.

   1) You are looking at it purely from the VMM perspective, which is bogus
      as you already said, that this can/should be used to be extended to
      other scenarios (including kvm ioctl or such).

      So no, it's not just coming from kernel space and returning to it.

      If that'd be true then the entry code could just stay as is because
      you can handle _ALL_ of that very trivial in the atomic VMM
      enter/exit code.

   2) It does not matter how small that window is. If there is a window
      then this needs to be covered, no matter what.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 11:56     ` Alexandre Chartre
  2019-07-12 12:50       ` Peter Zijlstra
@ 2019-07-12 16:00       ` Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: Thomas Gleixner @ 2019-07-12 16:00 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen,
	luto, peterz, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt

On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 12:44 PM, Thomas Gleixner wrote:
> > That ASI thing is just PTI on steroids.
> > 
> > So why do we need two versions of the same thing? That's absolutely bonkers
> > and will just introduce subtle bugs and conflicting decisions all over the
> > place.
> > 
> > The need for ASI is very tightly coupled to the need for PTI and there is
> > absolutely no point in keeping them separate.
> > 
> > The only difference vs. interrupts and exceptions is that the PTI logic
> > cares whether they enter from user or from kernel space while ASI only
> > cares about the kernel entry.
> 
> I think that's precisely what makes ASI and PTI different and independent.
> PTI is just about switching between userland and kernel page-tables, while
> ASI is about switching page-table inside the kernel. You can have ASI without
> having PTI. You can also use ASI for kernel threads so for code that won't
> be triggered from userland and so which won't involve PTI.

It's still the same concept. And you can argue in circles it does not
justify yet another mapping setup with is a different copy of some other
mapping setup. Whether PTI is replaced by ASI or PTI is extended to handle
ASI does not matter at all. Having two similar concepts side by side is a
guarantee for disaster.

> > So why do you want ot treat that differently? There is absolutely zero
> > reason to do so. And there is no reason to create a pointlessly different
> > version of PTI which introduces yet another variant of a restricted page
> > table instead of just reusing and extending what's there already.
> > 
> 
> As I've tried to explain, to me PTI and ASI are different and independent.
> PTI manages switching between userland and kernel page-table, and ASI manages
> switching between kernel and a reduced-kernel page-table.

Again. It's the same concept and it does not matter what form of reduced
page tables you use. You always need transition points and in order to make
the transition points work you need reliably mapped bits and pieces.

Also Paul wants to use the same concept for user space so trivial system
calls can do w/o PTI. In some other thread you said yourself that this
could be extended to cover the kvm ioctl, which is clearly a return to user
space.

Are we then going to add another set of randomly sprinkled transition
points and yet another 'state machine' to duct-tape the fallout?

Definitely not going to happen.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 15:16         ` Thomas Gleixner
@ 2019-07-12 16:37           ` Alexandre Chartre
  2019-07-12 16:45             ` Andy Lutomirski
                               ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-12 16:37 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen,
	luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner



On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>
>>> I think that's precisely what makes ASI and PTI different and independent.
>>> PTI is just about switching between userland and kernel page-tables, while
>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>> be triggered from userland and so which won't involve PTI.
>>
>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>
>> See how very similar they are?
>>
>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>> core-scheduling but core-scheduling per address space. And ASI was
>> specifically designed to help mitigate the trainwreck just described.
>>
>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>> we reduce the part that needs core-scheduling and thus reduce the rate
>> the SMT siblngs need to sync up/schedule.
>>
>> But looking at it that way, it makes no sense to retain 3 address
>> spaces, namely:
>>
>>    user / kernel exposed / kernel private.
>>
>> Specifically, it makes no sense to expose part of the kernel through MDS
>> but not through Meltdow. Therefore we can merge the user and kernel
>> exposed address spaces.
>>
>> And then we've fully replaced PTI.
>>
>> So no, they're not orthogonal.
> 
> Right. If we decide to expose more parts of the kernel mappings then that's
> just adding more stuff to the existing user (PTI) map mechanics.
  

If we expose more parts of the kernel mapping by adding them to the existing
user (PTI) map, then we only control the mapping of kernel sensitive data but
we don't control user mapping (with ASI, we exclude all user mappings).

How would you control the mapping of userland sensitive data and exclude them
from the user map? Would you have the application explicitly identify sensitive
data (like Andy suggested with a /dev/xpfo device)?

Thanks,

alex.


> As a consequence the CR3 switching points become different or can be
> consolidated and that can be handled right at those switching points
> depending on static keys or alternatives as we do today with PTI and other
> mitigations.
> 
> All of that can do without that obscure "state machine" which is solely
> there to duct-tape the complete lack of design. The same applies to that
> mapping thing. Just mapping randomly selected parts by sticking them into
> an array is a non-maintainable approach. This needs proper separation of
> text and data sections, so violations of the mapping constraints can be
> statically analyzed. Depending solely on the page fault at run time for
> analysis is just bound to lead to hard to diagnose failures in the field.
> 
> TBH we all know already that this can be done and that this will solve some
> of the issues caused by the speculation mess, so just writing some hastily
> cobbled together POC code which explodes just by looking at it, does not
> lead to anything else than time waste on all ends.
> 
> This first needs a clear definition of protection scope. That scope clearly
> defines the required mappings and consequently the transition requirements
> which provide the necessary transition points for flipping CR3.
> 
> If we have agreed on that, then we can think about the implementation
> details.
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 16:37           ` Alexandre Chartre
@ 2019-07-12 16:45             ` Andy Lutomirski
  2019-07-14 17:11               ` Mike Rapoport
  2019-07-12 19:06             ` Peter Zijlstra
  2019-07-12 19:48             ` Thomas Gleixner
  2 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2019-07-12 16:45 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Thomas Gleixner, Peter Zijlstra, Dave Hansen, pbonzini, rkrcmar,
	mingo, bp, hpa, dave.hansen, luto, kvm, x86, linux-mm,
	linux-kernel, konrad.wilk, jan.setjeeilers, liran.alon, jwadams,
	graf, rppt, Paul Turner



> On Jul 12, 2019, at 10:37 AM, Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
> 
> 
> 
>> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>> 
>>>> I think that's precisely what makes ASI and PTI different and independent.
>>>> PTI is just about switching between userland and kernel page-tables, while
>>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>>> be triggered from userland and so which won't involve PTI.
>>> 
>>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
>>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>> 
>>> See how very similar they are?
>>> 
>>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>>> core-scheduling but core-scheduling per address space. And ASI was
>>> specifically designed to help mitigate the trainwreck just described.
>>> 
>>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>>> we reduce the part that needs core-scheduling and thus reduce the rate
>>> the SMT siblngs need to sync up/schedule.
>>> 
>>> But looking at it that way, it makes no sense to retain 3 address
>>> spaces, namely:
>>> 
>>>   user / kernel exposed / kernel private.
>>> 
>>> Specifically, it makes no sense to expose part of the kernel through MDS
>>> but not through Meltdow. Therefore we can merge the user and kernel
>>> exposed address spaces.
>>> 
>>> And then we've fully replaced PTI.
>>> 
>>> So no, they're not orthogonal.
>> Right. If we decide to expose more parts of the kernel mappings then that's
>> just adding more stuff to the existing user (PTI) map mechanics.
> 
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).
> 
> How would you control the mapping of userland sensitive data and exclude them
> from the user map?

As I see it, if we think part of the kernel is okay to leak to VM guests, then it should think it’s okay to leak to userspace and versa. At the end of the day, this may just have to come down to an administrator’s choice of how careful the mitigations need to be.

> Would you have the application explicitly identify sensitive
> data (like Andy suggested with a /dev/xpfo device)?

That’s not really the intent of my suggestion. I was suggesting that maybe we don’t need ASI at all if we allow VMs to exclude their memory from the kernel mapping entirely.  Heck, in a setup like this, we can maybe even get away with turning PTI off under very, very controlled circumstances.  I’m not quite sure what to do about the kernel random pools, though.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 16:37           ` Alexandre Chartre
  2019-07-12 16:45             ` Andy Lutomirski
@ 2019-07-12 19:06             ` Peter Zijlstra
  2019-07-14 15:06               ` Andy Lutomirski
  2019-07-12 19:48             ` Thomas Gleixner
  2 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2019-07-12 19:06 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Thomas Gleixner, Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa,
	dave.hansen, luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner

On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> On 7/12/19 5:16 PM, Thomas Gleixner wrote:

> > Right. If we decide to expose more parts of the kernel mappings then that's
> > just adding more stuff to the existing user (PTI) map mechanics.
> 
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).
> 
> How would you control the mapping of userland sensitive data and exclude them
> from the user map? Would you have the application explicitly identify sensitive
> data (like Andy suggested with a /dev/xpfo device)?

To what purpose do you want to exclude userspace from the kernel
mapping; that is, what are you mitigating against with that?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 16:37           ` Alexandre Chartre
  2019-07-12 16:45             ` Andy Lutomirski
  2019-07-12 19:06             ` Peter Zijlstra
@ 2019-07-12 19:48             ` Thomas Gleixner
  2019-07-15  8:23               ` Alexandre Chartre
  2 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2019-07-12 19:48 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Peter Zijlstra, Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa,
	dave.hansen, luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner

On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> > On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> > > On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> > > And then we've fully replaced PTI.
> > > 
> > > So no, they're not orthogonal.
> > 
> > Right. If we decide to expose more parts of the kernel mappings then that's
> > just adding more stuff to the existing user (PTI) map mechanics.
>  
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).

What prevents you from adding functionality to do so to the PTI
implementation? Nothing.

Again, the underlying concept is exactly the same:

  1) Create a restricted mapping from an existing mapping

  2) Switch to the restricted mapping when entering a particular execution
     context

  3) Switch to the unrestricted mapping when leaving that execution context

  4) Keep track of the state

The restriction scope is different, but that's conceptually completely
irrelevant. It's a detail which needs to be handled at the implementation
level.

What matters here is the concept and because the concept is the same, this
needs to share the infrastructure for #1 - #4.

It's obvious that this requires changes to the way PTI works today, but
anything which creates a parallel implementation of any part of the above
#1 - #4 is not going anywhere.

This stuff is way too sensitive and has pretty well understood limitations
and corner cases. So it needs to be designed from ground up to handle these
proper. Which also means, that the possible use cases are going to be
limited.

As I said before, come up with a list of possible usage scenarios and
protection scopes first and please take all the ideas other people have
with this into account. This includes PTI of course.

Once we have that we need to figure out whether these things can actually
coexist and do not contradict each other at the semantical level and
whether the outcome justifies the resulting complexity.

After that we can talk about implementation details.

This problem is not going to be solved with handwaving and an ad hoc
implementation which creates more problems than it solves.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 19:06             ` Peter Zijlstra
@ 2019-07-14 15:06               ` Andy Lutomirski
  2019-07-15 10:33                 ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2019-07-14 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexandre Chartre, Thomas Gleixner, Dave Hansen, Paolo Bonzini,
	Radim Krcmar, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Andrew Lutomirski, kvm list, X86 ML, Linux-MM, LKML,
	Konrad Rzeszutek Wilk, jan.setjeeilers, Liran Alon,
	Jonathan Adams, Alexander Graf, Mike Rapoport, Paul Turner

On Fri, Jul 12, 2019 at 12:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> > On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>
> > > Right. If we decide to expose more parts of the kernel mappings then that's
> > > just adding more stuff to the existing user (PTI) map mechanics.
> >
> > If we expose more parts of the kernel mapping by adding them to the existing
> > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > we don't control user mapping (with ASI, we exclude all user mappings).
> >
> > How would you control the mapping of userland sensitive data and exclude them
> > from the user map? Would you have the application explicitly identify sensitive
> > data (like Andy suggested with a /dev/xpfo device)?
>
> To what purpose do you want to exclude userspace from the kernel
> mapping; that is, what are you mitigating against with that?

Mutually distrusting user/guest tenants.  Imagine an attack against a
VM hosting provider (GCE, for example).  If the overall system is
well-designed, the host kernel won't possess secrets that are
important to the overall hosting network.  The interesting secrets are
in the memory of other tenants running under the same host.  So, if we
can mostly or completely avoid mapping one tenant's memory in the
host, we reduce the amount of valuable information that could leak via
a speculation (or wild read) attack to another tenant.

The practicality of such a scheme is obviously an open question.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 16:45             ` Andy Lutomirski
@ 2019-07-14 17:11               ` Mike Rapoport
  0 siblings, 0 replies; 68+ messages in thread
From: Mike Rapoport @ 2019-07-14 17:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexandre Chartre, Thomas Gleixner, Peter Zijlstra, Dave Hansen,
	pbonzini, rkrcmar, mingo, bp, hpa, dave.hansen, luto, kvm, x86,
	linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers, liran.alon,
	jwadams, graf, rppt, Paul Turner

On Fri, Jul 12, 2019 at 10:45:06AM -0600, Andy Lutomirski wrote:
> 
> 
> > On Jul 12, 2019, at 10:37 AM, Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
> > 
> > 
> > 
> >> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> >>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> >>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> >>>> 
> >>>> I think that's precisely what makes ASI and PTI different and independent.
> >>>> PTI is just about switching between userland and kernel page-tables, while
> >>>> ASI is about switching page-table inside the kernel. You can have ASI without
> >>>> having PTI. You can also use ASI for kernel threads so for code that won't
> >>>> be triggered from userland and so which won't involve PTI.
> >>> 
> >>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> >>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> >>> 
> >>> See how very similar they are?
> >>> 
> >>> Furthermore, to recover SMT for userspace (under MDS) we not only need
> >>> core-scheduling but core-scheduling per address space. And ASI was
> >>> specifically designed to help mitigate the trainwreck just described.
> >>> 
> >>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> >>> we reduce the part that needs core-scheduling and thus reduce the rate
> >>> the SMT siblngs need to sync up/schedule.
> >>> 
> >>> But looking at it that way, it makes no sense to retain 3 address
> >>> spaces, namely:
> >>> 
> >>>   user / kernel exposed / kernel private.
> >>> 
> >>> Specifically, it makes no sense to expose part of the kernel through MDS
> >>> but not through Meltdow. Therefore we can merge the user and kernel
> >>> exposed address spaces.
> >>> 
> >>> And then we've fully replaced PTI.
> >>> 
> >>> So no, they're not orthogonal.
> >> Right. If we decide to expose more parts of the kernel mappings then that's
> >> just adding more stuff to the existing user (PTI) map mechanics.
> > 
> > If we expose more parts of the kernel mapping by adding them to the existing
> > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > we don't control user mapping (with ASI, we exclude all user mappings).
> > 
> > How would you control the mapping of userland sensitive data and exclude them
> > from the user map?
> 
> As I see it, if we think part of the kernel is okay to leak to VM guests,
> then it should think it’s okay to leak to userspace and versa. At the end
> of the day, this may just have to come down to an administrator’s choice
> of how careful the mitigations need to be.
> 
> > Would you have the application explicitly identify sensitive
> > data (like Andy suggested with a /dev/xpfo device)?
> 
> That’s not really the intent of my suggestion. I was suggesting that
> maybe we don’t need ASI at all if we allow VMs to exclude their memory
> from the kernel mapping entirely.  Heck, in a setup like this, we can
> maybe even get away with turning PTI off under very, very controlled
> circumstances.  I’m not quite sure what to do about the kernel random
> pools, though.

I think KVM already allows excluding VMs memory from the kernel mapping
with the "new guest mapping interface" [1]. The memory managed by the host
can be restricted with "mem=" and KVM maps/unmaps the guest memory pages
only when needed.

It would be interesting to see if /dev/xpfo or even
madvise(MAKE_MY_MEMORY_PRIVATE) can be made useful for multi-tenant
container hosts.

[1] https://lore.kernel.org/lkml/1548966284-28642-1-git-send-email-karahmed@amazon.de/

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 14:36           ` Andy Lutomirski
@ 2019-07-14 18:17             ` Alexander Graf
  0 siblings, 0 replies; 68+ messages in thread
From: Alexander Graf @ 2019-07-14 18:17 UTC (permalink / raw)
  To: Andy Lutomirski, Alexandre Chartre
  Cc: Peter Zijlstra, Thomas Gleixner, Dave Hansen, Paolo Bonzini,
	Radim Krcmar, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, kvm list, X86 ML, Linux-MM, LKML,
	Konrad Rzeszutek Wilk, jan.setjeeilers, Liran Alon,
	Jonathan Adams, Mike Rapoport, Paul Turner



On 12.07.19 16:36, Andy Lutomirski wrote:
> On Fri, Jul 12, 2019 at 6:45 AM Alexandre Chartre
> <alexandre.chartre@oracle.com> wrote:
>>
>>
>> On 7/12/19 2:50 PM, Peter Zijlstra wrote:
>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>
>>>> I think that's precisely what makes ASI and PTI different and independent.
>>>> PTI is just about switching between userland and kernel page-tables, while
>>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>>> be triggered from userland and so which won't involve PTI.
>>>
>>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
>>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>>
>>> See how very similar they are?
>>>
>>>
>>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>>> core-scheduling but core-scheduling per address space. And ASI was
>>> specifically designed to help mitigate the trainwreck just described.
>>>
>>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>>> we reduce the part that needs core-scheduling and thus reduce the rate
>>> the SMT siblngs need to sync up/schedule.
>>>
>>> But looking at it that way, it makes no sense to retain 3 address
>>> spaces, namely:
>>>
>>>     user / kernel exposed / kernel private.
>>>
>>> Specifically, it makes no sense to expose part of the kernel through MDS
>>> but not through Meltdow. Therefore we can merge the user and kernel
>>> exposed address spaces.
>>
>> The goal of ASI is to provide a reduced address space which exclude sensitive
>> data. A user process (for example a database daemon, a web server, or a vmm
>> like qemu) will likely have sensitive data mapped in its user address space.
>> Such data shouldn't be mapped with ASI because it can potentially leak to the
>> sibling hyperthread. For example, if an hyperthread is running a VM then the
>> VM could potentially access user sensitive data if they are mapped on the
>> sibling hyperthread with ASI.
> 
> So I've proposed the following slightly hackish thing:
> 
> Add a mechanism (call it /dev/xpfo).  When you open /dev/xpfo and
> fallocate it to some size, you allocate that amount of memory and kick
> it out of the kernel direct map.  (And pay the IPI cost unless there
> were already cached non-direct-mapped pages ready.)  Then you map
> *that* into your VMs.  Now, for a dedicated VM host, you map *all* the
> VM private memory from /dev/xpfo.  Pretend it's SEV if you want to
> determine which pages can be set up like this.
> 
> Does this get enough of the benefit at a negligible fraction of the
> code complexity cost?  (This plus core scheduling, anyway.)

The problem with that approach is that you lose the ability to run 
legacy workloads that do not support an SEV like model of "guest owned" 
and "host visible" pages, but instead assume you can DMA anywhere.

Without that, your host will have visibility into guest pages via user 
space (QEMU) pages which again are mapped in the kernel direct map, so 
can be exposed via a spectre gadget into a malicious guest.

Also, please keep in mind that even register state of other VMs may be a 
secret that we do not want to leak into other guests.


Alex

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 19:48             ` Thomas Gleixner
@ 2019-07-15  8:23               ` Alexandre Chartre
  2019-07-15  8:28                 ` Thomas Gleixner
  0 siblings, 1 reply; 68+ messages in thread
From: Alexandre Chartre @ 2019-07-15  8:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa,
	dave.hansen, luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner



On 7/12/19 9:48 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2019, Alexandre Chartre wrote:
>> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>> And then we've fully replaced PTI.
>>>>
>>>> So no, they're not orthogonal.
>>>
>>> Right. If we decide to expose more parts of the kernel mappings then that's
>>> just adding more stuff to the existing user (PTI) map mechanics.
>>   
>> If we expose more parts of the kernel mapping by adding them to the existing
>> user (PTI) map, then we only control the mapping of kernel sensitive data but
>> we don't control user mapping (with ASI, we exclude all user mappings).
> 
> What prevents you from adding functionality to do so to the PTI
> implementation? Nothing.
> 
> Again, the underlying concept is exactly the same:
> 
>    1) Create a restricted mapping from an existing mapping
> 
>    2) Switch to the restricted mapping when entering a particular execution
>       context
> 
>    3) Switch to the unrestricted mapping when leaving that execution context
> 
>    4) Keep track of the state
> 
> The restriction scope is different, but that's conceptually completely
> irrelevant. It's a detail which needs to be handled at the implementation
> level.
> 
> What matters here is the concept and because the concept is the same, this
> needs to share the infrastructure for #1 - #4.
> 

You are totally right, that's the same concept (page-table creation and switching),
it is just used in different contexts. Sorry it took me that long to realize it,
I was too focus on the use case.


> It's obvious that this requires changes to the way PTI works today, but
> anything which creates a parallel implementation of any part of the above
> #1 - #4 is not going anywhere.
> 
> This stuff is way too sensitive and has pretty well understood limitations
> and corner cases. So it needs to be designed from ground up to handle these
> proper. Which also means, that the possible use cases are going to be
> limited.
>
> As I said before, come up with a list of possible usage scenarios and
> protection scopes first and please take all the ideas other people have
> with this into account. This includes PTI of course.
> 
> Once we have that we need to figure out whether these things can actually
> coexist and do not contradict each other at the semantical level and
> whether the outcome justifies the resulting complexity.
> 
> After that we can talk about implementation details.

Right, that makes perfect sense. I think so far we have the following scenarios:

  - PTI
  - KVM (i.e. VMExit handler isolation)
  - maybe some syscall isolation?

I will look at them in more details, in particular what particular mappings they
need and when they need to switch mappings.


And thanks for putting me back on the right track.


alex.

> This problem is not going to be solved with handwaving and an ad hoc
> implementation which creates more problems than it solves.
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-15  8:23               ` Alexandre Chartre
@ 2019-07-15  8:28                 ` Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: Thomas Gleixner @ 2019-07-15  8:28 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Peter Zijlstra, Dave Hansen, pbonzini, rkrcmar, mingo, bp, hpa,
	dave.hansen, luto, kvm, x86, linux-mm, linux-kernel, konrad.wilk,
	jan.setjeeilers, liran.alon, jwadams, graf, rppt, Paul Turner

Alexandre,

On Mon, 15 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 9:48 PM, Thomas Gleixner wrote:
> > As I said before, come up with a list of possible usage scenarios and
> > protection scopes first and please take all the ideas other people have
> > with this into account. This includes PTI of course.
> > 
> > Once we have that we need to figure out whether these things can actually
> > coexist and do not contradict each other at the semantical level and
> > whether the outcome justifies the resulting complexity.
> > 
> > After that we can talk about implementation details.
> 
> Right, that makes perfect sense. I think so far we have the following
> scenarios:
> 
>  - PTI
>  - KVM (i.e. VMExit handler isolation)
>  - maybe some syscall isolation?

Vs. the latter you want to talk to Paul Turner. He had some ideas there.

> I will look at them in more details, in particular what particular
> mappings they need and when they need to switch mappings.
> 
> And thanks for putting me back on the right track.

That's what maintainers are for :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-14 15:06               ` Andy Lutomirski
@ 2019-07-15 10:33                 ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2019-07-15 10:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexandre Chartre, Thomas Gleixner, Dave Hansen, Paolo Bonzini,
	Radim Krcmar, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, kvm list, X86 ML, Linux-MM, LKML,
	Konrad Rzeszutek Wilk, jan.setjeeilers, Liran Alon,
	Jonathan Adams, Alexander Graf, Mike Rapoport, Paul Turner

On Sun, Jul 14, 2019 at 08:06:12AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 12, 2019 at 12:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> > > On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> >
> > > > Right. If we decide to expose more parts of the kernel mappings then that's
> > > > just adding more stuff to the existing user (PTI) map mechanics.
> > >
> > > If we expose more parts of the kernel mapping by adding them to the existing
> > > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > > we don't control user mapping (with ASI, we exclude all user mappings).
> > >
> > > How would you control the mapping of userland sensitive data and exclude them
> > > from the user map? Would you have the application explicitly identify sensitive
> > > data (like Andy suggested with a /dev/xpfo device)?
> >
> > To what purpose do you want to exclude userspace from the kernel
> > mapping; that is, what are you mitigating against with that?
> 
> Mutually distrusting user/guest tenants.  Imagine an attack against a
> VM hosting provider (GCE, for example).  If the overall system is
> well-designed, the host kernel won't possess secrets that are
> important to the overall hosting network.  The interesting secrets are
> in the memory of other tenants running under the same host.  So, if we
> can mostly or completely avoid mapping one tenant's memory in the
> host, we reduce the amount of valuable information that could leak via
> a speculation (or wild read) attack to another tenant.
> 
> The practicality of such a scheme is obviously an open question.

Ah, ok. So it's some virt specific nonsense. I'll go on ignoring it then
;-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-12 13:46           ` Alexandre Chartre
@ 2019-07-31 16:31             ` Dario Faggioli
  2019-08-22 12:31               ` Alexandre Chartre
  0 siblings, 1 reply; 68+ messages in thread
From: Dario Faggioli @ 2019-07-31 16:31 UTC (permalink / raw)
  To: Alexandre Chartre, Peter Zijlstra
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner

[-- Attachment #1: Type: text/plain, Size: 2681 bytes --]

Hello all,

I know this is a bit of an old thread, so apologies for being late to
the party. :-)

I would have a question about this:

> > > On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> > > > On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre
> > > > wrote:
> > > > > On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> > > > > > AFAIK3 this wants/needs to be combined with core-scheduling 
> > > > > > to be
> > > > > > useful, but not a single mention of that is anywhere.
> > > > > 
> > > > > No. This is actually an alternative to core-scheduling.
> > > > > Eventually, ASI
> > > > > will kick all sibling hyperthreads when exiting isolation and
> > > > > it needs to
> > > > > run with the full kernel page-table (note that's currently
> > > > > not in these
> > > > > patches).
> 
I.e., about the fact that ASI is presented as an alternative to
core-scheduling or, at least, as it will only need integrate a small
subset of the logic (and of the code) from core-scheduling, as said
here:

> I haven't looked at details about what has been done so far.
> Hopefully, we
> can do something not too complex, or reuse a (small) part of co-
> scheduling.
> 
Now, sticking to virtualization examples, if you don't have core-
scheduling, it means that you can have two vcpus, one from VM A and the
other from VM B, running on the same core, one on thread 0 and the
other one on thread 1, at the same time.

And if VM A's vcpu, running on thread 0, exits, then VM B's vcpu
running in guest more on thread 1 can read host memory, as it is
speculatively accessed (either "normally" or because of cache load
gadgets) and brought in L1D cache by thread 0. And Indeed I do see how
ASI protects us from this attack scenario.

However, when the two VMs' vcpus are both running in guest mode, each
one on a thread of the same core, VM B's vcpu running on thread 1 can
exploit L1TF to peek at and steal secrets that VM A's vcpu, running on
thread 0, is accessing, as they're brought into L1D cache... can't it? 

How can, ASI *without* core-scheduling, prevent this other attack
scenario?

Because I may very well be missing something, but it looks to me that
it can't. In which case, I'm not sure we can call it "alternative" to
core-scheduling.... Or is the second attack scenario that I tried to
describe above, not considered interesting?

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 00/27] Kernel Address Space Isolation
  2019-07-31 16:31             ` Dario Faggioli
@ 2019-08-22 12:31               ` Alexandre Chartre
  0 siblings, 0 replies; 68+ messages in thread
From: Alexandre Chartre @ 2019-08-22 12:31 UTC (permalink / raw)
  To: dario.faggioli, Peter Zijlstra
  Cc: pbonzini, rkrcmar, tglx, mingo, bp, hpa, dave.hansen, luto, kvm,
	x86, linux-mm, linux-kernel, konrad.wilk, jan.setjeeilers,
	liran.alon, jwadams, graf, rppt, Paul Turner


On 7/31/19 6:31 PM, Dario Faggioli wrote:
> Hello all,
> 
> I know this is a bit of an old thread, so apologies for being late to
> the party. :-)

And sorry for the late reply, I was away for a while.

> I would have a question about this:
> 
>>>> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
>>>>> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre
>>>>> wrote:
>>>>>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
>>>>>>> AFAIK3 this wants/needs to be combined with core-scheduling
>>>>>>> to be
>>>>>>> useful, but not a single mention of that is anywhere.
>>>>>>
>>>>>> No. This is actually an alternative to core-scheduling.
>>>>>> Eventually, ASI
>>>>>> will kick all sibling hyperthreads when exiting isolation and
>>>>>> it needs to
>>>>>> run with the full kernel page-table (note that's currently
>>>>>> not in these
>>>>>> patches).
>>
> I.e., about the fact that ASI is presented as an alternative to
> core-scheduling or, at least, as it will only need integrate a small
> subset of the logic (and of the code) from core-scheduling, as said
> here:
> 
>> I haven't looked at details about what has been done so far.
>> Hopefully, we
>> can do something not too complex, or reuse a (small) part of co-
>> scheduling.
>>
> Now, sticking to virtualization examples, if you don't have core-
> scheduling, it means that you can have two vcpus, one from VM A and the
> other from VM B, running on the same core, one on thread 0 and the
> other one on thread 1, at the same time.
> 
> And if VM A's vcpu, running on thread 0, exits, then VM B's vcpu
> running in guest more on thread 1 can read host memory, as it is
> speculatively accessed (either "normally" or because of cache load
> gadgets) and brought in L1D cache by thread 0. And Indeed I do see how
> ASI protects us from this attack scenario.
> 
>
> However, when the two VMs' vcpus are both running in guest mode, each
> one on a thread of the same core, VM B's vcpu running on thread 1 can
> exploit L1TF to peek at and steal secrets that VM A's vcpu, running on
> thread 0, is accessing, as they're brought into L1D cache... can't it?
> 
> How can, ASI *without* core-scheduling, prevent this other attack
> scenario?
>
> Because I may very well be missing something, but it looks to me that
> it can't. In which case, I'm not sure we can call it "alternative" to
> core-scheduling.... Or is the second attack scenario that I tried to
> describe above, not considered interesting?
> 

Correct, ASI doesn't prevent this attack scenario. However, this case can
be prevented by pinning each VM to different CPU cores (for example, using
cgroups) so that you never have two different VMs running with CPU threads
from the same CPU core. Of course, this limits the number of VMs you can
run to the number of CPU cores on the system but we assume this is a
reasonable configuration when you want to have high performing VM.

Rgds,

alex.

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, back to index

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-11 14:25 [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 01/26] mm/x86: Introduce kernel address space isolation Alexandre Chartre
2019-07-11 21:33   ` Thomas Gleixner
2019-07-12  7:43     ` Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 02/26] mm/asi: Abort isolation on interrupt, exception and context switch Alexandre Chartre
2019-07-11 20:11   ` Andi Kleen
2019-07-11 20:17     ` Mike Rapoport
2019-07-11 20:41       ` Alexandre Chartre
2019-07-12  0:05   ` Andy Lutomirski
2019-07-12  7:50     ` Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 03/26] mm/asi: Handle page fault due to address space isolation Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 04/26] mm/asi: Functions to track buffers allocated for an ASI page-table Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 05/26] mm/asi: Add ASI page-table entry offset functions Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 06/26] mm/asi: Add ASI page-table entry allocation functions Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 07/26] mm/asi: Add ASI page-table entry set functions Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 08/26] mm/asi: Functions to populate an ASI page-table from a VA range Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 09/26] mm/asi: Helper functions to map module into ASI Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 10/26] mm/asi: Keep track of VA ranges mapped in ASI page-table Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 11/26] mm/asi: Functions to clear ASI page-table entries for a VA range Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 12/26] mm/asi: Function to copy page-table entries for percpu buffer Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 13/26] mm/asi: Add asi_remap() function Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 14/26] mm/asi: Handle ASI mapped range leaks and overlaps Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 15/26] mm/asi: Initialize the ASI page-table with core mappings Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 16/26] mm/asi: Option to map current task into ASI Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 17/26] rcu: Move tree.h static forward declarations to tree.c Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 18/26] rcu: Make percpu rcu_data non-static Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 19/26] mm/asi: Add option to map RCU data Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 20/26] mm/asi: Add option to map cpu_hw_events Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 21/26] mm/asi: Make functions to read cr3/cr4 ASI aware Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 22/26] KVM: x86/asi: Introduce address_space_isolation module parameter Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 23/26] KVM: x86/asi: Introduce KVM address space isolation Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 24/26] KVM: x86/asi: Populate the KVM ASI page-table Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 25/26] KVM: x86/asi: Switch to KVM address space on entry to guest Alexandre Chartre
2019-07-11 14:25 ` [RFC v2 26/26] KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI Alexandre Chartre
2019-07-11 14:40 ` [RFC v2 00/27] Kernel Address Space Isolation Alexandre Chartre
2019-07-11 22:38 ` Dave Hansen
2019-07-12  8:09   ` Alexandre Chartre
2019-07-12 13:51     ` Dave Hansen
2019-07-12 14:06       ` Alexandre Chartre
2019-07-12 15:23         ` Thomas Gleixner
2019-07-12 10:44   ` Thomas Gleixner
2019-07-12 11:56     ` Alexandre Chartre
2019-07-12 12:50       ` Peter Zijlstra
2019-07-12 13:43         ` Alexandre Chartre
2019-07-12 13:58           ` Dave Hansen
2019-07-12 14:36           ` Andy Lutomirski
2019-07-14 18:17             ` Alexander Graf
2019-07-12 13:54         ` Dave Hansen
2019-07-12 15:20           ` Peter Zijlstra
2019-07-12 15:16         ` Thomas Gleixner
2019-07-12 16:37           ` Alexandre Chartre
2019-07-12 16:45             ` Andy Lutomirski
2019-07-14 17:11               ` Mike Rapoport
2019-07-12 19:06             ` Peter Zijlstra
2019-07-14 15:06               ` Andy Lutomirski
2019-07-15 10:33                 ` Peter Zijlstra
2019-07-12 19:48             ` Thomas Gleixner
2019-07-15  8:23               ` Alexandre Chartre
2019-07-15  8:28                 ` Thomas Gleixner
2019-07-12 16:00       ` Thomas Gleixner
2019-07-12 11:44 ` Peter Zijlstra
2019-07-12 12:17   ` Alexandre Chartre
2019-07-12 12:36     ` Peter Zijlstra
2019-07-12 12:47       ` Alexandre Chartre
2019-07-12 13:07         ` Peter Zijlstra
2019-07-12 13:46           ` Alexandre Chartre
2019-07-31 16:31             ` Dario Faggioli
2019-08-22 12:31               ` Alexandre Chartre

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org kvm@archiver.kernel.org
	public-inbox-index kvm


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox