linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/16] KVM protected memory extension
@ 2020-05-22 12:51 Kirill A. Shutemov
  2020-05-22 12:51 ` [RFC 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
                   ` (18 more replies)
  0 siblings, 19 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:51 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

== Background / Problem ==

There are a number of hardware features (MKTME, SEV) which protect guest
memory from some unauthorized host access. The patchset proposes a purely
software feature that mitigates some of the same host-side read-only
attacks.


== What does this set mitigate? ==

 - Host kernel ”accidental” access to guest data (think speculation)

 - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))

 - Host userspace access to guest data (compromised qemu)

== What does this set NOT mitigate? ==

 - Full host kernel compromise.  Kernel will just map the pages again.

 - Hardware attacks


The patchset is RFC-quality: it works but has known issues that must be
addressed before it can be considered for applying.

We are looking for high-level feedback on the concept.  Some open
questions:

 - This protects from some kernel and host userspace read-only attacks,
   but does not place the host kernel outside the trust boundary. Is it
   still valuable?

 - Can this approach be used to avoid cache-coherency problems with
   hardware encryption schemes that repurpose physical bits?

 - The guest kernel must be modified for this to work.  Is that a deal
   breaker, especially for public clouds?

 - Are the costs of removing pages from the direct map too high to be
   feasible?

== Series Overview ==

The hardware features protect guest data by encrypting it and then
ensuring that only the right guest can decrypt it.  This has the
side-effect of making the kernel direct map and userspace mapping
(QEMU et al) useless.  But, this teaches us something very useful:
neither the kernel or userspace mappings are really necessary for normal
guest operations.

Instead of using encryption, this series simply unmaps the memory. One
advantage compared to allowing access to ciphertext is that it allows bad
accesses to be caught instead of simply reading garbage.

Protection from physical attacks needs to be provided by some other means.
On Intel platforms, (single-key) Total Memory Encryption (TME) provides
mitigation against physical attacks, such as DIMM interposers sniffing
memory bus traffic.

The patchset modifies both host and guest kernel. The guest OS must enable
the feature via hypercall and mark any memory range that has to be shared
with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
bit in the guest’s page table while this approach uses a hypercall.

For removing the userspace mapping, use a trick similar to what NUMA
balancing does: convert memory that belongs to KVM memory slots to
PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
VMA must be treated in a special way in the GUP and fault paths. The flag
allows GUP to return the page even though it is mapped with PROT_NONE, but
only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
would result in -EFAULT.

Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
flushes local TLB. I think it's a reasonable compromise between security and
perfromance.

Zapping the PTE would bring the page back to the direct mapping after clearing.
At least for now, we don't remove file-backed pages from the direct mapping.
File-backed pages could be accessed via read/write syscalls. It adds
complexity.

Occasionally, host kernel has to access guest memory that was not made
shared by the guest. For instance, it happens for instruction emulation.
Normally, it's done via copy_to/from_user() which would fail with -EFAULT
now. We introduced a new pair of helpers: copy_to/from_guest(). The new
helpers acquire the page via GUP, map it into kernel address space with
kmap_atomic()-style mechanism and only then copy the data.

For some instruction emulation copying is not good enough: cmpxchg
emulation has to have direct access to the guest memory. __kvm_map_gfn()
is modified to accommodate the case.

The patchset is on top of v5.7-rc6 plus this patch:

https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com

== Open Issues ==

Unmapping the pages from direct mapping bring a few of issues that have
not rectified yet:

 - Touching direct mapping leads to fragmentation. We need to be able to
   recover from it. I have a buggy patch that aims at recovering 2M/1G page.
   It has to be fixed and tested properly

 - Page migration and KSM is not supported yet.

 - Live migration of a guest would require a new flow. Not sure yet how it
   would look like.

 - The feature interfere with NUMA balancing. Not sure yet if it's
   possible to make them work together.

 - Guests have no mechanism to ensure that even a well-behaving host has
   unmapped its private data.  With SEV, for instance, the guest only has
   to trust the hardware to encrypt a page after the C bit is set in a
   guest PTE.  A mechanism for a guest to query the host mapping state, or
   to constantly assert the intent for a page to be Private would be
   valuable.
Kirill A. Shutemov (16):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvm: Make VirtIO use DMA API in KVM guest
  KVM: Use GUP instead of copy_from/to_user() to access guest memory
  KVM: mm: Introduce VM_KVM_PROTECTED
  KVM: x86: Use GUP for page walk instead of __get_user()
  KVM: Protected memory extension
  KVM: x86: Enabled protected memory extension
  KVM: Rework copy_to/from_guest() to avoid direct mapping
  x86/kvm: Share steal time page with host
  x86/kvmclock: Share hvclock memory with the host
  KVM: Introduce gfn_to_pfn_memslot_protected()
  KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  KVM: Unmap protected pages from direct mapping

 arch/powerpc/kvm/book3s_64_mmu_hv.c    |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
 arch/x86/Kconfig                       |  11 +-
 arch/x86/include/asm/io.h              |   6 +-
 arch/x86/include/asm/kvm_para.h        |   5 +
 arch/x86/include/asm/pgtable_types.h   |   1 +
 arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
 arch/x86/kernel/kvm.c                  |  24 +-
 arch/x86/kernel/kvmclock.c             |   2 +-
 arch/x86/kernel/pci-swiotlb.c          |   3 +-
 arch/x86/kvm/cpuid.c                   |   3 +
 arch/x86/kvm/mmu/mmu.c                 |   6 +-
 arch/x86/kvm/mmu/paging_tmpl.h         |  10 +-
 arch/x86/kvm/x86.c                     |   9 +
 arch/x86/mm/Makefile                   |   2 +
 arch/x86/mm/ioremap.c                  |  16 +-
 arch/x86/mm/mem_encrypt.c              |  50 ----
 arch/x86/mm/mem_encrypt_common.c       |  62 ++++
 arch/x86/mm/pat/set_memory.c           |   8 +
 drivers/virtio/virtio_ring.c           |   4 +
 include/linux/kvm_host.h               |  14 +-
 include/linux/mm.h                     |  12 +
 include/uapi/linux/kvm_para.h          |   5 +-
 mm/gup.c                               |  20 +-
 mm/huge_memory.c                       |  29 +-
 mm/ksm.c                               |   3 +
 mm/memory.c                            |  16 +
 mm/mmap.c                              |   3 +
 mm/mprotect.c                          |   1 +
 mm/rmap.c                              |   4 +
 virt/kvm/async_pf.c                    |   4 +-
 virt/kvm/kvm_main.c                    | 390 +++++++++++++++++++++++--
 32 files changed, 627 insertions(+), 103 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

-- 
2.26.2


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC 01/16] x86/mm: Move force_dma_unencrypted() to common code
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
@ 2020-05-22 12:51 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 02/16] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:51 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

force_dma_unencrypted() has to return true for KVM guest with the memory
protected enabled. Move it out of AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for the following patches.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  8 +++++--
 arch/x86/include/asm/io.h        |  4 +++-
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        | 30 -------------------------
 arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
 5 files changed, 49 insertions(+), 33 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2d3f963fd6f1..bc72bfd89bcf 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1518,12 +1518,16 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
-	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select X86_MEM_ENCRYPT_COMMON
 	---help---
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index e1aa17a468a8..c58d52fd7bf2 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -256,10 +256,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
 	return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 98f7c6fa2eaa..af8683c053a3 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -49,6 +49,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index a03614bd3e1a..112304a706f3 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include <linux/dma-direct.h>
 #include <linux/swiotlb.h>
 #include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -350,32 +346,6 @@ bool sev_active(void)
 	return sme_me_mask && sev_enabled;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-	/*
-	 * For SEV, all DMA must be to unencrypted addresses.
-	 */
-	if (sev_active())
-		return true;
-
-	/*
-	 * For SME, all DMA must be to unencrypted addresses if the
-	 * device does not support DMA to addresses that include the
-	 * encryption mask.
-	 */
-	if (sme_active()) {
-		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-						dev->bus_dma_limit);
-
-		if (dma_dev_mask <= dma_enc_mask)
-			return true;
-	}
-
-	return false;
-}
-
 /* Architecture __weak replacement functions */
 void __init mem_encrypt_free_decrypted_mem(void)
 {
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..964e04152417
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	/*
+	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 */
+	if (sev_active())
+		return true;
+
+	/*
+	 * For SME, all DMA must be to unencrypted addresses if the
+	 * device does not support DMA to addresses that include the
+	 * encryption mask.
+	 */
+	if (sme_active()) {
+		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+						dev->bus_dma_limit);
+
+		if (dma_dev_mask <= dma_enc_mask)
+			return true;
+	}
+
+	return false;
+}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
  2020-05-22 12:51 ` [RFC 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-25 14:58   ` Vitaly Kuznetsov
  2020-05-22 12:52 ` [RFC 03/16] x86/kvm: Make DMA pages shared Kirill A. Shutemov
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

Provide basic helpers, KVM_FEATURE and a hypercall.

Host side doesn't provide the feature yet, so it is a dead code for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/kvm_para.h      |  5 +++++
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kernel/kvm.c                | 16 ++++++++++++++++
 include/uapi/linux/kvm_para.h        |  3 ++-
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9b4df6eaa11a..3ce84fc07144 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -10,11 +10,16 @@ extern void kvmclock_init(void);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
+bool kvm_mem_protected(void);
 #else
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
 	return false;
 }
+static inline bool kvm_mem_protected(void)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_GUEST */
 
 #define KVM_HYPERCALL \
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..c3b499acc98f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -28,9 +28,10 @@
 #define KVM_FEATURE_PV_UNHALT		7
 #define KVM_FEATURE_PV_TLB_FLUSH	9
 #define KVM_FEATURE_ASYNC_PF_VMEXIT	10
-#define KVM_FEATURE_PV_SEND_IPI	11
+#define KVM_FEATURE_PV_SEND_IPI		11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
+#define KVM_FEATURE_MEM_PROTECTED	14
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 6efe0410fb72..bda761ca0d26 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -35,6 +35,13 @@
 #include <asm/tlb.h>
 #include <asm/cpuidle_haltpoll.h>
 
+static bool mem_protected;
+
+bool kvm_mem_protected(void)
+{
+	return mem_protected;
+}
+
 static int kvmapf = 1;
 
 static int __init parse_no_kvmapf(char *arg)
@@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
 {
 	kvmclock_init();
 	x86_platform.apic_post_init = kvm_apic_init;
+
+	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
+		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
+			pr_err("Failed to enable KVM memory protection\n");
+			return;
+		}
+
+		mem_protected = true;
+	}
 }
 
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..1a216f32e572 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -27,8 +27,9 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
-#define KVM_HC_SEND_IPI		10
+#define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
+#define KVM_HC_ENABLE_MEM_PROTECTED	12
 
 /*
  * hypercalls use architecture specific
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 03/16] x86/kvm: Make DMA pages shared
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
  2020-05-22 12:51 ` [RFC 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 02/16] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 04/16] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

Make force_dma_unencrypted() return true for KVM to get DMA pages mapped
as shared.

__set_memory_enc_dec() now informs the host via hypercall if the state
of the page has changed from shared to private or back.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 | 1 +
 arch/x86/mm/mem_encrypt_common.c | 5 +++--
 arch/x86/mm/pat/set_memory.c     | 7 +++++++
 include/uapi/linux/kvm_para.h    | 2 ++
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bc72bfd89bcf..86c012582f51 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -799,6 +799,7 @@ config KVM_GUEST
 	depends on PARAVIRT
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
+	select X86_MEM_ENCRYPT_COMMON
 	default y
 	---help---
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 964e04152417..a878e7f246d5 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,14 +10,15 @@
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
 #include <linux/dma-mapping.h>
+#include <asm/kvm_para.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
 	/*
-	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 * For SEV and KVM, all DMA must be to unencrypted/shared addresses.
 	 */
-	if (sev_active())
+	if (sev_active() || kvm_mem_protected())
 		return true;
 
 	/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index b8c55a2e402d..6f075766bb94 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,7 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/kvm_para.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -1972,6 +1973,12 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	struct cpa_data cpa;
 	int ret;
 
+	if (kvm_mem_protected()) {
+		unsigned long gfn = __pa(addr) >> PAGE_SHIFT;
+		int call = enc ? KVM_HC_MEM_UNSHARE : KVM_HC_MEM_SHARE;
+		return kvm_hypercall2(call, gfn, numpages);
+	}
+
 	/* Nothing to do if memory encryption is not active */
 	if (!mem_encrypt_active())
 		return 0;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 1a216f32e572..c6d8c988e330 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,8 @@
 #define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
 #define KVM_HC_ENABLE_MEM_PROTECTED	12
+#define KVM_HC_MEM_SHARE		13
+#define KVM_HC_MEM_UNSHARE		14
 
 /*
  * hypercalls use architecture specific
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 04/16] x86/kvm: Use bounce buffers for KVM memory protection
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 03/16] x86/kvm: Make DMA pages shared Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/kernel/kvm.c            |  2 ++
 arch/x86/kernel/pci-swiotlb.c    |  3 ++-
 arch/x86/mm/mem_encrypt.c        | 20 --------------------
 arch/x86/mm/mem_encrypt_common.c | 23 +++++++++++++++++++++++
 5 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 86c012582f51..58dd44a1b92f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -800,6 +800,7 @@ config KVM_GUEST
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_MEM_ENCRYPT_COMMON
+	select SWIOTLB
 	default y
 	---help---
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index bda761ca0d26..f50d65df4412 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -24,6 +24,7 @@
 #include <linux/debugfs.h>
 #include <linux/nmi.h>
 #include <linux/swait.h>
+#include <linux/swiotlb.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -742,6 +743,7 @@ static void __init kvm_init_platform(void)
 		}
 
 		mem_protected = true;
+		swiotlb_force = SWIOTLB_FORCE;
 	}
 }
 
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..814060a6ceb0 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -13,6 +13,7 @@
 #include <asm/dma.h>
 #include <asm/xen/swiotlb-xen.h>
 #include <asm/iommu_table.h>
+#include <asm/kvm_para.h>
 
 int swiotlb __read_mostly;
 
@@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (sme_active() || kvm_mem_protected())
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 112304a706f3..35c748ee3fcb 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -370,23 +370,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
 	free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-void __init mem_encrypt_init(void)
-{
-	if (!sme_me_mask)
-		return;
-
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
-	/*
-	 * With SEV, we need to unroll the rep string I/O instructions.
-	 */
-	if (sev_active())
-		static_branch_enable(&sev_enable_key);
-
-	pr_info("AMD %s active\n",
-		sev_active() ? "Secure Encrypted Virtualization (SEV)"
-			     : "Secure Memory Encryption (SME)");
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index a878e7f246d5..7900f3788010 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -37,3 +37,26 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+void __init mem_encrypt_init(void)
+{
+	if (!sme_me_mask && !kvm_mem_protected())
+		return;
+
+	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+	swiotlb_update_mem_attributes();
+
+	/*
+	 * With SEV, we need to unroll the rep string I/O instructions.
+	 */
+	if (sev_active())
+		static_branch_enable(&sev_enable_key);
+
+	if (sme_me_mask) {
+		pr_info("AMD %s active\n",
+			sev_active() ? "Secure Encrypted Virtualization (SEV)"
+			: "Secure Memory Encryption (SME)");
+	} else {
+		pr_info("KVM memory protection enabled\n");
+	}
+}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 04/16] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

VirtIO for KVM is a primary way to provide IO. All memory that used for
communication with the host has to be marked as shared.

The easiest way to archive that is to use DMA API that already knows how
to deal with shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/virtio/virtio_ring.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 58b96baa8d48..bd9c56160107 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -12,6 +12,7 @@
 #include <linux/hrtimer.h>
 #include <linux/dma-mapping.h>
 #include <xen/xen.h>
+#include <asm/kvm_para.h>
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -255,6 +256,9 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
 	if (xen_domain())
 		return true;
 
+	if (kvm_mem_protected())
+		return true;
+
 	return false;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-25 15:08   ` Vitaly Kuznetsov
                     ` (2 more replies)
  2020-05-22 12:52 ` [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
                   ` (12 subsequent siblings)
  18 siblings, 3 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
protection feature is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/kvm_host.h |  4 +++
 virt/kvm/kvm_main.c      | 78 ++++++++++++++++++++++++++++++++++------
 2 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 131cc1527d68..bd0bb600f610 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -503,6 +503,7 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	bool mem_protected;
 };
 
 #define kvm_err(fmt, ...) \
@@ -727,6 +728,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_accessed(kvm_pfn_t pfn);
 void kvm_get_pfn(kvm_pfn_t pfn);
 
+int copy_from_guest(void *data, unsigned long hva, int len);
+int copy_to_guest(unsigned long hva, const void *data, int len);
+
 void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 731c1e517716..033471f71dae 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2248,8 +2248,48 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+int copy_from_guest(void *data, unsigned long hva, int len)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page, 0);
+		if (npages != 1)
+			return -EFAULT;
+		memcpy(data, page_address(page) + offset, seg);
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		offset = 0;
+	}
+
+	return 0;
+}
+
+int copy_to_guest(unsigned long hva, const void *data, int len)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
+		if (npages != 1)
+			return -EFAULT;
+		memcpy(page_address(page) + offset, data, seg);
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		offset = 0;
+	}
+	return 0;
+}
+
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+				 void *data, int offset, int len,
+				 bool protected)
 {
 	int r;
 	unsigned long addr;
@@ -2257,7 +2297,10 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_from_user(data, (void __user *)addr + offset, len);
+	if (protected)
+		r = copy_from_guest(data, addr + offset, len);
+	else
+		r = __copy_from_user(data, (void __user *)addr + offset, len);
 	if (r)
 		return -EFAULT;
 	return 0;
@@ -2268,7 +2311,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(slot, gfn, data, offset, len,
+				     kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -2277,7 +2321,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(slot, gfn, data, offset, len,
+				     vcpu->kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -2350,7 +2395,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
 static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
-			          const void *data, int offset, int len)
+			          const void *data, int offset, int len,
+				  bool protected)
 {
 	int r;
 	unsigned long addr;
@@ -2358,7 +2404,11 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
 	addr = gfn_to_hva_memslot(memslot, gfn);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_to_user((void __user *)addr + offset, data, len);
+
+	if (protected)
+		r = copy_to_guest(addr + offset, data, len);
+	else
+		r = __copy_to_user((void __user *)addr + offset, data, len);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(memslot, gfn);
@@ -2370,7 +2420,8 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(slot, gfn, data, offset, len,
+				      kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
 
@@ -2379,7 +2430,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(slot, gfn, data, offset, len,
+				      vcpu->kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
@@ -2495,7 +2547,10 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_write_guest(kvm, gpa, data, len);
 
-	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
+	if (kvm->mem_protected)
+		r = copy_to_guest(ghc->hva + offset, data, len);
+	else
+		r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
@@ -2530,7 +2585,10 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_read_guest(kvm, ghc->gpa, data, len);
 
-	r = __copy_from_user(data, (void __user *)ghc->hva, len);
+	if (kvm->mem_protected)
+		r = copy_from_guest(data, ghc->hva, len);
+	else
+		r = __copy_from_user(data, (void __user *)ghc->hva, len);
 	if (r)
 		return -EFAULT;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-26  6:15   ` Mike Rapoport
  2020-05-26  6:40   ` John Hubbard
  2020-05-22 12:52 ` [RFC 08/16] KVM: x86: Use GUP for page walk instead of __get_user() Kirill A. Shutemov
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

The new VMA flag that indicate a VMA that is not accessible to userspace
but usable by kernel with GUP if FOLL_KVM is specified.

The FOLL_KVM is only used in the KVM code. The code has to know how to
deal with such pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h  |  8 ++++++++
 mm/gup.c            | 20 ++++++++++++++++----
 mm/huge_memory.c    | 20 ++++++++++++++++----
 mm/memory.c         |  3 +++
 mm/mmap.c           |  3 +++
 virt/kvm/async_pf.c |  4 ++--
 virt/kvm/kvm_main.c |  9 +++++----
 7 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1882eec1752..4f7195365cc0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -329,6 +329,8 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
 #endif
 
+#define VM_KVM_PROTECTED 0
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
@@ -646,6 +648,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_ACCESS_FLAGS;
 }
 
+static inline bool vma_is_kvm_protected(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_KVM_PROTECTED;
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
@@ -2773,6 +2780,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
+#define FOLL_KVM	0x80000 /* access to VM_KVM_PROTECTED VMAs */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 87a6a59fe667..bd7b9484b35a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -385,10 +385,19 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write_pte(struct vm_area_struct *vma,
+					pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	if (pte_write(pte))
+		return true;
+
+	if ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte))
+		return true;
+
+	if (!vma_is_kvm_protected(vma) || !(vma->vm_flags & VM_WRITE))
+		return false;
+
+	return (vma->vm_flags & VM_SHARED) || page_mapcount(pte_page(pte)) == 1;
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -431,7 +440,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+	if ((flags & FOLL_WRITE) && !can_follow_write_pte(vma, pte, flags)) {
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
 	}
@@ -751,6 +760,9 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	ctx->page_mask = 0;
 
+	if (vma_is_kvm_protected(vma) && (flags & FOLL_KVM))
+		flags &= ~FOLL_NUMA;
+
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6ecd1045113b..c3562648a4ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1518,10 +1518,19 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
  * FOLL_FORCE can write to even unwritable pmd's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write_pmd(struct vm_area_struct *vma,
+					pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	if (pmd_write(pmd))
+		return true;
+
+	if ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd))
+		return true;
+
+	if (!vma_is_kvm_protected(vma) || !(vma->vm_flags & VM_WRITE))
+		return false;
+
+	return (vma->vm_flags & VM_SHARED) || page_mapcount(pmd_page(pmd)) == 1;
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1534,7 +1543,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+	if (flags & FOLL_WRITE && !can_follow_write_pmd(vma, *pmd, flags))
 		goto out;
 
 	/* Avoid dumping huge zero page */
@@ -1609,6 +1618,9 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	bool was_writable;
 	int flags = 0;
 
+	if (vma_is_kvm_protected(vma))
+		return VM_FAULT_SIGBUS;
+
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
 		goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..d7228db6e4bf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4013,6 +4013,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	bool was_writable = pte_savedwrite(vmf->orig_pte);
 	int flags = 0;
 
+	if (vma_is_kvm_protected(vma))
+		return VM_FAULT_SIGBUS;
+
 	/*
 	 * The "pte" at this point cannot be used safely without
 	 * validation through pte_unmap_same(). It's of NUMA type but
diff --git a/mm/mmap.c b/mm/mmap.c
index f609e9ec4a25..d56c3f6efc99 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -112,6 +112,9 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags)
 				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
 			pgprot_val(arch_vm_get_page_prot(vm_flags)));
 
+	if (vm_flags & VM_KVM_PROTECTED)
+		ret = PAGE_NONE;
+
 	return arch_filter_pgprot(ret);
 }
 EXPORT_SYMBOL(vm_get_page_prot);
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 15e5b037f92d..7663e962510a 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -60,8 +60,8 @@ static void async_pf_execute(struct work_struct *work)
 	 * access remotely.
 	 */
 	down_read(&mm->mmap_sem);
-	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
-			&locked);
+	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE | FOLL_KVM, NULL,
+			      NULL, &locked);
 	if (locked)
 		up_read(&mm->mmap_sem);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 033471f71dae..530af95efdf3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1727,7 +1727,7 @@ unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *w
 
 static inline int check_user_page_hwpoison(unsigned long addr)
 {
-	int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
+	int rc, flags = FOLL_HWPOISON | FOLL_WRITE | FOLL_KVM;
 
 	rc = get_user_pages(addr, 1, flags, NULL, NULL);
 	return rc == -EHWPOISON;
@@ -1771,7 +1771,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 			   bool *writable, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	unsigned int flags = FOLL_HWPOISON | FOLL_KVM;
 	struct page *page;
 	int npages = 0;
 
@@ -2255,7 +2255,7 @@ int copy_from_guest(void *data, unsigned long hva, int len)
 	int npages, seg;
 
 	while ((seg = next_segment(len, offset)) != 0) {
-		npages = get_user_pages_unlocked(hva, 1, &page, 0);
+		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
 		memcpy(data, page_address(page) + offset, seg);
@@ -2275,7 +2275,8 @@ int copy_to_guest(unsigned long hva, const void *data, int len)
 	int npages, seg;
 
 	while ((seg = next_segment(len, offset)) != 0) {
-		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
+		npages = get_user_pages_unlocked(hva, 1, &page,
+						 FOLL_WRITE | FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
 		memcpy(page_address(page) + offset, data, seg);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 08/16] KVM: x86: Use GUP for page walk instead of __get_user()
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 09/16] KVM: Protected memory extension Kirill A. Shutemov
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

The user mapping doesn't have the page mapping for protected memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kvm/mmu/paging_tmpl.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 9bdf9b7d9a96..ef0c5bc8ad7e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -400,8 +400,14 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 			goto error;
 
 		ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
-		if (unlikely(__get_user(pte, ptep_user)))
-			goto error;
+		if (vcpu->kvm->mem_protected) {
+			if (copy_from_guest(&pte, host_addr + offset,
+					    sizeof(pte)))
+				goto error;
+		} else {
+			if (unlikely(__get_user(pte, ptep_user)))
+				goto error;
+		}
 		walker->ptep_user[walker->level - 1] = ptep_user;
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 09/16] KVM: Protected memory extension
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 08/16] KVM: x86: Use GUP for page walk instead of __get_user() Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-25 15:26   ` Vitaly Kuznetsov
  2020-05-22 12:52 ` [RFC 10/16] KVM: x86: Enabled protected " Kirill A. Shutemov
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

Add infrastructure that handles protected memory extension.

Arch-specific code has to provide hypercalls and define non-zero
VM_KVM_PROTECTED.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/kvm_host.h |   4 ++
 mm/mprotect.c            |   1 +
 virt/kvm/kvm_main.c      | 131 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 136 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd0bb600f610..d7072f6d6aa0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -700,6 +700,10 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot);
 
+int kvm_protect_all_memory(struct kvm *kvm);
+int kvm_protect_memory(struct kvm *kvm,
+		       unsigned long gfn, unsigned long npages, bool protect);
+
 int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 			    struct page **pages, int nr_pages);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 494192ca954b..552be3b4c80a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -505,6 +505,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	vm_unacct_memory(charged);
 	return error;
 }
+EXPORT_SYMBOL_GPL(mprotect_fixup);
 
 /*
  * pkey==-1 when doing a legacy mprotect()
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 530af95efdf3..07d45da5d2aa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -155,6 +155,8 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+static int protect_memory(unsigned long start, unsigned long end, bool protect);
+
 __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end, bool blockable)
 {
@@ -1309,6 +1311,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	if (r)
 		goto out_bitmap;
 
+	if (mem->memory_size && kvm->mem_protected) {
+		r = protect_memory(new.userspace_addr,
+				   new.userspace_addr + new.npages * PAGE_SIZE,
+				   true);
+		if (r)
+			goto out_bitmap;
+	}
+
 	if (old.dirty_bitmap && !new.dirty_bitmap)
 		kvm_destroy_dirty_bitmap(&old);
 	return 0;
@@ -2652,6 +2662,127 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
+static int protect_memory(unsigned long start, unsigned long end, bool protect)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	int ret;
+
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+
+	ret = -ENOMEM;
+	vma = find_vma(current->mm, start);
+	if (!vma)
+		goto out;
+
+	ret = -EINVAL;
+	if (vma->vm_start > start)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+	else
+		prev = vma->vm_prev;
+
+	ret = 0;
+	while (true) {
+		unsigned long newflags, tmp;
+
+		tmp = vma->vm_end;
+		if (tmp > end)
+			tmp = end;
+
+		newflags = vma->vm_flags;
+		if (protect)
+			newflags |= VM_KVM_PROTECTED;
+		else
+			newflags &= ~VM_KVM_PROTECTED;
+
+		/* The VMA has been handled as part of other memslot */
+		if (newflags == vma->vm_flags)
+			goto next;
+
+		ret = mprotect_fixup(vma, &prev, start, tmp, newflags);
+		if (ret)
+			goto out;
+
+next:
+		start = tmp;
+		if (start < prev->vm_end)
+			start = prev->vm_end;
+
+		if (start >= end)
+			goto out;
+
+		vma = prev->vm_next;
+		if (!vma || vma->vm_start != start) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+out:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
+int kvm_protect_memory(struct kvm *kvm,
+		       unsigned long gfn, unsigned long npages, bool protect)
+{
+	struct kvm_memory_slot *memslot;
+	unsigned long start, end;
+	gfn_t numpages;
+
+	if (!VM_KVM_PROTECTED)
+		return -KVM_ENOSYS;
+
+	if (!npages)
+		return 0;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	/* Not backed by memory. It's okay. */
+	if (!memslot)
+		return 0;
+
+	start = gfn_to_hva_many(memslot, gfn, &numpages);
+	end = start + npages * PAGE_SIZE;
+
+	/* XXX: Share range across memory slots? */
+	if (WARN_ON(numpages < npages))
+		return -EINVAL;
+
+	return protect_memory(start, end, protect);
+}
+EXPORT_SYMBOL_GPL(kvm_protect_memory);
+
+int kvm_protect_all_memory(struct kvm *kvm)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	unsigned long start, end;
+	int i, ret = 0;;
+
+	if (!VM_KVM_PROTECTED)
+		return -KVM_ENOSYS;
+
+	mutex_lock(&kvm->slots_lock);
+	kvm->mem_protected = true;
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(memslot, slots) {
+			start = memslot->userspace_addr;
+			end = start + memslot->npages * PAGE_SIZE;
+			ret = protect_memory(start, end, true);
+			if (ret)
+				goto out;
+		}
+	}
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_protect_all_memory);
+
 void kvm_sigset_activate(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->sigset_active)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 10/16] KVM: x86: Enabled protected memory extension
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 09/16] KVM: Protected memory extension Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-25 15:26   ` Vitaly Kuznetsov
  2020-05-26  6:16   ` Mike Rapoport
  2020-05-22 12:52 ` [RFC 11/16] KVM: Rework copy_to/from_guest() to avoid direct mapping Kirill A. Shutemov
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

Wire up hypercalls for the feature and define VM_KVM_PROTECTED.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 1 +
 arch/x86/kvm/cpuid.c | 3 +++
 arch/x86/kvm/x86.c   | 9 +++++++++
 include/linux/mm.h   | 4 ++++
 4 files changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 58dd44a1b92f..420e3947f0c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -801,6 +801,7 @@ config KVM_GUEST
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_MEM_ENCRYPT_COMMON
 	select SWIOTLB
+	select ARCH_USES_HIGH_VMA_FLAGS
 	default y
 	---help---
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 901cd1fdecd9..94cc5e45467e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -714,6 +714,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
 
+		if (VM_KVM_PROTECTED)
+			entry->eax |=(1 << KVM_FEATURE_MEM_PROTECTED);
+
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c17e6eb9ad43..acba0ac07f61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7598,6 +7598,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		kvm_sched_yield(vcpu->kvm, a0);
 		ret = 0;
 		break;
+	case KVM_HC_ENABLE_MEM_PROTECTED:
+		ret = kvm_protect_all_memory(vcpu->kvm);
+		break;
+	case KVM_HC_MEM_SHARE:
+		ret = kvm_protect_memory(vcpu->kvm, a0, a1, false);
+		break;
+	case KVM_HC_MEM_UNSHARE:
+		ret = kvm_protect_memory(vcpu->kvm, a0, a1, true);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4f7195365cc0..6eb771c14968 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -329,7 +329,11 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
 #endif
 
+#if defined(CONFIG_X86_64) && defined(CONFIG_KVM)
+#define VM_KVM_PROTECTED VM_HIGH_ARCH_4
+#else
 #define VM_KVM_PROTECTED 0
+#endif
 
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 11/16] KVM: Rework copy_to/from_guest() to avoid direct mapping
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 10/16] KVM: x86: Enabled protected " Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 12/16] x86/kvm: Share steal time page with host Kirill A. Shutemov
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

We are going unmap guest pages from direct mapping and cannot rely on it
for guest memory access. Use temporary kmap_atomic()-style mapping to
access guest memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 virt/kvm/kvm_main.c | 57 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 07d45da5d2aa..63282def3760 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2258,17 +2258,45 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+static pte_t **guest_map_ptes;
+static struct vm_struct *guest_map_area;
+
+static void *map_page_atomic(struct page *page)
+{
+	pte_t *pte;
+	void *vaddr;
+
+	preempt_disable();
+	pte = guest_map_ptes[smp_processor_id()];
+	vaddr = guest_map_area->addr + smp_processor_id() * PAGE_SIZE;
+	set_pte(pte, mk_pte(page, PAGE_KERNEL));
+	return vaddr;
+}
+
+static void unmap_page_atomic(void *vaddr)
+{
+	pte_t *pte = guest_map_ptes[smp_processor_id()];
+	set_pte(pte, __pte(0));
+	__flush_tlb_one_kernel((unsigned long)vaddr);
+	preempt_enable();
+}
+
 int copy_from_guest(void *data, unsigned long hva, int len)
 {
 	int offset = offset_in_page(hva);
 	struct page *page;
 	int npages, seg;
+	void *vaddr;
 
 	while ((seg = next_segment(len, offset)) != 0) {
 		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
-		memcpy(data, page_address(page) + offset, seg);
+
+		vaddr = map_page_atomic(page);
+		memcpy(data, vaddr + offset, seg);
+		unmap_page_atomic(vaddr);
+
 		put_page(page);
 		len -= seg;
 		hva += seg;
@@ -2283,13 +2311,18 @@ int copy_to_guest(unsigned long hva, const void *data, int len)
 	int offset = offset_in_page(hva);
 	struct page *page;
 	int npages, seg;
+	void *vaddr;
 
 	while ((seg = next_segment(len, offset)) != 0) {
 		npages = get_user_pages_unlocked(hva, 1, &page,
 						 FOLL_WRITE | FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
-		memcpy(page_address(page) + offset, data, seg);
+
+		vaddr = map_page_atomic(page);
+		memcpy(vaddr + offset, data, seg);
+		unmap_page_atomic(vaddr);
+
 		put_page(page);
 		len -= seg;
 		hva += seg;
@@ -4921,6 +4954,18 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	if (r)
 		goto out_free;
 
+	if (VM_KVM_PROTECTED) {
+		guest_map_ptes = kmalloc_array(num_possible_cpus(),
+					       sizeof(pte_t *), GFP_KERNEL);
+		if (!guest_map_ptes)
+			goto out_unreg;
+
+		guest_map_area = alloc_vm_area(PAGE_SIZE * num_possible_cpus(),
+					       guest_map_ptes);
+		if (!guest_map_ptes)
+			goto out_unreg;
+	}
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -4944,6 +4989,10 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	return 0;
 
 out_unreg:
+	if (guest_map_area)
+		free_vm_area(guest_map_area);
+	if (guest_map_ptes)
+		kfree(guest_map_ptes);
 	kvm_async_pf_deinit();
 out_free:
 	kmem_cache_destroy(kvm_vcpu_cache);
@@ -4965,6 +5014,10 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+	if (guest_map_area)
+		free_vm_area(guest_map_area);
+	if (guest_map_ptes)
+		kfree(guest_map_ptes);
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 12/16] x86/kvm: Share steal time page with host
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 11/16] KVM: Rework copy_to/from_guest() to avoid direct mapping Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 13/16] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

struct kvm_steal_time is shared between guest and host. Mark it as
shared.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/kvm.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index f50d65df4412..b0f445796ed1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -286,11 +286,15 @@ static void kvm_register_steal_time(void)
 {
 	int cpu = smp_processor_id();
 	struct kvm_steal_time *st = &per_cpu(steal_time, cpu);
+	unsigned long phys;
 
 	if (!has_steal_clock)
 		return;
 
-	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
+	phys = slow_virt_to_phys(st);
+	if (kvm_mem_protected())
+		kvm_hypercall2(KVM_HC_MEM_SHARE, phys >> PAGE_SHIFT, 1);
+	wrmsrl(MSR_KVM_STEAL_TIME, (phys | KVM_MSR_ENABLED));
 	pr_info("kvm-stealtime: cpu %d, msr %llx\n",
 		cpu, (unsigned long long) slow_virt_to_phys(st));
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 13/16] x86/kvmclock: Share hvclock memory with the host
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 12/16] x86/kvm: Share steal time page with host Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-25 15:22   ` Vitaly Kuznetsov
  2020-05-22 12:52 ` [RFC 14/16] KVM: Introduce gfn_to_pfn_memslot_protected() Kirill A. Shutemov
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

hvclock is shared between the guest and the hypervisor. It has to be
accessible by host.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/kvmclock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 34b18f6eeb2c..ac6c2abe0d0f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -253,7 +253,7 @@ static void __init kvmclock_init_mem(void)
 	 * hvclock is shared between the guest and the hypervisor, must
 	 * be mapped decrypted.
 	 */
-	if (sev_active()) {
+	if (sev_active() || kvm_mem_protected()) {
 		r = set_memory_decrypted((unsigned long) hvclock_mem,
 					 1UL << order);
 		if (r) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 14/16] KVM: Introduce gfn_to_pfn_memslot_protected()
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 13/16] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 15/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

The new interface allows to detect if the page is protected.
A protected page cannot be accessed directly by the host: it has to be
mapped manually.

This is preparation for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/x86/kvm/mmu/mmu.c                 |  6 +++--
 include/linux/kvm_host.h               |  2 +-
 virt/kvm/kvm_main.c                    | 35 ++++++++++++++++++--------
 5 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2b35f9bcf892..e9a13ecf812f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -587,7 +587,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 	} else {
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-					   writing, &write_ok);
+					   writing, &write_ok, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index aa12cd4078b3..58f8df466a94 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -798,7 +798,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-					   writing, upgrade_p);
+					   writing, upgrade_p, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8071952e9cf2..0fc095a66a3c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4096,7 +4096,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	async = false;
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable,
+				    NULL);
 	if (!async)
 		return false; /* *pfn has correct page already */
 
@@ -4110,7 +4111,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			return true;
 	}
 
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable,
+				    NULL);
 	return false;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d7072f6d6aa0..eca18ef9b1f4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -724,7 +724,7 @@ kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
-			       bool *writable);
+			       bool *writable, bool *protected);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 63282def3760..8bcf3201304a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1779,9 +1779,10 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, kvm_pfn_t *pfn)
+			   bool *writable, bool *protected, kvm_pfn_t *pfn)
 {
 	unsigned int flags = FOLL_HWPOISON | FOLL_KVM;
+	struct vm_area_struct *vma;
 	struct page *page;
 	int npages = 0;
 
@@ -1795,9 +1796,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 	if (async)
 		flags |= FOLL_NOWAIT;
 
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
-	if (npages != 1)
+	down_read(&current->mm->mmap_sem);
+	npages = get_user_pages(addr, 1, flags, &page, &vma);
+	if (npages != 1) {
+		up_read(&current->mm->mmap_sem);
 		return npages;
+	}
+	if (protected)
+		*protected = vma_is_kvm_protected(vma);
+	up_read(&current->mm->mmap_sem);
 
 	/* map read fault as writable if possible */
 	if (unlikely(!write_fault) && writable) {
@@ -1888,7 +1895,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  *     whether the mapping is writable.
  */
 static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-			bool write_fault, bool *writable)
+			bool write_fault, bool *writable, bool *protected)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn = 0;
@@ -1903,7 +1910,8 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	if (atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+	npages = hva_to_pfn_slow(addr, async, write_fault, writable, protected,
+				 &pfn);
 	if (npages == 1)
 		return pfn;
 
@@ -1937,7 +1945,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 
 kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
-			       bool *writable)
+			       bool *writable, bool *protected)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -1960,7 +1968,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 	}
 
 	return hva_to_pfn(addr, atomic, async, write_fault,
-			  writable);
+			  writable, protected);
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
@@ -1968,19 +1976,26 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
-				    write_fault, writable);
+				    write_fault, writable, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
+static kvm_pfn_t gfn_to_pfn_memslot_protected(struct kvm_memory_slot *slot,
+					      gfn_t gfn, bool *protected)
+{
+	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL,
+				    protected);
+}
+
 kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 15/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 14/16] KVM: Introduce gfn_to_pfn_memslot_protected() Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-22 12:52 ` [RFC 16/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

We cannot access protected pages directly. Use ioremap() to
create a temporary mapping of the page. The mapping is destroyed
on __kvm_unmap_gfn().

The new interface gfn_to_pfn_memslot_protected() is used to detect if
the page is protected.

ioremap_cache_force() is a hack to bypass IORES_MAP_SYSTEM_RAM check in
the x86 ioremap code. We need a better solution.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/io.h            |  2 ++
 arch/x86/include/asm/pgtable_types.h |  1 +
 arch/x86/mm/ioremap.c                | 16 +++++++++++++---
 include/linux/kvm_host.h             |  1 +
 virt/kvm/kvm_main.c                  | 14 +++++++++++---
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index c58d52fd7bf2..a3e1bfad1026 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -184,6 +184,8 @@ extern void __iomem *ioremap_uc(resource_size_t offset, unsigned long size);
 #define ioremap_uc ioremap_uc
 extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size);
 #define ioremap_cache ioremap_cache
+extern void __iomem *ioremap_cache_force(resource_size_t offset, unsigned long size);
+#define ioremap_cache_force ioremap_cache_force
 extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, unsigned long prot_val);
 #define ioremap_prot ioremap_prot
 extern void __iomem *ioremap_encrypted(resource_size_t phys_addr, unsigned long size);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b6606fe6cfdf..66cc22abda7b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -147,6 +147,7 @@ enum page_cache_mode {
 	_PAGE_CACHE_MODE_UC       = 3,
 	_PAGE_CACHE_MODE_WT       = 4,
 	_PAGE_CACHE_MODE_WP       = 5,
+	_PAGE_CACHE_MODE_WB_FORCE = 6,
 
 	_PAGE_CACHE_MODE_NUM      = 8
 };
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 18c637c0dc6f..e48fc0e130b2 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -202,9 +202,12 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	__ioremap_check_mem(phys_addr, size, &io_desc);
 
 	/*
-	 * Don't allow anybody to remap normal RAM that we're using..
+	 * Don't allow anybody to remap normal RAM that we're using, unless
+	 * _PAGE_CACHE_MODE_WB_FORCE is used.
 	 */
-	if (io_desc.flags & IORES_MAP_SYSTEM_RAM) {
+	if (pcm == _PAGE_CACHE_MODE_WB_FORCE) {
+	    pcm = _PAGE_CACHE_MODE_WB;
+	} else if (io_desc.flags & IORES_MAP_SYSTEM_RAM) {
 		WARN_ONCE(1, "ioremap on RAM at %pa - %pa\n",
 			  &phys_addr, &last_addr);
 		return NULL;
@@ -419,6 +422,13 @@ void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size)
 }
 EXPORT_SYMBOL(ioremap_cache);
 
+void __iomem *ioremap_cache_force(resource_size_t phys_addr, unsigned long size)
+{
+	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB_FORCE,
+				__builtin_return_address(0), false);
+}
+EXPORT_SYMBOL(ioremap_cache_force);
+
 void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long size,
 				unsigned long prot_val)
 {
@@ -467,7 +477,7 @@ void iounmap(volatile void __iomem *addr)
 	p = find_vm_area((void __force *)addr);
 
 	if (!p) {
-		printk(KERN_ERR "iounmap: bad address %p\n", addr);
+		printk(KERN_ERR "iounmap: bad address %px\n", addr);
 		dump_stack();
 		return;
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eca18ef9b1f4..b6944f88033d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -237,6 +237,7 @@ struct kvm_host_map {
 	void *hva;
 	kvm_pfn_t pfn;
 	kvm_pfn_t gfn;
+	bool protected;
 };
 
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8bcf3201304a..71aac117357f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2091,6 +2091,7 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 	void *hva = NULL;
 	struct page *page = KVM_UNMAPPED_PAGE;
 	struct kvm_memory_slot *slot = __gfn_to_memslot(slots, gfn);
+	bool protected = false;
 	u64 gen = slots->generation;
 
 	if (!map)
@@ -2107,12 +2108,16 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 	} else {
 		if (atomic)
 			return -EAGAIN;
-		pfn = gfn_to_pfn_memslot(slot, gfn);
+		pfn = gfn_to_pfn_memslot_protected(slot, gfn, &protected);
 	}
 	if (is_error_noslot_pfn(pfn))
 		return -EINVAL;
 
-	if (pfn_valid(pfn)) {
+	if (protected) {
+		if (atomic)
+			return -EAGAIN;
+		hva = ioremap_cache_force(pfn_to_hpa(pfn), PAGE_SIZE);
+	} else if (pfn_valid(pfn)) {
 		page = pfn_to_page(pfn);
 		if (atomic)
 			hva = kmap_atomic(page);
@@ -2133,6 +2138,7 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 	map->hva = hva;
 	map->pfn = pfn;
 	map->gfn = gfn;
+	map->protected = protected;
 
 	return 0;
 }
@@ -2163,7 +2169,9 @@ static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
 	if (!map->hva)
 		return;
 
-	if (map->page != KVM_UNMAPPED_PAGE) {
+	if (map->protected) {
+		iounmap(map->hva);
+	} else if (map->page != KVM_UNMAPPED_PAGE) {
 		if (atomic)
 			kunmap_atomic(map->hva);
 		else
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 16/16] KVM: Unmap protected pages from direct mapping
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 15/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
@ 2020-05-22 12:52 ` Kirill A. Shutemov
  2020-05-26  6:16   ` Mike Rapoport
  2020-05-25  5:27 ` [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 12:52 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

If the protected memory feature enabled, unmap guest memory from
kernel's direct mappings.

Migration and KSM is disabled for protected memory as it would require a
special treatment.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/pat/set_memory.c |  1 +
 include/linux/kvm_host.h     |  3 ++
 mm/huge_memory.c             |  9 +++++
 mm/ksm.c                     |  3 ++
 mm/memory.c                  | 13 +++++++
 mm/rmap.c                    |  4 ++
 virt/kvm/kvm_main.c          | 74 ++++++++++++++++++++++++++++++++++++
 7 files changed, 107 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 6f075766bb94..13988413af40 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2227,6 +2227,7 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 
 	arch_flush_lazy_mmu_mode();
 }
+EXPORT_SYMBOL_GPL(__kernel_map_pages);
 
 #ifdef CONFIG_HIBERNATION
 bool kernel_page_present(struct page *page)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b6944f88033d..e1d7762b615c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -705,6 +705,9 @@ int kvm_protect_all_memory(struct kvm *kvm);
 int kvm_protect_memory(struct kvm *kvm,
 		       unsigned long gfn, unsigned long npages, bool protect);
 
+void kvm_map_page(struct page *page, int nr_pages);
+void kvm_unmap_page(struct page *page, int nr_pages);
+
 int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 			    struct page **pages, int nr_pages);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3562648a4ef..d8a444a401cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 #include <linux/page_owner.h>
+#include <linux/kvm_host.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -650,6 +651,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		spin_unlock(vmf->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
 		count_memcg_events(memcg, THP_FAULT_ALLOC, 1);
+
+		/* Unmap page from direct mapping */
+		if (vma_is_kvm_protected(vma))
+			kvm_unmap_page(page, HPAGE_PMD_NR);
 	}
 
 	return 0;
@@ -1886,6 +1891,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
+
+			/* Map the page back to the direct mapping */
+			if (vma_is_kvm_protected(vma))
+				kvm_map_page(page, HPAGE_PMD_NR);
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 281c00129a2e..942b88782ac2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -527,6 +527,9 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
 		return NULL;
 	if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 		return NULL;
+	/* TODO */
+	if (vma_is_kvm_protected(vma))
+		return NULL;
 	return vma;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index d7228db6e4bf..74773229b854 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -71,6 +71,7 @@
 #include <linux/dax.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/kvm_host.h>
 
 #include <trace/events/kmem.h>
 
@@ -1088,6 +1089,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
+
+			/* Map the page back to the direct mapping */
+			if (vma_is_anonymous(vma) && vma_is_kvm_protected(vma))
+				kvm_map_page(page, 1);
+
 			rss[mm_counter(page)]--;
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
@@ -3312,6 +3318,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct page *page;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	bool set = false;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -3397,6 +3404,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
+	set = true;
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3404,6 +3412,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	/* Unmap page from direct mapping */
+	if (vma_is_kvm_protected(vma) && set)
+		kvm_unmap_page(page, 1);
+
 	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
diff --git a/mm/rmap.c b/mm/rmap.c
index f79a206b271a..a9b2e347d1ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1709,6 +1709,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
 {
+	/* TODO */
+	if (vma_is_kvm_protected(vma))
+		return true;
+
 	return vma_is_temporary_stack(vma);
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 71aac117357f..defc33d3a124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/io.h>
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
+#include <linux/pagewalk.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -2718,6 +2719,72 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
+void kvm_map_page(struct page *page, int nr_pages)
+{
+	int i;
+
+	/* Clear page before returning it to the direct mapping */
+	for (i = 0; i < nr_pages; i++) {
+		void *p = map_page_atomic(page + i);
+		memset(p, 0, PAGE_SIZE);
+		unmap_page_atomic(p);
+	}
+
+	kernel_map_pages(page, nr_pages, 1);
+}
+EXPORT_SYMBOL_GPL(kvm_map_page);
+
+void kvm_unmap_page(struct page *page, int nr_pages)
+{
+	kernel_map_pages(page, nr_pages, 0);
+}
+EXPORT_SYMBOL_GPL(kvm_unmap_page);
+
+static int adjust_direct_mapping_pte_range(pmd_t *pmd, unsigned long addr,
+					   unsigned long end,
+					   struct mm_walk *walk)
+{
+	bool protect = (bool)walk->private;
+	pte_t *pte;
+	struct page *page;
+
+	if (pmd_trans_huge(*pmd)) {
+		page = pmd_page(*pmd);
+		if (is_huge_zero_page(page))
+			return 0;
+		VM_BUG_ON_PAGE(total_mapcount(page) != 1, page);
+		/* XXX: Would it fail with direct device assignment? */
+		VM_BUG_ON_PAGE(page_count(page) != 1, page);
+		kernel_map_pages(page, HPAGE_PMD_NR, !protect);
+		return 0;
+	}
+
+	pte = pte_offset_map(pmd, addr);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		pte_t entry = *pte;
+
+		if (!pte_present(entry))
+			continue;
+
+		if (is_zero_pfn(pte_pfn(entry)))
+			continue;
+
+		page = pte_page(entry);
+
+		VM_BUG_ON_PAGE(page_mapcount(page) != 1, page);
+		/* XXX: Would it fail with direct device assignment? */
+		VM_BUG_ON_PAGE(page_count(page) !=
+			       total_mapcount(compound_head(page)), page);
+		kernel_map_pages(page, 1, !protect);
+	}
+
+	return 0;
+}
+
+static const struct mm_walk_ops adjust_direct_mapping_ops = {
+	.pmd_entry	= adjust_direct_mapping_pte_range,
+};
+
 static int protect_memory(unsigned long start, unsigned long end, bool protect)
 {
 	struct mm_struct *mm = current->mm;
@@ -2763,6 +2830,13 @@ static int protect_memory(unsigned long start, unsigned long end, bool protect)
 		if (ret)
 			goto out;
 
+		if (vma_is_anonymous(vma)) {
+			ret = walk_page_range_novma(mm, start, tmp,
+					    &adjust_direct_mapping_ops, NULL,
+					    (void *) protect);
+			if (ret)
+				goto out;
+		}
 next:
 		start = tmp;
 		if (start < prev->vm_end)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2020-05-22 12:52 ` [RFC 16/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
@ 2020-05-25  5:27 ` Kirill A. Shutemov
  2020-05-25 13:47 ` Liran Alon
  2020-06-04 15:15 ` Marc Zyngier
  18 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25  5:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Mike Rapoport, Alexandre Chartre,
	Marius Hillenbrand

On Fri, May 22, 2020 at 03:51:58PM +0300, Kirill A. Shutemov wrote:
> == Background / Problem ==
> 
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.

CC people who worked on the related patchsets.
 
> == What does this set mitigate? ==
> 
>  - Host kernel ”accidental” access to guest data (think speculation)
> 
>  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> 
>  - Host userspace access to guest data (compromised qemu)
> 
> == What does this set NOT mitigate? ==
> 
>  - Full host kernel compromise.  Kernel will just map the pages again.
> 
>  - Hardware attacks
> 
> 
> The patchset is RFC-quality: it works but has known issues that must be
> addressed before it can be considered for applying.
> 
> We are looking for high-level feedback on the concept.  Some open
> questions:
> 
>  - This protects from some kernel and host userspace read-only attacks,
>    but does not place the host kernel outside the trust boundary. Is it
>    still valuable?
> 
>  - Can this approach be used to avoid cache-coherency problems with
>    hardware encryption schemes that repurpose physical bits?
> 
>  - The guest kernel must be modified for this to work.  Is that a deal
>    breaker, especially for public clouds?
> 
>  - Are the costs of removing pages from the direct map too high to be
>    feasible?
> 
> == Series Overview ==
> 
> The hardware features protect guest data by encrypting it and then
> ensuring that only the right guest can decrypt it.  This has the
> side-effect of making the kernel direct map and userspace mapping
> (QEMU et al) useless.  But, this teaches us something very useful:
> neither the kernel or userspace mappings are really necessary for normal
> guest operations.
> 
> Instead of using encryption, this series simply unmaps the memory. One
> advantage compared to allowing access to ciphertext is that it allows bad
> accesses to be caught instead of simply reading garbage.
> 
> Protection from physical attacks needs to be provided by some other means.
> On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> mitigation against physical attacks, such as DIMM interposers sniffing
> memory bus traffic.
> 
> The patchset modifies both host and guest kernel. The guest OS must enable
> the feature via hypercall and mark any memory range that has to be shared
> with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> bit in the guest’s page table while this approach uses a hypercall.
> 
> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
> 
> Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> flushes local TLB. I think it's a reasonable compromise between security and
> perfromance.
> 
> Zapping the PTE would bring the page back to the direct mapping after clearing.
> At least for now, we don't remove file-backed pages from the direct mapping.
> File-backed pages could be accessed via read/write syscalls. It adds
> complexity.
> 
> Occasionally, host kernel has to access guest memory that was not made
> shared by the guest. For instance, it happens for instruction emulation.
> Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> helpers acquire the page via GUP, map it into kernel address space with
> kmap_atomic()-style mechanism and only then copy the data.
> 
> For some instruction emulation copying is not good enough: cmpxchg
> emulation has to have direct access to the guest memory. __kvm_map_gfn()
> is modified to accommodate the case.
> 
> The patchset is on top of v5.7-rc6 plus this patch:
> 
> https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com
> 
> == Open Issues ==
> 
> Unmapping the pages from direct mapping bring a few of issues that have
> not rectified yet:
> 
>  - Touching direct mapping leads to fragmentation. We need to be able to
>    recover from it. I have a buggy patch that aims at recovering 2M/1G page.
>    It has to be fixed and tested properly
> 
>  - Page migration and KSM is not supported yet.
> 
>  - Live migration of a guest would require a new flow. Not sure yet how it
>    would look like.
> 
>  - The feature interfere with NUMA balancing. Not sure yet if it's
>    possible to make them work together.
> 
>  - Guests have no mechanism to ensure that even a well-behaving host has
>    unmapped its private data.  With SEV, for instance, the guest only has
>    to trust the hardware to encrypt a page after the C bit is set in a
>    guest PTE.  A mechanism for a guest to query the host mapping state, or
>    to constantly assert the intent for a page to be Private would be
>    valuable.
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2020-05-25  5:27 ` [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
@ 2020-05-25 13:47 ` Liran Alon
  2020-05-25 14:46   ` Kirill A. Shutemov
  2020-05-26  6:17   ` Mike Rapoport
  2020-06-04 15:15 ` Marc Zyngier
  18 siblings, 2 replies; 62+ messages in thread
From: Liran Alon @ 2020-05-25 13:47 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov


On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> == Background / Problem ==
>
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.
>
>
> == What does this set mitigate? ==
>
>   - Host kernel ”accidental” access to guest data (think speculation)

Just to clarify: This is any host kernel memory info-leak vulnerability. 
Not just speculative execution memory info-leaks. Also architectural ones.

In addition, note that removing guest data from host kernel VA space 
also makes guest<->host memory exploits more difficult.
E.g. Guest cannot use already available memory buffer in kernel VA space 
for ROP or placing valuable guest-controlled code/data in general.

>
>   - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>
>   - Host userspace access to guest data (compromised qemu)

I don't quite understand what is the benefit of preventing userspace VMM 
access to guest data while the host kernel can still access it.

QEMU is more easily compromised than the host kernel because it's 
guest<->host attack surface is larger (E.g. Various device emulation).
But this compromise comes from the guest itself. Not other guests. In 
contrast to host kernel attack surface, which an info-leak there can
be exploited from one guest to leak another guest data.
>
> == What does this set NOT mitigate? ==
>
>   - Full host kernel compromise.  Kernel will just map the pages again.
>
>   - Hardware attacks
>
>
> The patchset is RFC-quality: it works but has known issues that must be
> addressed before it can be considered for applying.
>
> We are looking for high-level feedback on the concept.  Some open
> questions:
>
>   - This protects from some kernel and host userspace read-only attacks,
>     but does not place the host kernel outside the trust boundary. Is it
>     still valuable?
I don't currently see a good argument for preventing host userspace 
access to guest data while host kernel can still access it.
But there is definitely strong benefit of mitigating kernel info-leaks 
exploitable from one guest to leak another guest data.
>
>   - Can this approach be used to avoid cache-coherency problems with
>     hardware encryption schemes that repurpose physical bits?
>
>   - The guest kernel must be modified for this to work.  Is that a deal
>     breaker, especially for public clouds?
>
>   - Are the costs of removing pages from the direct map too high to be
>     feasible?

If I remember correctly, this perf cost was too high when considering 
XPFO (eXclusive Page Frame Ownership) patch-series.
This created two major perf costs:
1) Removing pages from direct-map prevented direct-map from simply be 
entirely mapped as 1GB huge-pages.
2) Frequent allocation/free of userspace pages resulted in frequent TLB 
invalidations.

Having said that, (1) can be mitigated in case guest data is completely 
allocated from 1GB hugetlbfs to guarantee it will not
create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM 
use-case.

This makes me wonder:
XPFO patch-series, applied to the context of QEMU/KVM, seems to provide 
exactly the functionality of this patch-series,
with the exception of the additional "feature" of preventing guest data 
from also being accessible to host userspace VMM.
i.e. XPFO will unmap guest pages from host kernel direct-map while still 
keeping them mapped in host userspace VMM page-tables.

If I understand correctly, this "feature" is what brings most of the 
extra complexity of this patch-series compared to XPFO.
It requires guest modification to explicitly specify to host which pages 
can be accessed by userspace VMM, it requires
changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it 
creates issues with Live-Migration support.

So if there is no strong convincing argument for the motivation to 
prevent userspace VMM access to guest data *while host kernel
can still access guest data*, I don't see a good reason for using this 
approach.

Furthermore, I would like to point out that just unmapping guest data 
from kernel direct-map is not sufficient to prevent all
guest-to-guest info-leaks via a kernel memory info-leak vulnerability. 
This is because host kernel VA space have other regions
which contains guest sensitive data. For example, KVM per-vCPU struct 
(which holds vCPU state) is allocated on slab and therefore
still leakable.

I recommend you will have a look at my (and Alexandre Charte) KVM Forum 
2019 talk on KVM ASI which provides extensive background
on the various attempts done by the community for mitigating host kernel 
memory info-leaks exploitable by guest to leak other guests data:
https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%20KVM%20ASI.pdf

>
> == Series Overview ==
>
> The hardware features protect guest data by encrypting it and then
> ensuring that only the right guest can decrypt it.  This has the
> side-effect of making the kernel direct map and userspace mapping
> (QEMU et al) useless.  But, this teaches us something very useful:
> neither the kernel or userspace mappings are really necessary for normal
> guest operations.
>
> Instead of using encryption, this series simply unmaps the memory. One
> advantage compared to allowing access to ciphertext is that it allows bad
> accesses to be caught instead of simply reading garbage.
>
> Protection from physical attacks needs to be provided by some other means.
> On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> mitigation against physical attacks, such as DIMM interposers sniffing
> memory bus traffic.
>
> The patchset modifies both host and guest kernel. The guest OS must enable
> the feature via hypercall and mark any memory range that has to be shared
> with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> bit in the guest’s page table while this approach uses a hypercall.
>
> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
>
> Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> flushes local TLB. I think it's a reasonable compromise between security and
> perfromance.
>
> Zapping the PTE would bring the page back to the direct mapping after clearing.
> At least for now, we don't remove file-backed pages from the direct mapping.
> File-backed pages could be accessed via read/write syscalls. It adds
> complexity.
>
> Occasionally, host kernel has to access guest memory that was not made
> shared by the guest. For instance, it happens for instruction emulation.
> Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> helpers acquire the page via GUP, map it into kernel address space with
> kmap_atomic()-style mechanism and only then copy the data.
>
> For some instruction emulation copying is not good enough: cmpxchg
> emulation has to have direct access to the guest memory. __kvm_map_gfn()
> is modified to accommodate the case.
>
> The patchset is on top of v5.7-rc6 plus this patch:
>
> https://urldefense.com/v3/__https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com__;!!GqivPVa7Brio!MSTb9DzpOUJMLMaMq-J7QOkopsKIlAYXpIxiu5FwFYfRctwIyNi8zBJWvlt89j8$
>
> == Open Issues ==
>
> Unmapping the pages from direct mapping bring a few of issues that have
> not rectified yet:
>
>   - Touching direct mapping leads to fragmentation. We need to be able to
>     recover from it. I have a buggy patch that aims at recovering 2M/1G page.
>     It has to be fixed and tested properly
As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs 
will lead to holes in kernel direct-map which force it to not be mapped 
anymore as a series of 1GB huge-pages.
This have non-trivial performance cost. Thus, I am not sure addressing 
this use-case is valuable.
>
>   - Page migration and KSM is not supported yet.
>
>   - Live migration of a guest would require a new flow. Not sure yet how it
>     would look like.

Note that Live-Migration issue is a result of not making guest data 
accessible to host userspace VMM.

-Liran

>
>   - The feature interfere with NUMA balancing. Not sure yet if it's
>     possible to make them work together.
>
>   - Guests have no mechanism to ensure that even a well-behaving host has
>     unmapped its private data.  With SEV, for instance, the guest only has
>     to trust the hardware to encrypt a page after the C bit is set in a
>     guest PTE.  A mechanism for a guest to query the host mapping state, or
>     to constantly assert the intent for a page to be Private would be
>     valuable.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-25 13:47 ` Liran Alon
@ 2020-05-25 14:46   ` Kirill A. Shutemov
  2020-05-25 15:56     ` Liran Alon
  2020-05-26  6:17   ` Mike Rapoport
  1 sibling, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25 14:46 UTC (permalink / raw)
  To: Liran Alon
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> 
> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> > == Background / Problem ==
> > 
> > There are a number of hardware features (MKTME, SEV) which protect guest
> > memory from some unauthorized host access. The patchset proposes a purely
> > software feature that mitigates some of the same host-side read-only
> > attacks.
> > 
> > 
> > == What does this set mitigate? ==
> > 
> >   - Host kernel ”accidental” access to guest data (think speculation)
> 
> Just to clarify: This is any host kernel memory info-leak vulnerability. Not
> just speculative execution memory info-leaks. Also architectural ones.
> 
> In addition, note that removing guest data from host kernel VA space also
> makes guest<->host memory exploits more difficult.
> E.g. Guest cannot use already available memory buffer in kernel VA space for
> ROP or placing valuable guest-controlled code/data in general.
> 
> > 
> >   - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> > 
> >   - Host userspace access to guest data (compromised qemu)
> 
> I don't quite understand what is the benefit of preventing userspace VMM
> access to guest data while the host kernel can still access it.

Let me clarify: the guest memory mapped into host userspace is not
accessible by both host kernel and userspace. Host still has way to access
it via a new interface: GUP(FOLL_KVM). The GUP will give you struct page
that kernel has to map (temporarily) if need to access the data. So only
blessed codepaths would know how to deal with the memory.

It can help preventing some host->guest attack on the compromised host.
Like if an VM has successfully attacked the host it cannot attack other
VMs as easy.

It would also help to protect against guest->host attack by removing one
more places where the guest's data is mapped on the host.

> QEMU is more easily compromised than the host kernel because it's
> guest<->host attack surface is larger (E.g. Various device emulation).
> But this compromise comes from the guest itself. Not other guests. In
> contrast to host kernel attack surface, which an info-leak there can
> be exploited from one guest to leak another guest data.

Consider the case when unprivileged guest user exploits bug in a QEMU
device emulation to gain access to data it cannot normally have access
within the guest. With the feature it would able to see only other shared
regions of guest memory such as DMA and IO buffers, but not the rest.

> > 
> > == What does this set NOT mitigate? ==
> > 
> >   - Full host kernel compromise.  Kernel will just map the pages again.
> > 
> >   - Hardware attacks
> > 
> > 
> > The patchset is RFC-quality: it works but has known issues that must be
> > addressed before it can be considered for applying.
> > 
> > We are looking for high-level feedback on the concept.  Some open
> > questions:
> > 
> >   - This protects from some kernel and host userspace read-only attacks,
> >     but does not place the host kernel outside the trust boundary. Is it
> >     still valuable?
> I don't currently see a good argument for preventing host userspace access
> to guest data while host kernel can still access it.
> But there is definitely strong benefit of mitigating kernel info-leaks
> exploitable from one guest to leak another guest data.
> > 
> >   - Can this approach be used to avoid cache-coherency problems with
> >     hardware encryption schemes that repurpose physical bits?
> > 
> >   - The guest kernel must be modified for this to work.  Is that a deal
> >     breaker, especially for public clouds?
> > 
> >   - Are the costs of removing pages from the direct map too high to be
> >     feasible?
> 
> If I remember correctly, this perf cost was too high when considering XPFO
> (eXclusive Page Frame Ownership) patch-series.
> This created two major perf costs:
> 1) Removing pages from direct-map prevented direct-map from simply be
> entirely mapped as 1GB huge-pages.
> 2) Frequent allocation/free of userspace pages resulted in frequent TLB
> invalidations.
> 
> Having said that, (1) can be mitigated in case guest data is completely
> allocated from 1GB hugetlbfs to guarantee it will not
> create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM
> use-case.

I'm too invested into THP to give it up to the ugly hugetlbfs. I think we
can do better :)

> This makes me wonder:
> XPFO patch-series, applied to the context of QEMU/KVM, seems to provide
> exactly the functionality of this patch-series,
> with the exception of the additional "feature" of preventing guest data from
> also being accessible to host userspace VMM.
> i.e. XPFO will unmap guest pages from host kernel direct-map while still
> keeping them mapped in host userspace VMM page-tables.
> 
> If I understand correctly, this "feature" is what brings most of the extra
> complexity of this patch-series compared to XPFO.
> It requires guest modification to explicitly specify to host which pages can
> be accessed by userspace VMM, it requires
> changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it
> creates issues with Live-Migration support.
> 
> So if there is no strong convincing argument for the motivation to prevent
> userspace VMM access to guest data *while host kernel
> can still access guest data*, I don't see a good reason for using this
> approach.

Well, I disagree with you here. See few points above.

> Furthermore, I would like to point out that just unmapping guest data from
> kernel direct-map is not sufficient to prevent all
> guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This
> is because host kernel VA space have other regions
> which contains guest sensitive data. For example, KVM per-vCPU struct (which
> holds vCPU state) is allocated on slab and therefore
> still leakable.
> 
> I recommend you will have a look at my (and Alexandre Charte) KVM Forum 2019
> talk on KVM ASI which provides extensive background
> on the various attempts done by the community for mitigating host kernel
> memory info-leaks exploitable by guest to leak other guests data:
> https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%20KVM%20ASI.pdf

Thanks, I'll read it up.

> > == Series Overview ==
> > 
> > The hardware features protect guest data by encrypting it and then
> > ensuring that only the right guest can decrypt it.  This has the
> > side-effect of making the kernel direct map and userspace mapping
> > (QEMU et al) useless.  But, this teaches us something very useful:
> > neither the kernel or userspace mappings are really necessary for normal
> > guest operations.
> > 
> > Instead of using encryption, this series simply unmaps the memory. One
> > advantage compared to allowing access to ciphertext is that it allows bad
> > accesses to be caught instead of simply reading garbage.
> > 
> > Protection from physical attacks needs to be provided by some other means.
> > On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> > mitigation against physical attacks, such as DIMM interposers sniffing
> > memory bus traffic.
> > 
> > The patchset modifies both host and guest kernel. The guest OS must enable
> > the feature via hypercall and mark any memory range that has to be shared
> > with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> > bit in the guest’s page table while this approach uses a hypercall.
> > 
> > For removing the userspace mapping, use a trick similar to what NUMA
> > balancing does: convert memory that belongs to KVM memory slots to
> > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> > the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> > VMA must be treated in a special way in the GUP and fault paths. The flag
> > allows GUP to return the page even though it is mapped with PROT_NONE, but
> > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> > would result in -EFAULT.
> > 
> > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> > the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> > flushes local TLB. I think it's a reasonable compromise between security and
> > perfromance.
> > 
> > Zapping the PTE would bring the page back to the direct mapping after clearing.
> > At least for now, we don't remove file-backed pages from the direct mapping.
> > File-backed pages could be accessed via read/write syscalls. It adds
> > complexity.
> > 
> > Occasionally, host kernel has to access guest memory that was not made
> > shared by the guest. For instance, it happens for instruction emulation.
> > Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> > now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> > helpers acquire the page via GUP, map it into kernel address space with
> > kmap_atomic()-style mechanism and only then copy the data.
> > 
> > For some instruction emulation copying is not good enough: cmpxchg
> > emulation has to have direct access to the guest memory. __kvm_map_gfn()
> > is modified to accommodate the case.
> > 
> > The patchset is on top of v5.7-rc6 plus this patch:
> > 
> > https://urldefense.com/v3/__https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com__;!!GqivPVa7Brio!MSTb9DzpOUJMLMaMq-J7QOkopsKIlAYXpIxiu5FwFYfRctwIyNi8zBJWvlt89j8$
> > 
> > == Open Issues ==
> > 
> > Unmapping the pages from direct mapping bring a few of issues that have
> > not rectified yet:
> > 
> >   - Touching direct mapping leads to fragmentation. We need to be able to
> >     recover from it. I have a buggy patch that aims at recovering 2M/1G page.
> >     It has to be fixed and tested properly
> As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs
> will lead to holes in kernel direct-map which force it to not be mapped
> anymore as a series of 1GB huge-pages.
> This have non-trivial performance cost. Thus, I am not sure addressing this
> use-case is valuable.

Here's the buggy patch I've referred to:

http://lore.kernel.org/r/20200416213229.19174-1-kirill.shutemov@linux.intel.com

I plan to get work right.

> > 
> >   - Page migration and KSM is not supported yet.
> > 
> >   - Live migration of a guest would require a new flow. Not sure yet how it
> >     would look like.
> 
> Note that Live-Migration issue is a result of not making guest data
> accessible to host userspace VMM.

Yes, I understand.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-05-22 12:52 ` [RFC 02/16] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
@ 2020-05-25 14:58   ` Vitaly Kuznetsov
  2020-05-25 15:15     ` Kirill A. Shutemov
  0 siblings, 1 reply; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-25 14:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> Provide basic helpers, KVM_FEATURE and a hypercall.
>
> Host side doesn't provide the feature yet, so it is a dead code for now.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_para.h      |  5 +++++
>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>  arch/x86/kernel/kvm.c                | 16 ++++++++++++++++
>  include/uapi/linux/kvm_para.h        |  3 ++-
>  4 files changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 9b4df6eaa11a..3ce84fc07144 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -10,11 +10,16 @@ extern void kvmclock_init(void);
>  
>  #ifdef CONFIG_KVM_GUEST
>  bool kvm_check_and_clear_guest_paused(void);
> +bool kvm_mem_protected(void);
>  #else
>  static inline bool kvm_check_and_clear_guest_paused(void)
>  {
>  	return false;
>  }
> +static inline bool kvm_mem_protected(void)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_KVM_GUEST */
>  
>  #define KVM_HYPERCALL \
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 2a8e0b6b9805..c3b499acc98f 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -28,9 +28,10 @@
>  #define KVM_FEATURE_PV_UNHALT		7
>  #define KVM_FEATURE_PV_TLB_FLUSH	9
>  #define KVM_FEATURE_ASYNC_PF_VMEXIT	10
> -#define KVM_FEATURE_PV_SEND_IPI	11
> +#define KVM_FEATURE_PV_SEND_IPI		11

Nit: spurrious change

>  #define KVM_FEATURE_POLL_CONTROL	12
>  #define KVM_FEATURE_PV_SCHED_YIELD	13
> +#define KVM_FEATURE_MEM_PROTECTED	14
>  
>  #define KVM_HINTS_REALTIME      0
>  
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 6efe0410fb72..bda761ca0d26 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -35,6 +35,13 @@
>  #include <asm/tlb.h>
>  #include <asm/cpuidle_haltpoll.h>
>  
> +static bool mem_protected;
> +
> +bool kvm_mem_protected(void)
> +{
> +	return mem_protected;
> +}
> +

Honestly, I don't see a need for kvm_mem_protected(), just rename the
bool if you need kvm_ prefix :-)

>  static int kvmapf = 1;
>  
>  static int __init parse_no_kvmapf(char *arg)
> @@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
>  {
>  	kvmclock_init();
>  	x86_platform.apic_post_init = kvm_apic_init;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
> +		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
> +			pr_err("Failed to enable KVM memory protection\n");
> +			return;
> +		}
> +
> +		mem_protected = true;
> +	}
>  }

Personally, I'd prefer to do this via setting a bit in a KVM-specific
MSR instead. The benefit is that the guest doesn't need to remember if
it enabled the feature or not, it can always read the config msr. May
come handy for e.g. kexec/kdump.

>  
>  const __initconst struct hypervisor_x86 x86_hyper_kvm = {
> diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> index 8b86609849b9..1a216f32e572 100644
> --- a/include/uapi/linux/kvm_para.h
> +++ b/include/uapi/linux/kvm_para.h
> @@ -27,8 +27,9 @@
>  #define KVM_HC_MIPS_EXIT_VM		7
>  #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
>  #define KVM_HC_CLOCK_PAIRING		9
> -#define KVM_HC_SEND_IPI		10
> +#define KVM_HC_SEND_IPI			10

Same spurrious change detected.

>  #define KVM_HC_SCHED_YIELD		11
> +#define KVM_HC_ENABLE_MEM_PROTECTED	12
>  
>  /*
>   * hypercalls use architecture specific

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-05-22 12:52 ` [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
@ 2020-05-25 15:08   ` Vitaly Kuznetsov
  2020-05-25 15:17     ` Kirill A. Shutemov
  2020-05-26  6:14   ` Mike Rapoport
  2020-05-29 15:24   ` Kees Cook
  2 siblings, 1 reply; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-25 15:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> protection feature is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  4 +++
>  virt/kvm/kvm_main.c      | 78 ++++++++++++++++++++++++++++++++++------
>  2 files changed, 72 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 131cc1527d68..bd0bb600f610 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -503,6 +503,7 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +	bool mem_protected;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -727,6 +728,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
>  void kvm_set_pfn_accessed(kvm_pfn_t pfn);
>  void kvm_get_pfn(kvm_pfn_t pfn);
>  
> +int copy_from_guest(void *data, unsigned long hva, int len);
> +int copy_to_guest(unsigned long hva, const void *data, int len);
> +
>  void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
>  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  			int len);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 731c1e517716..033471f71dae 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2248,8 +2248,48 @@ static int next_segment(unsigned long len, int offset)
>  		return len;
>  }
>  
> +int copy_from_guest(void *data, unsigned long hva, int len)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(data, page_address(page) + offset, seg);
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;
> +	}
> +
> +	return 0;
> +}
> +
> +int copy_to_guest(unsigned long hva, const void *data, int len)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(page_address(page) + offset, data, seg);
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;
> +	}
> +	return 0;
> +}
> +
>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -				 void *data, int offset, int len)
> +				 void *data, int offset, int len,
> +				 bool protected)
>  {
>  	int r;
>  	unsigned long addr;
> @@ -2257,7 +2297,10 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> +	if (protected)
> +		r = copy_from_guest(data, addr + offset, len);
> +	else
> +		r = __copy_from_user(data, (void __user *)addr + offset, len);
>  	if (r)
>  		return -EFAULT;
>  	return 0;
> @@ -2268,7 +2311,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_guest_page);
>  
> @@ -2277,7 +2321,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     vcpu->kvm->mem_protected);

Personally, I would've just added 'struct kvm' pointer to 'struct
kvm_memory_slot' to be able to extract 'mem_protected' info when
needed. This will make the patch much smaller.

>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
>  
> @@ -2350,7 +2395,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
>  static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
> -			          const void *data, int offset, int len)
> +			          const void *data, int offset, int len,
> +				  bool protected)
>  {
>  	int r;
>  	unsigned long addr;
> @@ -2358,7 +2404,11 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
>  	addr = gfn_to_hva_memslot(memslot, gfn);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -	r = __copy_to_user((void __user *)addr + offset, data, len);
> +
> +	if (protected)
> +		r = copy_to_guest(addr + offset, data, len);
> +	else
> +		r = __copy_to_user((void __user *)addr + offset, data, len);

All users of copy_to_guest() will have to have the same 'if (protected)'
check, right? Why not move the check to copy_to/from_guest() then?

>  	if (r)
>  		return -EFAULT;
>  	mark_page_dirty_in_slot(memslot, gfn);
> @@ -2370,7 +2420,8 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +				      kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_write_guest_page);
>  
> @@ -2379,7 +2430,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +				      vcpu->kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
>  
> @@ -2495,7 +2547,10 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  	if (unlikely(!ghc->memslot))
>  		return kvm_write_guest(kvm, gpa, data, len);
>  
> -	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
> +	if (kvm->mem_protected)
> +		r = copy_to_guest(ghc->hva + offset, data, len);
> +	else
> +		r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
>  	if (r)
>  		return -EFAULT;
>  	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
> @@ -2530,7 +2585,10 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  	if (unlikely(!ghc->memslot))
>  		return kvm_read_guest(kvm, ghc->gpa, data, len);
>  
> -	r = __copy_from_user(data, (void __user *)ghc->hva, len);
> +	if (kvm->mem_protected)
> +		r = copy_from_guest(data, ghc->hva, len);
> +	else
> +		r = __copy_from_user(data, (void __user *)ghc->hva, len);
>  	if (r)
>  		return -EFAULT;

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-05-25 14:58   ` Vitaly Kuznetsov
@ 2020-05-25 15:15     ` Kirill A. Shutemov
  2020-05-27  5:03       ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25 15:15 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

On Mon, May 25, 2020 at 04:58:51PM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > Provide basic helpers, KVM_FEATURE and a hypercall.
> >
> > Host side doesn't provide the feature yet, so it is a dead code for now.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/include/asm/kvm_para.h      |  5 +++++
> >  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >  arch/x86/kernel/kvm.c                | 16 ++++++++++++++++
> >  include/uapi/linux/kvm_para.h        |  3 ++-
> >  4 files changed, 25 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index 9b4df6eaa11a..3ce84fc07144 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -10,11 +10,16 @@ extern void kvmclock_init(void);
> >  
> >  #ifdef CONFIG_KVM_GUEST
> >  bool kvm_check_and_clear_guest_paused(void);
> > +bool kvm_mem_protected(void);
> >  #else
> >  static inline bool kvm_check_and_clear_guest_paused(void)
> >  {
> >  	return false;
> >  }
> > +static inline bool kvm_mem_protected(void)
> > +{
> > +	return false;
> > +}
> >  #endif /* CONFIG_KVM_GUEST */
> >  
> >  #define KVM_HYPERCALL \
> > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> > index 2a8e0b6b9805..c3b499acc98f 100644
> > --- a/arch/x86/include/uapi/asm/kvm_para.h
> > +++ b/arch/x86/include/uapi/asm/kvm_para.h
> > @@ -28,9 +28,10 @@
> >  #define KVM_FEATURE_PV_UNHALT		7
> >  #define KVM_FEATURE_PV_TLB_FLUSH	9
> >  #define KVM_FEATURE_ASYNC_PF_VMEXIT	10
> > -#define KVM_FEATURE_PV_SEND_IPI	11
> > +#define KVM_FEATURE_PV_SEND_IPI		11
> 
> Nit: spurrious change
> 

I fixed indentation while there. (Look at the file, not the diff to see
what I mean).

> >  #define KVM_FEATURE_POLL_CONTROL	12
> >  #define KVM_FEATURE_PV_SCHED_YIELD	13
> > +#define KVM_FEATURE_MEM_PROTECTED	14
> >  
> >  #define KVM_HINTS_REALTIME      0
> >  
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 6efe0410fb72..bda761ca0d26 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -35,6 +35,13 @@
> >  #include <asm/tlb.h>
> >  #include <asm/cpuidle_haltpoll.h>
> >  
> > +static bool mem_protected;
> > +
> > +bool kvm_mem_protected(void)
> > +{
> > +	return mem_protected;
> > +}
> > +
> 
> Honestly, I don't see a need for kvm_mem_protected(), just rename the
> bool if you need kvm_ prefix :-)

For !CONFIG_KVM_GUEST it would not be a variable. We may want to change it
to static branch or something in the future.

> >  static int kvmapf = 1;
> >  
> >  static int __init parse_no_kvmapf(char *arg)
> > @@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
> >  {
> >  	kvmclock_init();
> >  	x86_platform.apic_post_init = kvm_apic_init;
> > +
> > +	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
> > +		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
> > +			pr_err("Failed to enable KVM memory protection\n");
> > +			return;
> > +		}
> > +
> > +		mem_protected = true;
> > +	}
> >  }
> 
> Personally, I'd prefer to do this via setting a bit in a KVM-specific
> MSR instead. The benefit is that the guest doesn't need to remember if
> it enabled the feature or not, it can always read the config msr. May
> come handy for e.g. kexec/kdump.

I think we would need to remember it anyway. Accessing MSR is somewhat
expensive. But, okay, I can rework it MSR if needed.

Note, that we can avoid the enabling algother, if we modify BIOS to deal
with private/shared memory. Currently BIOS get system crash if we enable
the feature from time zero.

> >  const __initconst struct hypervisor_x86 x86_hyper_kvm = {
> > diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> > index 8b86609849b9..1a216f32e572 100644
> > --- a/include/uapi/linux/kvm_para.h
> > +++ b/include/uapi/linux/kvm_para.h
> > @@ -27,8 +27,9 @@
> >  #define KVM_HC_MIPS_EXIT_VM		7
> >  #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
> >  #define KVM_HC_CLOCK_PAIRING		9
> > -#define KVM_HC_SEND_IPI		10
> > +#define KVM_HC_SEND_IPI			10
> 
> Same spurrious change detected.

The same justification :)

> >  #define KVM_HC_SCHED_YIELD		11
> > +#define KVM_HC_ENABLE_MEM_PROTECTED	12
> >  
> >  /*
> >   * hypercalls use architecture specific
> 
> -- 
> Vitaly
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-05-25 15:08   ` Vitaly Kuznetsov
@ 2020-05-25 15:17     ` Kirill A. Shutemov
  2020-06-01 16:35       ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25 15:17 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

On Mon, May 25, 2020 at 05:08:43PM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> > protection feature is enabled.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |  4 +++
> >  virt/kvm/kvm_main.c      | 78 ++++++++++++++++++++++++++++++++++------
> >  2 files changed, 72 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 131cc1527d68..bd0bb600f610 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -503,6 +503,7 @@ struct kvm {
> >  	struct srcu_struct srcu;
> >  	struct srcu_struct irq_srcu;
> >  	pid_t userspace_pid;
> > +	bool mem_protected;
> >  };
> >  
> >  #define kvm_err(fmt, ...) \
> > @@ -727,6 +728,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
> >  void kvm_set_pfn_accessed(kvm_pfn_t pfn);
> >  void kvm_get_pfn(kvm_pfn_t pfn);
> >  
> > +int copy_from_guest(void *data, unsigned long hva, int len);
> > +int copy_to_guest(unsigned long hva, const void *data, int len);
> > +
> >  void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
> >  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> >  			int len);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 731c1e517716..033471f71dae 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2248,8 +2248,48 @@ static int next_segment(unsigned long len, int offset)
> >  		return len;
> >  }
> >  
> > +int copy_from_guest(void *data, unsigned long hva, int len)
> > +{
> > +	int offset = offset_in_page(hva);
> > +	struct page *page;
> > +	int npages, seg;
> > +
> > +	while ((seg = next_segment(len, offset)) != 0) {
> > +		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> > +		if (npages != 1)
> > +			return -EFAULT;
> > +		memcpy(data, page_address(page) + offset, seg);
> > +		put_page(page);
> > +		len -= seg;
> > +		hva += seg;
> > +		offset = 0;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int copy_to_guest(unsigned long hva, const void *data, int len)
> > +{
> > +	int offset = offset_in_page(hva);
> > +	struct page *page;
> > +	int npages, seg;
> > +
> > +	while ((seg = next_segment(len, offset)) != 0) {
> > +		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
> > +		if (npages != 1)
> > +			return -EFAULT;
> > +		memcpy(page_address(page) + offset, data, seg);
> > +		put_page(page);
> > +		len -= seg;
> > +		hva += seg;
> > +		offset = 0;
> > +	}
> > +	return 0;
> > +}
> > +
> >  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > -				 void *data, int offset, int len)
> > +				 void *data, int offset, int len,
> > +				 bool protected)
> >  {
> >  	int r;
> >  	unsigned long addr;
> > @@ -2257,7 +2297,10 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> >  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
> >  	if (kvm_is_error_hva(addr))
> >  		return -EFAULT;
> > -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> > +	if (protected)
> > +		r = copy_from_guest(data, addr + offset, len);
> > +	else
> > +		r = __copy_from_user(data, (void __user *)addr + offset, len);
> >  	if (r)
> >  		return -EFAULT;
> >  	return 0;
> > @@ -2268,7 +2311,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> >  {
> >  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> >  
> > -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> > +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> > +				     kvm->mem_protected);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_read_guest_page);
> >  
> > @@ -2277,7 +2321,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
> >  {
> >  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >  
> > -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> > +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> > +				     vcpu->kvm->mem_protected);
> 
> Personally, I would've just added 'struct kvm' pointer to 'struct
> kvm_memory_slot' to be able to extract 'mem_protected' info when
> needed. This will make the patch much smaller.

Okay, can do.

Other thing I tried is to have per-slot flag to indicate that it's
protected. But Sean pointed that it's all-or-nothing feature and having
the flag in the slot would be misleading.

> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
> >  
> > @@ -2350,7 +2395,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
> >  
> >  static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
> > -			          const void *data, int offset, int len)
> > +			          const void *data, int offset, int len,
> > +				  bool protected)
> >  {
> >  	int r;
> >  	unsigned long addr;
> > @@ -2358,7 +2404,11 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
> >  	addr = gfn_to_hva_memslot(memslot, gfn);
> >  	if (kvm_is_error_hva(addr))
> >  		return -EFAULT;
> > -	r = __copy_to_user((void __user *)addr + offset, data, len);
> > +
> > +	if (protected)
> > +		r = copy_to_guest(addr + offset, data, len);
> > +	else
> > +		r = __copy_to_user((void __user *)addr + offset, data, len);
> 
> All users of copy_to_guest() will have to have the same 'if (protected)'
> check, right? Why not move the check to copy_to/from_guest() then?

Good point.

> >  	if (r)
> >  		return -EFAULT;
> >  	mark_page_dirty_in_slot(memslot, gfn);
> > @@ -2370,7 +2420,8 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
> >  {
> >  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> >  
> > -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> > +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> > +				      kvm->mem_protected);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_write_guest_page);
> >  
> > @@ -2379,7 +2430,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >  {
> >  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >  
> > -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> > +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> > +				      vcpu->kvm->mem_protected);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
> >  
> > @@ -2495,7 +2547,10 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> >  	if (unlikely(!ghc->memslot))
> >  		return kvm_write_guest(kvm, gpa, data, len);
> >  
> > -	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
> > +	if (kvm->mem_protected)
> > +		r = copy_to_guest(ghc->hva + offset, data, len);
> > +	else
> > +		r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
> >  	if (r)
> >  		return -EFAULT;
> >  	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
> > @@ -2530,7 +2585,10 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> >  	if (unlikely(!ghc->memslot))
> >  		return kvm_read_guest(kvm, ghc->gpa, data, len);
> >  
> > -	r = __copy_from_user(data, (void __user *)ghc->hva, len);
> > +	if (kvm->mem_protected)
> > +		r = copy_from_guest(data, ghc->hva, len);
> > +	else
> > +		r = __copy_from_user(data, (void __user *)ghc->hva, len);
> >  	if (r)
> >  		return -EFAULT;
> 
> -- 
> Vitaly
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 13/16] x86/kvmclock: Share hvclock memory with the host
  2020-05-22 12:52 ` [RFC 13/16] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
@ 2020-05-25 15:22   ` Vitaly Kuznetsov
  2020-05-25 15:25     ` Kirill A. Shutemov
  0 siblings, 1 reply; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-25 15:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> hvclock is shared between the guest and the hypervisor. It has to be
> accessible by host.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/kernel/kvmclock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 34b18f6eeb2c..ac6c2abe0d0f 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -253,7 +253,7 @@ static void __init kvmclock_init_mem(void)
>  	 * hvclock is shared between the guest and the hypervisor, must
>  	 * be mapped decrypted.
>  	 */
> -	if (sev_active()) {
> +	if (sev_active() || kvm_mem_protected()) {
>  		r = set_memory_decrypted((unsigned long) hvclock_mem,
>  					 1UL << order);
>  		if (r) {

Sorry if I missed something but we have other structures which KVM guest
share with the host,

sev_map_percpu_data():
...
	for_each_possible_cpu(cpu) {
		__set_percpu_decrypted(&per_cpu(apf_reason, cpu), sizeof(apf_reason));
		__set_percpu_decrypted(&per_cpu(steal_time, cpu), sizeof(steal_time));
		__set_percpu_decrypted(&per_cpu(kvm_apic_eoi, cpu), sizeof(kvm_apic_eoi));
	}
...

Do you handle them somehow in the patchset? (I'm probably just blind
failing to see how 'early_set_memory_decrypted()' is wired up)

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 13/16] x86/kvmclock: Share hvclock memory with the host
  2020-05-25 15:22   ` Vitaly Kuznetsov
@ 2020-05-25 15:25     ` Kirill A. Shutemov
  2020-05-25 15:42       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25 15:25 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

On Mon, May 25, 2020 at 05:22:10PM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > hvclock is shared between the guest and the hypervisor. It has to be
> > accessible by host.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/kernel/kvmclock.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index 34b18f6eeb2c..ac6c2abe0d0f 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -253,7 +253,7 @@ static void __init kvmclock_init_mem(void)
> >  	 * hvclock is shared between the guest and the hypervisor, must
> >  	 * be mapped decrypted.
> >  	 */
> > -	if (sev_active()) {
> > +	if (sev_active() || kvm_mem_protected()) {
> >  		r = set_memory_decrypted((unsigned long) hvclock_mem,
> >  					 1UL << order);
> >  		if (r) {
> 
> Sorry if I missed something but we have other structures which KVM guest
> share with the host,
> 
> sev_map_percpu_data():
> ...
> 	for_each_possible_cpu(cpu) {
> 		__set_percpu_decrypted(&per_cpu(apf_reason, cpu), sizeof(apf_reason));
> 		__set_percpu_decrypted(&per_cpu(steal_time, cpu), sizeof(steal_time));
> 		__set_percpu_decrypted(&per_cpu(kvm_apic_eoi, cpu), sizeof(kvm_apic_eoi));
> 	}
> ...
> 
> Do you handle them somehow in the patchset? (I'm probably just blind
> failing to see how 'early_set_memory_decrypted()' is wired up)

I don't handle them yet: I've seen the function, but have not modified it.
I want to understand first why it doesn't blow up for me without the
change. Any clues?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 09/16] KVM: Protected memory extension
  2020-05-22 12:52 ` [RFC 09/16] KVM: Protected memory extension Kirill A. Shutemov
@ 2020-05-25 15:26   ` Vitaly Kuznetsov
  2020-05-25 15:34     ` Kirill A. Shutemov
  0 siblings, 1 reply; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-25 15:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> Add infrastructure that handles protected memory extension.
>
> Arch-specific code has to provide hypercalls and define non-zero
> VM_KVM_PROTECTED.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/kvm_host.h |   4 ++
>  mm/mprotect.c            |   1 +
>  virt/kvm/kvm_main.c      | 131 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 136 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index bd0bb600f610..d7072f6d6aa0 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -700,6 +700,10 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
>  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>  				   struct kvm_memory_slot *slot);
>  
> +int kvm_protect_all_memory(struct kvm *kvm);
> +int kvm_protect_memory(struct kvm *kvm,
> +		       unsigned long gfn, unsigned long npages, bool protect);
> +
>  int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
>  			    struct page **pages, int nr_pages);
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 494192ca954b..552be3b4c80a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -505,6 +505,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>  	vm_unacct_memory(charged);
>  	return error;
>  }
> +EXPORT_SYMBOL_GPL(mprotect_fixup);
>  
>  /*
>   * pkey==-1 when doing a legacy mprotect()
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 530af95efdf3..07d45da5d2aa 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -155,6 +155,8 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> +static int protect_memory(unsigned long start, unsigned long end, bool protect);
> +
>  __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>  		unsigned long start, unsigned long end, bool blockable)
>  {
> @@ -1309,6 +1311,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	if (r)
>  		goto out_bitmap;
>  
> +	if (mem->memory_size && kvm->mem_protected) {
> +		r = protect_memory(new.userspace_addr,
> +				   new.userspace_addr + new.npages * PAGE_SIZE,
> +				   true);
> +		if (r)
> +			goto out_bitmap;
> +	}
> +
>  	if (old.dirty_bitmap && !new.dirty_bitmap)
>  		kvm_destroy_dirty_bitmap(&old);
>  	return 0;
> @@ -2652,6 +2662,127 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>  
> +static int protect_memory(unsigned long start, unsigned long end, bool protect)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma, *prev;
> +	int ret;
> +
> +	if (down_write_killable(&mm->mmap_sem))
> +		return -EINTR;
> +
> +	ret = -ENOMEM;
> +	vma = find_vma(current->mm, start);
> +	if (!vma)
> +		goto out;
> +
> +	ret = -EINVAL;
> +	if (vma->vm_start > start)
> +		goto out;
> +
> +	if (start > vma->vm_start)
> +		prev = vma;
> +	else
> +		prev = vma->vm_prev;
> +
> +	ret = 0;
> +	while (true) {
> +		unsigned long newflags, tmp;
> +
> +		tmp = vma->vm_end;
> +		if (tmp > end)
> +			tmp = end;
> +
> +		newflags = vma->vm_flags;
> +		if (protect)
> +			newflags |= VM_KVM_PROTECTED;
> +		else
> +			newflags &= ~VM_KVM_PROTECTED;
> +
> +		/* The VMA has been handled as part of other memslot */
> +		if (newflags == vma->vm_flags)
> +			goto next;
> +
> +		ret = mprotect_fixup(vma, &prev, start, tmp, newflags);
> +		if (ret)
> +			goto out;
> +
> +next:
> +		start = tmp;
> +		if (start < prev->vm_end)
> +			start = prev->vm_end;
> +
> +		if (start >= end)
> +			goto out;
> +
> +		vma = prev->vm_next;
> +		if (!vma || vma->vm_start != start) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	}
> +out:
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
> +int kvm_protect_memory(struct kvm *kvm,
> +		       unsigned long gfn, unsigned long npages, bool protect)
> +{
> +	struct kvm_memory_slot *memslot;
> +	unsigned long start, end;
> +	gfn_t numpages;
> +
> +	if (!VM_KVM_PROTECTED)
> +		return -KVM_ENOSYS;
> +
> +	if (!npages)
> +		return 0;
> +
> +	memslot = gfn_to_memslot(kvm, gfn);
> +	/* Not backed by memory. It's okay. */
> +	if (!memslot)
> +		return 0;
> +
> +	start = gfn_to_hva_many(memslot, gfn, &numpages);
> +	end = start + npages * PAGE_SIZE;
> +
> +	/* XXX: Share range across memory slots? */
> +	if (WARN_ON(numpages < npages))
> +		return -EINVAL;
> +
> +	return protect_memory(start, end, protect);
> +}
> +EXPORT_SYMBOL_GPL(kvm_protect_memory);
> +
> +int kvm_protect_all_memory(struct kvm *kvm)
> +{
> +	struct kvm_memslots *slots;
> +	struct kvm_memory_slot *memslot;
> +	unsigned long start, end;
> +	int i, ret = 0;;
> +
> +	if (!VM_KVM_PROTECTED)
> +		return -KVM_ENOSYS;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	kvm->mem_protected = true;

What will happen upon guest reboot? Do we need to unprotect everything
to make sure we'll be able to boot? Also, after the reboot how will the
guest know that it is protected and needs to unprotect things? -> see my
idea about converting KVM_HC_ENABLE_MEM_PROTECTED to a stateful MSR (but
we'll likely have to reset it upon reboot anyway).

> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		slots = __kvm_memslots(kvm, i);
> +		kvm_for_each_memslot(memslot, slots) {
> +			start = memslot->userspace_addr;
> +			end = start + memslot->npages * PAGE_SIZE;
> +			ret = protect_memory(start, end, true);
> +			if (ret)
> +				goto out;
> +		}
> +	}
> +out:
> +	mutex_unlock(&kvm->slots_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(kvm_protect_all_memory);
> +
>  void kvm_sigset_activate(struct kvm_vcpu *vcpu)
>  {
>  	if (!vcpu->sigset_active)

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 10/16] KVM: x86: Enabled protected memory extension
  2020-05-22 12:52 ` [RFC 10/16] KVM: x86: Enabled protected " Kirill A. Shutemov
@ 2020-05-25 15:26   ` Vitaly Kuznetsov
  2020-05-26  6:16   ` Mike Rapoport
  1 sibling, 0 replies; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-25 15:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> Wire up hypercalls for the feature and define VM_KVM_PROTECTED.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig     | 1 +
>  arch/x86/kvm/cpuid.c | 3 +++
>  arch/x86/kvm/x86.c   | 9 +++++++++
>  include/linux/mm.h   | 4 ++++
>  4 files changed, 17 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 58dd44a1b92f..420e3947f0c6 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -801,6 +801,7 @@ config KVM_GUEST
>  	select ARCH_CPUIDLE_HALTPOLL
>  	select X86_MEM_ENCRYPT_COMMON
>  	select SWIOTLB
> +	select ARCH_USES_HIGH_VMA_FLAGS
>  	default y
>  	---help---
>  	  This option enables various optimizations for running under the KVM
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 901cd1fdecd9..94cc5e45467e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -714,6 +714,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  			     (1 << KVM_FEATURE_POLL_CONTROL) |
>  			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
>  
> +		if (VM_KVM_PROTECTED)
> +			entry->eax |=(1 << KVM_FEATURE_MEM_PROTECTED);

Nit: missing space.

> +
>  		if (sched_info_on())
>  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c17e6eb9ad43..acba0ac07f61 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7598,6 +7598,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  		kvm_sched_yield(vcpu->kvm, a0);
>  		ret = 0;
>  		break;
> +	case KVM_HC_ENABLE_MEM_PROTECTED:
> +		ret = kvm_protect_all_memory(vcpu->kvm);
> +		break;
> +	case KVM_HC_MEM_SHARE:
> +		ret = kvm_protect_memory(vcpu->kvm, a0, a1, false);
> +		break;
> +	case KVM_HC_MEM_UNSHARE:
> +		ret = kvm_protect_memory(vcpu->kvm, a0, a1, true);
> +		break;
>  	default:
>  		ret = -KVM_ENOSYS;
>  		break;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4f7195365cc0..6eb771c14968 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -329,7 +329,11 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
>  #endif
>  
> +#if defined(CONFIG_X86_64) && defined(CONFIG_KVM)
> +#define VM_KVM_PROTECTED VM_HIGH_ARCH_4
> +#else
>  #define VM_KVM_PROTECTED 0
> +#endif
>  
>  #ifndef VM_GROWSUP
>  # define VM_GROWSUP	VM_NONE

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 09/16] KVM: Protected memory extension
  2020-05-25 15:26   ` Vitaly Kuznetsov
@ 2020-05-25 15:34     ` Kirill A. Shutemov
  2020-06-03  1:34       ` Huang, Kai
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25 15:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

On Mon, May 25, 2020 at 05:26:37PM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > Add infrastructure that handles protected memory extension.
> >
> > Arch-specific code has to provide hypercalls and define non-zero
> > VM_KVM_PROTECTED.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |   4 ++
> >  mm/mprotect.c            |   1 +
> >  virt/kvm/kvm_main.c      | 131 +++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 136 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index bd0bb600f610..d7072f6d6aa0 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -700,6 +700,10 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
> >  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> >  				   struct kvm_memory_slot *slot);
> >  
> > +int kvm_protect_all_memory(struct kvm *kvm);
> > +int kvm_protect_memory(struct kvm *kvm,
> > +		       unsigned long gfn, unsigned long npages, bool protect);
> > +
> >  int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
> >  			    struct page **pages, int nr_pages);
> >  
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 494192ca954b..552be3b4c80a 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -505,6 +505,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> >  	vm_unacct_memory(charged);
> >  	return error;
> >  }
> > +EXPORT_SYMBOL_GPL(mprotect_fixup);
> >  
> >  /*
> >   * pkey==-1 when doing a legacy mprotect()
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 530af95efdf3..07d45da5d2aa 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -155,6 +155,8 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
> >  static unsigned long long kvm_createvm_count;
> >  static unsigned long long kvm_active_vms;
> >  
> > +static int protect_memory(unsigned long start, unsigned long end, bool protect);
> > +
> >  __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >  		unsigned long start, unsigned long end, bool blockable)
> >  {
> > @@ -1309,6 +1311,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  	if (r)
> >  		goto out_bitmap;
> >  
> > +	if (mem->memory_size && kvm->mem_protected) {
> > +		r = protect_memory(new.userspace_addr,
> > +				   new.userspace_addr + new.npages * PAGE_SIZE,
> > +				   true);
> > +		if (r)
> > +			goto out_bitmap;
> > +	}
> > +
> >  	if (old.dirty_bitmap && !new.dirty_bitmap)
> >  		kvm_destroy_dirty_bitmap(&old);
> >  	return 0;
> > @@ -2652,6 +2662,127 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
> >  
> > +static int protect_memory(unsigned long start, unsigned long end, bool protect)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *vma, *prev;
> > +	int ret;
> > +
> > +	if (down_write_killable(&mm->mmap_sem))
> > +		return -EINTR;
> > +
> > +	ret = -ENOMEM;
> > +	vma = find_vma(current->mm, start);
> > +	if (!vma)
> > +		goto out;
> > +
> > +	ret = -EINVAL;
> > +	if (vma->vm_start > start)
> > +		goto out;
> > +
> > +	if (start > vma->vm_start)
> > +		prev = vma;
> > +	else
> > +		prev = vma->vm_prev;
> > +
> > +	ret = 0;
> > +	while (true) {
> > +		unsigned long newflags, tmp;
> > +
> > +		tmp = vma->vm_end;
> > +		if (tmp > end)
> > +			tmp = end;
> > +
> > +		newflags = vma->vm_flags;
> > +		if (protect)
> > +			newflags |= VM_KVM_PROTECTED;
> > +		else
> > +			newflags &= ~VM_KVM_PROTECTED;
> > +
> > +		/* The VMA has been handled as part of other memslot */
> > +		if (newflags == vma->vm_flags)
> > +			goto next;
> > +
> > +		ret = mprotect_fixup(vma, &prev, start, tmp, newflags);
> > +		if (ret)
> > +			goto out;
> > +
> > +next:
> > +		start = tmp;
> > +		if (start < prev->vm_end)
> > +			start = prev->vm_end;
> > +
> > +		if (start >= end)
> > +			goto out;
> > +
> > +		vma = prev->vm_next;
> > +		if (!vma || vma->vm_start != start) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +	}
> > +out:
> > +	up_write(&mm->mmap_sem);
> > +	return ret;
> > +}
> > +
> > +int kvm_protect_memory(struct kvm *kvm,
> > +		       unsigned long gfn, unsigned long npages, bool protect)
> > +{
> > +	struct kvm_memory_slot *memslot;
> > +	unsigned long start, end;
> > +	gfn_t numpages;
> > +
> > +	if (!VM_KVM_PROTECTED)
> > +		return -KVM_ENOSYS;
> > +
> > +	if (!npages)
> > +		return 0;
> > +
> > +	memslot = gfn_to_memslot(kvm, gfn);
> > +	/* Not backed by memory. It's okay. */
> > +	if (!memslot)
> > +		return 0;
> > +
> > +	start = gfn_to_hva_many(memslot, gfn, &numpages);
> > +	end = start + npages * PAGE_SIZE;
> > +
> > +	/* XXX: Share range across memory slots? */
> > +	if (WARN_ON(numpages < npages))
> > +		return -EINVAL;
> > +
> > +	return protect_memory(start, end, protect);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_protect_memory);
> > +
> > +int kvm_protect_all_memory(struct kvm *kvm)
> > +{
> > +	struct kvm_memslots *slots;
> > +	struct kvm_memory_slot *memslot;
> > +	unsigned long start, end;
> > +	int i, ret = 0;;
> > +
> > +	if (!VM_KVM_PROTECTED)
> > +		return -KVM_ENOSYS;
> > +
> > +	mutex_lock(&kvm->slots_lock);
> > +	kvm->mem_protected = true;
> 
> What will happen upon guest reboot? Do we need to unprotect everything
> to make sure we'll be able to boot? Also, after the reboot how will the
> guest know that it is protected and needs to unprotect things? -> see my
> idea about converting KVM_HC_ENABLE_MEM_PROTECTED to a stateful MSR (but
> we'll likely have to reset it upon reboot anyway).

That's extremely good question. I have not considered reboot. I tend to use
-no-reboot in my setup.

I'll think how to deal with reboot. I don't know how it works now to give
a good answer.

The may not be a good solution: unprotecting memory on reboot means we
expose user data. We can wipe the data before unprotecting, but we should
not wipe BIOS and anything else that is required on reboot. I donno.

> > +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +		slots = __kvm_memslots(kvm, i);
> > +		kvm_for_each_memslot(memslot, slots) {
> > +			start = memslot->userspace_addr;
> > +			end = start + memslot->npages * PAGE_SIZE;
> > +			ret = protect_memory(start, end, true);
> > +			if (ret)
> > +				goto out;
> > +		}
> > +	}
> > +out:
> > +	mutex_unlock(&kvm->slots_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_protect_all_memory);
> > +
> >  void kvm_sigset_activate(struct kvm_vcpu *vcpu)
> >  {
> >  	if (!vcpu->sigset_active)
> 
> -- 
> Vitaly
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 13/16] x86/kvmclock: Share hvclock memory with the host
  2020-05-25 15:25     ` Kirill A. Shutemov
@ 2020-05-25 15:42       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-25 15:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Mon, May 25, 2020 at 05:22:10PM +0200, Vitaly Kuznetsov wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> 
>> > hvclock is shared between the guest and the hypervisor. It has to be
>> > accessible by host.
>> >
>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > ---
>> >  arch/x86/kernel/kvmclock.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>> > index 34b18f6eeb2c..ac6c2abe0d0f 100644
>> > --- a/arch/x86/kernel/kvmclock.c
>> > +++ b/arch/x86/kernel/kvmclock.c
>> > @@ -253,7 +253,7 @@ static void __init kvmclock_init_mem(void)
>> >  	 * hvclock is shared between the guest and the hypervisor, must
>> >  	 * be mapped decrypted.
>> >  	 */
>> > -	if (sev_active()) {
>> > +	if (sev_active() || kvm_mem_protected()) {
>> >  		r = set_memory_decrypted((unsigned long) hvclock_mem,
>> >  					 1UL << order);
>> >  		if (r) {
>> 
>> Sorry if I missed something but we have other structures which KVM guest
>> share with the host,
>> 
>> sev_map_percpu_data():
>> ...
>> 	for_each_possible_cpu(cpu) {
>> 		__set_percpu_decrypted(&per_cpu(apf_reason, cpu), sizeof(apf_reason));
>> 		__set_percpu_decrypted(&per_cpu(steal_time, cpu), sizeof(steal_time));
>> 		__set_percpu_decrypted(&per_cpu(kvm_apic_eoi, cpu), sizeof(kvm_apic_eoi));
>> 	}
>> ...
>> 
>> Do you handle them somehow in the patchset? (I'm probably just blind
>> failing to see how 'early_set_memory_decrypted()' is wired up)
>
> I don't handle them yet: I've seen the function, but have not modified it.
> I want to understand first why it doesn't blow up for me without the
> change. Any clues?

(if I got the idea of the patchset right) these features are kernel-only
(e.g. QEMU doesn't need to access these areas). E.g. for APF KVM will do
kvm_write_guest_cached() and this will use FOLL_KVM. Guests should not
rely on that and mark all shared areas as unprotected.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-25 14:46   ` Kirill A. Shutemov
@ 2020-05-25 15:56     ` Liran Alon
  0 siblings, 0 replies; 62+ messages in thread
From: Liran Alon @ 2020-05-25 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov


On 25/05/2020 17:46, Kirill A. Shutemov wrote:
> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
>>> == Background / Problem ==
>>>
>>> There are a number of hardware features (MKTME, SEV) which protect guest
>>> memory from some unauthorized host access. The patchset proposes a purely
>>> software feature that mitigates some of the same host-side read-only
>>> attacks.
>>>
>>>
>>> == What does this set mitigate? ==
>>>
>>>    - Host kernel ”accidental” access to guest data (think speculation)
>> Just to clarify: This is any host kernel memory info-leak vulnerability. Not
>> just speculative execution memory info-leaks. Also architectural ones.
>>
>> In addition, note that removing guest data from host kernel VA space also
>> makes guest<->host memory exploits more difficult.
>> E.g. Guest cannot use already available memory buffer in kernel VA space for
>> ROP or placing valuable guest-controlled code/data in general.
>>
>>>    - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>>>
>>>    - Host userspace access to guest data (compromised qemu)
>> I don't quite understand what is the benefit of preventing userspace VMM
>> access to guest data while the host kernel can still access it.
> Let me clarify: the guest memory mapped into host userspace is not
> accessible by both host kernel and userspace. Host still has way to access
> it via a new interface: GUP(FOLL_KVM). The GUP will give you struct page
> that kernel has to map (temporarily) if need to access the data. So only
> blessed codepaths would know how to deal with the memory.
Yes, I understood that. I meant explicit host kernel access.
>
> It can help preventing some host->guest attack on the compromised host.
> Like if an VM has successfully attacked the host it cannot attack other
> VMs as easy.

We have mechanisms to sandbox the userspace VMM process for that.

You need to be more specific on what is the attack scenario you attempt 
to address
here that is not covered by existing mechanisms. i.e. Be crystal clear 
on the extra value
of the feature of not exposing guest data to userspace VMM.

>
> It would also help to protect against guest->host attack by removing one
> more places where the guest's data is mapped on the host.
Because guest have explicit interface to request which guest pages can 
be mapped in userspace VMM, the value of this is very small.

Guest already have ability to map guest controlled code/data in 
userspace VMM either via this interface or via forcing userspace VMM
to create various objects during device emulation handling. The only 
extra property this patch-series provides, is that only a
small portion of guest pages will be mapped to host userspace instead of 
all of it. Resulting in smaller regions for exploits that require
guessing a virtual address. But: (a) Userspace VMM device emulation may 
still allow guest to spray userspace heap with objects containing
guest controlled data. (b) How is userspace VMM suppose to limit which 
guest pages should not be mapped to userspace VMM even though guest have
explicitly requested them to be mapped? (E.g. Because they are valid DMA 
sources/targets for virtual devices or because it's vGPU frame-buffer).
>> QEMU is more easily compromised than the host kernel because it's
>> guest<->host attack surface is larger (E.g. Various device emulation).
>> But this compromise comes from the guest itself. Not other guests. In
>> contrast to host kernel attack surface, which an info-leak there can
>> be exploited from one guest to leak another guest data.
> Consider the case when unprivileged guest user exploits bug in a QEMU
> device emulation to gain access to data it cannot normally have access
> within the guest. With the feature it would able to see only other shared
> regions of guest memory such as DMA and IO buffers, but not the rest.
This is a scenario where an unpriviledged guest userspace have direct 
access to a virtual device
and is able to exploit a bug in device emulation handling such that it 
will allow it to compromise
the security *inside* the guest. i.e. Leak guest kernel data or other 
guest userspace processes data.

That's true. Good point. This is a very important missing argument from 
the cover-letter.

Now it's crystal clear on the trade-off considered here:
Is the extra complication and perf cost provided by the mechanism of 
this patch-series worth
to protect against the scenario of a userspace VMM vulnerability that 
may be accessible to unpriviledged
guest userspace process to leak other *in-guest* data that is not 
otherwise accessible to that process?

-Liran



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-05-22 12:52 ` [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
  2020-05-25 15:08   ` Vitaly Kuznetsov
@ 2020-05-26  6:14   ` Mike Rapoport
  2020-05-26 21:56     ` Kirill A. Shutemov
  2020-05-29 15:24   ` Kees Cook
  2 siblings, 1 reply; 62+ messages in thread
From: Mike Rapoport @ 2020-05-26  6:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Fri, May 22, 2020 at 03:52:04PM +0300, Kirill A. Shutemov wrote:
> New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> protection feature is enabled.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  4 +++
>  virt/kvm/kvm_main.c      | 78 ++++++++++++++++++++++++++++++++++------
>  2 files changed, 72 insertions(+), 10 deletions(-)
> 
>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -				 void *data, int offset, int len)
> +				 void *data, int offset, int len,
> +				 bool protected)
>  {
>  	int r;
>  	unsigned long addr;
> @@ -2257,7 +2297,10 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> +	if (protected)
> +		r = copy_from_guest(data, addr + offset, len);
> +	else
> +		r = __copy_from_user(data, (void __user *)addr + offset, len);

Maybe always use copy_{from,to}_guest() and move the 'if (protected)'
there?
If kvm is added to memory slot, it cab be the passed to copy_{to,from}_guest.

>  	if (r)
>  		return -EFAULT;
>  	return 0;
> @@ -2268,7 +2311,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_guest_page);
>  
> @@ -2277,7 +2321,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     vcpu->kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
>  

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-05-22 12:52 ` [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
@ 2020-05-26  6:15   ` Mike Rapoport
  2020-05-26 22:01     ` Kirill A. Shutemov
  2020-05-26  6:40   ` John Hubbard
  1 sibling, 1 reply; 62+ messages in thread
From: Mike Rapoport @ 2020-05-26  6:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Fri, May 22, 2020 at 03:52:05PM +0300, Kirill A. Shutemov wrote:
> The new VMA flag that indicate a VMA that is not accessible to userspace
> but usable by kernel with GUP if FOLL_KVM is specified.
> 
> The FOLL_KVM is only used in the KVM code. The code has to know how to
> deal with such pages.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/mm.h  |  8 ++++++++
>  mm/gup.c            | 20 ++++++++++++++++----
>  mm/huge_memory.c    | 20 ++++++++++++++++----
>  mm/memory.c         |  3 +++
>  mm/mmap.c           |  3 +++
>  virt/kvm/async_pf.c |  4 ++--
>  virt/kvm/kvm_main.c |  9 +++++----
>  7 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e1882eec1752..4f7195365cc0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -329,6 +329,8 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
>  #endif
>  
> +#define VM_KVM_PROTECTED 0

With all the ideas about removing pages from the direct mapi floating
around I wouldn't limit this to KVM.

VM_NOT_IN_DIRECT_MAP would describe such areas better, but I realise
it's very far from perfect and nothing better does not comes to mind :)


>  #ifndef VM_GROWSUP
>  # define VM_GROWSUP	VM_NONE
>  #endif
> @@ -646,6 +648,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_ACCESS_FLAGS;
>  }
>  
> +static inline bool vma_is_kvm_protected(struct vm_area_struct *vma)

Ditto

> +{
> +	return vma->vm_flags & VM_KVM_PROTECTED;
> +}
> +
>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> @@ -2773,6 +2780,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
>  #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
>  #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
> +#define FOLL_KVM	0x80000 /* access to VM_KVM_PROTECTED VMAs */

Maybe

FOLL_DM		0x80000  /* access  memory dropped from the direct map */

>  /*
>   * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> diff --git a/mm/gup.c b/mm/gup.c
> index 87a6a59fe667..bd7b9484b35a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

...

> diff --git a/mm/mmap.c b/mm/mmap.c
> index f609e9ec4a25..d56c3f6efc99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -112,6 +112,9 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags)
>  				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
>  			pgprot_val(arch_vm_get_page_prot(vm_flags)));
>  
> +	if (vm_flags & VM_KVM_PROTECTED)
> +		ret = PAGE_NONE;

Nit: vma_is_kvm_protected()?

> +
>  	return arch_filter_pgprot(ret);
>  }
>  EXPORT_SYMBOL(vm_get_page_prot);
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 15e5b037f92d..7663e962510a 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -60,8 +60,8 @@ static void async_pf_execute(struct work_struct *work)
>  	 * access remotely.
>  	 */
>  	down_read(&mm->mmap_sem);
> -	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> -			&locked);
> +	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE | FOLL_KVM, NULL,
> +			      NULL, &locked);
>  	if (locked)
>  		up_read(&mm->mmap_sem);
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 033471f71dae..530af95efdf3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1727,7 +1727,7 @@ unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *w
>  
>  static inline int check_user_page_hwpoison(unsigned long addr)
>  {
> -	int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
> +	int rc, flags = FOLL_HWPOISON | FOLL_WRITE | FOLL_KVM;
>  
>  	rc = get_user_pages(addr, 1, flags, NULL, NULL);
>  	return rc == -EHWPOISON;
> @@ -1771,7 +1771,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>  static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  			   bool *writable, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_KVM;
>  	struct page *page;
>  	int npages = 0;
>  
> @@ -2255,7 +2255,7 @@ int copy_from_guest(void *data, unsigned long hva, int len)
>  	int npages, seg;
>  
>  	while ((seg = next_segment(len, offset)) != 0) {
> -		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> +		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_KVM);
>  		if (npages != 1)
>  			return -EFAULT;
>  		memcpy(data, page_address(page) + offset, seg);
> @@ -2275,7 +2275,8 @@ int copy_to_guest(unsigned long hva, const void *data, int len)
>  	int npages, seg;
>  
>  	while ((seg = next_segment(len, offset)) != 0) {
> -		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
> +		npages = get_user_pages_unlocked(hva, 1, &page,
> +						 FOLL_WRITE | FOLL_KVM);
>  		if (npages != 1)
>  			return -EFAULT;
>  		memcpy(page_address(page) + offset, data, seg);
> -- 
> 2.26.2
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 10/16] KVM: x86: Enabled protected memory extension
  2020-05-22 12:52 ` [RFC 10/16] KVM: x86: Enabled protected " Kirill A. Shutemov
  2020-05-25 15:26   ` Vitaly Kuznetsov
@ 2020-05-26  6:16   ` Mike Rapoport
  2020-05-26 21:58     ` Kirill A. Shutemov
  1 sibling, 1 reply; 62+ messages in thread
From: Mike Rapoport @ 2020-05-26  6:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Fri, May 22, 2020 at 03:52:08PM +0300, Kirill A. Shutemov wrote:
> Wire up hypercalls for the feature and define VM_KVM_PROTECTED.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig     | 1 +
>  arch/x86/kvm/cpuid.c | 3 +++
>  arch/x86/kvm/x86.c   | 9 +++++++++
>  include/linux/mm.h   | 4 ++++
>  4 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 58dd44a1b92f..420e3947f0c6 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -801,6 +801,7 @@ config KVM_GUEST
>  	select ARCH_CPUIDLE_HALTPOLL
>  	select X86_MEM_ENCRYPT_COMMON
>  	select SWIOTLB
> +	select ARCH_USES_HIGH_VMA_FLAGS
>  	default y
>  	---help---
>  	  This option enables various optimizations for running under the KVM
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 901cd1fdecd9..94cc5e45467e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -714,6 +714,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  			     (1 << KVM_FEATURE_POLL_CONTROL) |
>  			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
>  
> +		if (VM_KVM_PROTECTED)
> +			entry->eax |=(1 << KVM_FEATURE_MEM_PROTECTED);
> +
>  		if (sched_info_on())
>  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c17e6eb9ad43..acba0ac07f61 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7598,6 +7598,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  		kvm_sched_yield(vcpu->kvm, a0);
>  		ret = 0;
>  		break;
> +	case KVM_HC_ENABLE_MEM_PROTECTED:
> +		ret = kvm_protect_all_memory(vcpu->kvm);
> +		break;
> +	case KVM_HC_MEM_SHARE:
> +		ret = kvm_protect_memory(vcpu->kvm, a0, a1, false);
> +		break;
> +	case KVM_HC_MEM_UNSHARE:
> +		ret = kvm_protect_memory(vcpu->kvm, a0, a1, true);
> +		break;
>  	default:
>  		ret = -KVM_ENOSYS;
>  		break;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4f7195365cc0..6eb771c14968 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -329,7 +329,11 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
>  #endif
>  
> +#if defined(CONFIG_X86_64) && defined(CONFIG_KVM)

This would be better spelled as ARCH_WANTS_PROTECTED_MEMORY, IMHO.

> +#define VM_KVM_PROTECTED VM_HIGH_ARCH_4

Maybe this should be VM_HIGH_ARCH_5 so that powerpc could enable this
feature eventually?

> +#else
>  #define VM_KVM_PROTECTED 0
> +#endif
>  
>  #ifndef VM_GROWSUP
>  # define VM_GROWSUP	VM_NONE
> -- 
> 2.26.2
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 16/16] KVM: Unmap protected pages from direct mapping
  2020-05-22 12:52 ` [RFC 16/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
@ 2020-05-26  6:16   ` Mike Rapoport
  2020-05-26 22:10     ` Kirill A. Shutemov
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Rapoport @ 2020-05-26  6:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Fri, May 22, 2020 at 03:52:14PM +0300, Kirill A. Shutemov wrote:
> If the protected memory feature enabled, unmap guest memory from
> kernel's direct mappings.
> 
> Migration and KSM is disabled for protected memory as it would require a
> special treatment.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/mm/pat/set_memory.c |  1 +
>  include/linux/kvm_host.h     |  3 ++
>  mm/huge_memory.c             |  9 +++++
>  mm/ksm.c                     |  3 ++
>  mm/memory.c                  | 13 +++++++
>  mm/rmap.c                    |  4 ++
>  virt/kvm/kvm_main.c          | 74 ++++++++++++++++++++++++++++++++++++
>  7 files changed, 107 insertions(+)
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 6f075766bb94..13988413af40 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2227,6 +2227,7 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
>  
>  	arch_flush_lazy_mmu_mode();
>  }
> +EXPORT_SYMBOL_GPL(__kernel_map_pages);
>  
>  #ifdef CONFIG_HIBERNATION
>  bool kernel_page_present(struct page *page)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b6944f88033d..e1d7762b615c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -705,6 +705,9 @@ int kvm_protect_all_memory(struct kvm *kvm);
>  int kvm_protect_memory(struct kvm *kvm,
>  		       unsigned long gfn, unsigned long npages, bool protect);
>  
> +void kvm_map_page(struct page *page, int nr_pages);
> +void kvm_unmap_page(struct page *page, int nr_pages);
> +
>  int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
>  			    struct page **pages, int nr_pages);
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3562648a4ef..d8a444a401cc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -33,6 +33,7 @@
>  #include <linux/oom.h>
>  #include <linux/numa.h>
>  #include <linux/page_owner.h>
> +#include <linux/kvm_host.h>

This does not seem right... 

>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -650,6 +651,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>  		spin_unlock(vmf->ptl);
>  		count_vm_event(THP_FAULT_ALLOC);
>  		count_memcg_events(memcg, THP_FAULT_ALLOC, 1);
> +
> +		/* Unmap page from direct mapping */
> +		if (vma_is_kvm_protected(vma))
> +			kvm_unmap_page(page, HPAGE_PMD_NR);

... and neither does this.

I think the map/unmap primitives shoud be a part of the generic mm and
not burried inside KVM.

>  	}
>  
>  	return 0;
> @@ -1886,6 +1891,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			page_remove_rmap(page, true);
>  			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>  			VM_BUG_ON_PAGE(!PageHead(page), page);
> +
> +			/* Map the page back to the direct mapping */
> +			if (vma_is_kvm_protected(vma))
> +				kvm_map_page(page, HPAGE_PMD_NR);
>  		} else if (thp_migration_supported()) {
>  			swp_entry_t entry;
>  
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 281c00129a2e..942b88782ac2 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -527,6 +527,9 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
>  		return NULL;
>  	if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
>  		return NULL;
> +	/* TODO */

Probably this is not something that should be done. For a security
sensitive environment that wants protected memory, KSM woudn't be
relevant anyway...

> +	if (vma_is_kvm_protected(vma))
> +		return NULL;
>  	return vma;
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index d7228db6e4bf..74773229b854 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -71,6 +71,7 @@
>  #include <linux/dax.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> +#include <linux/kvm_host.h>

The same comment as in mm/huge_memory.c. I don't think that generic mm
should depend on KVM.

>  #include <trace/events/kmem.h>
>  
> @@ -1088,6 +1089,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				    likely(!(vma->vm_flags & VM_SEQ_READ)))
>  					mark_page_accessed(page);
>  			}
> +
> +			/* Map the page back to the direct mapping */
> +			if (vma_is_anonymous(vma) && vma_is_kvm_protected(vma))
> +				kvm_map_page(page, 1);
> +
>  			rss[mm_counter(page)]--;
>  			page_remove_rmap(page, false);
>  			if (unlikely(page_mapcount(page) < 0))
> @@ -3312,6 +3318,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	struct page *page;
>  	vm_fault_t ret = 0;
>  	pte_t entry;
> +	bool set = false;
>  
>  	/* File mapping without ->vm_ops ? */
>  	if (vma->vm_flags & VM_SHARED)
> @@ -3397,6 +3404,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	page_add_new_anon_rmap(page, vma, vmf->address, false);
>  	mem_cgroup_commit_charge(page, memcg, false, false);
>  	lru_cache_add_active_or_unevictable(page, vma);
> +	set = true;
>  setpte:
>  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>  
> @@ -3404,6 +3412,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	update_mmu_cache(vma, vmf->address, vmf->pte);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> +	/* Unmap page from direct mapping */
> +	if (vma_is_kvm_protected(vma) && set)
> +		kvm_unmap_page(page, 1);
> +
>  	return ret;
>  release:
>  	mem_cgroup_cancel_charge(page, memcg, false);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f79a206b271a..a9b2e347d1ab 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1709,6 +1709,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  
>  static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
>  {
> +	/* TODO */
> +	if (vma_is_kvm_protected(vma))
> +		return true;
> +
>  	return vma_is_temporary_stack(vma);
>  }
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 71aac117357f..defc33d3a124 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -51,6 +51,7 @@
>  #include <linux/io.h>
>  #include <linux/lockdep.h>
>  #include <linux/kthread.h>
> +#include <linux/pagewalk.h>
>  
>  #include <asm/processor.h>
>  #include <asm/ioctl.h>
> @@ -2718,6 +2719,72 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>  
> +void kvm_map_page(struct page *page, int nr_pages)
> +{
> +	int i;
> +
> +	/* Clear page before returning it to the direct mapping */
> +	for (i = 0; i < nr_pages; i++) {
> +		void *p = map_page_atomic(page + i);
> +		memset(p, 0, PAGE_SIZE);
> +		unmap_page_atomic(p);
> +	}
> +
> +	kernel_map_pages(page, nr_pages, 1);
> +}
> +EXPORT_SYMBOL_GPL(kvm_map_page);
> +
> +void kvm_unmap_page(struct page *page, int nr_pages)
> +{
> +	kernel_map_pages(page, nr_pages, 0);
> +}
> +EXPORT_SYMBOL_GPL(kvm_unmap_page);
> +
> +static int adjust_direct_mapping_pte_range(pmd_t *pmd, unsigned long addr,
> +					   unsigned long end,
> +					   struct mm_walk *walk)
> +{
> +	bool protect = (bool)walk->private;
> +	pte_t *pte;
> +	struct page *page;
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		page = pmd_page(*pmd);
> +		if (is_huge_zero_page(page))
> +			return 0;
> +		VM_BUG_ON_PAGE(total_mapcount(page) != 1, page);
> +		/* XXX: Would it fail with direct device assignment? */
> +		VM_BUG_ON_PAGE(page_count(page) != 1, page);
> +		kernel_map_pages(page, HPAGE_PMD_NR, !protect);
> +		return 0;
> +	}
> +
> +	pte = pte_offset_map(pmd, addr);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		pte_t entry = *pte;
> +
> +		if (!pte_present(entry))
> +			continue;
> +
> +		if (is_zero_pfn(pte_pfn(entry)))
> +			continue;
> +
> +		page = pte_page(entry);
> +
> +		VM_BUG_ON_PAGE(page_mapcount(page) != 1, page);
> +		/* XXX: Would it fail with direct device assignment? */
> +		VM_BUG_ON_PAGE(page_count(page) !=
> +			       total_mapcount(compound_head(page)), page);
> +		kernel_map_pages(page, 1, !protect);
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops adjust_direct_mapping_ops = {
> +	.pmd_entry	= adjust_direct_mapping_pte_range,
> +};
> +

All this seem to me an addition to set_memory APIs rather then KVM.

>  static int protect_memory(unsigned long start, unsigned long end, bool protect)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -2763,6 +2830,13 @@ static int protect_memory(unsigned long start, unsigned long end, bool protect)
>  		if (ret)
>  			goto out;
>  
> +		if (vma_is_anonymous(vma)) {
> +			ret = walk_page_range_novma(mm, start, tmp,
> +					    &adjust_direct_mapping_ops, NULL,
> +					    (void *) protect);
> +			if (ret)
> +				goto out;
> +		}
>  next:
>  		start = tmp;
>  		if (start < prev->vm_end)
> -- 
> 2.26.2
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-25 13:47 ` Liran Alon
  2020-05-25 14:46   ` Kirill A. Shutemov
@ 2020-05-26  6:17   ` Mike Rapoport
  2020-05-26 10:16     ` Liran Alon
  1 sibling, 1 reply; 62+ messages in thread
From: Mike Rapoport @ 2020-05-26  6:17 UTC (permalink / raw)
  To: Liran Alon
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm,
	linux-mm, linux-kernel, Kirill A. Shutemov

On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> 
> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> 
> Furthermore, I would like to point out that just unmapping guest data from
> kernel direct-map is not sufficient to prevent all
> guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This
> is because host kernel VA space have other regions
> which contains guest sensitive data. For example, KVM per-vCPU struct (which
> holds vCPU state) is allocated on slab and therefore
> still leakable.

Objects allocated from slab use the direct map, vmalloc() is another story.

> >   - Touching direct mapping leads to fragmentation. We need to be able to
> >     recover from it. I have a buggy patch that aims at recovering 2M/1G page.
> >     It has to be fixed and tested properly
>
> As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs
> will lead to holes in kernel direct-map which force it to not be mapped
> anymore as a series of 1GB huge-pages.
> This have non-trivial performance cost. Thus, I am not sure addressing this
> use-case is valuable.

Out of curiosity, do we actually have some numbers for the "non-trivial
performance cost"? For instance for KVM usecase?


-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-05-22 12:52 ` [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
  2020-05-26  6:15   ` Mike Rapoport
@ 2020-05-26  6:40   ` John Hubbard
  2020-05-26 22:04     ` Kirill A. Shutemov
  1 sibling, 1 reply; 62+ messages in thread
From: John Hubbard @ 2020-05-26  6:40 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On 2020-05-22 05:52, Kirill A. Shutemov wrote:
...
> @@ -2773,6 +2780,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>   #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
>   #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
>   #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
> +#define FOLL_KVM	0x80000 /* access to VM_KVM_PROTECTED VMAs */
>   

I grabbed 0x80000 already, for FOLL_FAST_ONLY. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-26  6:17   ` Mike Rapoport
@ 2020-05-26 10:16     ` Liran Alon
  2020-05-26 11:38       ` Mike Rapoport
  0 siblings, 1 reply; 62+ messages in thread
From: Liran Alon @ 2020-05-26 10:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm,
	linux-mm, linux-kernel, Kirill A. Shutemov


On 26/05/2020 9:17, Mike Rapoport wrote:
> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
>>
>> Furthermore, I would like to point out that just unmapping guest data from
>> kernel direct-map is not sufficient to prevent all
>> guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This
>> is because host kernel VA space have other regions
>> which contains guest sensitive data. For example, KVM per-vCPU struct (which
>> holds vCPU state) is allocated on slab and therefore
>> still leakable.
> Objects allocated from slab use the direct map, vmalloc() is another story.
It doesn't matter. This patch series, like XPFO, only removes guest 
memory pages from direct-map.
Not things such as KVM per-vCPU structs. That's why Julian & Marius 
(AWS), created the "Process local kernel VA region" patch-series
that declare a single PGD entry, which maps a kernelspace region, to 
have different PFN between different tasks.
For more information, see my KVM Forum talk slides I gave in previous 
reply and related AWS patch-series:
https://patchwork.kernel.org/cover/10990403/
>
>>>    - Touching direct mapping leads to fragmentation. We need to be able to
>>>      recover from it. I have a buggy patch that aims at recovering 2M/1G page.
>>>      It has to be fixed and tested properly
>> As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs
>> will lead to holes in kernel direct-map which force it to not be mapped
>> anymore as a series of 1GB huge-pages.
>> This have non-trivial performance cost. Thus, I am not sure addressing this
>> use-case is valuable.
> Out of curiosity, do we actually have some numbers for the "non-trivial
> performance cost"? For instance for KVM usecase?
>
Dig into XPFO mailing-list discussions to find out...
I just remember that this was one of the main concerns regarding XPFO.

-Liran


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-26 10:16     ` Liran Alon
@ 2020-05-26 11:38       ` Mike Rapoport
  2020-05-27 15:45         ` Dave Hansen
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Rapoport @ 2020-05-26 11:38 UTC (permalink / raw)
  To: Liran Alon
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm,
	linux-mm, linux-kernel, Kirill A. Shutemov

On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote:
> 
> On 26/05/2020 9:17, Mike Rapoport wrote:
> > On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> > > On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> > > 
> > Out of curiosity, do we actually have some numbers for the "non-trivial
> > performance cost"? For instance for KVM usecase?
> > 
> Dig into XPFO mailing-list discussions to find out...
> I just remember that this was one of the main concerns regarding XPFO.

The XPFO benchmarks measure total XPFO cost, and huge share of it comes
from TLB shootdowns.

It's not exactly measurement of the imapct of the direct map
fragmentation to workload running inside a vitrual machine.

> -Liran

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-05-26  6:14   ` Mike Rapoport
@ 2020-05-26 21:56     ` Kirill A. Shutemov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-26 21:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Tue, May 26, 2020 at 09:14:59AM +0300, Mike Rapoport wrote:
> On Fri, May 22, 2020 at 03:52:04PM +0300, Kirill A. Shutemov wrote:
> > New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> > protection feature is enabled.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |  4 +++
> >  virt/kvm/kvm_main.c      | 78 ++++++++++++++++++++++++++++++++++------
> >  2 files changed, 72 insertions(+), 10 deletions(-)
> > 
> >  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > -				 void *data, int offset, int len)
> > +				 void *data, int offset, int len,
> > +				 bool protected)
> >  {
> >  	int r;
> >  	unsigned long addr;
> > @@ -2257,7 +2297,10 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> >  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
> >  	if (kvm_is_error_hva(addr))
> >  		return -EFAULT;
> > -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> > +	if (protected)
> > +		r = copy_from_guest(data, addr + offset, len);
> > +	else
> > +		r = __copy_from_user(data, (void __user *)addr + offset, len);
> 
> Maybe always use copy_{from,to}_guest() and move the 'if (protected)'
> there?
> If kvm is added to memory slot, it cab be the passed to copy_{to,from}_guest.

Right, Vitaly has pointed me to this already.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 10/16] KVM: x86: Enabled protected memory extension
  2020-05-26  6:16   ` Mike Rapoport
@ 2020-05-26 21:58     ` Kirill A. Shutemov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-26 21:58 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Tue, May 26, 2020 at 09:16:09AM +0300, Mike Rapoport wrote:
> On Fri, May 22, 2020 at 03:52:08PM +0300, Kirill A. Shutemov wrote:
> > Wire up hypercalls for the feature and define VM_KVM_PROTECTED.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig     | 1 +
> >  arch/x86/kvm/cpuid.c | 3 +++
> >  arch/x86/kvm/x86.c   | 9 +++++++++
> >  include/linux/mm.h   | 4 ++++
> >  4 files changed, 17 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 58dd44a1b92f..420e3947f0c6 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -801,6 +801,7 @@ config KVM_GUEST
> >  	select ARCH_CPUIDLE_HALTPOLL
> >  	select X86_MEM_ENCRYPT_COMMON
> >  	select SWIOTLB
> > +	select ARCH_USES_HIGH_VMA_FLAGS
> >  	default y
> >  	---help---
> >  	  This option enables various optimizations for running under the KVM
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 901cd1fdecd9..94cc5e45467e 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -714,6 +714,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  			     (1 << KVM_FEATURE_POLL_CONTROL) |
> >  			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
> >  
> > +		if (VM_KVM_PROTECTED)
> > +			entry->eax |=(1 << KVM_FEATURE_MEM_PROTECTED);
> > +
> >  		if (sched_info_on())
> >  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c17e6eb9ad43..acba0ac07f61 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7598,6 +7598,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >  		kvm_sched_yield(vcpu->kvm, a0);
> >  		ret = 0;
> >  		break;
> > +	case KVM_HC_ENABLE_MEM_PROTECTED:
> > +		ret = kvm_protect_all_memory(vcpu->kvm);
> > +		break;
> > +	case KVM_HC_MEM_SHARE:
> > +		ret = kvm_protect_memory(vcpu->kvm, a0, a1, false);
> > +		break;
> > +	case KVM_HC_MEM_UNSHARE:
> > +		ret = kvm_protect_memory(vcpu->kvm, a0, a1, true);
> > +		break;
> >  	default:
> >  		ret = -KVM_ENOSYS;
> >  		break;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 4f7195365cc0..6eb771c14968 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -329,7 +329,11 @@ extern unsigned int kobjsize(const void *objp);
> >  # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
> >  #endif
> >  
> > +#if defined(CONFIG_X86_64) && defined(CONFIG_KVM)
> 
> This would be better spelled as ARCH_WANTS_PROTECTED_MEMORY, IMHO.

Sure. I though it's good enough for RFC :)

> > +#define VM_KVM_PROTECTED VM_HIGH_ARCH_4
> 
> Maybe this should be VM_HIGH_ARCH_5 so that powerpc could enable this
> feature eventually?

Okay-okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-05-26  6:15   ` Mike Rapoport
@ 2020-05-26 22:01     ` Kirill A. Shutemov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-26 22:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Tue, May 26, 2020 at 09:15:52AM +0300, Mike Rapoport wrote:
> On Fri, May 22, 2020 at 03:52:05PM +0300, Kirill A. Shutemov wrote:
> > The new VMA flag that indicate a VMA that is not accessible to userspace
> > but usable by kernel with GUP if FOLL_KVM is specified.
> > 
> > The FOLL_KVM is only used in the KVM code. The code has to know how to
> > deal with such pages.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/mm.h  |  8 ++++++++
> >  mm/gup.c            | 20 ++++++++++++++++----
> >  mm/huge_memory.c    | 20 ++++++++++++++++----
> >  mm/memory.c         |  3 +++
> >  mm/mmap.c           |  3 +++
> >  virt/kvm/async_pf.c |  4 ++--
> >  virt/kvm/kvm_main.c |  9 +++++----
> >  7 files changed, 53 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e1882eec1752..4f7195365cc0 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -329,6 +329,8 @@ extern unsigned int kobjsize(const void *objp);
> >  # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
> >  #endif
> >  
> > +#define VM_KVM_PROTECTED 0
> 
> With all the ideas about removing pages from the direct mapi floating
> around I wouldn't limit this to KVM.
> 
> VM_NOT_IN_DIRECT_MAP would describe such areas better, but I realise
> it's very far from perfect and nothing better does not comes to mind :)

I don't like VM_NOT_IN_DIRECT_MAP.

It's not only about direct mapping, but about userspace mapping as well.
For the same reason other naming proposals don't fit as well.

> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f609e9ec4a25..d56c3f6efc99 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -112,6 +112,9 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags)
> >  				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
> >  			pgprot_val(arch_vm_get_page_prot(vm_flags)));
> >  
> > +	if (vm_flags & VM_KVM_PROTECTED)
> > +		ret = PAGE_NONE;
> 
> Nit: vma_is_kvm_protected()?

Which VMA? :P

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-05-26  6:40   ` John Hubbard
@ 2020-05-26 22:04     ` Kirill A. Shutemov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-26 22:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Mon, May 25, 2020 at 11:40:01PM -0700, John Hubbard wrote:
> On 2020-05-22 05:52, Kirill A. Shutemov wrote:
> ...
> > @@ -2773,6 +2780,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> >   #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
> >   #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
> >   #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
> > +#define FOLL_KVM	0x80000 /* access to VM_KVM_PROTECTED VMAs */
> 
> I grabbed 0x80000 already, for FOLL_FAST_ONLY. :)

Let's see who getting upstream first :P (Spoiler: you)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 16/16] KVM: Unmap protected pages from direct mapping
  2020-05-26  6:16   ` Mike Rapoport
@ 2020-05-26 22:10     ` Kirill A. Shutemov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-05-26 22:10 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Tue, May 26, 2020 at 09:16:38AM +0300, Mike Rapoport wrote:
> On Fri, May 22, 2020 at 03:52:14PM +0300, Kirill A. Shutemov wrote:
> > If the protected memory feature enabled, unmap guest memory from
> > kernel's direct mappings.
> > 
> > Migration and KSM is disabled for protected memory as it would require a
> > special treatment.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/mm/pat/set_memory.c |  1 +
> >  include/linux/kvm_host.h     |  3 ++
> >  mm/huge_memory.c             |  9 +++++
> >  mm/ksm.c                     |  3 ++
> >  mm/memory.c                  | 13 +++++++
> >  mm/rmap.c                    |  4 ++
> >  virt/kvm/kvm_main.c          | 74 ++++++++++++++++++++++++++++++++++++
> >  7 files changed, 107 insertions(+)
> > 
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 6f075766bb94..13988413af40 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -2227,6 +2227,7 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
> >  
> >  	arch_flush_lazy_mmu_mode();
> >  }
> > +EXPORT_SYMBOL_GPL(__kernel_map_pages);
> >  
> >  #ifdef CONFIG_HIBERNATION
> >  bool kernel_page_present(struct page *page)
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index b6944f88033d..e1d7762b615c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -705,6 +705,9 @@ int kvm_protect_all_memory(struct kvm *kvm);
> >  int kvm_protect_memory(struct kvm *kvm,
> >  		       unsigned long gfn, unsigned long npages, bool protect);
> >  
> > +void kvm_map_page(struct page *page, int nr_pages);
> > +void kvm_unmap_page(struct page *page, int nr_pages);
> > +
> >  int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
> >  			    struct page **pages, int nr_pages);
> >  
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c3562648a4ef..d8a444a401cc 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/oom.h>
> >  #include <linux/numa.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/kvm_host.h>
> 
> This does not seem right... 

I agree. I try to find a more clean way to deal with it.

> >  #include <asm/tlb.h>
> >  #include <asm/pgalloc.h>
> > @@ -650,6 +651,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >  		spin_unlock(vmf->ptl);
> >  		count_vm_event(THP_FAULT_ALLOC);
> >  		count_memcg_events(memcg, THP_FAULT_ALLOC, 1);
> > +
> > +		/* Unmap page from direct mapping */
> > +		if (vma_is_kvm_protected(vma))
> > +			kvm_unmap_page(page, HPAGE_PMD_NR);
> 
> ... and neither does this.
> 
> I think the map/unmap primitives shoud be a part of the generic mm and
> not burried inside KVM.

Well, yes. Except, kvm_map_page() also clears the page before bringing it
back to direct mappings. Not sure yet how to deal with it.

> >  	return 0;
> > @@ -1886,6 +1891,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  			page_remove_rmap(page, true);
> >  			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> >  			VM_BUG_ON_PAGE(!PageHead(page), page);
> > +
> > +			/* Map the page back to the direct mapping */
> > +			if (vma_is_kvm_protected(vma))
> > +				kvm_map_page(page, HPAGE_PMD_NR);
> >  		} else if (thp_migration_supported()) {
> >  			swp_entry_t entry;
> >  
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 281c00129a2e..942b88782ac2 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -527,6 +527,9 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
> >  		return NULL;
> >  	if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
> >  		return NULL;
> > +	/* TODO */
> 
> Probably this is not something that should be done. For a security
> sensitive environment that wants protected memory, KSM woudn't be
> relevant anyway...

Hm. True.

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 71aac117357f..defc33d3a124 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -51,6 +51,7 @@
> >  #include <linux/io.h>
> >  #include <linux/lockdep.h>
> >  #include <linux/kthread.h>
> > +#include <linux/pagewalk.h>
> >  
> >  #include <asm/processor.h>
> >  #include <asm/ioctl.h>
> > @@ -2718,6 +2719,72 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
> >  
> > +void kvm_map_page(struct page *page, int nr_pages)
> > +{
> > +	int i;
> > +
> > +	/* Clear page before returning it to the direct mapping */
> > +	for (i = 0; i < nr_pages; i++) {
> > +		void *p = map_page_atomic(page + i);
> > +		memset(p, 0, PAGE_SIZE);
> > +		unmap_page_atomic(p);
> > +	}
> > +
> > +	kernel_map_pages(page, nr_pages, 1);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_map_page);
> > +
> > +void kvm_unmap_page(struct page *page, int nr_pages)
> > +{
> > +	kernel_map_pages(page, nr_pages, 0);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_unmap_page);
> > +
> > +static int adjust_direct_mapping_pte_range(pmd_t *pmd, unsigned long addr,
> > +					   unsigned long end,
> > +					   struct mm_walk *walk)
> > +{
> > +	bool protect = (bool)walk->private;
> > +	pte_t *pte;
> > +	struct page *page;
> > +
> > +	if (pmd_trans_huge(*pmd)) {
> > +		page = pmd_page(*pmd);
> > +		if (is_huge_zero_page(page))
> > +			return 0;
> > +		VM_BUG_ON_PAGE(total_mapcount(page) != 1, page);
> > +		/* XXX: Would it fail with direct device assignment? */
> > +		VM_BUG_ON_PAGE(page_count(page) != 1, page);
> > +		kernel_map_pages(page, HPAGE_PMD_NR, !protect);
> > +		return 0;
> > +	}
> > +
> > +	pte = pte_offset_map(pmd, addr);
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		pte_t entry = *pte;
> > +
> > +		if (!pte_present(entry))
> > +			continue;
> > +
> > +		if (is_zero_pfn(pte_pfn(entry)))
> > +			continue;
> > +
> > +		page = pte_page(entry);
> > +
> > +		VM_BUG_ON_PAGE(page_mapcount(page) != 1, page);
> > +		/* XXX: Would it fail with direct device assignment? */
> > +		VM_BUG_ON_PAGE(page_count(page) !=
> > +			       total_mapcount(compound_head(page)), page);
> > +		kernel_map_pages(page, 1, !protect);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct mm_walk_ops adjust_direct_mapping_ops = {
> > +	.pmd_entry	= adjust_direct_mapping_pte_range,
> > +};
> > +
> 
> All this seem to me an addition to set_memory APIs rather then KVM.

Emm?.. I don't think walking userspace mapping is set_memory thing.
And kernel_map_pages() is VMM interface already.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-05-25 15:15     ` Kirill A. Shutemov
@ 2020-05-27  5:03       ` Sean Christopherson
  2020-05-27  8:39         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2020-05-27  5:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vitaly Kuznetsov, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Joerg Roedel

On Mon, May 25, 2020 at 06:15:25PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 25, 2020 at 04:58:51PM +0200, Vitaly Kuznetsov wrote:
> > > @@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
> > >  {
> > >  	kvmclock_init();
> > >  	x86_platform.apic_post_init = kvm_apic_init;
> > > +
> > > +	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
> > > +		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
> > > +			pr_err("Failed to enable KVM memory protection\n");
> > > +			return;
> > > +		}
> > > +
> > > +		mem_protected = true;
> > > +	}
> > >  }
> > 
> > Personally, I'd prefer to do this via setting a bit in a KVM-specific
> > MSR instead. The benefit is that the guest doesn't need to remember if
> > it enabled the feature or not, it can always read the config msr. May
> > come handy for e.g. kexec/kdump.
> 
> I think we would need to remember it anyway. Accessing MSR is somewhat
> expensive. But, okay, I can rework it MSR if needed.

I think Vitaly is talking about the case where the kernel can't easily get
at its cached state, e.g. after booting into a new kernel.  The kernel would
still have an X86_FEATURE bit or whatever, providing a virtual MSR would be
purely for rare slow paths.

That being said, a hypercall plus CPUID bit might be better, e.g. that'd
allow the guest to query the state without risking a #GP.

> Note, that we can avoid the enabling algother, if we modify BIOS to deal
> with private/shared memory. Currently BIOS get system crash if we enable
> the feature from time zero.

Which would mesh better with a CPUID feature bit.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-05-27  5:03       ` Sean Christopherson
@ 2020-05-27  8:39         ` Vitaly Kuznetsov
  2020-05-27  8:52           ` Sean Christopherson
  2020-06-03  2:09           ` Huang, Kai
  0 siblings, 2 replies; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-27  8:39 UTC (permalink / raw)
  To: Sean Christopherson, Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Wanpeng Li, Jim Mattson, Joerg Roedel

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Mon, May 25, 2020 at 06:15:25PM +0300, Kirill A. Shutemov wrote:
>> On Mon, May 25, 2020 at 04:58:51PM +0200, Vitaly Kuznetsov wrote:
>> > > @@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
>> > >  {
>> > >  	kvmclock_init();
>> > >  	x86_platform.apic_post_init = kvm_apic_init;
>> > > +
>> > > +	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
>> > > +		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
>> > > +			pr_err("Failed to enable KVM memory protection\n");
>> > > +			return;
>> > > +		}
>> > > +
>> > > +		mem_protected = true;
>> > > +	}
>> > >  }
>> > 
>> > Personally, I'd prefer to do this via setting a bit in a KVM-specific
>> > MSR instead. The benefit is that the guest doesn't need to remember if
>> > it enabled the feature or not, it can always read the config msr. May
>> > come handy for e.g. kexec/kdump.
>> 
>> I think we would need to remember it anyway. Accessing MSR is somewhat
>> expensive. But, okay, I can rework it MSR if needed.
>
> I think Vitaly is talking about the case where the kernel can't easily get
> at its cached state, e.g. after booting into a new kernel.  The kernel would
> still have an X86_FEATURE bit or whatever, providing a virtual MSR would be
> purely for rare slow paths.
>
> That being said, a hypercall plus CPUID bit might be better, e.g. that'd
> allow the guest to query the state without risking a #GP.

We have rdmsr_safe() for that! :-) MSR (and hypercall to that matter)
should have an associated CPUID feature bit of course.

Yes, hypercall + CPUID would do but normally we treat CPUID data as
static and in this case we'll make it a dynamically flipping
bit. Especially if we introduce 'KVM_HC_DISABLE_MEM_PROTECTED' later.

>
>> Note, that we can avoid the enabling algother, if we modify BIOS to deal
>> with private/shared memory. Currently BIOS get system crash if we enable
>> the feature from time zero.
>
> Which would mesh better with a CPUID feature bit.
>

And maybe even help us to resolve 'reboot' problem.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-05-27  8:39         ` Vitaly Kuznetsov
@ 2020-05-27  8:52           ` Sean Christopherson
  2020-06-03  2:09           ` Huang, Kai
  1 sibling, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2020-05-27  8:52 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Kirill A. Shutemov, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	Joerg Roedel

On Wed, May 27, 2020 at 10:39:33AM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Mon, May 25, 2020 at 06:15:25PM +0300, Kirill A. Shutemov wrote:
> >> On Mon, May 25, 2020 at 04:58:51PM +0200, Vitaly Kuznetsov wrote:
> >> > > @@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
> >> > >  {
> >> > >  	kvmclock_init();
> >> > >  	x86_platform.apic_post_init = kvm_apic_init;
> >> > > +
> >> > > +	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
> >> > > +		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
> >> > > +			pr_err("Failed to enable KVM memory protection\n");
> >> > > +			return;
> >> > > +		}
> >> > > +
> >> > > +		mem_protected = true;
> >> > > +	}
> >> > >  }
> >> > 
> >> > Personally, I'd prefer to do this via setting a bit in a KVM-specific
> >> > MSR instead. The benefit is that the guest doesn't need to remember if
> >> > it enabled the feature or not, it can always read the config msr. May
> >> > come handy for e.g. kexec/kdump.
> >> 
> >> I think we would need to remember it anyway. Accessing MSR is somewhat
> >> expensive. But, okay, I can rework it MSR if needed.
> >
> > I think Vitaly is talking about the case where the kernel can't easily get
> > at its cached state, e.g. after booting into a new kernel.  The kernel would
> > still have an X86_FEATURE bit or whatever, providing a virtual MSR would be
> > purely for rare slow paths.
> >
> > That being said, a hypercall plus CPUID bit might be better, e.g. that'd
> > allow the guest to query the state without risking a #GP.
> 
> We have rdmsr_safe() for that! :-) MSR (and hypercall to that matter)
> should have an associated CPUID feature bit of course.

rdmsr_safe() won't fly in early boot, e.g. verify_cpu.  It probably doesn't
matter for late enabling, but it might save some headache if there's ever a
handoff from vBIOS.

> Yes, hypercall + CPUID would do but normally we treat CPUID data as
> static and in this case we'll make it a dynamically flipping

There are multiple examples of dynamic CPUID, e.g. MWAIT and OSPKE.

> bit. Especially if we introduce 'KVM_HC_DISABLE_MEM_PROTECTED' later.
> 
> >
> >> Note, that we can avoid the enabling algother, if we modify BIOS to deal
> >> with private/shared memory. Currently BIOS get system crash if we enable
> >> the feature from time zero.
> >
> > Which would mesh better with a CPUID feature bit.
> >
> 
> And maybe even help us to resolve 'reboot' problem.
> 
> -- 
> Vitaly
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-26 11:38       ` Mike Rapoport
@ 2020-05-27 15:45         ` Dave Hansen
  2020-05-27 21:22           ` Mike Rapoport
  0 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2020-05-27 15:45 UTC (permalink / raw)
  To: Mike Rapoport, Liran Alon
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm,
	linux-mm, linux-kernel, Kirill A. Shutemov

On 5/26/20 4:38 AM, Mike Rapoport wrote:
> On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote:
>> On 26/05/2020 9:17, Mike Rapoport wrote:
>>> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
>>>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
>>>>
>>> Out of curiosity, do we actually have some numbers for the "non-trivial
>>> performance cost"? For instance for KVM usecase?
>>>
>> Dig into XPFO mailing-list discussions to find out...
>> I just remember that this was one of the main concerns regarding XPFO.
> The XPFO benchmarks measure total XPFO cost, and huge share of it comes
> from TLB shootdowns.

Yes, TLB shootdown when pages transition between owners is huge.  The
XPFO folks did a lot of work to try to optimize some of this overhead
away.  But, it's still a concern.

The concern with XPFO was that it could affect *all* application page
allocation.  This approach cheats a bit and only goes after guest VM
pages.  It's significantly more work to allocate a page and map it into
a guest than it is to, for instance, allocate an anonymous user page.
That means that the *additional* overhead of things like this for guest
memory matter a lot less.

> It's not exactly measurement of the imapct of the direct map
> fragmentation to workload running inside a vitrual machine.

While the VM *itself* is running, there is zero overhead.  The host
direct map is not used at *all*.  The guest and host TLB entries share
the same space in the TLB so there could be some increased pressure on
the TLB, but that's a really secondary effect.  It would also only occur
if the guest exits and the host runs and starts evicting TLB entries.

The other effects I could think of would be when the guest exits and the
host is doing some work for the guest, like emulation or something.  The
host would see worse TLB behavior because the host is using the
(fragmented) direct map.

But, both of those things require VMEXITs.  The more exits, the more
overhead you _might_ observe.  What I've been hearing from KVM folks is
that exits are getting more and more rare and the hardware designers are
working hard to minimize them.

That's especially good news because it means that even if the situation
isn't perfect, it's only bound to get *better* over time, not worse.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-27 15:45         ` Dave Hansen
@ 2020-05-27 21:22           ` Mike Rapoport
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Rapoport @ 2020-05-27 21:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Liran Alon, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Wed, May 27, 2020 at 08:45:33AM -0700, Dave Hansen wrote:
> On 5/26/20 4:38 AM, Mike Rapoport wrote:
> > On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote:
> >> On 26/05/2020 9:17, Mike Rapoport wrote:
> >>> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> >>>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> >>>>
> >>> Out of curiosity, do we actually have some numbers for the "non-trivial
> >>> performance cost"? For instance for KVM usecase?
> >>>
> >> Dig into XPFO mailing-list discussions to find out...
> >> I just remember that this was one of the main concerns regarding XPFO.
> >
> > The XPFO benchmarks measure total XPFO cost, and huge share of it comes
> > from TLB shootdowns.
> 
> Yes, TLB shootdown when pages transition between owners is huge.  The
> XPFO folks did a lot of work to try to optimize some of this overhead
> away.  But, it's still a concern.
> 
> The concern with XPFO was that it could affect *all* application page
> allocation.  This approach cheats a bit and only goes after guest VM
> pages.  It's significantly more work to allocate a page and map it into
> a guest than it is to, for instance, allocate an anonymous user page.
> That means that the *additional* overhead of things like this for guest
> memory matter a lot less.
> 
> > It's not exactly measurement of the imapct of the direct map
> > fragmentation to workload running inside a vitrual machine.
> 
> While the VM *itself* is running, there is zero overhead.  The host
> direct map is not used at *all*.  The guest and host TLB entries share
> the same space in the TLB so there could be some increased pressure on
> the TLB, but that's a really secondary effect.  It would also only occur
> if the guest exits and the host runs and starts evicting TLB entries.
> 
> The other effects I could think of would be when the guest exits and the
> host is doing some work for the guest, like emulation or something.  The
> host would see worse TLB behavior because the host is using the
> (fragmented) direct map.
> 
> But, both of those things require VMEXITs.  The more exits, the more
> overhead you _might_ observe.  What I've been hearing from KVM folks is
> that exits are getting more and more rare and the hardware designers are
> working hard to minimize them.

Right, when guest stays in the guest mode, there is no overhead. But
guests still exit sometimes and I was wondering if anybody had measured
difference in the overhead with different page size used for the host's
direct map. 

My guesstimate is that the overhead will not differ much for most
workloads. But still, it's still interesting to *know* what is it.

> That's especially good news because it means that even if the
> situation
> isn't perfect, it's only bound to get *better* over time, not worse.

The processors have been aggressively improving performance for decades
and see where are we know because of it ;-)

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-05-22 12:52 ` [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
  2020-05-25 15:08   ` Vitaly Kuznetsov
  2020-05-26  6:14   ` Mike Rapoport
@ 2020-05-29 15:24   ` Kees Cook
  2 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2020-05-29 15:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Fri, May 22, 2020 at 03:52:04PM +0300, Kirill A. Shutemov wrote:
> +int copy_from_guest(void *data, unsigned long hva, int len)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(data, page_address(page) + offset, seg);
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;
> +	}
> +
> +	return 0;
> +}
> +
> +int copy_to_guest(unsigned long hva, const void *data, int len)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(page_address(page) + offset, data, seg);
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;
> +	}
> +	return 0;
> +}
> +
>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -				 void *data, int offset, int len)
> +				 void *data, int offset, int len,
> +				 bool protected)
>  {
>  	int r;
>  	unsigned long addr;
> @@ -2257,7 +2297,10 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> +	if (protected)
> +		r = copy_from_guest(data, addr + offset, len);
> +	else
> +		r = __copy_from_user(data, (void __user *)addr + offset, len);
>  	if (r)
>  		return -EFAULT;
>  	return 0;

This ends up removing KASAN and object size tests. Compare to:

__copy_from_user(void *to, const void __user *from, unsigned long n)
{
        might_fault();
        kasan_check_write(to, n);
        check_object_size(to, n, false);
        return raw_copy_from_user(to, from, n);
}

Those will need to get added back. :)

Additionally, I see that copy_from_guest() neither clears the
destination memory on a short read, nor does KVM actually handle the
short read case correctly now. See the notes in uaccess.h:

 * NOTE: only copy_from_user() zero-pads the destination in case of short copy.
 * Neither __copy_from_user() nor __copy_from_user_inatomic() zero anything
 * at all; their callers absolutely must check the return value.

It's not clear to me how the destination buffers get reused, but the has
the potential to leak kernel memory contents. This needs separate
fixing, I think.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-05-25 15:17     ` Kirill A. Shutemov
@ 2020-06-01 16:35       ` Paolo Bonzini
  2020-06-02 13:33         ` Kirill A. Shutemov
  0 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2020-06-01 16:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Vitaly Kuznetsov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel

On 25/05/20 17:17, Kirill A. Shutemov wrote:
>> Personally, I would've just added 'struct kvm' pointer to 'struct
>> kvm_memory_slot' to be able to extract 'mem_protected' info when
>> needed. This will make the patch much smaller.
> Okay, can do.
> 
> Other thing I tried is to have per-slot flag to indicate that it's
> protected. But Sean pointed that it's all-or-nothing feature and having
> the flag in the slot would be misleading.
> 

Perhaps it would be misleading, but it's an optimization.  Saving a
pointer dereference can be worth it, also because there are some places
where we just pass around a memslot and we don't have the struct kvm*.

Also, it's an all-or-nothing feature _now_.  It doesn't have to remain
that way.

Paolo


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-06-01 16:35       ` Paolo Bonzini
@ 2020-06-02 13:33         ` Kirill A. Shutemov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill A. Shutemov @ 2020-06-02 13:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel

On Mon, Jun 01, 2020 at 06:35:22PM +0200, Paolo Bonzini wrote:
> On 25/05/20 17:17, Kirill A. Shutemov wrote:
> >> Personally, I would've just added 'struct kvm' pointer to 'struct
> >> kvm_memory_slot' to be able to extract 'mem_protected' info when
> >> needed. This will make the patch much smaller.
> > Okay, can do.
> > 
> > Other thing I tried is to have per-slot flag to indicate that it's
> > protected. But Sean pointed that it's all-or-nothing feature and having
> > the flag in the slot would be misleading.
> > 
> 
> Perhaps it would be misleading, but it's an optimization.  Saving a
> pointer dereference can be worth it, also because there are some places
> where we just pass around a memslot and we don't have the struct kvm*.

Vitaly proposed to add struct kvm pointer into memslot. Do you object
against it?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 09/16] KVM: Protected memory extension
  2020-05-25 15:34     ` Kirill A. Shutemov
@ 2020-06-03  1:34       ` Huang, Kai
  0 siblings, 0 replies; 62+ messages in thread
From: Huang, Kai @ 2020-06-03  1:34 UTC (permalink / raw)
  To: kirill, vkuznets
  Cc: kvm, Kleen, Andi, wad, keescook, aarcange, dave.hansen, luto,
	wanpengli, linux-kernel, kirill.shutemov, pbonzini, linux-mm,
	joro, peterz, jmattson, Christopherson, Sean J, Edgecombe,
	Rick P, rientjes, x86

On Mon, 2020-05-25 at 18:34 +0300, Kirill A. Shutemov wrote:
> On Mon, May 25, 2020 at 05:26:37PM +0200, Vitaly Kuznetsov wrote:
> > "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> > 
> > > Add infrastructure that handles protected memory extension.
> > > 
> > > Arch-specific code has to provide hypercalls and define non-zero
> > > VM_KVM_PROTECTED.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  include/linux/kvm_host.h |   4 ++
> > >  mm/mprotect.c            |   1 +
> > >  virt/kvm/kvm_main.c      | 131 +++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 136 insertions(+)
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index bd0bb600f610..d7072f6d6aa0 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -700,6 +700,10 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
> > >  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> > >  				   struct kvm_memory_slot *slot);
> > >  
> > > +int kvm_protect_all_memory(struct kvm *kvm);
> > > +int kvm_protect_memory(struct kvm *kvm,
> > > +		       unsigned long gfn, unsigned long npages, bool protect);
> > > +
> > >  int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  			    struct page **pages, int nr_pages);
> > >  
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 494192ca954b..552be3b4c80a 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -505,6 +505,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct
> > > vm_area_struct **pprev,
> > >  	vm_unacct_memory(charged);
> > >  	return error;
> > >  }
> > > +EXPORT_SYMBOL_GPL(mprotect_fixup);
> > >  
> > >  /*
> > >   * pkey==-1 when doing a legacy mprotect()
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 530af95efdf3..07d45da5d2aa 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -155,6 +155,8 @@ static void kvm_uevent_notify_change(unsigned int
> > > type, struct kvm *kvm);
> > >  static unsigned long long kvm_createvm_count;
> > >  static unsigned long long kvm_active_vms;
> > >  
> > > +static int protect_memory(unsigned long start, unsigned long end, bool
> > > protect);
> > > +
> > >  __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > >  		unsigned long start, unsigned long end, bool blockable)
> > >  {
> > > @@ -1309,6 +1311,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > >  	if (r)
> > >  		goto out_bitmap;
> > >  
> > > +	if (mem->memory_size && kvm->mem_protected) {
> > > +		r = protect_memory(new.userspace_addr,
> > > +				   new.userspace_addr + new.npages * PAGE_SIZE,
> > > +				   true);
> > > +		if (r)
> > > +			goto out_bitmap;
> > > +	}
> > > +
> > >  	if (old.dirty_bitmap && !new.dirty_bitmap)
> > >  		kvm_destroy_dirty_bitmap(&old);
> > >  	return 0;
> > > @@ -2652,6 +2662,127 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu
> > > *vcpu, gfn_t gfn)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
> > >  
> > > +static int protect_memory(unsigned long start, unsigned long end, bool
> > > protect)
> > > +{
> > > +	struct mm_struct *mm = current->mm;
> > > +	struct vm_area_struct *vma, *prev;
> > > +	int ret;
> > > +
> > > +	if (down_write_killable(&mm->mmap_sem))
> > > +		return -EINTR;
> > > +
> > > +	ret = -ENOMEM;
> > > +	vma = find_vma(current->mm, start);
> > > +	if (!vma)
> > > +		goto out;
> > > +
> > > +	ret = -EINVAL;
> > > +	if (vma->vm_start > start)
> > > +		goto out;
> > > +
> > > +	if (start > vma->vm_start)
> > > +		prev = vma;
> > > +	else
> > > +		prev = vma->vm_prev;
> > > +
> > > +	ret = 0;
> > > +	while (true) {
> > > +		unsigned long newflags, tmp;
> > > +
> > > +		tmp = vma->vm_end;
> > > +		if (tmp > end)
> > > +			tmp = end;
> > > +
> > > +		newflags = vma->vm_flags;
> > > +		if (protect)
> > > +			newflags |= VM_KVM_PROTECTED;
> > > +		else
> > > +			newflags &= ~VM_KVM_PROTECTED;
> > > +
> > > +		/* The VMA has been handled as part of other memslot */
> > > +		if (newflags == vma->vm_flags)
> > > +			goto next;
> > > +
> > > +		ret = mprotect_fixup(vma, &prev, start, tmp, newflags);
> > > +		if (ret)
> > > +			goto out;
> > > +
> > > +next:
> > > +		start = tmp;
> > > +		if (start < prev->vm_end)
> > > +			start = prev->vm_end;
> > > +
> > > +		if (start >= end)
> > > +			goto out;
> > > +
> > > +		vma = prev->vm_next;
> > > +		if (!vma || vma->vm_start != start) {
> > > +			ret = -ENOMEM;
> > > +			goto out;
> > > +		}
> > > +	}
> > > +out:
> > > +	up_write(&mm->mmap_sem);
> > > +	return ret;
> > > +}
> > > +
> > > +int kvm_protect_memory(struct kvm *kvm,
> > > +		       unsigned long gfn, unsigned long npages, bool protect)
> > > +{
> > > +	struct kvm_memory_slot *memslot;
> > > +	unsigned long start, end;
> > > +	gfn_t numpages;
> > > +
> > > +	if (!VM_KVM_PROTECTED)
> > > +		return -KVM_ENOSYS;
> > > +
> > > +	if (!npages)
> > > +		return 0;
> > > +
> > > +	memslot = gfn_to_memslot(kvm, gfn);
> > > +	/* Not backed by memory. It's okay. */
> > > +	if (!memslot)
> > > +		return 0;
> > > +
> > > +	start = gfn_to_hva_many(memslot, gfn, &numpages);
> > > +	end = start + npages * PAGE_SIZE;
> > > +
> > > +	/* XXX: Share range across memory slots? */
> > > +	if (WARN_ON(numpages < npages))
> > > +		return -EINVAL;
> > > +
> > > +	return protect_memory(start, end, protect);
> > > +}
> > > +EXPORT_SYMBOL_GPL(kvm_protect_memory);
> > > +
> > > +int kvm_protect_all_memory(struct kvm *kvm)
> > > +{
> > > +	struct kvm_memslots *slots;
> > > +	struct kvm_memory_slot *memslot;
> > > +	unsigned long start, end;
> > > +	int i, ret = 0;;
> > > +
> > > +	if (!VM_KVM_PROTECTED)
> > > +		return -KVM_ENOSYS;
> > > +
> > > +	mutex_lock(&kvm->slots_lock);
> > > +	kvm->mem_protected = true;
> > 
> > What will happen upon guest reboot? Do we need to unprotect everything
> > to make sure we'll be able to boot? Also, after the reboot how will the
> > guest know that it is protected and needs to unprotect things? -> see my
> > idea about converting KVM_HC_ENABLE_MEM_PROTECTED to a stateful MSR (but
> > we'll likely have to reset it upon reboot anyway).
> 
> That's extremely good question. I have not considered reboot. I tend to use
> -no-reboot in my setup.
> 
> I'll think how to deal with reboot. I don't know how it works now to give
> a good answer.
> 
> The may not be a good solution: unprotecting memory on reboot means we
> expose user data. We can wipe the data before unprotecting, but we should
> not wipe BIOS and anything else that is required on reboot. I donno.

If you let Qemu to protect guest memory when creating the vm, but not ask guest
kernel to enable when it boots, you won't have this problem. And guest kernel
*queries* whether its memory is protected or not during boot. This is consistent
to SEV as well.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-05-27  8:39         ` Vitaly Kuznetsov
  2020-05-27  8:52           ` Sean Christopherson
@ 2020-06-03  2:09           ` Huang, Kai
  2020-06-03 11:14             ` Vitaly Kuznetsov
  1 sibling, 1 reply; 62+ messages in thread
From: Huang, Kai @ 2020-06-03  2:09 UTC (permalink / raw)
  To: kirill, Christopherson, Sean J, vkuznets
  Cc: kvm, wad, Kleen, Andi, luto, aarcange, keescook, dave.hansen,
	wanpengli, kirill.shutemov, linux-kernel, pbonzini, linux-mm,
	joro, peterz, jmattson, Edgecombe, Rick P, rientjes, x86

On Wed, 2020-05-27 at 10:39 +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Mon, May 25, 2020 at 06:15:25PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, May 25, 2020 at 04:58:51PM +0200, Vitaly Kuznetsov wrote:
> > > > > @@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
> > > > >  {
> > > > >  	kvmclock_init();
> > > > >  	x86_platform.apic_post_init = kvm_apic_init;
> > > > > +
> > > > > +	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
> > > > > +		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
> > > > > +			pr_err("Failed to enable KVM memory
> > > > > protection\n");
> > > > > +			return;
> > > > > +		}
> > > > > +
> > > > > +		mem_protected = true;
> > > > > +	}
> > > > >  }
> > > > 
> > > > Personally, I'd prefer to do this via setting a bit in a KVM-specific
> > > > MSR instead. The benefit is that the guest doesn't need to remember if
> > > > it enabled the feature or not, it can always read the config msr. May
> > > > come handy for e.g. kexec/kdump.
> > > 
> > > I think we would need to remember it anyway. Accessing MSR is somewhat
> > > expensive. But, okay, I can rework it MSR if needed.
> > 
> > I think Vitaly is talking about the case where the kernel can't easily get
> > at its cached state, e.g. after booting into a new kernel.  The kernel would
> > still have an X86_FEATURE bit or whatever, providing a virtual MSR would be
> > purely for rare slow paths.
> > 
> > That being said, a hypercall plus CPUID bit might be better, e.g. that'd
> > allow the guest to query the state without risking a #GP.
> 
> We have rdmsr_safe() for that! :-) MSR (and hypercall to that matter)
> should have an associated CPUID feature bit of course.
> 
> Yes, hypercall + CPUID would do but normally we treat CPUID data as
> static and in this case we'll make it a dynamically flipping
> bit. Especially if we introduce 'KVM_HC_DISABLE_MEM_PROTECTED' later.

Not sure why is KVM_HC_DISABLE_MEM_PROTECTED needed?

> 
> > > Note, that we can avoid the enabling algother, if we modify BIOS to deal
> > > with private/shared memory. Currently BIOS get system crash if we enable
> > > the feature from time zero.
> > 
> > Which would mesh better with a CPUID feature bit.
> > 
> 
> And maybe even help us to resolve 'reboot' problem.

IMO we can ask Qemu to call hypercall to 'enable' memory protection when
creating VM, and guest kernel *queries* whether it is protected via CPUID
feature bit.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-06-03  2:09           ` Huang, Kai
@ 2020-06-03 11:14             ` Vitaly Kuznetsov
  0 siblings, 0 replies; 62+ messages in thread
From: Vitaly Kuznetsov @ 2020-06-03 11:14 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, wad, Kleen, Andi, luto, aarcange, keescook, dave.hansen,
	wanpengli, kirill.shutemov, linux-kernel, pbonzini, linux-mm,
	joro, peterz, jmattson, Edgecombe, Rick P, rientjes, x86, kirill,
	Christopherson, Sean J

"Huang, Kai" <kai.huang@intel.com> writes:

> On Wed, 2020-05-27 at 10:39 +0200, Vitaly Kuznetsov wrote:
>> Sean Christopherson <sean.j.christopherson@intel.com> writes:
>> 
>> > On Mon, May 25, 2020 at 06:15:25PM +0300, Kirill A. Shutemov wrote:
>> > > On Mon, May 25, 2020 at 04:58:51PM +0200, Vitaly Kuznetsov wrote:
>> > > > > @@ -727,6 +734,15 @@ static void __init kvm_init_platform(void)
>> > > > >  {
>> > > > >  	kvmclock_init();
>> > > > >  	x86_platform.apic_post_init = kvm_apic_init;
>> > > > > +
>> > > > > +	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
>> > > > > +		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
>> > > > > +			pr_err("Failed to enable KVM memory
>> > > > > protection\n");
>> > > > > +			return;
>> > > > > +		}
>> > > > > +
>> > > > > +		mem_protected = true;
>> > > > > +	}
>> > > > >  }
>> > > > 
>> > > > Personally, I'd prefer to do this via setting a bit in a KVM-specific
>> > > > MSR instead. The benefit is that the guest doesn't need to remember if
>> > > > it enabled the feature or not, it can always read the config msr. May
>> > > > come handy for e.g. kexec/kdump.
>> > > 
>> > > I think we would need to remember it anyway. Accessing MSR is somewhat
>> > > expensive. But, okay, I can rework it MSR if needed.
>> > 
>> > I think Vitaly is talking about the case where the kernel can't easily get
>> > at its cached state, e.g. after booting into a new kernel.  The kernel would
>> > still have an X86_FEATURE bit or whatever, providing a virtual MSR would be
>> > purely for rare slow paths.
>> > 
>> > That being said, a hypercall plus CPUID bit might be better, e.g. that'd
>> > allow the guest to query the state without risking a #GP.
>> 
>> We have rdmsr_safe() for that! :-) MSR (and hypercall to that matter)
>> should have an associated CPUID feature bit of course.
>> 
>> Yes, hypercall + CPUID would do but normally we treat CPUID data as
>> static and in this case we'll make it a dynamically flipping
>> bit. Especially if we introduce 'KVM_HC_DISABLE_MEM_PROTECTED' later.
>
> Not sure why is KVM_HC_DISABLE_MEM_PROTECTED needed?
>

I didn't put much thought in it but we may need it to support 'kexec'
case when no reboot is performed but we either need to pass  the data
about which regions are protected from old kernel to the new one or
'unprotect exerything'.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2020-05-25 13:47 ` Liran Alon
@ 2020-06-04 15:15 ` Marc Zyngier
  2020-06-04 15:48   ` Sean Christopherson
  18 siblings, 1 reply; 62+ messages in thread
From: Marc Zyngier @ 2020-06-04 15:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, kernel-team, will

Hi Kirill,

Thanks for this.

On Fri, 22 May 2020 15:51:58 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> == Background / Problem ==
> 
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.
> 
> 
> == What does this set mitigate? ==
> 
>  - Host kernel ”accidental” access to guest data (think speculation)
> 
>  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> 
>  - Host userspace access to guest data (compromised qemu)
> 
> == What does this set NOT mitigate? ==
> 
>  - Full host kernel compromise.  Kernel will just map the pages again.
> 
>  - Hardware attacks

Just as a heads up, we (the Android kernel team) are currently
involved in something pretty similar for KVM/arm64 in order to bring
some level of confidentiality to guests.

The main idea is to de-privilege the host kernel by wrapping it in its
own nested set of page tables which allows us to remove memory
allocated to guests on a per-page basis. The core hypervisor runs more
or less independently at its own privilege level. It still is KVM
though, as we don't intend to reinvent the wheel.

Will has written a much more lingo-heavy description here:
https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/

This works for one of the virtualization modes that arm64 can use (what
we call non-VHE, or nVHE for short). The other mode (VHE), is much more
similar to what happens on other architectures, where the kernel and
the hypervisor are one single entity. In this case, we cannot use the
same trick with nested page tables, and have to rely on something that
would very much look like what you're proposing.

Note that the two modes of the architecture would benefit from this
work anyway, as I'd like the host to know that we've pulled memory
from under its feet. Since you have done most of the initial work, I
intend to give it a go on arm64 shortly and see what sticks.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-06-04 15:15 ` Marc Zyngier
@ 2020-06-04 15:48   ` Sean Christopherson
  2020-06-04 16:27     ` Marc Zyngier
  2020-06-04 16:35     ` Will Deacon
  0 siblings, 2 replies; 62+ messages in thread
From: Sean Christopherson @ 2020-06-04 15:48 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, kernel-team, will,
	Jun Nakajima

+Jun

On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
> Hi Kirill,
> 
> Thanks for this.
> 
> On Fri, 22 May 2020 15:51:58 +0300
> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > == Background / Problem ==
> > 
> > There are a number of hardware features (MKTME, SEV) which protect guest
> > memory from some unauthorized host access. The patchset proposes a purely
> > software feature that mitigates some of the same host-side read-only
> > attacks.
> > 
> > 
> > == What does this set mitigate? ==
> > 
> >  - Host kernel ”accidental” access to guest data (think speculation)
> > 
> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> > 
> >  - Host userspace access to guest data (compromised qemu)
> > 
> > == What does this set NOT mitigate? ==
> > 
> >  - Full host kernel compromise.  Kernel will just map the pages again.
> > 
> >  - Hardware attacks
> 
> Just as a heads up, we (the Android kernel team) are currently
> involved in something pretty similar for KVM/arm64 in order to bring
> some level of confidentiality to guests.
> 
> The main idea is to de-privilege the host kernel by wrapping it in its
> own nested set of page tables which allows us to remove memory
> allocated to guests on a per-page basis. The core hypervisor runs more
> or less independently at its own privilege level. It still is KVM
> though, as we don't intend to reinvent the wheel.
> 
> Will has written a much more lingo-heavy description here:
> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/

Pardon my arm64 ignorance...

IIUC, in this mode, the host kernel runs at EL1?  And to switch to a guest
it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
I assume the EL1->EL2->EL1 switch is done by trapping an exception of some
form?

If all of the above are "yes", does KVM already have the necessary logic to
perform the EL1->EL2->EL1 switches, or is that being added as part of the
de-privileging effort?
 
> This works for one of the virtualization modes that arm64 can use (what
> we call non-VHE, or nVHE for short). The other mode (VHE), is much more
> similar to what happens on other architectures, where the kernel and
> the hypervisor are one single entity. In this case, we cannot use the
> same trick with nested page tables, and have to rely on something that
> would very much look like what you're proposing.
> 
> Note that the two modes of the architecture would benefit from this
> work anyway, as I'd like the host to know that we've pulled memory
> from under its feet. Since you have done most of the initial work, I
> intend to give it a go on arm64 shortly and see what sticks.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-06-04 15:48   ` Sean Christopherson
@ 2020-06-04 16:27     ` Marc Zyngier
  2020-06-04 16:35     ` Will Deacon
  1 sibling, 0 replies; 62+ messages in thread
From: Marc Zyngier @ 2020-06-04 16:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, kernel-team, will,
	Jun Nakajima

Hi Sean,

On 2020-06-04 16:48, Sean Christopherson wrote:
> +Jun
> 
> On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
>> Hi Kirill,
>> 
>> Thanks for this.
>> 
>> On Fri, 22 May 2020 15:51:58 +0300
>> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>> 
>> > == Background / Problem ==
>> >
>> > There are a number of hardware features (MKTME, SEV) which protect guest
>> > memory from some unauthorized host access. The patchset proposes a purely
>> > software feature that mitigates some of the same host-side read-only
>> > attacks.
>> >
>> >
>> > == What does this set mitigate? ==
>> >
>> >  - Host kernel ”accidental” access to guest data (think speculation)
>> >
>> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>> >
>> >  - Host userspace access to guest data (compromised qemu)
>> >
>> > == What does this set NOT mitigate? ==
>> >
>> >  - Full host kernel compromise.  Kernel will just map the pages again.
>> >
>> >  - Hardware attacks
>> 
>> Just as a heads up, we (the Android kernel team) are currently
>> involved in something pretty similar for KVM/arm64 in order to bring
>> some level of confidentiality to guests.
>> 
>> The main idea is to de-privilege the host kernel by wrapping it in its
>> own nested set of page tables which allows us to remove memory
>> allocated to guests on a per-page basis. The core hypervisor runs more
>> or less independently at its own privilege level. It still is KVM
>> though, as we don't intend to reinvent the wheel.
>> 
>> Will has written a much more lingo-heavy description here:
>> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/
> 
> Pardon my arm64 ignorance...
> 
> IIUC, in this mode, the host kernel runs at EL1?  And to switch to a 
> guest
> it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
> I assume the EL1->EL2->EL1 switch is done by trapping an exception of 
> some
> form?
> 
> If all of the above are "yes", does KVM already have the necessary 
> logic to
> perform the EL1->EL2->EL1 switches, or is that being added as part of 
> the
> de-privileging effort?

KVM already handles the EL1->EL2->EL1 madness, meaning that from
an exception level perspective, the host kernel is already a guest.
It's just that this guest can directly change the hypervisor's text,
its page tables, and muck with about everything else.

De-privileging the memory access to non host EL1 memory is where the
ongoing effort is.

          M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-06-04 15:48   ` Sean Christopherson
  2020-06-04 16:27     ` Marc Zyngier
@ 2020-06-04 16:35     ` Will Deacon
  2020-06-04 19:09       ` Nakajima, Jun
  1 sibling, 1 reply; 62+ messages in thread
From: Will Deacon @ 2020-06-04 16:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi, x86, kvm,
	linux-mm, linux-kernel, Kirill A. Shutemov, kernel-team,
	Jun Nakajima

Hi Sean,

On Thu, Jun 04, 2020 at 08:48:35AM -0700, Sean Christopherson wrote:
> On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
> > On Fri, 22 May 2020 15:51:58 +0300
> > "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> > 
> > > == Background / Problem ==
> > > 
> > > There are a number of hardware features (MKTME, SEV) which protect guest
> > > memory from some unauthorized host access. The patchset proposes a purely
> > > software feature that mitigates some of the same host-side read-only
> > > attacks.
> > > 
> > > 
> > > == What does this set mitigate? ==
> > > 
> > >  - Host kernel ”accidental” access to guest data (think speculation)
> > > 
> > >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> > > 
> > >  - Host userspace access to guest data (compromised qemu)
> > > 
> > > == What does this set NOT mitigate? ==
> > > 
> > >  - Full host kernel compromise.  Kernel will just map the pages again.
> > > 
> > >  - Hardware attacks
> > 
> > Just as a heads up, we (the Android kernel team) are currently
> > involved in something pretty similar for KVM/arm64 in order to bring
> > some level of confidentiality to guests.
> > 
> > The main idea is to de-privilege the host kernel by wrapping it in its
> > own nested set of page tables which allows us to remove memory
> > allocated to guests on a per-page basis. The core hypervisor runs more
> > or less independently at its own privilege level. It still is KVM
> > though, as we don't intend to reinvent the wheel.
> > 
> > Will has written a much more lingo-heavy description here:
> > https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/
> 
> Pardon my arm64 ignorance...

No, not at all!

> IIUC, in this mode, the host kernel runs at EL1?  And to switch to a guest
> it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
> I assume the EL1->EL2->EL1 switch is done by trapping an exception of some
> form?

Yes, and this is actually the way that KVM works on some Arm CPUs today,
as the original virtualisation extensions in the Armv8 architecture do
not make it possible to run the kernel directly at EL2 (for example, there
is only one page-table base register). This was later addressed in the
architecture by the "Virtualisation Host Extensions (VHE)", and so KVM
supports both options.

With non-VHE today, there is a small amount of "world switch" code at
EL2 which is installed by the host kernel and provides a way to transition
between the host and the guest. If the host needs to do something at EL2
(e.g. privileged TLB invalidation), then it makes a hypercall (HVC instruction)
via the kvm_call_hyp() macro (and this ends up just being a function call
for VHE).

> If all of the above are "yes", does KVM already have the necessary logic to
> perform the EL1->EL2->EL1 switches, or is that being added as part of the
> de-privileging effort?

The logic is there as part of the non-VHE support code, but it's not great
from a security angle. For example, the guest stage-2 page-tables are still
allocated by the host, the host has complete access to guest and hypervisor
memory (including hypervisor text) and things like kvm_call_hyp() are a bit
of an open door. We're working on making the EL2 code more self contained,
so that after the host has initialised KVM, it can shut the door and the
hypervisor can install a stage-2 translation over the host, which limits its
access to hypervisor and guest memory. There will clearly be IOMMU work as
well to prevent DMA attacks.

Will

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-06-04 16:35     ` Will Deacon
@ 2020-06-04 19:09       ` Nakajima, Jun
  2020-06-04 21:03         ` Jim Mattson
  0 siblings, 1 reply; 62+ messages in thread
From: Nakajima, Jun @ 2020-06-04 19:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: Christopherson, Sean J, Marc Zyngier, Kirill A. Shutemov,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, kernel-team

> 
> On Jun 4, 2020, at 9:35 AM, Will Deacon <will@kernel.org> wrote:
> 
> Hi Sean,
> 
> On Thu, Jun 04, 2020 at 08:48:35AM -0700, Sean Christopherson wrote:
>> On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
>>> On Fri, 22 May 2020 15:51:58 +0300
>>> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>>> 
>>>> == Background / Problem ==
>>>> 
>>>> There are a number of hardware features (MKTME, SEV) which protect guest
>>>> memory from some unauthorized host access. The patchset proposes a purely
>>>> software feature that mitigates some of the same host-side read-only
>>>> attacks.
>>>> 
>>>> 
>>>> == What does this set mitigate? ==
>>>> 
>>>> - Host kernel ”accidental” access to guest data (think speculation)
>>>> 
>>>> - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>>>> 
>>>> - Host userspace access to guest data (compromised qemu)
>>>> 
>>>> == What does this set NOT mitigate? ==
>>>> 
>>>> - Full host kernel compromise.  Kernel will just map the pages again.
>>>> 
>>>> - Hardware attacks
>>> 
>>> Just as a heads up, we (the Android kernel team) are currently
>>> involved in something pretty similar for KVM/arm64 in order to bring
>>> some level of confidentiality to guests.
>>> 
>>> The main idea is to de-privilege the host kernel by wrapping it in its
>>> own nested set of page tables which allows us to remove memory
>>> allocated to guests on a per-page basis. The core hypervisor runs more
>>> or less independently at its own privilege level. It still is KVM
>>> though, as we don't intend to reinvent the wheel.
>>> 
>>> Will has written a much more lingo-heavy description here:
>>> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/
>> 

We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings.

To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though). 

When the control enters KVM, we go back to privileged (hypervisor or root) mode, and it works as does today. Once a VM exit happens, we will stay in the root mode as long as the exit can be handled within KVM. If we need to depend on the host kernel, we de-privilege the host kernel (i.e. VM enter). Yes, it sounds ugly.

There are cleaner (but more expensive) approaches, and we are collecting data at this point. For example, we could run the host kernel (like Xen dom0) on top of a thin? hypervisor that consists of KVM and minimally configured Linux.  

> 
>> IIUC, in this mode, the host kernel runs at EL1?  And to switch to a guest
>> it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
>> I assume the EL1->EL2->EL1 switch is done by trapping an exception of some
>> form?
> 
> Yes, and this is actually the way that KVM works on some Arm CPUs today,
> as the original virtualisation extensions in the Armv8 architecture do
> not make it possible to run the kernel directly at EL2 (for example, there
> is only one page-table base register). This was later addressed in the
> architecture by the "Virtualisation Host Extensions (VHE)", and so KVM
> supports both options.
> 
> With non-VHE today, there is a small amount of "world switch" code at
> EL2 which is installed by the host kernel and provides a way to transition
> between the host and the guest. If the host needs to do something at EL2
> (e.g. privileged TLB invalidation), then it makes a hypercall (HVC instruction)
> via the kvm_call_hyp() macro (and this ends up just being a function call
> for VHE).
> 
>> If all of the above are "yes", does KVM already have the necessary logic to
>> perform the EL1->EL2->EL1 switches, or is that being added as part of the
>> de-privileging effort?
> 
> The logic is there as part of the non-VHE support code, but it's not great
> from a security angle. For example, the guest stage-2 page-tables are still
> allocated by the host, the host has complete access to guest and hypervisor
> memory (including hypervisor text) and things like kvm_call_hyp() are a bit
> of an open door. We're working on making the EL2 code more self contained,
> so that after the host has initialised KVM, it can shut the door and the
> hypervisor can install a stage-2 translation over the host, which limits its
> access to hypervisor and guest memory. There will clearly be IOMMU work as
> well to prevent DMA attacks.

Sounds interesting. 

--- 
Jun
Intel Open Source Technology Center






^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-06-04 19:09       ` Nakajima, Jun
@ 2020-06-04 21:03         ` Jim Mattson
  2020-06-04 23:29           ` Nakajima, Jun
  0 siblings, 1 reply; 62+ messages in thread
From: Jim Mattson @ 2020-06-04 21:03 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Will Deacon, Christopherson, Sean J, Marc Zyngier,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, kernel-team

On Thu, Jun 4, 2020 at 12:09 PM Nakajima, Jun <jun.nakajima@intel.com> wrote:

> We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings.
>
> To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though).

You're Intel. Can't you just change the CPUID intercept from required
to optional? It seems like this should be in the realm of a small
microcode patch.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 00/16] KVM protected memory extension
  2020-06-04 21:03         ` Jim Mattson
@ 2020-06-04 23:29           ` Nakajima, Jun
  0 siblings, 0 replies; 62+ messages in thread
From: Nakajima, Jun @ 2020-06-04 23:29 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Will Deacon, Christopherson, Sean J, Marc Zyngier,
	Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov, kernel-team

> 
> On Jun 4, 2020, at 2:03 PM, Jim Mattson <jmattson@google.com> wrote:
> 
> On Thu, Jun 4, 2020 at 12:09 PM Nakajima, Jun <jun.nakajima@intel.com> wrote:
> 
>> We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings.
>> 
>> To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though).
> 
> You're Intel. Can't you just change the CPUID intercept from required
> to optional? It seems like this should be in the realm of a small
> microcode patch.

We’ll take a look. Probably it would be helpful even for the bare-metal kernel (e.g. debugging). 
Thanks for the suggestion.

--- 
Jun
Intel Open Source Technology Center



^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2020-06-04 23:29 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22 12:51 [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
2020-05-22 12:51 ` [RFC 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 02/16] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
2020-05-25 14:58   ` Vitaly Kuznetsov
2020-05-25 15:15     ` Kirill A. Shutemov
2020-05-27  5:03       ` Sean Christopherson
2020-05-27  8:39         ` Vitaly Kuznetsov
2020-05-27  8:52           ` Sean Christopherson
2020-06-03  2:09           ` Huang, Kai
2020-06-03 11:14             ` Vitaly Kuznetsov
2020-05-22 12:52 ` [RFC 03/16] x86/kvm: Make DMA pages shared Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 04/16] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 06/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
2020-05-25 15:08   ` Vitaly Kuznetsov
2020-05-25 15:17     ` Kirill A. Shutemov
2020-06-01 16:35       ` Paolo Bonzini
2020-06-02 13:33         ` Kirill A. Shutemov
2020-05-26  6:14   ` Mike Rapoport
2020-05-26 21:56     ` Kirill A. Shutemov
2020-05-29 15:24   ` Kees Cook
2020-05-22 12:52 ` [RFC 07/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
2020-05-26  6:15   ` Mike Rapoport
2020-05-26 22:01     ` Kirill A. Shutemov
2020-05-26  6:40   ` John Hubbard
2020-05-26 22:04     ` Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 08/16] KVM: x86: Use GUP for page walk instead of __get_user() Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 09/16] KVM: Protected memory extension Kirill A. Shutemov
2020-05-25 15:26   ` Vitaly Kuznetsov
2020-05-25 15:34     ` Kirill A. Shutemov
2020-06-03  1:34       ` Huang, Kai
2020-05-22 12:52 ` [RFC 10/16] KVM: x86: Enabled protected " Kirill A. Shutemov
2020-05-25 15:26   ` Vitaly Kuznetsov
2020-05-26  6:16   ` Mike Rapoport
2020-05-26 21:58     ` Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 11/16] KVM: Rework copy_to/from_guest() to avoid direct mapping Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 12/16] x86/kvm: Share steal time page with host Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 13/16] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
2020-05-25 15:22   ` Vitaly Kuznetsov
2020-05-25 15:25     ` Kirill A. Shutemov
2020-05-25 15:42       ` Vitaly Kuznetsov
2020-05-22 12:52 ` [RFC 14/16] KVM: Introduce gfn_to_pfn_memslot_protected() Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 15/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
2020-05-22 12:52 ` [RFC 16/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
2020-05-26  6:16   ` Mike Rapoport
2020-05-26 22:10     ` Kirill A. Shutemov
2020-05-25  5:27 ` [RFC 00/16] KVM protected memory extension Kirill A. Shutemov
2020-05-25 13:47 ` Liran Alon
2020-05-25 14:46   ` Kirill A. Shutemov
2020-05-25 15:56     ` Liran Alon
2020-05-26  6:17   ` Mike Rapoport
2020-05-26 10:16     ` Liran Alon
2020-05-26 11:38       ` Mike Rapoport
2020-05-27 15:45         ` Dave Hansen
2020-05-27 21:22           ` Mike Rapoport
2020-06-04 15:15 ` Marc Zyngier
2020-06-04 15:48   ` Sean Christopherson
2020-06-04 16:27     ` Marc Zyngier
2020-06-04 16:35     ` Will Deacon
2020-06-04 19:09       ` Nakajima, Jun
2020-06-04 21:03         ` Jim Mattson
2020-06-04 23:29           ` Nakajima, Jun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).