linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFCv2 00/16] KVM protected memory extension
@ 2020-10-20  6:18 Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
                   ` (17 more replies)
  0 siblings, 18 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

== Background / Problem ==

There are a number of hardware features (MKTME, SEV) which protect guest
memory from some unauthorized host access. The patchset proposes a purely
software feature that mitigates some of the same host-side read-only
attacks.


== What does this set mitigate? ==

 - Host kernel ”accidental” access to guest data (think speculation)

 - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))

 - Host userspace access to guest data (compromised qemu)

 - Guest privilege escalation via compromised QEMU device emulation

== What does this set NOT mitigate? ==

 - Full host kernel compromise.  Kernel will just map the pages again.

 - Hardware attacks


The second RFC revision addresses /most/ of the feedback.

I still didn't found a good solution to reboot and kexec. Unprotect all
the memory on such operations defeat the goal of the feature. Clearing up
most of the memory before unprotecting what is required for reboot (or
kexec) is tedious and error-prone.
Maybe we should just declare them unsupported?

== Series Overview ==

The hardware features protect guest data by encrypting it and then
ensuring that only the right guest can decrypt it.  This has the
side-effect of making the kernel direct map and userspace mapping
(QEMU et al) useless.  But, this teaches us something very useful:
neither the kernel or userspace mappings are really necessary for normal
guest operations.

Instead of using encryption, this series simply unmaps the memory. One
advantage compared to allowing access to ciphertext is that it allows bad
accesses to be caught instead of simply reading garbage.

Protection from physical attacks needs to be provided by some other means.
On Intel platforms, (single-key) Total Memory Encryption (TME) provides
mitigation against physical attacks, such as DIMM interposers sniffing
memory bus traffic.

The patchset modifies both host and guest kernel. The guest OS must enable
the feature via hypercall and mark any memory range that has to be shared
with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
bit in the guest’s page table while this approach uses a hypercall.

For removing the userspace mapping, use a trick similar to what NUMA
balancing does: convert memory that belongs to KVM memory slots to
PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
VMA must be treated in a special way in the GUP and fault paths. The flag
allows GUP to return the page even though it is mapped with PROT_NONE, but
only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
would result in -EFAULT.

Removing userspace mapping of the guest memory from QEMU process can help
to address some guest privilege escalation attacks. Consider the case when
unprivileged guest user exploits bug in a QEMU device emulation to gain
access to data it cannot normally have access within the guest.

Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
flushes local TLB. I think it's a reasonable compromise between security and
perfromance.

Zapping the PTE would bring the page back to the direct mapping after clearing.
At least for now, we don't remove file-backed pages from the direct mapping.
File-backed pages could be accessed via read/write syscalls. It adds
complexity.

Occasionally, host kernel has to access guest memory that was not made
shared by the guest. For instance, it happens for instruction emulation.
Normally, it's done via copy_to/from_user() which would fail with -EFAULT
now. We introduced a new pair of helpers: copy_to/from_guest(). The new
helpers acquire the page via GUP, map it into kernel address space with
kmap_atomic()-style mechanism and only then copy the data.

For some instruction emulation copying is not good enough: cmpxchg
emulation has to have direct access to the guest memory. __kvm_map_gfn()
is modified to accommodate the case.

The patchset is on top of v5.9

Kirill A. Shutemov (16):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvm: Make VirtIO use DMA API in KVM guest
  x86/kvmclock: Share hvclock memory with the host
  x86/realmode: Share trampoline area if KVM memory protection enabled
  KVM: Use GUP instead of copy_from/to_user() to access guest memory
  KVM: mm: Introduce VM_KVM_PROTECTED
  KVM: x86: Use GUP for page walk instead of __get_user()
  KVM: Protected memory extension
  KVM: x86: Enabled protected memory extension
  KVM: Rework copy_to/from_guest() to avoid direct mapping
  KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  KVM: Unmap protected pages from direct mapping
  mm: Do not use zero page for VM_KVM_PROTECTED VMAs

 arch/powerpc/kvm/book3s_64_mmu_hv.c    |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
 arch/s390/include/asm/pgtable.h        |   2 +-
 arch/x86/Kconfig                       |  11 +-
 arch/x86/include/asm/cpufeatures.h     |   1 +
 arch/x86/include/asm/io.h              |   6 +-
 arch/x86/include/asm/kvm_para.h        |   5 +
 arch/x86/include/asm/pgtable_types.h   |   1 +
 arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
 arch/x86/kernel/kvm.c                  |  20 +++
 arch/x86/kernel/kvmclock.c             |   2 +-
 arch/x86/kernel/pci-swiotlb.c          |   3 +-
 arch/x86/kvm/Kconfig                   |   1 +
 arch/x86/kvm/cpuid.c                   |   3 +-
 arch/x86/kvm/mmu/mmu.c                 |   6 +-
 arch/x86/kvm/mmu/paging_tmpl.h         |  10 +-
 arch/x86/kvm/x86.c                     |   9 +
 arch/x86/mm/Makefile                   |   2 +
 arch/x86/mm/ioremap.c                  |  16 +-
 arch/x86/mm/mem_encrypt.c              |  51 ------
 arch/x86/mm/mem_encrypt_common.c       |  62 +++++++
 arch/x86/mm/pat/set_memory.c           |   7 +
 arch/x86/realmode/init.c               |   7 +-
 drivers/virtio/virtio_ring.c           |   4 +
 include/linux/kvm_host.h               |  11 +-
 include/linux/kvm_types.h              |   1 +
 include/linux/mm.h                     |  21 ++-
 include/uapi/linux/kvm_para.h          |   5 +-
 mm/gup.c                               |  20 ++-
 mm/huge_memory.c                       |  31 +++-
 mm/ksm.c                               |   2 +
 mm/memory.c                            |  18 +-
 mm/mmap.c                              |   3 +
 mm/rmap.c                              |   4 +
 virt/kvm/Kconfig                       |   3 +
 virt/kvm/async_pf.c                    |   2 +-
 virt/kvm/kvm_main.c                    | 238 +++++++++++++++++++++----
 virt/lib/Makefile                      |   1 +
 virt/lib/mem_protected.c               | 193 ++++++++++++++++++++
 39 files changed, 666 insertions(+), 123 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c
 create mode 100644 virt/lib/mem_protected.c

-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 01/16] x86/mm: Move force_dma_unencrypted() to common code
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 02/16] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

force_dma_unencrypted() has to return true for KVM guest with the memory
protected enabled. Move it out of AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for the following patches.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  8 +++++--
 arch/x86/include/asm/io.h        |  4 +++-
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        | 30 -------------------------
 arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
 5 files changed, 49 insertions(+), 33 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..619ebf40e457 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1514,13 +1514,17 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
-	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index e1aa17a468a8..c58d52fd7bf2 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -256,10 +256,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
 	return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 9f1177edc2e7..4dbdc9dac36b 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include <linux/dma-direct.h>
 #include <linux/swiotlb.h>
 #include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -350,32 +346,6 @@ bool sev_active(void)
 	return sme_me_mask && sev_enabled;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-	/*
-	 * For SEV, all DMA must be to unencrypted addresses.
-	 */
-	if (sev_active())
-		return true;
-
-	/*
-	 * For SME, all DMA must be to unencrypted addresses if the
-	 * device does not support DMA to addresses that include the
-	 * encryption mask.
-	 */
-	if (sme_active()) {
-		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-						dev->bus_dma_limit);
-
-		if (dma_dev_mask <= dma_enc_mask)
-			return true;
-	}
-
-	return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
 	unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..964e04152417
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	/*
+	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 */
+	if (sev_active())
+		return true;
+
+	/*
+	 * For SME, all DMA must be to unencrypted addresses if the
+	 * device does not support DMA to addresses that include the
+	 * encryption mask.
+	 */
+	if (sme_active()) {
+		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+						dev->bus_dma_limit);
+
+		if (dma_dev_mask <= dma_enc_mask)
+			return true;
+	}
+
+	return false;
+}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 02/16] x86/kvm: Introduce KVM memory protection feature
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 03/16] x86/kvm: Make DMA pages shared Kirill A. Shutemov
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.

Host side doesn't provide the feature yet, so it is a dead code for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/kvm_para.h      |  5 +++++
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kernel/kvm.c                | 18 ++++++++++++++++++
 include/uapi/linux/kvm_para.h        |  3 ++-
 5 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2901d5df4366..a72157137764 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
 #define X86_FEATURE_EPT_AD		( 8*32+17) /* Intel Extended Page Table access-dirty bit */
 #define X86_FEATURE_VMCALL		( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
+#define X86_FEATURE_KVM_MEM_PROTECTED	( 8*32+20) /* KVM memory protection extenstion */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..74aea18f3130 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -11,11 +11,16 @@ extern void kvmclock_init(void);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
+bool kvm_mem_protected(void);
 #else
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
 	return false;
 }
+static inline bool kvm_mem_protected(void)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_GUEST */
 
 #define KVM_HYPERCALL \
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 812e9b4c1114..defbfc630a9f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -28,10 +28,11 @@
 #define KVM_FEATURE_PV_UNHALT		7
 #define KVM_FEATURE_PV_TLB_FLUSH	9
 #define KVM_FEATURE_ASYNC_PF_VMEXIT	10
-#define KVM_FEATURE_PV_SEND_IPI	11
+#define KVM_FEATURE_PV_SEND_IPI		11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
+#define KVM_FEATURE_MEM_PROTECTED	15
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 9663ba31347c..2c1f8952b92a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -37,6 +37,13 @@
 #include <asm/tlb.h>
 #include <asm/cpuidle_haltpoll.h>
 
+static bool mem_protected;
+
+bool kvm_mem_protected(void)
+{
+	return mem_protected;
+}
+
 DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
 
 static int kvmapf = 1;
@@ -742,6 +749,17 @@ static void __init kvm_init_platform(void)
 {
 	kvmclock_init();
 	x86_platform.apic_post_init = kvm_apic_init;
+
+	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
+		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
+			pr_err("Failed to enable KVM memory protection\n");
+			return;
+		}
+
+		pr_info("KVM memory protection enabled\n");
+		mem_protected = true;
+		setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+	}
 }
 
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..1a216f32e572 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -27,8 +27,9 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
-#define KVM_HC_SEND_IPI		10
+#define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
+#define KVM_HC_ENABLE_MEM_PROTECTED	12
 
 /*
  * hypercalls use architecture specific
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 03/16] x86/kvm: Make DMA pages shared
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 02/16] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 04/16] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Make force_dma_unencrypted() return true for KVM to get DMA pages mapped
as shared.

__set_memory_enc_dec() now informs the host via hypercall if the state
of the page has changed from shared to private or back.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 | 1 +
 arch/x86/mm/mem_encrypt_common.c | 5 +++--
 arch/x86/mm/pat/set_memory.c     | 7 +++++++
 include/uapi/linux/kvm_para.h    | 2 ++
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 619ebf40e457..cd272e3babbc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -805,6 +805,7 @@ config KVM_GUEST
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_HV_CALLBACK_VECTOR
+	select X86_MEM_ENCRYPT_COMMON
 	default y
 	help
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 964e04152417..a878e7f246d5 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,14 +10,15 @@
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
 #include <linux/dma-mapping.h>
+#include <asm/kvm_para.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
 	/*
-	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 * For SEV and KVM, all DMA must be to unencrypted/shared addresses.
 	 */
-	if (sev_active())
+	if (sev_active() || kvm_mem_protected())
 		return true;
 
 	/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d1b2a889f035..4c49303126c9 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,7 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/kvm_para.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -1977,6 +1978,12 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	struct cpa_data cpa;
 	int ret;
 
+	if (kvm_mem_protected()) {
+		unsigned long gfn = __pa(addr) >> PAGE_SHIFT;
+		int call = enc ? KVM_HC_MEM_UNSHARE : KVM_HC_MEM_SHARE;
+		return kvm_hypercall2(call, gfn, numpages);
+	}
+
 	/* Nothing to do if memory encryption is not active */
 	if (!mem_encrypt_active())
 		return 0;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 1a216f32e572..c6d8c988e330 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,8 @@
 #define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
 #define KVM_HC_ENABLE_MEM_PROTECTED	12
+#define KVM_HC_MEM_SHARE		13
+#define KVM_HC_MEM_UNSHARE		14
 
 /*
  * hypercalls use architecture specific
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 04/16] x86/kvm: Use bounce buffers for KVM memory protection
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 03/16] x86/kvm: Make DMA pages shared Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/kernel/kvm.c            |  2 ++
 arch/x86/kernel/pci-swiotlb.c    |  3 ++-
 arch/x86/mm/mem_encrypt.c        | 21 ---------------------
 arch/x86/mm/mem_encrypt_common.c | 23 +++++++++++++++++++++++
 5 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cd272e3babbc..b22b95517437 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -806,6 +806,7 @@ config KVM_GUEST
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_HV_CALLBACK_VECTOR
 	select X86_MEM_ENCRYPT_COMMON
+	select SWIOTLB
 	default y
 	help
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 2c1f8952b92a..30bb3d2d6ccd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -26,6 +26,7 @@
 #include <linux/kprobes.h>
 #include <linux/nmi.h>
 #include <linux/swait.h>
+#include <linux/swiotlb.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -759,6 +760,7 @@ static void __init kvm_init_platform(void)
 		pr_info("KVM memory protection enabled\n");
 		mem_protected = true;
 		setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+		swiotlb_force = SWIOTLB_FORCE;
 	}
 }
 
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..814060a6ceb0 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -13,6 +13,7 @@
 #include <asm/dma.h>
 #include <asm/xen/swiotlb-xen.h>
 #include <asm/iommu_table.h>
+#include <asm/kvm_para.h>
 
 int swiotlb __read_mostly;
 
@@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (sme_active() || kvm_mem_protected())
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 4dbdc9dac36b..5de64e068b0a 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -369,24 +369,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
 	free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-	if (!sme_me_mask)
-		return;
-
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
-	/*
-	 * With SEV, we need to unroll the rep string I/O instructions.
-	 */
-	if (sev_active())
-		static_branch_enable(&sev_enable_key);
-
-	pr_info("AMD %s active\n",
-		sev_active() ? "Secure Encrypted Virtualization (SEV)"
-			     : "Secure Memory Encryption (SME)");
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index a878e7f246d5..7900f3788010 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -37,3 +37,26 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+void __init mem_encrypt_init(void)
+{
+	if (!sme_me_mask && !kvm_mem_protected())
+		return;
+
+	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+	swiotlb_update_mem_attributes();
+
+	/*
+	 * With SEV, we need to unroll the rep string I/O instructions.
+	 */
+	if (sev_active())
+		static_branch_enable(&sev_enable_key);
+
+	if (sme_me_mask) {
+		pr_info("AMD %s active\n",
+			sev_active() ? "Secure Encrypted Virtualization (SEV)"
+			: "Secure Memory Encryption (SME)");
+	} else {
+		pr_info("KVM memory protection enabled\n");
+	}
+}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 04/16] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  8:06   ` Christoph Hellwig
  2020-10-22  3:31   ` Halil Pasic
  2020-10-20  6:18 ` [RFCv2 06/16] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

VirtIO for KVM is a primary way to provide IO. All memory that used for
communication with the host has to be marked as shared.

The easiest way to archive that is to use DMA API that already knows how
to deal with shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/virtio/virtio_ring.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index becc77697960..ace733845d5d 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -12,6 +12,7 @@
 #include <linux/hrtimer.h>
 #include <linux/dma-mapping.h>
 #include <xen/xen.h>
+#include <asm/kvm_para.h>
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -255,6 +256,9 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
 	if (xen_domain())
 		return true;
 
+	if (kvm_mem_protected())
+		return true;
+
 	return false;
 }
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 06/16] x86/kvmclock: Share hvclock memory with the host
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 07/16] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

hvclock is shared between the guest and the hypervisor. It has to be
accessible by host.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/kvmclock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 34b18f6eeb2c..ac6c2abe0d0f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -253,7 +253,7 @@ static void __init kvmclock_init_mem(void)
 	 * hvclock is shared between the guest and the hypervisor, must
 	 * be mapped decrypted.
 	 */
-	if (sev_active()) {
+	if (sev_active() || kvm_mem_protected()) {
 		r = set_memory_decrypted((unsigned long) hvclock_mem,
 					 1UL << order);
 		if (r) {
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 07/16] x86/realmode: Share trampoline area if KVM memory protection enabled
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 06/16] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

If KVM memory protection is active, the trampoline area will need to be
in shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/realmode/init.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 1ed1208931e0..7392940a7f96 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -9,6 +9,7 @@
 #include <asm/realmode.h>
 #include <asm/tlbflush.h>
 #include <asm/crash.h>
+#include <asm/kvm_para.h>
 
 struct real_mode_header *real_mode_header;
 u32 *trampoline_cr4_features;
@@ -55,11 +56,11 @@ static void __init setup_real_mode(void)
 	base = (unsigned char *)real_mode_header;
 
 	/*
-	 * If SME is active, the trampoline area will need to be in
-	 * decrypted memory in order to bring up other processors
+	 * If SME or KVM memory protection is active, the trampoline area will
+	 * need to be in decrypted memory in order to bring up other processors
 	 * successfully. This is not needed for SEV.
 	 */
-	if (sme_active())
+	if (sme_active() || kvm_mem_protected())
 		set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);
 
 	memcpy(base, real_mode_blob, size);
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 07/16] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  8:25   ` John Hubbard
  2020-10-20 17:29   ` Ira Weiny
  2020-10-20  6:18 ` [RFCv2 09/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
protection feature is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/kvm_host.h |  4 ++
 virt/kvm/kvm_main.c      | 90 +++++++++++++++++++++++++++++++---------
 2 files changed, 75 insertions(+), 19 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 05e3c2fb3ef7..380a64613880 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -504,6 +504,7 @@ struct kvm {
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
 	unsigned int max_halt_poll_ns;
+	bool mem_protected;
 };
 
 #define kvm_err(fmt, ...) \
@@ -728,6 +729,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_accessed(kvm_pfn_t pfn);
 void kvm_get_pfn(kvm_pfn_t pfn);
 
+int copy_from_guest(void *data, unsigned long hva, int len, bool protected);
+int copy_to_guest(unsigned long hva, const void *data, int len, bool protected);
+
 void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf88233b819a..a9884cb8c867 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2313,19 +2313,70 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+
+	if (!protected)
+		return __copy_from_user(data, (void __user *)hva, len);
+
+	might_fault();
+	kasan_check_write(data, len);
+	check_object_size(data, len, false);
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page, 0);
+		if (npages != 1)
+			return -EFAULT;
+		memcpy(data, page_address(page) + offset, seg);
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		offset = 0;
+	}
+
+	return 0;
+}
+
+int copy_to_guest(unsigned long hva, const void *data, int len, bool protected)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+
+	if (!protected)
+		return __copy_to_user((void __user *)hva, data, len);
+
+	might_fault();
+	kasan_check_read(data, len);
+	check_object_size(data, len, true);
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
+		if (npages != 1)
+			return -EFAULT;
+		memcpy(page_address(page) + offset, data, seg);
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		offset = 0;
+	}
+
+	return 0;
+}
+
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+				 void *data, int offset, int len,
+				 bool protected)
 {
-	int r;
 	unsigned long addr;
 
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_from_user(data, (void __user *)addr + offset, len);
-	if (r)
-		return -EFAULT;
-	return 0;
+	return copy_from_guest(data, addr + offset, len, protected);
 }
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
@@ -2333,7 +2384,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(slot, gfn, data, offset, len,
+				     kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -2342,7 +2394,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(slot, gfn, data, offset, len,
+				     vcpu->kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -2415,7 +2468,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
 static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
-			          const void *data, int offset, int len)
+			          const void *data, int offset, int len,
+				  bool protected)
 {
 	int r;
 	unsigned long addr;
@@ -2423,7 +2477,8 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
 	addr = gfn_to_hva_memslot(memslot, gfn);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_to_user((void __user *)addr + offset, data, len);
+
+	r = copy_to_guest(addr + offset, data, len, protected);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(memslot, gfn);
@@ -2435,7 +2490,8 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(slot, gfn, data, offset, len,
+				      kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
 
@@ -2444,7 +2500,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(slot, gfn, data, offset, len,
+				      vcpu->kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
@@ -2560,7 +2617,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_write_guest(kvm, gpa, data, len);
 
-	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
+	r = copy_to_guest(ghc->hva + offset, data, len, kvm->mem_protected);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
@@ -2581,7 +2638,6 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 				 unsigned long len)
 {
 	struct kvm_memslots *slots = kvm_memslots(kvm);
-	int r;
 	gpa_t gpa = ghc->gpa + offset;
 
 	BUG_ON(len + offset > ghc->len);
@@ -2597,11 +2653,7 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_read_guest(kvm, gpa, data, len);
 
-	r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
-	if (r)
-		return -EFAULT;
-
-	return 0;
+	return copy_from_guest(data, ghc->hva + offset, len, kvm->mem_protected);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached);
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 09/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-21 18:47   ` Edgecombe, Rick P
  2020-10-20  6:18 ` [RFCv2 10/16] KVM: x86: Use GUP for page walk instead of __get_user() Kirill A. Shutemov
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

The new VMA flag that indicate a VMA that is not accessible to userspace
but usable by kernel with GUP if FOLL_KVM is specified.

The FOLL_KVM is only used in the KVM code. The code has to know how to
deal with such pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h  |  8 ++++++++
 mm/gup.c            | 20 ++++++++++++++++----
 mm/huge_memory.c    | 20 ++++++++++++++++----
 mm/memory.c         |  3 +++
 mm/mmap.c           |  3 +++
 virt/kvm/async_pf.c |  2 +-
 virt/kvm/kvm_main.c |  9 +++++----
 7 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 16b799a0522c..c8d8cdcbc425 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -342,6 +342,8 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
 #endif
 
+#define VM_KVM_PROTECTED 0
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
@@ -658,6 +660,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_ACCESS_FLAGS;
 }
 
+static inline bool vma_is_kvm_protected(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_KVM_PROTECTED;
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
@@ -2766,6 +2773,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_KVM	0x100000 /* access to VM_KVM_PROTECTED VMAs */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index e869c634cc9a..accf6db0c06f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -384,10 +384,19 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write_pte(struct vm_area_struct *vma,
+					pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	if (pte_write(pte))
+		return true;
+
+	if ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte))
+		return true;
+
+	if (!vma_is_kvm_protected(vma) || !(vma->vm_flags & VM_WRITE))
+		return false;
+
+	return (vma->vm_flags & VM_SHARED) || page_mapcount(pte_page(pte)) == 1;
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -430,7 +439,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+	if ((flags & FOLL_WRITE) && !can_follow_write_pte(vma, pte, flags)) {
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
 	}
@@ -750,6 +759,9 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	ctx->page_mask = 0;
 
+	if (vma_is_kvm_protected(vma) && (flags & FOLL_KVM))
+		flags &= ~FOLL_NUMA;
+
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index da397779a6d4..ec8cf9a40cfd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1322,10 +1322,19 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
  * FOLL_FORCE can write to even unwritable pmd's, but only
  * after we've gone through a COW cycle and they are dirty.
  */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write_pmd(struct vm_area_struct *vma,
+					pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	if (pmd_write(pmd))
+		return true;
+
+	if ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd))
+		return true;
+
+	if (!vma_is_kvm_protected(vma) || !(vma->vm_flags & VM_WRITE))
+		return false;
+
+	return (vma->vm_flags & VM_SHARED) || page_mapcount(pmd_page(pmd)) == 1;
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1338,7 +1347,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+	if (flags & FOLL_WRITE && !can_follow_write_pmd(vma, *pmd, flags))
 		goto out;
 
 	/* Avoid dumping huge zero page */
@@ -1412,6 +1421,9 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	bool was_writable;
 	int flags = 0;
 
+	if (vma_is_kvm_protected(vma))
+		return VM_FAULT_SIGBUS;
+
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
 		goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index eeae590e526a..2c9756b4e52f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4165,6 +4165,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	bool was_writable = pte_savedwrite(vmf->orig_pte);
 	int flags = 0;
 
+	if (vma_is_kvm_protected(vma))
+		return VM_FAULT_SIGBUS;
+
 	/*
 	 * The "pte" at this point cannot be used safely without
 	 * validation through pte_unmap_same(). It's of NUMA type but
diff --git a/mm/mmap.c b/mm/mmap.c
index bdd19f5b994e..be699f688b6c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -112,6 +112,9 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags)
 				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
 			pgprot_val(arch_vm_get_page_prot(vm_flags)));
 
+	if (vm_flags & VM_KVM_PROTECTED)
+		ret = PAGE_NONE;
+
 	return arch_filter_pgprot(ret);
 }
 EXPORT_SYMBOL(vm_get_page_prot);
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index dd777688d14a..85a2f99f6e9b 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -61,7 +61,7 @@ static void async_pf_execute(struct work_struct *work)
 	 * access remotely.
 	 */
 	mmap_read_lock(mm);
-	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, NULL,
+	get_user_pages_remote(mm, addr, 1, FOLL_WRITE | FOLL_KVM, NULL, NULL,
 			&locked);
 	if (locked)
 		mmap_read_unlock(mm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a9884cb8c867..125db5a73e10 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1794,7 +1794,7 @@ unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *w
 
 static inline int check_user_page_hwpoison(unsigned long addr)
 {
-	int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
+	int rc, flags = FOLL_HWPOISON | FOLL_WRITE | FOLL_KVM;
 
 	rc = get_user_pages(addr, 1, flags, NULL, NULL);
 	return rc == -EHWPOISON;
@@ -1836,7 +1836,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 			   bool *writable, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	unsigned int flags = FOLL_HWPOISON | FOLL_KVM;
 	struct page *page;
 	int npages = 0;
 
@@ -2327,7 +2327,7 @@ int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
 	check_object_size(data, len, false);
 
 	while ((seg = next_segment(len, offset)) != 0) {
-		npages = get_user_pages_unlocked(hva, 1, &page, 0);
+		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
 		memcpy(data, page_address(page) + offset, seg);
@@ -2354,7 +2354,8 @@ int copy_to_guest(unsigned long hva, const void *data, int len, bool protected)
 	check_object_size(data, len, true);
 
 	while ((seg = next_segment(len, offset)) != 0) {
-		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
+		npages = get_user_pages_unlocked(hva, 1, &page,
+						 FOLL_WRITE | FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
 		memcpy(page_address(page) + offset, data, seg);
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 10/16] KVM: x86: Use GUP for page walk instead of __get_user()
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 09/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 11/16] KVM: Protected memory extension Kirill A. Shutemov
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

The user mapping doesn't have the page mapping for protected memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kvm/mmu/paging_tmpl.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 4dd6b1e5b8cf..258a6361b9b2 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -397,8 +397,14 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 			goto error;
 
 		ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
-		if (unlikely(__get_user(pte, ptep_user)))
-			goto error;
+		if (vcpu->kvm->mem_protected) {
+			if (copy_from_guest(&pte, host_addr + offset,
+					    sizeof(pte), true))
+				goto error;
+		} else {
+			if (unlikely(__get_user(pte, ptep_user)))
+				goto error;
+		}
 		walker->ptep_user[walker->level - 1] = ptep_user;
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 11/16] KVM: Protected memory extension
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 10/16] KVM: x86: Use GUP for page walk instead of __get_user() Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  7:17   ` Peter Zijlstra
  2020-10-20  6:18 ` [RFCv2 12/16] KVM: x86: Enabled protected " Kirill A. Shutemov
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Add infrastructure that handles protected memory extension.

Arch-specific code has to provide hypercalls and define non-zero
VM_KVM_PROTECTED.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/kvm_host.h |  4 +++
 virt/kvm/Kconfig         |  3 ++
 virt/kvm/kvm_main.c      | 68 ++++++++++++++++++++++++++++++++++++++
 virt/lib/Makefile        |  1 +
 virt/lib/mem_protected.c | 71 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 147 insertions(+)
 create mode 100644 virt/lib/mem_protected.c

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 380a64613880..6655e8da4555 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -701,6 +701,10 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot);
 
+int kvm_protect_all_memory(struct kvm *kvm);
+int kvm_protect_memory(struct kvm *kvm,
+		       unsigned long gfn, unsigned long npages, bool protect);
+
 int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 			    struct page **pages, int nr_pages);
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 1c37ccd5d402..50d7422386aa 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -63,3 +63,6 @@ config HAVE_KVM_NO_POLL
 
 config KVM_XFER_TO_GUEST_WORK
        bool
+
+config HAVE_KVM_PROTECTED_MEMORY
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 125db5a73e10..4c008c7b4974 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -154,6 +154,8 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+int __kvm_protect_memory(unsigned long start, unsigned long end, bool protect);
+
 __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 						   unsigned long start, unsigned long end)
 {
@@ -1371,6 +1373,15 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	if (r)
 		goto out_bitmap;
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) &&
+	    mem->memory_size && kvm->mem_protected) {
+		r = __kvm_protect_memory(new.userspace_addr,
+					 new.userspace_addr + new.npages * PAGE_SIZE,
+					 true);
+		if (r)
+			goto out_bitmap;
+	}
+
 	if (old.dirty_bitmap && !new.dirty_bitmap)
 		kvm_destroy_dirty_bitmap(&old);
 	return 0;
@@ -2720,6 +2731,63 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
+int kvm_protect_memory(struct kvm *kvm,
+		       unsigned long gfn, unsigned long npages, bool protect)
+{
+	struct kvm_memory_slot *memslot;
+	unsigned long start, end;
+	gfn_t numpages;
+
+	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY))
+		return -KVM_ENOSYS;
+
+	if (!npages)
+		return 0;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	/* Not backed by memory. It's okay. */
+	if (!memslot)
+		return 0;
+
+	start = gfn_to_hva_many(memslot, gfn, &numpages);
+	end = start + npages * PAGE_SIZE;
+
+	/* XXX: Share range across memory slots? */
+	if (WARN_ON(numpages < npages))
+		return -EINVAL;
+
+	return __kvm_protect_memory(start, end, protect);
+}
+EXPORT_SYMBOL_GPL(kvm_protect_memory);
+
+int kvm_protect_all_memory(struct kvm *kvm)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	unsigned long start, end;
+	int i, ret = 0;;
+
+	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY))
+		return -KVM_ENOSYS;
+
+	mutex_lock(&kvm->slots_lock);
+	kvm->mem_protected = true;
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(memslot, slots) {
+			start = memslot->userspace_addr;
+			end = start + memslot->npages * PAGE_SIZE;
+			ret = __kvm_protect_memory(start, end, true);
+			if (ret)
+				goto out;
+		}
+	}
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_protect_all_memory);
+
 void kvm_sigset_activate(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->sigset_active)
diff --git a/virt/lib/Makefile b/virt/lib/Makefile
index bd7f9a78bb6b..d6e50510801f 100644
--- a/virt/lib/Makefile
+++ b/virt/lib/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_IRQ_BYPASS_MANAGER) += irqbypass.o
+obj-$(CONFIG_HAVE_KVM_PROTECTED_MEMORY) += mem_protected.o
diff --git a/virt/lib/mem_protected.c b/virt/lib/mem_protected.c
new file mode 100644
index 000000000000..0b01dd74f29c
--- /dev/null
+++ b/virt/lib/mem_protected.c
@@ -0,0 +1,71 @@
+#include <linux/kvm_host.h>
+#include <linux/mm.h>
+#include <linux/pagewalk.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <asm/tlbflush.h>
+
+int __kvm_protect_memory(unsigned long start, unsigned long end, bool protect)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	int ret;
+
+	if (mmap_write_lock_killable(mm))
+		return -EINTR;
+
+	ret = -ENOMEM;
+	vma = find_vma(current->mm, start);
+	if (!vma)
+		goto out;
+
+	ret = -EINVAL;
+	if (vma->vm_start > start)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+	else
+		prev = vma->vm_prev;
+
+	ret = 0;
+	while (true) {
+		unsigned long newflags, tmp;
+
+		tmp = vma->vm_end;
+		if (tmp > end)
+			tmp = end;
+
+		newflags = vma->vm_flags;
+		if (protect)
+			newflags |= VM_KVM_PROTECTED;
+		else
+			newflags &= ~VM_KVM_PROTECTED;
+
+		/* The VMA has been handled as part of other memslot */
+		if (newflags == vma->vm_flags)
+			goto next;
+
+		ret = mprotect_fixup(vma, &prev, start, tmp, newflags);
+		if (ret)
+			goto out;
+
+next:
+		start = tmp;
+		if (start < prev->vm_end)
+			start = prev->vm_end;
+
+		if (start >= end)
+			goto out;
+
+		vma = prev->vm_next;
+		if (!vma || vma->vm_start != start) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+out:
+	mmap_write_unlock(mm);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_protect_memory);
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 12/16] KVM: x86: Enabled protected memory extension
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 11/16] KVM: Protected memory extension Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 13/16] KVM: Rework copy_to/from_guest() to avoid direct mapping Kirill A. Shutemov
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Wire up hypercalls for the feature and define VM_KVM_PROTECTED.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 1 +
 arch/x86/kvm/Kconfig | 1 +
 arch/x86/kvm/cpuid.c | 3 ++-
 arch/x86/kvm/x86.c   | 9 +++++++++
 include/linux/mm.h   | 6 ++++++
 5 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b22b95517437..0bcbdadb97d6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -807,6 +807,7 @@ config KVM_GUEST
 	select X86_HV_CALLBACK_VECTOR
 	select X86_MEM_ENCRYPT_COMMON
 	select SWIOTLB
+	select ARCH_USES_HIGH_VMA_FLAGS
 	default y
 	help
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fbd5bd7a945a..2ea77c2a8029 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
 	select SRCU
+	select HAVE_KVM_PROTECTED_MEMORY
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 3fd6eec202d7..eed33db032fb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -746,7 +746,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-			     (1 << KVM_FEATURE_ASYNC_PF_INT);
+			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
+			     (1 << KVM_FEATURE_MEM_PROTECTED);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce856e0ece84..e89ff39204f0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7752,6 +7752,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		kvm_sched_yield(vcpu->kvm, a0);
 		ret = 0;
 		break;
+	case KVM_HC_ENABLE_MEM_PROTECTED:
+		ret = kvm_protect_all_memory(vcpu->kvm);
+		break;
+	case KVM_HC_MEM_SHARE:
+		ret = kvm_protect_memory(vcpu->kvm, a0, a1, false);
+		break;
+	case KVM_HC_MEM_UNSHARE:
+		ret = kvm_protect_memory(vcpu->kvm, a0, a1, true);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8d8cdcbc425..ee274d27e764 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -304,11 +304,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -342,7 +344,11 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
 #endif
 
+#if defined(CONFIG_X86_64) && defined(CONFIG_KVM)
+#define VM_KVM_PROTECTED VM_HIGH_ARCH_5
+#else
 #define VM_KVM_PROTECTED 0
+#endif
 
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 13/16] KVM: Rework copy_to/from_guest() to avoid direct mapping
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 12/16] KVM: x86: Enabled protected " Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  6:18 ` [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

We are going unmap guest pages from direct mapping and cannot rely on it
for guest memory access. Use temporary kmap_atomic()-style mapping to
access guest memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 virt/kvm/kvm_main.c      |  27 ++++++++++-
 virt/lib/mem_protected.c | 101 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4c008c7b4974..9b569b78874a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/io.h>
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
+#include <linux/pagewalk.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -154,6 +155,12 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+void *kvm_map_page_atomic(struct page *page);
+void kvm_unmap_page_atomic(void *vaddr);
+
+int kvm_init_protected_memory(void);
+void kvm_exit_protected_memory(void);
+
 int __kvm_protect_memory(unsigned long start, unsigned long end, bool protect);
 
 __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
@@ -2329,6 +2336,7 @@ int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
 	int offset = offset_in_page(hva);
 	struct page *page;
 	int npages, seg;
+	void *vaddr;
 
 	if (!protected)
 		return __copy_from_user(data, (void __user *)hva, len);
@@ -2341,7 +2349,11 @@ int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
 		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
-		memcpy(data, page_address(page) + offset, seg);
+
+		vaddr = kvm_map_page_atomic(page);
+		memcpy(data, vaddr + offset, seg);
+		kvm_unmap_page_atomic(vaddr);
+
 		put_page(page);
 		len -= seg;
 		hva += seg;
@@ -2356,6 +2368,7 @@ int copy_to_guest(unsigned long hva, const void *data, int len, bool protected)
 	int offset = offset_in_page(hva);
 	struct page *page;
 	int npages, seg;
+	void *vaddr;
 
 	if (!protected)
 		return __copy_to_user((void __user *)hva, data, len);
@@ -2369,7 +2382,11 @@ int copy_to_guest(unsigned long hva, const void *data, int len, bool protected)
 						 FOLL_WRITE | FOLL_KVM);
 		if (npages != 1)
 			return -EFAULT;
-		memcpy(page_address(page) + offset, data, seg);
+
+		vaddr = kvm_map_page_atomic(page);
+		memcpy(vaddr + offset, data, seg);
+		kvm_unmap_page_atomic(vaddr);
+
 		put_page(page);
 		len -= seg;
 		hva += seg;
@@ -4945,6 +4962,10 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	if (r)
 		goto out_free;
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) &&
+	    kvm_init_protected_memory())
+		goto out_unreg;
+
 	kvm_chardev_ops.owner = module;
 	kvm_vm_fops.owner = module;
 	kvm_vcpu_fops.owner = module;
@@ -4968,6 +4989,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	return 0;
 
 out_unreg:
+	kvm_exit_protected_memory();
 	kvm_async_pf_deinit();
 out_free:
 	kmem_cache_destroy(kvm_vcpu_cache);
@@ -4989,6 +5011,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+	kvm_exit_protected_memory();
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
diff --git a/virt/lib/mem_protected.c b/virt/lib/mem_protected.c
index 0b01dd74f29c..1dfe82534242 100644
--- a/virt/lib/mem_protected.c
+++ b/virt/lib/mem_protected.c
@@ -5,6 +5,100 @@
 #include <linux/vmalloc.h>
 #include <asm/tlbflush.h>
 
+static pte_t **guest_map_ptes;
+static struct vm_struct *guest_map_area;
+
+void *kvm_map_page_atomic(struct page *page)
+{
+	pte_t *pte;
+	void *vaddr;
+
+	preempt_disable();
+	pte = guest_map_ptes[smp_processor_id()];
+	vaddr = guest_map_area->addr + smp_processor_id() * PAGE_SIZE;
+	set_pte(pte, mk_pte(page, PAGE_KERNEL));
+	return vaddr;
+}
+EXPORT_SYMBOL_GPL(kvm_map_page_atomic);
+
+void kvm_unmap_page_atomic(void *vaddr)
+{
+	pte_t *pte = guest_map_ptes[smp_processor_id()];
+	set_pte(pte, __pte(0));
+	flush_tlb_one_kernel((unsigned long)vaddr);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(kvm_unmap_page_atomic);
+
+int kvm_init_protected_memory(void)
+{
+	guest_map_ptes = kmalloc_array(num_possible_cpus(),
+				       sizeof(pte_t *), GFP_KERNEL);
+	if (!guest_map_ptes)
+		return -ENOMEM;
+
+	guest_map_area = alloc_vm_area(PAGE_SIZE * num_possible_cpus(),
+				       guest_map_ptes);
+	if (!guest_map_ptes) {
+		kfree(guest_map_ptes);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_protected_memory);
+
+void kvm_exit_protected_memory(void)
+{
+	if (guest_map_area)
+		free_vm_area(guest_map_area);
+	if (guest_map_ptes)
+		kfree(guest_map_ptes);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_protected_memory);
+
+static int adjust_direct_mapping_pte_range(pmd_t *pmd, unsigned long addr,
+					   unsigned long end,
+					   struct mm_walk *walk)
+{
+	bool protect = (bool)walk->private;
+	pte_t *pte;
+	struct page *page;
+
+	if (pmd_trans_huge(*pmd)) {
+		page = pmd_page(*pmd);
+		if (is_huge_zero_page(page))
+			return 0;
+		VM_BUG_ON_PAGE(total_mapcount(page) != 1, page);
+		/* XXX: Would it fail with direct device assignment? */
+		VM_BUG_ON_PAGE(page_count(page) != 1, page);
+		kernel_map_pages(page, HPAGE_PMD_NR, !protect);
+		return 0;
+	}
+
+	pte = pte_offset_map(pmd, addr);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		pte_t entry = *pte;
+
+		if (!pte_present(entry))
+			continue;
+
+		if (is_zero_pfn(pte_pfn(entry)))
+			continue;
+
+		page = pte_page(entry);
+
+		VM_BUG_ON_PAGE(page_mapcount(page) != 1, page);
+		kernel_map_pages(page, 1, !protect);
+	}
+
+	return 0;
+}
+
+static const struct mm_walk_ops adjust_direct_mapping_ops = {
+	.pmd_entry      = adjust_direct_mapping_pte_range,
+};
+
 int __kvm_protect_memory(unsigned long start, unsigned long end, bool protect)
 {
 	struct mm_struct *mm = current->mm;
@@ -50,6 +144,13 @@ int __kvm_protect_memory(unsigned long start, unsigned long end, bool protect)
 		if (ret)
 			goto out;
 
+		if (vma_is_anonymous(vma)) {
+			ret = walk_page_range_novma(mm, start, tmp,
+						    &adjust_direct_mapping_ops, NULL,
+						    (void *) protect);
+			if (ret)
+				goto out;
+		}
 next:
 		start = tmp;
 		if (start < prev->vm_end)
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 13/16] KVM: Rework copy_to/from_guest() to avoid direct mapping Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-21 18:50   ` Edgecombe, Rick P
  2020-10-22  3:26   ` Halil Pasic
  2020-10-20  6:18 ` [RFCv2 15/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
                   ` (3 subsequent siblings)
  17 siblings, 2 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

We cannot access protected pages directly. Use ioremap() to
create a temporary mapping of the page. The mapping is destroyed
on __kvm_unmap_gfn().

The new interface gfn_to_pfn_memslot_protected() is used to detect if
the page is protected.

ioremap_cache_force() is a hack to bypass IORES_MAP_SYSTEM_RAM check in
the x86 ioremap code. We need a better solution.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/x86/include/asm/io.h              |  2 +
 arch/x86/include/asm/pgtable_types.h   |  1 +
 arch/x86/kvm/mmu/mmu.c                 |  6 ++-
 arch/x86/mm/ioremap.c                  | 16 ++++++--
 include/linux/kvm_host.h               |  3 +-
 include/linux/kvm_types.h              |  1 +
 virt/kvm/kvm_main.c                    | 52 +++++++++++++++++++-------
 9 files changed, 63 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 38ea396a23d6..8e06cd3f759c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -590,7 +590,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	} else {
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-					   writing, &write_ok);
+					   writing, &write_ok, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 22a677b18695..6fd4e3f9b66a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -822,7 +822,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-					   writing, upgrade_p);
+					   writing, upgrade_p, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index c58d52fd7bf2..a3e1bfad1026 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -184,6 +184,8 @@ extern void __iomem *ioremap_uc(resource_size_t offset, unsigned long size);
 #define ioremap_uc ioremap_uc
 extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size);
 #define ioremap_cache ioremap_cache
+extern void __iomem *ioremap_cache_force(resource_size_t offset, unsigned long size);
+#define ioremap_cache_force ioremap_cache_force
 extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, unsigned long prot_val);
 #define ioremap_prot ioremap_prot
 extern void __iomem *ioremap_encrypted(resource_size_t phys_addr, unsigned long size);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 816b31c68550..4c16a9583786 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -147,6 +147,7 @@ enum page_cache_mode {
 	_PAGE_CACHE_MODE_UC       = 3,
 	_PAGE_CACHE_MODE_WT       = 4,
 	_PAGE_CACHE_MODE_WP       = 5,
+	_PAGE_CACHE_MODE_WB_FORCE = 6,
 
 	_PAGE_CACHE_MODE_NUM      = 8
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 71aa3da2a0b7..162cb285b87b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4058,7 +4058,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	}
 
 	async = false;
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable,
+				    NULL);
 	if (!async)
 		return false; /* *pfn has correct page already */
 
@@ -4072,7 +4073,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			return true;
 	}
 
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable,
+				    NULL);
 	return false;
 }
 
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..4409785e294c 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -202,9 +202,12 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	__ioremap_check_mem(phys_addr, size, &io_desc);
 
 	/*
-	 * Don't allow anybody to remap normal RAM that we're using..
+	 * Don't allow anybody to remap normal RAM that we're using, unless
+	 * _PAGE_CACHE_MODE_WB_FORCE is used.
 	 */
-	if (io_desc.flags & IORES_MAP_SYSTEM_RAM) {
+	if (pcm == _PAGE_CACHE_MODE_WB_FORCE) {
+	    pcm = _PAGE_CACHE_MODE_WB;
+	} else if (io_desc.flags & IORES_MAP_SYSTEM_RAM) {
 		WARN_ONCE(1, "ioremap on RAM at %pa - %pa\n",
 			  &phys_addr, &last_addr);
 		return NULL;
@@ -419,6 +422,13 @@ void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size)
 }
 EXPORT_SYMBOL(ioremap_cache);
 
+void __iomem *ioremap_cache_force(resource_size_t phys_addr, unsigned long size)
+{
+	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB_FORCE,
+				__builtin_return_address(0), false);
+}
+EXPORT_SYMBOL(ioremap_cache_force);
+
 void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long size,
 				unsigned long prot_val)
 {
@@ -467,7 +477,7 @@ void iounmap(volatile void __iomem *addr)
 	p = find_vm_area((void __force *)addr);
 
 	if (!p) {
-		printk(KERN_ERR "iounmap: bad address %p\n", addr);
+		printk(KERN_ERR "iounmap: bad address %px\n", addr);
 		dump_stack();
 		return;
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6655e8da4555..0d5f3885747b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -238,6 +238,7 @@ struct kvm_host_map {
 	void *hva;
 	kvm_pfn_t pfn;
 	kvm_pfn_t gfn;
+	bool protected;
 };
 
 /*
@@ -725,7 +726,7 @@ kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
-			       bool *writable);
+			       bool *writable, bool *protected);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index a7580f69dda0..0a8c6426b4f4 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -58,6 +58,7 @@ struct gfn_to_pfn_cache {
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool dirty;
+	bool protected;
 };
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9b569b78874a..7c2c764c28c5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1852,9 +1852,10 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, kvm_pfn_t *pfn)
+			   bool *writable, bool *protected, kvm_pfn_t *pfn)
 {
 	unsigned int flags = FOLL_HWPOISON | FOLL_KVM;
+	struct vm_area_struct *vma;
 	struct page *page;
 	int npages = 0;
 
@@ -1868,9 +1869,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 	if (async)
 		flags |= FOLL_NOWAIT;
 
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
-	if (npages != 1)
+	mmap_read_lock(current->mm);
+	npages = get_user_pages(addr, 1, flags, &page, &vma);
+	if (npages != 1) {
+		mmap_read_unlock(current->mm);
 		return npages;
+	}
+	if (protected)
+		*protected = vma_is_kvm_protected(vma);
+	mmap_read_unlock(current->mm);
 
 	/* map read fault as writable if possible */
 	if (unlikely(!write_fault) && writable) {
@@ -1961,7 +1968,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  *     whether the mapping is writable.
  */
 static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-			bool write_fault, bool *writable)
+			bool write_fault, bool *writable, bool *protected)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn = 0;
@@ -1976,7 +1983,8 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	if (atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+	npages = hva_to_pfn_slow(addr, async, write_fault, writable, protected,
+				 &pfn);
 	if (npages == 1)
 		return pfn;
 
@@ -2010,7 +2018,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 
 kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
-			       bool *writable)
+			       bool *writable, bool *protected)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -2033,7 +2041,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 	}
 
 	return hva_to_pfn(addr, atomic, async, write_fault,
-			  writable);
+			  writable, protected);
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
@@ -2041,19 +2049,26 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
-				    write_fault, writable);
+				    write_fault, writable, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
+static kvm_pfn_t gfn_to_pfn_memslot_protected(struct kvm_memory_slot *slot,
+					      gfn_t gfn, bool *protected)
+{
+	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL,
+				    protected);
+}
+
 kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
@@ -2134,7 +2149,7 @@ static void kvm_cache_gfn_to_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
 {
 	kvm_release_pfn(cache->pfn, cache->dirty, cache);
 
-	cache->pfn = gfn_to_pfn_memslot(slot, gfn);
+	cache->pfn = gfn_to_pfn_memslot_protected(slot, gfn, &cache->protected);
 	cache->gfn = gfn;
 	cache->dirty = false;
 	cache->generation = gen;
@@ -2149,6 +2164,7 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 	void *hva = NULL;
 	struct page *page = KVM_UNMAPPED_PAGE;
 	struct kvm_memory_slot *slot = __gfn_to_memslot(slots, gfn);
+	bool protected;
 	u64 gen = slots->generation;
 
 	if (!map)
@@ -2162,15 +2178,20 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 			kvm_cache_gfn_to_pfn(slot, gfn, cache, gen);
 		}
 		pfn = cache->pfn;
+		protected = cache->protected;
 	} else {
 		if (atomic)
 			return -EAGAIN;
-		pfn = gfn_to_pfn_memslot(slot, gfn);
+		pfn = gfn_to_pfn_memslot_protected(slot, gfn, &protected);
 	}
 	if (is_error_noslot_pfn(pfn))
 		return -EINVAL;
 
-	if (pfn_valid(pfn)) {
+	if (protected) {
+		if (atomic)
+			return -EAGAIN;
+		hva = ioremap_cache_force(pfn_to_hpa(pfn), PAGE_SIZE);
+	} else if (pfn_valid(pfn)) {
 		page = pfn_to_page(pfn);
 		if (atomic)
 			hva = kmap_atomic(page);
@@ -2191,6 +2212,7 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 	map->hva = hva;
 	map->pfn = pfn;
 	map->gfn = gfn;
+	map->protected = protected;
 
 	return 0;
 }
@@ -2221,7 +2243,9 @@ static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
 	if (!map->hva)
 		return;
 
-	if (map->page != KVM_UNMAPPED_PAGE) {
+	if (map->protected) {
+		iounmap(map->hva);
+	} else if (map->page != KVM_UNMAPPED_PAGE) {
 		if (atomic)
 			kunmap_atomic(map->hva);
 		else
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  7:12   ` Peter Zijlstra
                     ` (3 more replies)
  2020-10-20  6:18 ` [RFCv2 16/16] mm: Do not use zero page for VM_KVM_PROTECTED VMAs Kirill A. Shutemov
                   ` (2 subsequent siblings)
  17 siblings, 4 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

If the protected memory feature enabled, unmap guest memory from
kernel's direct mappings.

Migration and KSM is disabled for protected memory as it would require a
special treatment.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h       |  3 +++
 mm/huge_memory.c         |  8 ++++++++
 mm/ksm.c                 |  2 ++
 mm/memory.c              | 12 ++++++++++++
 mm/rmap.c                |  4 ++++
 virt/lib/mem_protected.c | 21 +++++++++++++++++++++
 6 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ee274d27e764..74efc51e63f0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -671,6 +671,9 @@ static inline bool vma_is_kvm_protected(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_KVM_PROTECTED;
 }
 
+void kvm_map_page(struct page *page, int nr_pages);
+void kvm_unmap_page(struct page *page, int nr_pages);
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec8cf9a40cfd..40974656cb43 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -627,6 +627,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		spin_unlock(vmf->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
 		count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
+
+		/* Unmap page from direct mapping */
+		if (vma_is_kvm_protected(vma))
+			kvm_unmap_page(page, HPAGE_PMD_NR);
 	}
 
 	return 0;
@@ -1689,6 +1693,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
+
+			/* Map the page back to the direct mapping */
+			if (vma_is_kvm_protected(vma))
+				kvm_map_page(page, HPAGE_PMD_NR);
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 9afccc36dbd2..c720e271448f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -528,6 +528,8 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
 		return NULL;
 	if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 		return NULL;
+	if (vma_is_kvm_protected(vma))
+		return NULL;
 	return vma;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 2c9756b4e52f..e28bd5f902a7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1245,6 +1245,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
+
+			/* Map the page back to the direct mapping */
+			if (vma_is_anonymous(vma) && vma_is_kvm_protected(vma))
+				kvm_map_page(page, 1);
+
 			rss[mm_counter(page)]--;
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
@@ -3466,6 +3471,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct page *page;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	bool set = false;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -3554,6 +3560,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	lru_cache_add_inactive_or_unevictable(page, vma);
+	set = true;
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3561,6 +3568,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	/* Unmap page from direct mapping */
+	if (vma_is_kvm_protected(vma) && set)
+		kvm_unmap_page(page, 1);
+
 	return ret;
 release:
 	put_page(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 9425260774a1..247548d6d24b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1725,6 +1725,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
 {
+	/* TODO */
+	if (vma_is_kvm_protected(vma))
+		return true;
+
 	return vma_is_temporary_stack(vma);
 }
 
diff --git a/virt/lib/mem_protected.c b/virt/lib/mem_protected.c
index 1dfe82534242..9d2ef99285e5 100644
--- a/virt/lib/mem_protected.c
+++ b/virt/lib/mem_protected.c
@@ -30,6 +30,27 @@ void kvm_unmap_page_atomic(void *vaddr)
 }
 EXPORT_SYMBOL_GPL(kvm_unmap_page_atomic);
 
+void kvm_map_page(struct page *page, int nr_pages)
+{
+	int i;
+
+	/* Clear page before returning it to the direct mapping */
+	for (i = 0; i < nr_pages; i++) {
+		void *p = kvm_map_page_atomic(page + i);
+		memset(p, 0, PAGE_SIZE);
+		kvm_unmap_page_atomic(p);
+	}
+
+	kernel_map_pages(page, nr_pages, 1);
+}
+EXPORT_SYMBOL_GPL(kvm_map_page);
+
+void kvm_unmap_page(struct page *page, int nr_pages)
+{
+	kernel_map_pages(page, nr_pages, 0);
+}
+EXPORT_SYMBOL_GPL(kvm_unmap_page);
+
 int kvm_init_protected_memory(void)
 {
 	guest_map_ptes = kmalloc_array(num_possible_cpus(),
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFCv2 16/16] mm: Do not use zero page for VM_KVM_PROTECTED VMAs
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 15/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
@ 2020-10-20  6:18 ` Kirill A. Shutemov
  2020-10-20  7:46 ` [RFCv2 00/16] KVM protected memory extension Vitaly Kuznetsov
  2020-10-21 18:20 ` Andy Lutomirski
  17 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Presence of zero pages in the mapping would disclose content of the
mapping. Don't use them if KVM memory protection is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/s390/include/asm/pgtable.h | 2 +-
 include/linux/mm.h              | 4 ++--
 mm/huge_memory.c                | 3 +--
 mm/memory.c                     | 3 +--
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index b55561cc8786..72ca3b3f04cb 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -543,7 +543,7 @@ static inline int mm_alloc_pgste(struct mm_struct *mm)
  * In the case that a guest uses storage keys
  * faults should no longer be backed by zero pages
  */
-#define mm_forbids_zeropage mm_has_pgste
+#define vma_forbids_zeropage(vma) mm_has_pgste(vma->vm_mm)
 static inline int mm_uses_skeys(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74efc51e63f0..ee713b7c2819 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -130,8 +130,8 @@ extern int mmap_rnd_compat_bits __read_mostly;
  * s390 does this to prevent multiplexing of hardware bits
  * related to the physical page in case of virtualization.
  */
-#ifndef mm_forbids_zeropage
-#define mm_forbids_zeropage(X)	(0)
+#ifndef vma_forbids_zeropage
+#define vma_forbids_zeropage(vma) vma_is_kvm_protected(vma)
 #endif
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40974656cb43..383614b24c4f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,8 +709,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
 		return VM_FAULT_OOM;
-	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
-			!mm_forbids_zeropage(vma->vm_mm) &&
+	if (!(vmf->flags & FAULT_FLAG_WRITE) && !vma_forbids_zeropage(vma) &&
 			transparent_hugepage_use_zero_page()) {
 		pgtable_t pgtable;
 		struct page *zero_page;
diff --git a/mm/memory.c b/mm/memory.c
index e28bd5f902a7..9907ffe00490 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3495,8 +3495,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return 0;
 
 	/* Use the zero-page for reads */
-	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
-			!mm_forbids_zeropage(vma->vm_mm)) {
+	if (!(vmf->flags & FAULT_FLAG_WRITE) && !vma_forbids_zeropage(vma)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
 						vma->vm_page_prot));
 		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20  6:18 ` [RFCv2 15/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
@ 2020-10-20  7:12   ` Peter Zijlstra
  2020-10-20 12:18   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2020-10-20  7:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On Tue, Oct 20, 2020 at 09:18:58AM +0300, Kirill A. Shutemov wrote:
> If the protected memory feature enabled, unmap guest memory from
> kernel's direct mappings.
> 
> Migration and KSM is disabled for protected memory as it would require a
> special treatment.

How isn't disabling migration a problem? Are we going to back allocation
by CMA instead?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 11/16] KVM: Protected memory extension
  2020-10-20  6:18 ` [RFCv2 11/16] KVM: Protected memory extension Kirill A. Shutemov
@ 2020-10-20  7:17   ` Peter Zijlstra
  2020-10-20 12:55     ` Kirill A. Shutemov
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2020-10-20  7:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On Tue, Oct 20, 2020 at 09:18:54AM +0300, Kirill A. Shutemov wrote:
> +int __kvm_protect_memory(unsigned long start, unsigned long end, bool protect)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma, *prev;
> +	int ret;
> +
> +	if (mmap_write_lock_killable(mm))
> +		return -EINTR;
> +
> +	ret = -ENOMEM;
> +	vma = find_vma(current->mm, start);
> +	if (!vma)
> +		goto out;
> +
> +	ret = -EINVAL;
> +	if (vma->vm_start > start)
> +		goto out;
> +
> +	if (start > vma->vm_start)
> +		prev = vma;
> +	else
> +		prev = vma->vm_prev;
> +
> +	ret = 0;
> +	while (true) {
> +		unsigned long newflags, tmp;
> +
> +		tmp = vma->vm_end;
> +		if (tmp > end)
> +			tmp = end;
> +
> +		newflags = vma->vm_flags;
> +		if (protect)
> +			newflags |= VM_KVM_PROTECTED;
> +		else
> +			newflags &= ~VM_KVM_PROTECTED;
> +
> +		/* The VMA has been handled as part of other memslot */
> +		if (newflags == vma->vm_flags)
> +			goto next;
> +
> +		ret = mprotect_fixup(vma, &prev, start, tmp, newflags);
> +		if (ret)
> +			goto out;
> +
> +next:
> +		start = tmp;
> +		if (start < prev->vm_end)
> +			start = prev->vm_end;
> +
> +		if (start >= end)
> +			goto out;
> +
> +		vma = prev->vm_next;
> +		if (!vma || vma->vm_start != start) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	}
> +out:
> +	mmap_write_unlock(mm);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(__kvm_protect_memory);

Since migration will be disabled after this; should the above not (at
the very least) force compaction before proceeding to lock the pages in?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2020-10-20  6:18 ` [RFCv2 16/16] mm: Do not use zero page for VM_KVM_PROTECTED VMAs Kirill A. Shutemov
@ 2020-10-20  7:46 ` Vitaly Kuznetsov
  2020-10-20 13:49   ` Kirill A. Shutemov
  2020-10-21 18:20 ` Andy Lutomirski
  17 siblings, 1 reply; 57+ messages in thread
From: Vitaly Kuznetsov @ 2020-10-20  7:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> == Background / Problem ==
>
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.
>
>
> == What does this set mitigate? ==
>
>  - Host kernel ”accidental” access to guest data (think speculation)
>
>  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>
>  - Host userspace access to guest data (compromised qemu)
>
>  - Guest privilege escalation via compromised QEMU device emulation
>
> == What does this set NOT mitigate? ==
>
>  - Full host kernel compromise.  Kernel will just map the pages again.
>
>  - Hardware attacks
>
>
> The second RFC revision addresses /most/ of the feedback.
>
> I still didn't found a good solution to reboot and kexec. Unprotect all
> the memory on such operations defeat the goal of the feature. Clearing up
> most of the memory before unprotecting what is required for reboot (or
> kexec) is tedious and error-prone.
> Maybe we should just declare them unsupported?

Making reboot unsupported is a hard sell. Could you please elaborate on
why you think that "unprotect all" hypercall (or rather a single
hypercall supporting both protecting/unprotecting) defeats the purpose
of the feature?

(Leaving kexec aside for a while) Yes, it is not easy for a guest to
clean up *all* its memory upon reboot, however:
- It may only clean up the most sensitive parts. This should probably be
done even without this new feature and even on bare metal (think about
next boot target being malicious).
- The attack window shrinks significantly. "Speculative" bugs require
time to exploit and it will only remain open until it boots up again
(few seconds).

>
> == Series Overview ==
>
> The hardware features protect guest data by encrypting it and then
> ensuring that only the right guest can decrypt it.  This has the
> side-effect of making the kernel direct map and userspace mapping
> (QEMU et al) useless.  But, this teaches us something very useful:
> neither the kernel or userspace mappings are really necessary for normal
> guest operations.
>
> Instead of using encryption, this series simply unmaps the memory. One
> advantage compared to allowing access to ciphertext is that it allows bad
> accesses to be caught instead of simply reading garbage.
>
> Protection from physical attacks needs to be provided by some other means.
> On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> mitigation against physical attacks, such as DIMM interposers sniffing
> memory bus traffic.
>
> The patchset modifies both host and guest kernel. The guest OS must enable
> the feature via hypercall and mark any memory range that has to be shared
> with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> bit in the guest’s page table while this approach uses a hypercall.
>
> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
>
> Removing userspace mapping of the guest memory from QEMU process can help
> to address some guest privilege escalation attacks. Consider the case when
> unprivileged guest user exploits bug in a QEMU device emulation to gain
> access to data it cannot normally have access within the guest.
>
> Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> flushes local TLB. I think it's a reasonable compromise between security and
> perfromance.
>
> Zapping the PTE would bring the page back to the direct mapping after clearing.
> At least for now, we don't remove file-backed pages from the direct mapping.
> File-backed pages could be accessed via read/write syscalls. It adds
> complexity.
>
> Occasionally, host kernel has to access guest memory that was not made
> shared by the guest. For instance, it happens for instruction emulation.
> Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> helpers acquire the page via GUP, map it into kernel address space with
> kmap_atomic()-style mechanism and only then copy the data.
>
> For some instruction emulation copying is not good enough: cmpxchg
> emulation has to have direct access to the guest memory. __kvm_map_gfn()
> is modified to accommodate the case.
>
> The patchset is on top of v5.9
>
> Kirill A. Shutemov (16):
>   x86/mm: Move force_dma_unencrypted() to common code
>   x86/kvm: Introduce KVM memory protection feature
>   x86/kvm: Make DMA pages shared
>   x86/kvm: Use bounce buffers for KVM memory protection
>   x86/kvm: Make VirtIO use DMA API in KVM guest
>   x86/kvmclock: Share hvclock memory with the host
>   x86/realmode: Share trampoline area if KVM memory protection enabled
>   KVM: Use GUP instead of copy_from/to_user() to access guest memory
>   KVM: mm: Introduce VM_KVM_PROTECTED
>   KVM: x86: Use GUP for page walk instead of __get_user()
>   KVM: Protected memory extension
>   KVM: x86: Enabled protected memory extension
>   KVM: Rework copy_to/from_guest() to avoid direct mapping
>   KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
>   KVM: Unmap protected pages from direct mapping
>   mm: Do not use zero page for VM_KVM_PROTECTED VMAs
>
>  arch/powerpc/kvm/book3s_64_mmu_hv.c    |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
>  arch/s390/include/asm/pgtable.h        |   2 +-
>  arch/x86/Kconfig                       |  11 +-
>  arch/x86/include/asm/cpufeatures.h     |   1 +
>  arch/x86/include/asm/io.h              |   6 +-
>  arch/x86/include/asm/kvm_para.h        |   5 +
>  arch/x86/include/asm/pgtable_types.h   |   1 +
>  arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
>  arch/x86/kernel/kvm.c                  |  20 +++
>  arch/x86/kernel/kvmclock.c             |   2 +-
>  arch/x86/kernel/pci-swiotlb.c          |   3 +-
>  arch/x86/kvm/Kconfig                   |   1 +
>  arch/x86/kvm/cpuid.c                   |   3 +-
>  arch/x86/kvm/mmu/mmu.c                 |   6 +-
>  arch/x86/kvm/mmu/paging_tmpl.h         |  10 +-
>  arch/x86/kvm/x86.c                     |   9 +
>  arch/x86/mm/Makefile                   |   2 +
>  arch/x86/mm/ioremap.c                  |  16 +-
>  arch/x86/mm/mem_encrypt.c              |  51 ------
>  arch/x86/mm/mem_encrypt_common.c       |  62 +++++++
>  arch/x86/mm/pat/set_memory.c           |   7 +
>  arch/x86/realmode/init.c               |   7 +-
>  drivers/virtio/virtio_ring.c           |   4 +
>  include/linux/kvm_host.h               |  11 +-
>  include/linux/kvm_types.h              |   1 +
>  include/linux/mm.h                     |  21 ++-
>  include/uapi/linux/kvm_para.h          |   5 +-
>  mm/gup.c                               |  20 ++-
>  mm/huge_memory.c                       |  31 +++-
>  mm/ksm.c                               |   2 +
>  mm/memory.c                            |  18 +-
>  mm/mmap.c                              |   3 +
>  mm/rmap.c                              |   4 +
>  virt/kvm/Kconfig                       |   3 +
>  virt/kvm/async_pf.c                    |   2 +-
>  virt/kvm/kvm_main.c                    | 238 +++++++++++++++++++++----
>  virt/lib/Makefile                      |   1 +
>  virt/lib/mem_protected.c               | 193 ++++++++++++++++++++
>  39 files changed, 666 insertions(+), 123 deletions(-)
>  create mode 100644 arch/x86/mm/mem_encrypt_common.c
>  create mode 100644 virt/lib/mem_protected.c

-- 
Vitaly



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest
  2020-10-20  6:18 ` [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
@ 2020-10-20  8:06   ` Christoph Hellwig
  2020-10-20 12:47     ` Kirill A. Shutemov
  2020-10-22  3:31   ` Halil Pasic
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2020-10-20  8:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

NAK.  Any virtio implementation that needs special DMA OPS treatment
needs to set the VIRTIO_F_ACCESS_PLATFORM bit.  The only reason the
Xen hack existst is because it slipped in a long time ago and we can't
fix that any more.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-20  6:18 ` [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
@ 2020-10-20  8:25   ` John Hubbard
  2020-10-20 12:51     ` Kirill A. Shutemov
  2020-10-22 11:49     ` Matthew Wilcox
  2020-10-20 17:29   ` Ira Weiny
  1 sibling, 2 replies; 57+ messages in thread
From: John Hubbard @ 2020-10-20  8:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 10/19/20 11:18 PM, Kirill A. Shutemov wrote:
> New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> protection feature is enabled.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/kvm_host.h |  4 ++
>   virt/kvm/kvm_main.c      | 90 +++++++++++++++++++++++++++++++---------
>   2 files changed, 75 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 05e3c2fb3ef7..380a64613880 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -504,6 +504,7 @@ struct kvm {
>   	struct srcu_struct irq_srcu;
>   	pid_t userspace_pid;
>   	unsigned int max_halt_poll_ns;
> +	bool mem_protected;
>   };
>   
>   #define kvm_err(fmt, ...) \
> @@ -728,6 +729,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
>   void kvm_set_pfn_accessed(kvm_pfn_t pfn);
>   void kvm_get_pfn(kvm_pfn_t pfn);
>   
> +int copy_from_guest(void *data, unsigned long hva, int len, bool protected);
> +int copy_to_guest(unsigned long hva, const void *data, int len, bool protected);
> +
>   void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
>   int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>   			int len);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf88233b819a..a9884cb8c867 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2313,19 +2313,70 @@ static int next_segment(unsigned long len, int offset)
>   		return len;
>   }
>   
> +int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	if (!protected)
> +		return __copy_from_user(data, (void __user *)hva, len);
> +
> +	might_fault();
> +	kasan_check_write(data, len);
> +	check_object_size(data, len, false);
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(data, page_address(page) + offset, seg);

Hi Kirill!

OK, so the copy_from_guest() is a read-only case for gup, which I think is safe
from a gup/pup + filesystem point of view, but see below about copy_to_guest()...

> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;
> +	}
> +
> +	return 0;
> +}
> +
> +int copy_to_guest(unsigned long hva, const void *data, int len, bool protected)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	if (!protected)
> +		return __copy_to_user((void __user *)hva, data, len);
> +
> +	might_fault();
> +	kasan_check_read(data, len);
> +	check_object_size(data, len, true);
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);


Should copy_to_guest() use pin_user_pages_unlocked() instead of gup_unlocked?
We wrote a  "Case 5" in Documentation/core-api/pin_user_pages.rst, just for this
situation, I think:


CASE 5: Pinning in order to write to the data within the page
-------------------------------------------------------------
Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
write to a page's data, unpin" can cause a problem. Case 5 may be considered a
superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
other words, if the code is neither Case 1 nor Case 2, it may still require
FOLL_PIN, for patterns like this:

Correct (uses FOLL_PIN calls):
     pin_user_pages()
     write to the data within the pages
     unpin_user_pages()


thanks,
-- 
John Hubbard
NVIDIA

> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(page_address(page) + offset, data, seg);
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;
> +	}
> +
> +	return 0;
> +}
> +
>   static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -				 void *data, int offset, int len)
> +				 void *data, int offset, int len,
> +				 bool protected)
>   {
> -	int r;
>   	unsigned long addr;
>   
>   	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>   	if (kvm_is_error_hva(addr))
>   		return -EFAULT;
> -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> -	if (r)
> -		return -EFAULT;
> -	return 0;
> +	return copy_from_guest(data, addr + offset, len, protected);
>   }
>   
>   int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> @@ -2333,7 +2384,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>   {
>   	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>   
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     kvm->mem_protected);
>   }
>   EXPORT_SYMBOL_GPL(kvm_read_guest_page);
>   
> @@ -2342,7 +2394,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
>   {
>   	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>   
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     vcpu->kvm->mem_protected);
>   }
>   EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
>   
> @@ -2415,7 +2468,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>   EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>   
>   static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
> -			          const void *data, int offset, int len)
> +			          const void *data, int offset, int len,
> +				  bool protected)
>   {
>   	int r;
>   	unsigned long addr;
> @@ -2423,7 +2477,8 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
>   	addr = gfn_to_hva_memslot(memslot, gfn);
>   	if (kvm_is_error_hva(addr))
>   		return -EFAULT;
> -	r = __copy_to_user((void __user *)addr + offset, data, len);
> +
> +	r = copy_to_guest(addr + offset, data, len, protected);
>   	if (r)
>   		return -EFAULT;
>   	mark_page_dirty_in_slot(memslot, gfn);
> @@ -2435,7 +2490,8 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
>   {
>   	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>   
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +				      kvm->mem_protected);
>   }
>   EXPORT_SYMBOL_GPL(kvm_write_guest_page);
>   
> @@ -2444,7 +2500,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>   {
>   	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>   
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +				      vcpu->kvm->mem_protected);
>   }
>   EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
>   
> @@ -2560,7 +2617,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>   	if (unlikely(!ghc->memslot))
>   		return kvm_write_guest(kvm, gpa, data, len);
>   
> -	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
> +	r = copy_to_guest(ghc->hva + offset, data, len, kvm->mem_protected);
>   	if (r)
>   		return -EFAULT;
>   	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
> @@ -2581,7 +2638,6 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>   				 unsigned long len)
>   {
>   	struct kvm_memslots *slots = kvm_memslots(kvm);
> -	int r;
>   	gpa_t gpa = ghc->gpa + offset;
>   
>   	BUG_ON(len + offset > ghc->len);
> @@ -2597,11 +2653,7 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>   	if (unlikely(!ghc->memslot))
>   		return kvm_read_guest(kvm, gpa, data, len);
>   
> -	r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
> -	if (r)
> -		return -EFAULT;
> -
> -	return 0;
> +	return copy_from_guest(data, ghc->hva + offset, len, kvm->mem_protected);
>   }
>   EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached);
>   
> 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20  6:18 ` [RFCv2 15/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
  2020-10-20  7:12   ` Peter Zijlstra
@ 2020-10-20 12:18   ` David Hildenbrand
  2020-10-20 13:20     ` David Hildenbrand
  2020-10-26 19:55     ` Tom Lendacky
  2020-10-21 18:49   ` Edgecombe, Rick P
  2020-10-23 12:37   ` Mike Rapoport
  3 siblings, 2 replies; 57+ messages in thread
From: David Hildenbrand @ 2020-10-20 12:18 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Mike Rapoport
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 20.10.20 08:18, Kirill A. Shutemov wrote:
> If the protected memory feature enabled, unmap guest memory from
> kernel's direct mappings.

Gah, ugly. I guess this also defeats compaction, swapping, ... oh gosh.
As if all of the encrypted VM implementations didn't bring us enough
ugliness already (SEV extensions also don't support reboots, but can at
least kexec() IIRC).

Something similar is done with secretmem [1]. And people don't seem to
like fragmenting the direct mapping (including me).

[1] https://lkml.kernel.org/r/20200924132904.1391-1-rppt@kernel.org

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest
  2020-10-20  8:06   ` Christoph Hellwig
@ 2020-10-20 12:47     ` Kirill A. Shutemov
  0 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20 12:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel

On Tue, Oct 20, 2020 at 09:06:58AM +0100, Christoph Hellwig wrote:
> NAK.  Any virtio implementation that needs special DMA OPS treatment
> needs to set the VIRTIO_F_ACCESS_PLATFORM bit.  The only reason the
> Xen hack existst is because it slipped in a long time ago and we can't
> fix that any more.

Thanks. Will fix.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-20  8:25   ` John Hubbard
@ 2020-10-20 12:51     ` Kirill A. Shutemov
  2020-10-22 11:49     ` Matthew Wilcox
  1 sibling, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20 12:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel

On Tue, Oct 20, 2020 at 01:25:59AM -0700, John Hubbard wrote:
> On 10/19/20 11:18 PM, Kirill A. Shutemov wrote:
> > New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> > protection feature is enabled.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >   include/linux/kvm_host.h |  4 ++
> >   virt/kvm/kvm_main.c      | 90 +++++++++++++++++++++++++++++++---------
> >   2 files changed, 75 insertions(+), 19 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 05e3c2fb3ef7..380a64613880 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -504,6 +504,7 @@ struct kvm {
> >   	struct srcu_struct irq_srcu;
> >   	pid_t userspace_pid;
> >   	unsigned int max_halt_poll_ns;
> > +	bool mem_protected;
> >   };
> >   #define kvm_err(fmt, ...) \
> > @@ -728,6 +729,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
> >   void kvm_set_pfn_accessed(kvm_pfn_t pfn);
> >   void kvm_get_pfn(kvm_pfn_t pfn);
> > +int copy_from_guest(void *data, unsigned long hva, int len, bool protected);
> > +int copy_to_guest(unsigned long hva, const void *data, int len, bool protected);
> > +
> >   void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
> >   int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> >   			int len);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index cf88233b819a..a9884cb8c867 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2313,19 +2313,70 @@ static int next_segment(unsigned long len, int offset)
> >   		return len;
> >   }
> > +int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
> > +{
> > +	int offset = offset_in_page(hva);
> > +	struct page *page;
> > +	int npages, seg;
> > +
> > +	if (!protected)
> > +		return __copy_from_user(data, (void __user *)hva, len);
> > +
> > +	might_fault();
> > +	kasan_check_write(data, len);
> > +	check_object_size(data, len, false);
> > +
> > +	while ((seg = next_segment(len, offset)) != 0) {
> > +		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> > +		if (npages != 1)
> > +			return -EFAULT;
> > +		memcpy(data, page_address(page) + offset, seg);
> 
> Hi Kirill!
> 
> OK, so the copy_from_guest() is a read-only case for gup, which I think is safe
> from a gup/pup + filesystem point of view, but see below about copy_to_guest()...
> 
> > +		put_page(page);
> > +		len -= seg;
> > +		hva += seg;
> > +		offset = 0;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int copy_to_guest(unsigned long hva, const void *data, int len, bool protected)
> > +{
> > +	int offset = offset_in_page(hva);
> > +	struct page *page;
> > +	int npages, seg;
> > +
> > +	if (!protected)
> > +		return __copy_to_user((void __user *)hva, data, len);
> > +
> > +	might_fault();
> > +	kasan_check_read(data, len);
> > +	check_object_size(data, len, true);
> > +
> > +	while ((seg = next_segment(len, offset)) != 0) {
> > +		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
> 
> 
> Should copy_to_guest() use pin_user_pages_unlocked() instead of gup_unlocked?
> We wrote a  "Case 5" in Documentation/core-api/pin_user_pages.rst, just for this
> situation, I think:
> 
> 
> CASE 5: Pinning in order to write to the data within the page
> -------------------------------------------------------------
> Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> other words, if the code is neither Case 1 nor Case 2, it may still require
> FOLL_PIN, for patterns like this:
> 
> Correct (uses FOLL_PIN calls):
>     pin_user_pages()
>     write to the data within the pages
>     unpin_user_pages()

Right. I didn't internalize changes in GUP interface yet. Will update.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 11/16] KVM: Protected memory extension
  2020-10-20  7:17   ` Peter Zijlstra
@ 2020-10-20 12:55     ` Kirill A. Shutemov
  0 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20 12:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, x86, kvm, linux-mm, linux-kernel

On Tue, Oct 20, 2020 at 09:17:01AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2020 at 09:18:54AM +0300, Kirill A. Shutemov wrote:
> > +int __kvm_protect_memory(unsigned long start, unsigned long end, bool protect)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *vma, *prev;
> > +	int ret;
> > +
> > +	if (mmap_write_lock_killable(mm))
> > +		return -EINTR;
> > +
> > +	ret = -ENOMEM;
> > +	vma = find_vma(current->mm, start);
> > +	if (!vma)
> > +		goto out;
> > +
> > +	ret = -EINVAL;
> > +	if (vma->vm_start > start)
> > +		goto out;
> > +
> > +	if (start > vma->vm_start)
> > +		prev = vma;
> > +	else
> > +		prev = vma->vm_prev;
> > +
> > +	ret = 0;
> > +	while (true) {
> > +		unsigned long newflags, tmp;
> > +
> > +		tmp = vma->vm_end;
> > +		if (tmp > end)
> > +			tmp = end;
> > +
> > +		newflags = vma->vm_flags;
> > +		if (protect)
> > +			newflags |= VM_KVM_PROTECTED;
> > +		else
> > +			newflags &= ~VM_KVM_PROTECTED;
> > +
> > +		/* The VMA has been handled as part of other memslot */
> > +		if (newflags == vma->vm_flags)
> > +			goto next;
> > +
> > +		ret = mprotect_fixup(vma, &prev, start, tmp, newflags);
> > +		if (ret)
> > +			goto out;
> > +
> > +next:
> > +		start = tmp;
> > +		if (start < prev->vm_end)
> > +			start = prev->vm_end;
> > +
> > +		if (start >= end)
> > +			goto out;
> > +
> > +		vma = prev->vm_next;
> > +		if (!vma || vma->vm_start != start) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +	}
> > +out:
> > +	mmap_write_unlock(mm);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(__kvm_protect_memory);
> 
> Since migration will be disabled after this; should the above not (at
> the very least) force compaction before proceeding to lock the pages in?

Migration has to be implemented instead, before we hit upstream.

BTW, VMs with direct device assignment pins all guest memory today. So
it's not something new in the virtualization world.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20 12:18   ` David Hildenbrand
@ 2020-10-20 13:20     ` David Hildenbrand
  2020-10-21  1:20       ` Edgecombe, Rick P
  2020-10-26 19:55     ` Tom Lendacky
  1 sibling, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2020-10-20 13:20 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Mike Rapoport
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 20.10.20 14:18, David Hildenbrand wrote:
> On 20.10.20 08:18, Kirill A. Shutemov wrote:
>> If the protected memory feature enabled, unmap guest memory from
>> kernel's direct mappings.
> 
> Gah, ugly. I guess this also defeats compaction, swapping, ... oh gosh.
> As if all of the encrypted VM implementations didn't bring us enough
> ugliness already (SEV extensions also don't support reboots, but can at
> least kexec() IIRC).
> 
> Something similar is done with secretmem [1]. And people don't seem to
> like fragmenting the direct mapping (including me).
> 
> [1] https://lkml.kernel.org/r/20200924132904.1391-1-rppt@kernel.org
> 

I just thought "hey, we might have to replace pud/pmd mappings by page
tables when calling kernel_map_pages", this can fail with -ENOMEM, why
isn't there proper error handling.

Then I dived into __kernel_map_pages() which states:

"The return value is ignored as the calls cannot fail. Large pages for
identity mappings are not used at boot time and hence no memory
allocations during large page split."

I am probably missing something important, but how is calling
kernel_map_pages() safe *after* booting?! I know we use it for
debug_pagealloc(), but using it in a production-ready feature feels
completely irresponsible. What am I missing?


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-20  7:46 ` [RFCv2 00/16] KVM protected memory extension Vitaly Kuznetsov
@ 2020-10-20 13:49   ` Kirill A. Shutemov
  2020-10-21 14:46     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-20 13:49 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel

On Tue, Oct 20, 2020 at 09:46:11AM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > == Background / Problem ==
> >
> > There are a number of hardware features (MKTME, SEV) which protect guest
> > memory from some unauthorized host access. The patchset proposes a purely
> > software feature that mitigates some of the same host-side read-only
> > attacks.
> >
> >
> > == What does this set mitigate? ==
> >
> >  - Host kernel ”accidental” access to guest data (think speculation)
> >
> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> >
> >  - Host userspace access to guest data (compromised qemu)
> >
> >  - Guest privilege escalation via compromised QEMU device emulation
> >
> > == What does this set NOT mitigate? ==
> >
> >  - Full host kernel compromise.  Kernel will just map the pages again.
> >
> >  - Hardware attacks
> >
> >
> > The second RFC revision addresses /most/ of the feedback.
> >
> > I still didn't found a good solution to reboot and kexec. Unprotect all
> > the memory on such operations defeat the goal of the feature. Clearing up
> > most of the memory before unprotecting what is required for reboot (or
> > kexec) is tedious and error-prone.
> > Maybe we should just declare them unsupported?
> 
> Making reboot unsupported is a hard sell. Could you please elaborate on
> why you think that "unprotect all" hypercall (or rather a single
> hypercall supporting both protecting/unprotecting) defeats the purpose
> of the feature?

If guest has some data that it prefers not to leak to the host and use the
feature for the purpose, share all the memory to get through reboot is a
very weak point.

> 
> clean up *all* its memory upon reboot, however:
> - It may only clean up the most sensitive parts. This should probably be
> done even without this new feature and even on bare metal (think about
> next boot target being malicious).
> - The attack window shrinks significantly. "Speculative" bugs require
> time to exploit and it will only remain open until it boots up again
> (few seconds).

Maybe it would be cleaner to handle reboot in userspace? If we got the VM
rebooted, just reconstruct it from scratch as if it would be new boot.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-20  6:18 ` [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
  2020-10-20  8:25   ` John Hubbard
@ 2020-10-20 17:29   ` Ira Weiny
  2020-10-22 11:37     ` Kirill A. Shutemov
  1 sibling, 1 reply; 57+ messages in thread
From: Ira Weiny @ 2020-10-20 17:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Tue, Oct 20, 2020 at 09:18:51AM +0300, Kirill A. Shutemov wrote:
> New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> protection feature is enabled.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  4 ++
>  virt/kvm/kvm_main.c      | 90 +++++++++++++++++++++++++++++++---------
>  2 files changed, 75 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 05e3c2fb3ef7..380a64613880 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -504,6 +504,7 @@ struct kvm {
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
>  	unsigned int max_halt_poll_ns;
> +	bool mem_protected;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -728,6 +729,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
>  void kvm_set_pfn_accessed(kvm_pfn_t pfn);
>  void kvm_get_pfn(kvm_pfn_t pfn);
>  
> +int copy_from_guest(void *data, unsigned long hva, int len, bool protected);
> +int copy_to_guest(unsigned long hva, const void *data, int len, bool protected);
> +
>  void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
>  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  			int len);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf88233b819a..a9884cb8c867 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2313,19 +2313,70 @@ static int next_segment(unsigned long len, int offset)
>  		return len;
>  }
>  
> +int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	if (!protected)
> +		return __copy_from_user(data, (void __user *)hva, len);
> +
> +	might_fault();
> +	kasan_check_write(data, len);
> +	check_object_size(data, len, false);
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(data, page_address(page) + offset, seg);
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;

Why is data not updated on each iteration?

> +	}
> +
> +	return 0;
> +}
> +
> +int copy_to_guest(unsigned long hva, const void *data, int len, bool protected)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +
> +	if (!protected)
> +		return __copy_to_user((void __user *)hva, data, len);
> +
> +	might_fault();
> +	kasan_check_read(data, len);
> +	check_object_size(data, len, true);
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
> +		if (npages != 1)
> +			return -EFAULT;
> +		memcpy(page_address(page) + offset, data, seg);
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		offset = 0;

Same question?  Doesn't this result in *data being copied to multiple pages?

Ira

> +	}
> +
> +	return 0;
> +}
> +
>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -				 void *data, int offset, int len)
> +				 void *data, int offset, int len,
> +				 bool protected)
>  {
> -	int r;
>  	unsigned long addr;
>  
>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> -	if (r)
> -		return -EFAULT;
> -	return 0;
> +	return copy_from_guest(data, addr + offset, len, protected);
>  }
>  
>  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> @@ -2333,7 +2384,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_guest_page);
>  
> @@ -2342,7 +2394,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(slot, gfn, data, offset, len,
> +				     vcpu->kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
>  
> @@ -2415,7 +2468,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
>  static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
> -			          const void *data, int offset, int len)
> +			          const void *data, int offset, int len,
> +				  bool protected)
>  {
>  	int r;
>  	unsigned long addr;
> @@ -2423,7 +2477,8 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
>  	addr = gfn_to_hva_memslot(memslot, gfn);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -	r = __copy_to_user((void __user *)addr + offset, data, len);
> +
> +	r = copy_to_guest(addr + offset, data, len, protected);
>  	if (r)
>  		return -EFAULT;
>  	mark_page_dirty_in_slot(memslot, gfn);
> @@ -2435,7 +2490,8 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +				      kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_write_guest_page);
>  
> @@ -2444,7 +2500,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +				      vcpu->kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
>  
> @@ -2560,7 +2617,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  	if (unlikely(!ghc->memslot))
>  		return kvm_write_guest(kvm, gpa, data, len);
>  
> -	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
> +	r = copy_to_guest(ghc->hva + offset, data, len, kvm->mem_protected);
>  	if (r)
>  		return -EFAULT;
>  	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
> @@ -2581,7 +2638,6 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  				 unsigned long len)
>  {
>  	struct kvm_memslots *slots = kvm_memslots(kvm);
> -	int r;
>  	gpa_t gpa = ghc->gpa + offset;
>  
>  	BUG_ON(len + offset > ghc->len);
> @@ -2597,11 +2653,7 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  	if (unlikely(!ghc->memslot))
>  		return kvm_read_guest(kvm, gpa, data, len);
>  
> -	r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
> -	if (r)
> -		return -EFAULT;
> -
> -	return 0;
> +	return copy_from_guest(data, ghc->hva + offset, len, kvm->mem_protected);
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached);
>  
> -- 
> 2.26.2
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20 13:20     ` David Hildenbrand
@ 2020-10-21  1:20       ` Edgecombe, Rick P
  0 siblings, 0 replies; 57+ messages in thread
From: Edgecombe, Rick P @ 2020-10-21  1:20 UTC (permalink / raw)
  To: peterz, jmattson, Christopherson, Sean J, vkuznets, david,
	dave.hansen, joro, luto, kirill, pbonzini, rppt, wanpengli
  Cc: kvm, linux-kernel, kirill.shutemov, keescook, rppt, linux-mm,
	x86, aarcange, liran.alon, wad, rientjes, Kleen, Andi

On Tue, 2020-10-20 at 15:20 +0200, David Hildenbrand wrote:
> On 20.10.20 14:18, David Hildenbrand wrote:
> > On 20.10.20 08:18, Kirill A. Shutemov wrote:
> > > If the protected memory feature enabled, unmap guest memory from
> > > kernel's direct mappings.
> > 
> > Gah, ugly. I guess this also defeats compaction, swapping, ... oh
> > gosh.
> > As if all of the encrypted VM implementations didn't bring us
> > enough
> > ugliness already (SEV extensions also don't support reboots, but
> > can at
> > least kexec() IIRC).
> > 
> > Something similar is done with secretmem [1]. And people don't seem
> > to
> > like fragmenting the direct mapping (including me).
> > 
> > [1] https://lkml.kernel.org/r/20200924132904.1391-1-rppt@kernel.org
> > 
> 
> I just thought "hey, we might have to replace pud/pmd mappings by
> page
> tables when calling kernel_map_pages", this can fail with -ENOMEM,
> why
> isn't there proper error handling.
> 
> Then I dived into __kernel_map_pages() which states:
> 
> "The return value is ignored as the calls cannot fail. Large pages
> for
> identity mappings are not used at boot time and hence no memory
> allocations during large page split."
> 
> I am probably missing something important, but how is calling
> kernel_map_pages() safe *after* booting?! I know we use it for
> debug_pagealloc(), but using it in a production-ready feature feels
> completely irresponsible. What am I missing?
> 

My understanding was that DEBUG_PAGEALLOC maps the direct map at 4k
post-boot as well so that kernel_map_pages() can't fail for it, and to
avoid having to take locks for splits in interrupts. I think you are on
to something though. That function is essentially a set_memory_()
functions on x86 at least, and many callers do not check the error
codes. Kees Cook has been pushing to fix this actually: 
https://github.com/KSPP/linux/issues/7

As long as a page has been broken to 4k, it should be able to be re-
mapped without failure (like debug page alloc relies on). If an initial
call to restrict permissions needs and fails to break a page, its remap
does not need to succeed either before freeing the page, since the page
never got set with a lower permission. That's my understanding from
x86's cpa at least. The problem becomes that the page silently doesn't
have its expected protections during use. Unchecked set_memory_()
caching property change failures might result in functional problems
though.

So there is some background if you wanted it, but yea, I agree this
feature should handle if the unmap failed. Probably return an error on
the IOCTL and maybe the hypercall. kernel_map_pages() also doesn't do a
shootdown though, so this direct map protection is more in the best
effort category in its current state.


For unmapping causing direct map fragmentation. The two techniques
floating around, besides breaking indiscriminately, seem to be:
1. Group and cache unmapped pages to minimize the damage (like secret
mem)
2. Somehow repair them back to large pages when reset RW (as Kirill had
tried earlier this year in another series)

I had imagined this usage would want something like that eventually,
but neither helps with the other limitations you mentioned (migration,
etc).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-20 13:49   ` Kirill A. Shutemov
@ 2020-10-21 14:46     ` Vitaly Kuznetsov
  2020-10-23 11:35       ` Kirill A. Shutemov
  0 siblings, 1 reply; 57+ messages in thread
From: Vitaly Kuznetsov @ 2020-10-21 14:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Oct 20, 2020 at 09:46:11AM +0200, Vitaly Kuznetsov wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> 
>> > == Background / Problem ==
>> >
>> > There are a number of hardware features (MKTME, SEV) which protect guest
>> > memory from some unauthorized host access. The patchset proposes a purely
>> > software feature that mitigates some of the same host-side read-only
>> > attacks.
>> >
>> >
>> > == What does this set mitigate? ==
>> >
>> >  - Host kernel ”accidental” access to guest data (think speculation)
>> >
>> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>> >
>> >  - Host userspace access to guest data (compromised qemu)
>> >
>> >  - Guest privilege escalation via compromised QEMU device emulation
>> >
>> > == What does this set NOT mitigate? ==
>> >
>> >  - Full host kernel compromise.  Kernel will just map the pages again.
>> >
>> >  - Hardware attacks
>> >
>> >
>> > The second RFC revision addresses /most/ of the feedback.
>> >
>> > I still didn't found a good solution to reboot and kexec. Unprotect all
>> > the memory on such operations defeat the goal of the feature. Clearing up
>> > most of the memory before unprotecting what is required for reboot (or
>> > kexec) is tedious and error-prone.
>> > Maybe we should just declare them unsupported?
>> 
>> Making reboot unsupported is a hard sell. Could you please elaborate on
>> why you think that "unprotect all" hypercall (or rather a single
>> hypercall supporting both protecting/unprotecting) defeats the purpose
>> of the feature?
>
> If guest has some data that it prefers not to leak to the host and use the
> feature for the purpose, share all the memory to get through reboot is a
> very weak point.
>

My point that if it knows that there's something sensitive in its
memory it should clean it up even today without your feature before
rebooting to an unknown target.

>> 
>> clean up *all* its memory upon reboot, however:
>> - It may only clean up the most sensitive parts. This should probably be
>> done even without this new feature and even on bare metal (think about
>> next boot target being malicious).
>> - The attack window shrinks significantly. "Speculative" bugs require
>> time to exploit and it will only remain open until it boots up again
>> (few seconds).
>
> Maybe it would be cleaner to handle reboot in userspace? If we got the VM
> rebooted, just reconstruct it from scratch as if it would be new boot.

We are definitely not trying to protect against malicious KVM so maybe
we can do the cleanup there (when protection was enabled) so we can
unprotect everything without risk of a leak?

-- 
Vitaly



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2020-10-20  7:46 ` [RFCv2 00/16] KVM protected memory extension Vitaly Kuznetsov
@ 2020-10-21 18:20 ` Andy Lutomirski
  2020-10-26 15:29   ` Kirill A. Shutemov
  17 siblings, 1 reply; 57+ messages in thread
From: Andy Lutomirski @ 2020-10-21 18:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, X86 ML, kvm list, Linux-MM, LKML,
	Kirill A. Shutemov

> On Oct 19, 2020, at 11:19 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:

> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
>

I definitely like the direction this patchset is going in, and I think
that allowing KVM guests to have memory that is inaccessible to QEMU
is a great idea.

I do wonder, though: do we really want to do this with these PROT_NONE
tricks, or should we actually come up with a way to have KVM guest map
memory that isn't mapped into QEMU's mm_struct at all?  As an example
of the latter, I mean something a bit like this:

https://lkml.kernel.org/r/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com

I don't mean to say that this is a requirement of any kind of
protected memory like this, but I do think we should understand the
tradeoffs, in terms of what a full implementation looks like, the
effort and time frames involved, and the maintenance burden of
supporting whatever gets merged going forward.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 09/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-10-20  6:18 ` [RFCv2 09/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
@ 2020-10-21 18:47   ` Edgecombe, Rick P
  2020-10-22 12:01     ` Kirill A. Shutemov
  0 siblings, 1 reply; 57+ messages in thread
From: Edgecombe, Rick P @ 2020-10-21 18:47 UTC (permalink / raw)
  To: peterz, jmattson, Christopherson, Sean J, vkuznets, dave.hansen,
	joro, luto, kirill, pbonzini, wanpengli
  Cc: kvm, linux-kernel, kirill.shutemov, keescook, rppt, linux-mm,
	x86, aarcange, liran.alon, wad, rientjes, Kleen, Andi

On Tue, 2020-10-20 at 09:18 +0300, Kirill A. Shutemov wrote:
>  include/linux/mm.h  |  8 ++++++++
>  mm/gup.c            | 20 ++++++++++++++++----
>  mm/huge_memory.c    | 20 ++++++++++++++++----
>  mm/memory.c         |  3 +++
>  mm/mmap.c           |  3 +++
>  virt/kvm/async_pf.c |  2 +-
>  virt/kvm/kvm_main.c |  9 +++++----
>  7 files changed, 52 insertions(+), 13 deletions(-)

There is another get_user_pages() in
paging_tmpl.h:FNAME(cmpxchg_gpte). 

Also, did you leave off FOLL_KVM in gfn_to_page_many_atomic() on
purpose? It looks like its only used for pte prefetch which I guess is
not required.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20  6:18 ` [RFCv2 15/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
  2020-10-20  7:12   ` Peter Zijlstra
  2020-10-20 12:18   ` David Hildenbrand
@ 2020-10-21 18:49   ` Edgecombe, Rick P
  2020-10-23 12:37   ` Mike Rapoport
  3 siblings, 0 replies; 57+ messages in thread
From: Edgecombe, Rick P @ 2020-10-21 18:49 UTC (permalink / raw)
  To: peterz, jmattson, Christopherson, Sean J, vkuznets, dave.hansen,
	joro, luto, kirill, pbonzini, wanpengli
  Cc: kvm, linux-kernel, kirill.shutemov, keescook, rppt, linux-mm,
	x86, aarcange, liran.alon, wad, rientjes, Kleen, Andi

On Tue, 2020-10-20 at 09:18 +0300, Kirill A. Shutemov wrote:
> If the protected memory feature enabled, unmap guest memory from
> kernel's direct mappings.
> 
> Migration and KSM is disabled for protected memory as it would
> require a
> special treatment.
> 
So do we care about this scenario where a malicious userspace causes a
kernel oops? I'm not sure if it's prevented somehow.

CPU0 (exercising other kernel functionality)	CPU1
						mark page shared
page = get_user_pages(!FOLL_KVM)
						mark page private
kmap(page)
access unmapped page and oops


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-10-20  6:18 ` [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
@ 2020-10-21 18:50   ` Edgecombe, Rick P
  2020-10-22 12:06     ` Kirill A. Shutemov
  2020-10-22  3:26   ` Halil Pasic
  1 sibling, 1 reply; 57+ messages in thread
From: Edgecombe, Rick P @ 2020-10-21 18:50 UTC (permalink / raw)
  To: peterz, jmattson, Christopherson, Sean J, vkuznets, dave.hansen,
	joro, luto, kirill, pbonzini, wanpengli
  Cc: kvm, linux-kernel, kirill.shutemov, keescook, rppt, linux-mm,
	x86, aarcange, liran.alon, wad, rientjes, Kleen, Andi

On Tue, 2020-10-20 at 09:18 +0300, Kirill A. Shutemov wrote:
> We cannot access protected pages directly. Use ioremap() to
> create a temporary mapping of the page. The mapping is destroyed
> on __kvm_unmap_gfn().
> 
> The new interface gfn_to_pfn_memslot_protected() is used to detect if
> the page is protected.
> 
> ioremap_cache_force() is a hack to bypass IORES_MAP_SYSTEM_RAM check
> in
> the x86 ioremap code. We need a better solution.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
>  arch/x86/include/asm/io.h              |  2 +
>  arch/x86/include/asm/pgtable_types.h   |  1 +
>  arch/x86/kvm/mmu/mmu.c                 |  6 ++-
>  arch/x86/mm/ioremap.c                  | 16 ++++++--
>  include/linux/kvm_host.h               |  3 +-
>  include/linux/kvm_types.h              |  1 +
>  virt/kvm/kvm_main.c                    | 52 +++++++++++++++++++-----
> --
>  9 files changed, 63 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 38ea396a23d6..8e06cd3f759c 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -590,7 +590,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu
> *vcpu,
>  	} else {
>  		/* Call KVM generic code to do the slow-path check */
>  		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
> -					   writing, &write_ok);
> +					   writing, &write_ok, NULL);
>  		if (is_error_noslot_pfn(pfn))
>  			return -EFAULT;
>  		page = NULL;
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 22a677b18695..6fd4e3f9b66a 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -822,7 +822,7 @@ int kvmppc_book3s_instantiate_page(struct
> kvm_vcpu *vcpu,
>  
>  		/* Call KVM generic code to do the slow-path check */
>  		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
> -					   writing, upgrade_p);
> +					   writing, upgrade_p, NULL);
>  		if (is_error_noslot_pfn(pfn))
>  			return -EFAULT;
>  		page = NULL;
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index c58d52fd7bf2..a3e1bfad1026 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -184,6 +184,8 @@ extern void __iomem *ioremap_uc(resource_size_t
> offset, unsigned long size);
>  #define ioremap_uc ioremap_uc
>  extern void __iomem *ioremap_cache(resource_size_t offset, unsigned
> long size);
>  #define ioremap_cache ioremap_cache
> +extern void __iomem *ioremap_cache_force(resource_size_t offset,
> unsigned long size);
> +#define ioremap_cache_force ioremap_cache_force
>  extern void __iomem *ioremap_prot(resource_size_t offset, unsigned
> long size, unsigned long prot_val);
>  #define ioremap_prot ioremap_prot
>  extern void __iomem *ioremap_encrypted(resource_size_t phys_addr,
> unsigned long size);
> diff --git a/arch/x86/include/asm/pgtable_types.h
> b/arch/x86/include/asm/pgtable_types.h
> index 816b31c68550..4c16a9583786 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -147,6 +147,7 @@ enum page_cache_mode {
>  	_PAGE_CACHE_MODE_UC       = 3,
>  	_PAGE_CACHE_MODE_WT       = 4,
>  	_PAGE_CACHE_MODE_WP       = 5,
> +	_PAGE_CACHE_MODE_WB_FORCE = 6,
>  
>  	_PAGE_CACHE_MODE_NUM      = 8
>  };
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 71aa3da2a0b7..162cb285b87b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4058,7 +4058,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu,
> bool prefault, gfn_t gfn,
>  	}
>  
>  	async = false;
> -	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write,
> writable);
> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write,
> writable,
> +				    NULL);
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>  
> @@ -4072,7 +4073,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu,
> bool prefault, gfn_t gfn,
>  			return true;
>  	}
>  
> -	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write,
> writable);
> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write,
> writable,
> +				    NULL);
>  	return false;
>  }
>  
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 9e5ccc56f8e0..4409785e294c 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -202,9 +202,12 @@ __ioremap_caller(resource_size_t phys_addr,
> unsigned long size,
>  	__ioremap_check_mem(phys_addr, size, &io_desc);
>  
>  	/*
> -	 * Don't allow anybody to remap normal RAM that we're using..
> +	 * Don't allow anybody to remap normal RAM that we're using,
> unless
> +	 * _PAGE_CACHE_MODE_WB_FORCE is used.
>  	 */
> -	if (io_desc.flags & IORES_MAP_SYSTEM_RAM) {
> +	if (pcm == _PAGE_CACHE_MODE_WB_FORCE) {
> +	    pcm = _PAGE_CACHE_MODE_WB;
> +	} else if (io_desc.flags & IORES_MAP_SYSTEM_RAM) {
>  		WARN_ONCE(1, "ioremap on RAM at %pa - %pa\n",
>  			  &phys_addr, &last_addr);
>  		return NULL;
> @@ -419,6 +422,13 @@ void __iomem *ioremap_cache(resource_size_t
> phys_addr, unsigned long size)
>  }
>  EXPORT_SYMBOL(ioremap_cache);
>  
> +void __iomem *ioremap_cache_force(resource_size_t phys_addr,
> unsigned long size)
> +{
> +	return __ioremap_caller(phys_addr, size,
> _PAGE_CACHE_MODE_WB_FORCE,
> +				__builtin_return_address(0), false);
> +}
> +EXPORT_SYMBOL(ioremap_cache_force);
> +
>  void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long
> size,
>  				unsigned long prot_val)
>  {
> @@ -467,7 +477,7 @@ void iounmap(volatile void __iomem *addr)
>  	p = find_vm_area((void __force *)addr);
>  
>  	if (!p) {
> -		printk(KERN_ERR "iounmap: bad address %p\n", addr);
> +		printk(KERN_ERR "iounmap: bad address %px\n", addr);

Unintentional?

>  		dump_stack();
>  		return;
>  	}
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6655e8da4555..0d5f3885747b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -238,6 +238,7 @@ struct kvm_host_map {
>  	void *hva;
>  	kvm_pfn_t pfn;
>  	kvm_pfn_t gfn;
> +	bool protected;
>  };
>  
>  /*
> @@ -725,7 +726,7 @@ kvm_pfn_t gfn_to_pfn_memslot(struct
> kvm_memory_slot *slot, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot,
> gfn_t gfn);
>  kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t
> gfn,
>  			       bool atomic, bool *async, bool
> write_fault,
> -			       bool *writable);
> +			       bool *writable, bool *protected);
>  
>  void kvm_release_pfn_clean(kvm_pfn_t pfn);
>  void kvm_release_pfn_dirty(kvm_pfn_t pfn);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index a7580f69dda0..0a8c6426b4f4 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -58,6 +58,7 @@ struct gfn_to_pfn_cache {
>  	gfn_t gfn;
>  	kvm_pfn_t pfn;
>  	bool dirty;
> +	bool protected;
>  };
>  
>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9b569b78874a..7c2c764c28c5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1852,9 +1852,10 @@ static bool hva_to_pfn_fast(unsigned long
> addr, bool write_fault,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
>  static int hva_to_pfn_slow(unsigned long addr, bool *async, bool
> write_fault,
> -			   bool *writable, kvm_pfn_t *pfn)
> +			   bool *writable, bool *protected, kvm_pfn_t
> *pfn)
>  {
>  	unsigned int flags = FOLL_HWPOISON | FOLL_KVM;
> +	struct vm_area_struct *vma;
>  	struct page *page;
>  	int npages = 0;
>  
> @@ -1868,9 +1869,15 @@ static int hva_to_pfn_slow(unsigned long addr,
> bool *async, bool write_fault,
>  	if (async)
>  		flags |= FOLL_NOWAIT;
>  
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> -	if (npages != 1)
> +	mmap_read_lock(current->mm);
> +	npages = get_user_pages(addr, 1, flags, &page, &vma);
> +	if (npages != 1) {
> +		mmap_read_unlock(current->mm);
>  		return npages;
> +	}
> +	if (protected)
> +		*protected = vma_is_kvm_protected(vma);
> +	mmap_read_unlock(current->mm);
>  
>  	/* map read fault as writable if possible */
>  	if (unlikely(!write_fault) && writable) {
> @@ -1961,7 +1968,7 @@ static int hva_to_pfn_remapped(struct
> vm_area_struct *vma,
>   *     whether the mapping is writable.
>   */
>  static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool
> *async,
> -			bool write_fault, bool *writable)
> +			bool write_fault, bool *writable, bool
> *protected)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn = 0;
> @@ -1976,7 +1983,8 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr,
> bool atomic, bool *async,
>  	if (atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, writable,
> &pfn);
> +	npages = hva_to_pfn_slow(addr, async, write_fault, writable,
> protected,
> +				 &pfn);
>  	if (npages == 1)
>  		return pfn;
>  
> @@ -2010,7 +2018,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr,
> bool atomic, bool *async,
>  
>  kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t
> gfn,
>  			       bool atomic, bool *async, bool
> write_fault,
> -			       bool *writable)
> +			       bool *writable, bool *protected)
>  {
>  	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL,
> write_fault);
>  
> @@ -2033,7 +2041,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct
> kvm_memory_slot *slot, gfn_t gfn,
>  	}
>  
>  	return hva_to_pfn(addr, atomic, async, write_fault,
> -			  writable);
> +			  writable, protected);
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
> @@ -2041,19 +2049,26 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm,
> gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
>  	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn,
> false, NULL,
> -				    write_fault, writable);
> +				    write_fault, writable, NULL);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t
> gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true,
> NULL);
> +	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL,
> NULL);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
> +static kvm_pfn_t gfn_to_pfn_memslot_protected(struct kvm_memory_slot
> *slot,
> +					      gfn_t gfn, bool
> *protected)
> +{
> +	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL,
> +				    protected);
> +}
> +
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot,
> gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
> +	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL,
> NULL);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> @@ -2134,7 +2149,7 @@ static void kvm_cache_gfn_to_pfn(struct
> kvm_memory_slot *slot, gfn_t gfn,
>  {
>  	kvm_release_pfn(cache->pfn, cache->dirty, cache);
>  
> -	cache->pfn = gfn_to_pfn_memslot(slot, gfn);
> +	cache->pfn = gfn_to_pfn_memslot_protected(slot, gfn, &cache-
> >protected);
>  	cache->gfn = gfn;
>  	cache->dirty = false;
>  	cache->generation = gen;
> @@ -2149,6 +2164,7 @@ static int __kvm_map_gfn(struct kvm_memslots
> *slots, gfn_t gfn,
>  	void *hva = NULL;
>  	struct page *page = KVM_UNMAPPED_PAGE;
>  	struct kvm_memory_slot *slot = __gfn_to_memslot(slots, gfn);
> +	bool protected;
>  	u64 gen = slots->generation;
>  
>  	if (!map)
> @@ -2162,15 +2178,20 @@ static int __kvm_map_gfn(struct kvm_memslots
> *slots, gfn_t gfn,
>  			kvm_cache_gfn_to_pfn(slot, gfn, cache, gen);
>  		}
>  		pfn = cache->pfn;
> +		protected = cache->protected;
>  	} else {
>  		if (atomic)
>  			return -EAGAIN;
> -		pfn = gfn_to_pfn_memslot(slot, gfn);
> +		pfn = gfn_to_pfn_memslot_protected(slot, gfn,
> &protected);
>  	}
>  	if (is_error_noslot_pfn(pfn))
>  		return -EINVAL;
>  
> -	if (pfn_valid(pfn)) {
> +	if (protected) {
> +		if (atomic)
> +			return -EAGAIN;
> +		hva = ioremap_cache_force(pfn_to_hpa(pfn), PAGE_SIZE);
> +	} else if (pfn_valid(pfn)) {
>  		page = pfn_to_page(pfn);
>  		if (atomic)
>  			hva = kmap_atomic(page);

I think the page could have got unmapped since the gup via the
hypercall on another CPU. It could be an avenue for the guest to crash
the host.

> @@ -2191,6 +2212,7 @@ static int __kvm_map_gfn(struct kvm_memslots
> *slots, gfn_t gfn,
>  	map->hva = hva;
>  	map->pfn = pfn;
>  	map->gfn = gfn;
> +	map->protected = protected;
>  
>  	return 0;
>  }
> @@ -2221,7 +2243,9 @@ static void __kvm_unmap_gfn(struct
> kvm_memory_slot *memslot,
>  	if (!map->hva)
>  		return;
>  
> -	if (map->page != KVM_UNMAPPED_PAGE) {
> +	if (map->protected) {
> +		iounmap(map->hva);
> +	} else if (map->page != KVM_UNMAPPED_PAGE) {
>  		if (atomic)
>  			kunmap_atomic(map->hva);
>  		else


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-10-20  6:18 ` [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
  2020-10-21 18:50   ` Edgecombe, Rick P
@ 2020-10-22  3:26   ` Halil Pasic
  2020-10-22 12:07     ` Kirill A. Shutemov
  1 sibling, 1 reply; 57+ messages in thread
From: Halil Pasic @ 2020-10-22  3:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Tue, 20 Oct 2020 09:18:57 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> We cannot access protected pages directly. Use ioremap() to
> create a temporary mapping of the page. The mapping is destroyed
> on __kvm_unmap_gfn().
> 
> The new interface gfn_to_pfn_memslot_protected() is used to detect if
> the page is protected.
> 
> ioremap_cache_force() is a hack to bypass IORES_MAP_SYSTEM_RAM check in
> the x86 ioremap code. We need a better solution.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
>  arch/x86/include/asm/io.h              |  2 +
>  arch/x86/include/asm/pgtable_types.h   |  1 +
>  arch/x86/kvm/mmu/mmu.c                 |  6 ++-
>  arch/x86/mm/ioremap.c                  | 16 ++++++--
>  include/linux/kvm_host.h               |  3 +-
>  include/linux/kvm_types.h              |  1 +
>  virt/kvm/kvm_main.c                    | 52 +++++++++++++++++++-------
>  9 files changed, 63 insertions(+), 22 deletions(-)
> 

You declare ioremap_cache_force() arch/x86/include/asm/io.h  in and
define it in arch/x86/mm/ioremap.c which is architecture specific code,
but use it in __kvm_map_gfn() in virt/kvm/kvm_main.c which is common
code.

Thus your series breaks the build for the s390 architecture. Have you
tried to (cross) compile for s390?

Regards,
Halil


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest
  2020-10-20  6:18 ` [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
  2020-10-20  8:06   ` Christoph Hellwig
@ 2020-10-22  3:31   ` Halil Pasic
  1 sibling, 0 replies; 57+ messages in thread
From: Halil Pasic @ 2020-10-22  3:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Tue, 20 Oct 2020 09:18:48 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> VirtIO for KVM is a primary way to provide IO. All memory that used for
> communication with the host has to be marked as shared.
> 
> The easiest way to archive that is to use DMA API that already knows how
> to deal with shared memory.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/virtio/virtio_ring.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index becc77697960..ace733845d5d 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -12,6 +12,7 @@
>  #include <linux/hrtimer.h>
>  #include <linux/dma-mapping.h>
>  #include <xen/xen.h>
> +#include <asm/kvm_para.h>
>  
>  #ifdef DEBUG
>  /* For development, we want to crash whenever the ring is screwed. */
> @@ -255,6 +256,9 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
>  	if (xen_domain())
>  		return true;
>  
> +	if (kvm_mem_protected())
> +		return true;
> +

I guess it does not matter because Christophs comment, but this breaks
the build for s390, because there is no kvm_mem_protected() for s390.

Regards,
Halil

>  	return false;
>  }
>  



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-20 17:29   ` Ira Weiny
@ 2020-10-22 11:37     ` Kirill A. Shutemov
  0 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-22 11:37 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Tue, Oct 20, 2020 at 10:29:44AM -0700, Ira Weiny wrote:
> On Tue, Oct 20, 2020 at 09:18:51AM +0300, Kirill A. Shutemov wrote:
> > New helpers copy_from_guest()/copy_to_guest() to be used if KVM memory
> > protection feature is enabled.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |  4 ++
> >  virt/kvm/kvm_main.c      | 90 +++++++++++++++++++++++++++++++---------
> >  2 files changed, 75 insertions(+), 19 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 05e3c2fb3ef7..380a64613880 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -504,6 +504,7 @@ struct kvm {
> >  	struct srcu_struct irq_srcu;
> >  	pid_t userspace_pid;
> >  	unsigned int max_halt_poll_ns;
> > +	bool mem_protected;
> >  };
> >  
> >  #define kvm_err(fmt, ...) \
> > @@ -728,6 +729,9 @@ void kvm_set_pfn_dirty(kvm_pfn_t pfn);
> >  void kvm_set_pfn_accessed(kvm_pfn_t pfn);
> >  void kvm_get_pfn(kvm_pfn_t pfn);
> >  
> > +int copy_from_guest(void *data, unsigned long hva, int len, bool protected);
> > +int copy_to_guest(unsigned long hva, const void *data, int len, bool protected);
> > +
> >  void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache);
> >  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> >  			int len);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index cf88233b819a..a9884cb8c867 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2313,19 +2313,70 @@ static int next_segment(unsigned long len, int offset)
> >  		return len;
> >  }
> >  
> > +int copy_from_guest(void *data, unsigned long hva, int len, bool protected)
> > +{
> > +	int offset = offset_in_page(hva);
> > +	struct page *page;
> > +	int npages, seg;
> > +
> > +	if (!protected)
> > +		return __copy_from_user(data, (void __user *)hva, len);
> > +
> > +	might_fault();
> > +	kasan_check_write(data, len);
> > +	check_object_size(data, len, false);
> > +
> > +	while ((seg = next_segment(len, offset)) != 0) {
> > +		npages = get_user_pages_unlocked(hva, 1, &page, 0);
> > +		if (npages != 1)
> > +			return -EFAULT;
> > +		memcpy(data, page_address(page) + offset, seg);
> > +		put_page(page);
> > +		len -= seg;
> > +		hva += seg;
> > +		offset = 0;
> 
> Why is data not updated on each iteration?

Ouch. I guess no caller actually steps over page boundary. Will fix.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-20  8:25   ` John Hubbard
  2020-10-20 12:51     ` Kirill A. Shutemov
@ 2020-10-22 11:49     ` Matthew Wilcox
  2020-10-22 19:58       ` John Hubbard
  1 sibling, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2020-10-22 11:49 UTC (permalink / raw)
  To: John Hubbard
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Tue, Oct 20, 2020 at 01:25:59AM -0700, John Hubbard wrote:
> Should copy_to_guest() use pin_user_pages_unlocked() instead of gup_unlocked?
> We wrote a  "Case 5" in Documentation/core-api/pin_user_pages.rst, just for this
> situation, I think:
> 
> 
> CASE 5: Pinning in order to write to the data within the page
> -------------------------------------------------------------
> Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> other words, if the code is neither Case 1 nor Case 2, it may still require
> FOLL_PIN, for patterns like this:
> 
> Correct (uses FOLL_PIN calls):
>     pin_user_pages()
>     write to the data within the pages
>     unpin_user_pages()

Case 5 is crap though.  That bug should have been fixed by getting
the locking right.  ie:

	get_user_pages_fast();
	lock_page();
	kmap();
	set_bit();
	kunmap();
	set_page_dirty()
	unlock_page();

I should have vetoed that patch at the time, but I was busy with other things.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 09/16] KVM: mm: Introduce VM_KVM_PROTECTED
  2020-10-21 18:47   ` Edgecombe, Rick P
@ 2020-10-22 12:01     ` Kirill A. Shutemov
  0 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-22 12:01 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: peterz, jmattson, Christopherson, Sean J, vkuznets, dave.hansen,
	joro, luto, pbonzini, wanpengli, kvm, linux-kernel,
	kirill.shutemov, keescook, rppt, linux-mm, x86, aarcange,
	liran.alon, wad, rientjes, Kleen, Andi

On Wed, Oct 21, 2020 at 06:47:32PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2020-10-20 at 09:18 +0300, Kirill A. Shutemov wrote:
> >  include/linux/mm.h  |  8 ++++++++
> >  mm/gup.c            | 20 ++++++++++++++++----
> >  mm/huge_memory.c    | 20 ++++++++++++++++----
> >  mm/memory.c         |  3 +++
> >  mm/mmap.c           |  3 +++
> >  virt/kvm/async_pf.c |  2 +-
> >  virt/kvm/kvm_main.c |  9 +++++----
> >  7 files changed, 52 insertions(+), 13 deletions(-)
> 
> There is another get_user_pages() in
> paging_tmpl.h:FNAME(cmpxchg_gpte). 

Right, I will look into it. I wounder why I don't step onto it.

> Also, did you leave off FOLL_KVM in gfn_to_page_many_atomic() on
> purpose? It looks like its only used for pte prefetch which I guess is
> not required.

FOLL_FAST_ONLY is going to fail anyway. The caller has to handle it
gracefully.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-10-21 18:50   ` Edgecombe, Rick P
@ 2020-10-22 12:06     ` Kirill A. Shutemov
  2020-10-22 16:59       ` Edgecombe, Rick P
  0 siblings, 1 reply; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-22 12:06 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: peterz, jmattson, Christopherson, Sean J, vkuznets, dave.hansen,
	joro, luto, pbonzini, wanpengli, kvm, linux-kernel,
	kirill.shutemov, keescook, rppt, linux-mm, x86, aarcange,
	liran.alon, wad, rientjes, Kleen, Andi

On Wed, Oct 21, 2020 at 06:50:28PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2020-10-20 at 09:18 +0300, Kirill A. Shutemov wrote:
> > @@ -467,7 +477,7 @@ void iounmap(volatile void __iomem *addr)
> >  	p = find_vm_area((void __force *)addr);
> >  
> >  	if (!p) {
> > -		printk(KERN_ERR "iounmap: bad address %p\n", addr);
> > +		printk(KERN_ERR "iounmap: bad address %px\n", addr);
> 
> Unintentional?

Yep. Will fix.

> > @@ -2162,15 +2178,20 @@ static int __kvm_map_gfn(struct kvm_memslots
> > *slots, gfn_t gfn,
> >  			kvm_cache_gfn_to_pfn(slot, gfn, cache, gen);
> >  		}
> >  		pfn = cache->pfn;
> > +		protected = cache->protected;
> >  	} else {
> >  		if (atomic)
> >  			return -EAGAIN;
> > -		pfn = gfn_to_pfn_memslot(slot, gfn);
> > +		pfn = gfn_to_pfn_memslot_protected(slot, gfn,
> > &protected);
> >  	}
> >  	if (is_error_noslot_pfn(pfn))
> >  		return -EINVAL;
> >  
> > -	if (pfn_valid(pfn)) {
> > +	if (protected) {
> > +		if (atomic)
> > +			return -EAGAIN;
> > +		hva = ioremap_cache_force(pfn_to_hpa(pfn), PAGE_SIZE);
> > +	} else if (pfn_valid(pfn)) {
> >  		page = pfn_to_page(pfn);
> >  		if (atomic)
> >  			hva = kmap_atomic(page);
> 
> I think the page could have got unmapped since the gup via the
> hypercall on another CPU. It could be an avenue for the guest to crash
> the host.

Hm.. I'm not sure I follow. Could you elaborate on what scenario you have
in mind?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-10-22  3:26   ` Halil Pasic
@ 2020-10-22 12:07     ` Kirill A. Shutemov
  0 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-22 12:07 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Thu, Oct 22, 2020 at 05:26:47AM +0200, Halil Pasic wrote:
> On Tue, 20 Oct 2020 09:18:57 +0300
> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > We cannot access protected pages directly. Use ioremap() to
> > create a temporary mapping of the page. The mapping is destroyed
> > on __kvm_unmap_gfn().
> > 
> > The new interface gfn_to_pfn_memslot_protected() is used to detect if
> > the page is protected.
> > 
> > ioremap_cache_force() is a hack to bypass IORES_MAP_SYSTEM_RAM check in
> > the x86 ioremap code. We need a better solution.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
> >  arch/x86/include/asm/io.h              |  2 +
> >  arch/x86/include/asm/pgtable_types.h   |  1 +
> >  arch/x86/kvm/mmu/mmu.c                 |  6 ++-
> >  arch/x86/mm/ioremap.c                  | 16 ++++++--
> >  include/linux/kvm_host.h               |  3 +-
> >  include/linux/kvm_types.h              |  1 +
> >  virt/kvm/kvm_main.c                    | 52 +++++++++++++++++++-------
> >  9 files changed, 63 insertions(+), 22 deletions(-)
> > 
> 
> You declare ioremap_cache_force() arch/x86/include/asm/io.h  in and
> define it in arch/x86/mm/ioremap.c which is architecture specific code,
> but use it in __kvm_map_gfn() in virt/kvm/kvm_main.c which is common
> code.
> 
> Thus your series breaks the build for the s390 architecture. Have you
> tried to (cross) compile for s390?

Obviously not. I've got reports already from 0day and going to fix it.

Thanks.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-10-22 12:06     ` Kirill A. Shutemov
@ 2020-10-22 16:59       ` Edgecombe, Rick P
  2020-10-23 10:36         ` Kirill A. Shutemov
  0 siblings, 1 reply; 57+ messages in thread
From: Edgecombe, Rick P @ 2020-10-22 16:59 UTC (permalink / raw)
  To: kirill
  Cc: kvm, linux-kernel, jmattson, peterz, kirill.shutemov,
	Christopherson, Sean J, vkuznets, linux-mm, dave.hansen, joro,
	x86, aarcange, keescook, luto, pbonzini, rientjes, liran.alon,
	rppt, wad, wanpengli, Kleen, Andi

On Thu, 2020-10-22 at 15:06 +0300, Kirill A. Shutemov wrote:
> > I think the page could have got unmapped since the gup via the
> > hypercall on another CPU. It could be an avenue for the guest to
> > crash
> > the host.
> 
> Hm.. I'm not sure I follow. Could you elaborate on what scenario you
> have
> in mind?

Kind of similar scenario as the userspace triggered oops. My
understanding is that the protected status was gathered along with the
gup, but after the mm gets unlocked, nothing stops the page
transitioning to unmapped(?). At which point kmap() from a previous gup
with !protected, would go down the regular kmap() route and return an
address to an unmapped page.

So the guest kernel could start with a page mapped as shared via the
hypercall. Then trigger one of the PV MSR's that kmap() on CPU0. On
CPU1, after the gup on CPU0, it could transitioned the page to
private/unmapped via the hypercall. So the hva_to_pfn() would find
!protected, but by the time the kmap() happened the page would have
been unmapped. Am I missing something?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-22 11:49     ` Matthew Wilcox
@ 2020-10-22 19:58       ` John Hubbard
  2020-10-26  4:21         ` Matthew Wilcox
  0 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2020-10-22 19:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On 10/22/20 4:49 AM, Matthew Wilcox wrote:
> On Tue, Oct 20, 2020 at 01:25:59AM -0700, John Hubbard wrote:
>> Should copy_to_guest() use pin_user_pages_unlocked() instead of gup_unlocked?
>> We wrote a  "Case 5" in Documentation/core-api/pin_user_pages.rst, just for this
>> situation, I think:
>>
>>
>> CASE 5: Pinning in order to write to the data within the page
>> -------------------------------------------------------------
>> Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
>> write to a page's data, unpin" can cause a problem. Case 5 may be considered a
>> superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
>> other words, if the code is neither Case 1 nor Case 2, it may still require
>> FOLL_PIN, for patterns like this:
>>
>> Correct (uses FOLL_PIN calls):
>>      pin_user_pages()
>>      write to the data within the pages
>>      unpin_user_pages()
> 
> Case 5 is crap though.  That bug should have been fixed by getting
> the locking right.  ie:
> 
> 	get_user_pages_fast();
> 	lock_page();
> 	kmap();
> 	set_bit();
> 	kunmap();
> 	set_page_dirty()
> 	unlock_page();
> 
> I should have vetoed that patch at the time, but I was busy with other things.
> 

It does seem like lock_page() is better, for now at least, because it
forces the kind of synchronization with file system writeback that is
still yet to be implemented for pin_user_pages().

Long term though, Case 5 provides an alternative way to do this
pattern--without using lock_page(). Also, note that Case 5, *in
general*, need not be done page-at-a-time, unlike the lock_page()
approach. Therefore, Case 5 might potentially help at some call sites,
either for deadlock avoidance or performance improvements.

In other words, once the other half of the pin_user_pages() plan is
implemented, either of these approaches should work.

Or, are you thinking that there is never a situation in which Case 5 is
valid?


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  2020-10-22 16:59       ` Edgecombe, Rick P
@ 2020-10-23 10:36         ` Kirill A. Shutemov
  0 siblings, 0 replies; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-23 10:36 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm, linux-kernel, jmattson, peterz, kirill.shutemov,
	Christopherson, Sean J, vkuznets, linux-mm, dave.hansen, joro,
	x86, aarcange, keescook, luto, pbonzini, rientjes, liran.alon,
	rppt, wad, wanpengli, Kleen, Andi

On Thu, Oct 22, 2020 at 04:59:49PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2020-10-22 at 15:06 +0300, Kirill A. Shutemov wrote:
> > > I think the page could have got unmapped since the gup via the
> > > hypercall on another CPU. It could be an avenue for the guest to
> > > crash
> > > the host.
> > 
> > Hm.. I'm not sure I follow. Could you elaborate on what scenario you
> > have
> > in mind?
> 
> Kind of similar scenario as the userspace triggered oops. My
> understanding is that the protected status was gathered along with the
> gup, but after the mm gets unlocked, nothing stops the page
> transitioning to unmapped(?). At which point kmap() from a previous gup
> with !protected, would go down the regular kmap() route and return an
> address to an unmapped page.
> 
> So the guest kernel could start with a page mapped as shared via the
> hypercall. Then trigger one of the PV MSR's that kmap() on CPU0. On
> CPU1, after the gup on CPU0, it could transitioned the page to
> private/unmapped via the hypercall. So the hva_to_pfn() would find
> !protected, but by the time the kmap() happened the page would have
> been unmapped. Am I missing something?

We need to fail protection enabling if a page is pinned. That's the only
option I see. But it might be pain to debug.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-21 14:46     ` Vitaly Kuznetsov
@ 2020-10-23 11:35       ` Kirill A. Shutemov
  2020-10-23 12:01         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-23 11:35 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel

On Wed, Oct 21, 2020 at 04:46:48PM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > On Tue, Oct 20, 2020 at 09:46:11AM +0200, Vitaly Kuznetsov wrote:
> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >> 
> >> > == Background / Problem ==
> >> >
> >> > There are a number of hardware features (MKTME, SEV) which protect guest
> >> > memory from some unauthorized host access. The patchset proposes a purely
> >> > software feature that mitigates some of the same host-side read-only
> >> > attacks.
> >> >
> >> >
> >> > == What does this set mitigate? ==
> >> >
> >> >  - Host kernel ”accidental” access to guest data (think speculation)
> >> >
> >> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> >> >
> >> >  - Host userspace access to guest data (compromised qemu)
> >> >
> >> >  - Guest privilege escalation via compromised QEMU device emulation
> >> >
> >> > == What does this set NOT mitigate? ==
> >> >
> >> >  - Full host kernel compromise.  Kernel will just map the pages again.
> >> >
> >> >  - Hardware attacks
> >> >
> >> >
> >> > The second RFC revision addresses /most/ of the feedback.
> >> >
> >> > I still didn't found a good solution to reboot and kexec. Unprotect all
> >> > the memory on such operations defeat the goal of the feature. Clearing up
> >> > most of the memory before unprotecting what is required for reboot (or
> >> > kexec) is tedious and error-prone.
> >> > Maybe we should just declare them unsupported?
> >> 
> >> Making reboot unsupported is a hard sell. Could you please elaborate on
> >> why you think that "unprotect all" hypercall (or rather a single
> >> hypercall supporting both protecting/unprotecting) defeats the purpose
> >> of the feature?
> >
> > If guest has some data that it prefers not to leak to the host and use the
> > feature for the purpose, share all the memory to get through reboot is a
> > very weak point.
> >
> 
> My point that if it knows that there's something sensitive in its
> memory it should clean it up even today without your feature before
> rebooting to an unknown target.

It's unrealistic to expect everybody to do the right thing.

> >> clean up *all* its memory upon reboot, however:
> >> - It may only clean up the most sensitive parts. This should probably be
> >> done even without this new feature and even on bare metal (think about
> >> next boot target being malicious).
> >> - The attack window shrinks significantly. "Speculative" bugs require
> >> time to exploit and it will only remain open until it boots up again
> >> (few seconds).
> >
> > Maybe it would be cleaner to handle reboot in userspace? If we got the VM
> > rebooted, just reconstruct it from scratch as if it would be new boot.
> 
> We are definitely not trying to protect against malicious KVM so maybe
> we can do the cleanup there (when protection was enabled) so we can
> unprotect everything without risk of a leak?

Do you have any particular codepath in mind? I didn't find anything
suitable so far.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-23 11:35       ` Kirill A. Shutemov
@ 2020-10-23 12:01         ` Vitaly Kuznetsov
  0 siblings, 0 replies; 57+ messages in thread
From: Vitaly Kuznetsov @ 2020-10-23 12:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Wed, Oct 21, 2020 at 04:46:48PM +0200, Vitaly Kuznetsov wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> 
>> > Maybe it would be cleaner to handle reboot in userspace? If we got the VM
>> > rebooted, just reconstruct it from scratch as if it would be new boot.
>> 
>> We are definitely not trying to protect against malicious KVM so maybe
>> we can do the cleanup there (when protection was enabled) so we can
>> unprotect everything without risk of a leak?
>
> Do you have any particular codepath in mind? I didn't find anything
> suitable so far.

I didn't put much thought in it but e.g. on x86, what if we put this to
kvm_vcpu_reset() under 'if (kvm_vcpu_is_bsp())' condition? 

The main problem I see is that we can't clean up *all* memory,
e.g. firmware related stuff should stay intact and this contraducts your
KVM_HC_ENABLE_MEM_PROTECTED which protects everything. We can, probably,
get rid of it leaving KVM_HC_MEM_SHARE/KVM_HC_MEM_UNSHARE only shifting
responsibility to define what can be cleaned up on the guest kernel
(stating in the doc that all protected memory will get whiped out on
reboot).

-- 
Vitaly



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20  6:18 ` [RFCv2 15/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
                     ` (2 preceding siblings ...)
  2020-10-21 18:49   ` Edgecombe, Rick P
@ 2020-10-23 12:37   ` Mike Rapoport
  2020-10-23 16:32     ` Sean Christopherson
  3 siblings, 1 reply; 57+ messages in thread
From: Mike Rapoport @ 2020-10-23 12:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On Tue, Oct 20, 2020 at 09:18:58AM +0300, Kirill A. Shutemov wrote:
> If the protected memory feature enabled, unmap guest memory from
> kernel's direct mappings.
> 
> Migration and KSM is disabled for protected memory as it would require a
> special treatment.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/mm.h       |  3 +++
>  mm/huge_memory.c         |  8 ++++++++
>  mm/ksm.c                 |  2 ++
>  mm/memory.c              | 12 ++++++++++++
>  mm/rmap.c                |  4 ++++
>  virt/lib/mem_protected.c | 21 +++++++++++++++++++++
>  6 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ee274d27e764..74efc51e63f0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -671,6 +671,9 @@ static inline bool vma_is_kvm_protected(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_KVM_PROTECTED;
>  }
>  
> +void kvm_map_page(struct page *page, int nr_pages);
> +void kvm_unmap_page(struct page *page, int nr_pages);

This still does not seem right ;-)

And I still think that map/unmap primitives shoud be a part of the
generic mm rather than exported by KVM.

> +
>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ec8cf9a40cfd..40974656cb43 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -627,6 +627,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>  		spin_unlock(vmf->ptl);
>  		count_vm_event(THP_FAULT_ALLOC);
>  		count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
> +
> +		/* Unmap page from direct mapping */
> +		if (vma_is_kvm_protected(vma))
> +			kvm_unmap_page(page, HPAGE_PMD_NR);
>  	}
>  
>  	return 0;
> @@ -1689,6 +1693,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			page_remove_rmap(page, true);
>  			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>  			VM_BUG_ON_PAGE(!PageHead(page), page);
> +
> +			/* Map the page back to the direct mapping */
> +			if (vma_is_kvm_protected(vma))
> +				kvm_map_page(page, HPAGE_PMD_NR);
>  		} else if (thp_migration_supported()) {
>  			swp_entry_t entry;
>  
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 9afccc36dbd2..c720e271448f 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -528,6 +528,8 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
>  		return NULL;
>  	if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
>  		return NULL;
> +	if (vma_is_kvm_protected(vma))
> +		return NULL;
>  	return vma;
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 2c9756b4e52f..e28bd5f902a7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1245,6 +1245,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				    likely(!(vma->vm_flags & VM_SEQ_READ)))
>  					mark_page_accessed(page);
>  			}
> +
> +			/* Map the page back to the direct mapping */
> +			if (vma_is_anonymous(vma) && vma_is_kvm_protected(vma))
> +				kvm_map_page(page, 1);
> +
>  			rss[mm_counter(page)]--;
>  			page_remove_rmap(page, false);
>  			if (unlikely(page_mapcount(page) < 0))
> @@ -3466,6 +3471,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	struct page *page;
>  	vm_fault_t ret = 0;
>  	pte_t entry;
> +	bool set = false;
>  
>  	/* File mapping without ->vm_ops ? */
>  	if (vma->vm_flags & VM_SHARED)
> @@ -3554,6 +3560,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  	page_add_new_anon_rmap(page, vma, vmf->address, false);
>  	lru_cache_add_inactive_or_unevictable(page, vma);
> +	set = true;
>  setpte:
>  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>  
> @@ -3561,6 +3568,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	update_mmu_cache(vma, vmf->address, vmf->pte);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> +	/* Unmap page from direct mapping */
> +	if (vma_is_kvm_protected(vma) && set)
> +		kvm_unmap_page(page, 1);
> +
>  	return ret;
>  release:
>  	put_page(page);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9425260774a1..247548d6d24b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1725,6 +1725,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  
>  static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
>  {
> +	/* TODO */
> +	if (vma_is_kvm_protected(vma))
> +		return true;
> +
>  	return vma_is_temporary_stack(vma);
>  }
>  
> diff --git a/virt/lib/mem_protected.c b/virt/lib/mem_protected.c
> index 1dfe82534242..9d2ef99285e5 100644
> --- a/virt/lib/mem_protected.c
> +++ b/virt/lib/mem_protected.c
> @@ -30,6 +30,27 @@ void kvm_unmap_page_atomic(void *vaddr)
>  }
>  EXPORT_SYMBOL_GPL(kvm_unmap_page_atomic);
>  
> +void kvm_map_page(struct page *page, int nr_pages)
> +{
> +	int i;
> +
> +	/* Clear page before returning it to the direct mapping */
> +	for (i = 0; i < nr_pages; i++) {
> +		void *p = kvm_map_page_atomic(page + i);
> +		memset(p, 0, PAGE_SIZE);
> +		kvm_unmap_page_atomic(p);
> +	}
> +
> +	kernel_map_pages(page, nr_pages, 1);
> +}
> +EXPORT_SYMBOL_GPL(kvm_map_page);
> +
> +void kvm_unmap_page(struct page *page, int nr_pages)
> +{
> +	kernel_map_pages(page, nr_pages, 0);
> +}
> +EXPORT_SYMBOL_GPL(kvm_unmap_page);
> +
>  int kvm_init_protected_memory(void)
>  {
>  	guest_map_ptes = kmalloc_array(num_possible_cpus(),
> -- 
> 2.26.2
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-23 12:37   ` Mike Rapoport
@ 2020-10-23 16:32     ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2020-10-23 16:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On Fri, Oct 23, 2020 at 03:37:12PM +0300, Mike Rapoport wrote:
> On Tue, Oct 20, 2020 at 09:18:58AM +0300, Kirill A. Shutemov wrote:
> > If the protected memory feature enabled, unmap guest memory from
> > kernel's direct mappings.
> > 
> > Migration and KSM is disabled for protected memory as it would require a
> > special treatment.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/mm.h       |  3 +++
> >  mm/huge_memory.c         |  8 ++++++++
> >  mm/ksm.c                 |  2 ++
> >  mm/memory.c              | 12 ++++++++++++
> >  mm/rmap.c                |  4 ++++
> >  virt/lib/mem_protected.c | 21 +++++++++++++++++++++
> >  6 files changed, 50 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ee274d27e764..74efc51e63f0 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -671,6 +671,9 @@ static inline bool vma_is_kvm_protected(struct vm_area_struct *vma)
> >  	return vma->vm_flags & VM_KVM_PROTECTED;
> >  }
> >  
> > +void kvm_map_page(struct page *page, int nr_pages);
> > +void kvm_unmap_page(struct page *page, int nr_pages);
> 
> This still does not seem right ;-)
> 
> And I still think that map/unmap primitives shoud be a part of the
> generic mm rather than exported by KVM.

Ya, and going a step further, I suspect it will be cleaner in the long run if
the kernel does not automatically map or unmap when converting between private
and shared/public memory.  Conversions will be rare in a well behaved guest, so
exiting to userspace and forcing userspace to do the unmap->map would not be a
performance bottleneck.  In theory, userspace could also maintain separate
pools for private vs. public mappings, though I doubt any VMM will do that in
practice.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-22 19:58       ` John Hubbard
@ 2020-10-26  4:21         ` Matthew Wilcox
  2020-10-26  4:44           ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2020-10-26  4:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Thu, Oct 22, 2020 at 12:58:14PM -0700, John Hubbard wrote:
> On 10/22/20 4:49 AM, Matthew Wilcox wrote:
> > On Tue, Oct 20, 2020 at 01:25:59AM -0700, John Hubbard wrote:
> > > Should copy_to_guest() use pin_user_pages_unlocked() instead of gup_unlocked?
> > > We wrote a  "Case 5" in Documentation/core-api/pin_user_pages.rst, just for this
> > > situation, I think:
> > > 
> > > 
> > > CASE 5: Pinning in order to write to the data within the page
> > > -------------------------------------------------------------
> > > Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> > > write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> > > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> > > other words, if the code is neither Case 1 nor Case 2, it may still require
> > > FOLL_PIN, for patterns like this:
> > > 
> > > Correct (uses FOLL_PIN calls):
> > >      pin_user_pages()
> > >      write to the data within the pages
> > >      unpin_user_pages()
> > 
> > Case 5 is crap though.  That bug should have been fixed by getting
> > the locking right.  ie:
> > 
> > 	get_user_pages_fast();
> > 	lock_page();
> > 	kmap();
> > 	set_bit();
> > 	kunmap();
> > 	set_page_dirty()
> > 	unlock_page();
> > 
> > I should have vetoed that patch at the time, but I was busy with other things.
> 
> It does seem like lock_page() is better, for now at least, because it
> forces the kind of synchronization with file system writeback that is
> still yet to be implemented for pin_user_pages().
> 
> Long term though, Case 5 provides an alternative way to do this
> pattern--without using lock_page(). Also, note that Case 5, *in
> general*, need not be done page-at-a-time, unlike the lock_page()
> approach. Therefore, Case 5 might potentially help at some call sites,
> either for deadlock avoidance or performance improvements.
> 
> In other words, once the other half of the pin_user_pages() plan is
> implemented, either of these approaches should work.
> 
> Or, are you thinking that there is never a situation in which Case 5 is
> valid?

I don't think the page pinning approach is ever valid.  For file
mappings, the infiniband registration should take a lease on the inode,
blocking truncation.  DAX shouldn't be using struct pages at all, so
there shouldn't be anything there to pin.

It's been five years since DAX was merged, and page pinning still
doesn't work.  How much longer before the people who are pushing it
realise that it's fundamentally flawed?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-26  4:21         ` Matthew Wilcox
@ 2020-10-26  4:44           ` John Hubbard
  2020-10-26 13:28             ` Matthew Wilcox
  0 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2020-10-26  4:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On 10/25/20 9:21 PM, Matthew Wilcox wrote:
> On Thu, Oct 22, 2020 at 12:58:14PM -0700, John Hubbard wrote:
>> On 10/22/20 4:49 AM, Matthew Wilcox wrote:
>>> On Tue, Oct 20, 2020 at 01:25:59AM -0700, John Hubbard wrote:
>>>> Should copy_to_guest() use pin_user_pages_unlocked() instead of gup_unlocked?
>>>> We wrote a  "Case 5" in Documentation/core-api/pin_user_pages.rst, just for this
>>>> situation, I think:
>>>>
>>>>
>>>> CASE 5: Pinning in order to write to the data within the page
>>>> -------------------------------------------------------------
>>>> Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
>>>> write to a page's data, unpin" can cause a problem. Case 5 may be considered a
>>>> superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
>>>> other words, if the code is neither Case 1 nor Case 2, it may still require
>>>> FOLL_PIN, for patterns like this:
>>>>
>>>> Correct (uses FOLL_PIN calls):
>>>>       pin_user_pages()
>>>>       write to the data within the pages
>>>>       unpin_user_pages()
>>>
>>> Case 5 is crap though.  That bug should have been fixed by getting
>>> the locking right.  ie:
>>>
>>> 	get_user_pages_fast();
>>> 	lock_page();
>>> 	kmap();
>>> 	set_bit();
>>> 	kunmap();
>>> 	set_page_dirty()
>>> 	unlock_page();
>>>
>>> I should have vetoed that patch at the time, but I was busy with other things.
>>
>> It does seem like lock_page() is better, for now at least, because it
>> forces the kind of synchronization with file system writeback that is
>> still yet to be implemented for pin_user_pages().
>>
>> Long term though, Case 5 provides an alternative way to do this
>> pattern--without using lock_page(). Also, note that Case 5, *in
>> general*, need not be done page-at-a-time, unlike the lock_page()
>> approach. Therefore, Case 5 might potentially help at some call sites,
>> either for deadlock avoidance or performance improvements.
>>
>> In other words, once the other half of the pin_user_pages() plan is
>> implemented, either of these approaches should work.
>>
>> Or, are you thinking that there is never a situation in which Case 5 is
>> valid?
> 
> I don't think the page pinning approach is ever valid.  For file

Could you qualify that? Surely you don't mean that the entire pin_user_pages
story is a waste of time--I would have expected you to make more noise
earlier if you thought that, yes?

> mappings, the infiniband registration should take a lease on the inode,
> blocking truncation.  DAX shouldn't be using struct pages at all, so
> there shouldn't be anything there to pin.
> 
> It's been five years since DAX was merged, and page pinning still
> doesn't work.  How much longer before the people who are pushing it
> realise that it's fundamentally flawed?
> 

Is this a separate rant about *only* DAX, or is general RDMA in your sights
too? :)



thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-26  4:44           ` John Hubbard
@ 2020-10-26 13:28             ` Matthew Wilcox
  2020-10-26 14:16               ` Jason Gunthorpe
  2020-10-26 20:52               ` John Hubbard
  0 siblings, 2 replies; 57+ messages in thread
From: Matthew Wilcox @ 2020-10-26 13:28 UTC (permalink / raw)
  To: John Hubbard
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Sun, Oct 25, 2020 at 09:44:07PM -0700, John Hubbard wrote:
> On 10/25/20 9:21 PM, Matthew Wilcox wrote:
> > I don't think the page pinning approach is ever valid.  For file
> 
> Could you qualify that? Surely you don't mean that the entire pin_user_pages
> story is a waste of time--I would have expected you to make more noise
> earlier if you thought that, yes?

I do think page pinning is the wrong approach for everything.  I did say
so at the time, and I continue to say so when the opportunity presents
itself.  But shouting about it constantly only annoys people, so I don't
generally bother.  I have other things to work on, and they're productive,
so I don't need to spend my time arguing.

> > It's been five years since DAX was merged, and page pinning still
> > doesn't work.  How much longer before the people who are pushing it
> > realise that it's fundamentally flawed?
> 
> Is this a separate rant about *only* DAX, or is general RDMA in your sights
> too? :)

This is a case where it's not RDMA's _fault_ that there's no good API
for it to do what it needs to do.  There's a lot of work needed to wean
Linux device drivers off their assumption that there's a struct page
for every byte of memory.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-26 13:28             ` Matthew Wilcox
@ 2020-10-26 14:16               ` Jason Gunthorpe
  2020-10-26 20:52               ` John Hubbard
  1 sibling, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2020-10-26 14:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On Mon, Oct 26, 2020 at 01:28:30PM +0000, Matthew Wilcox wrote:

> > > It's been five years since DAX was merged, and page pinning still
> > > doesn't work.  How much longer before the people who are pushing it
> > > realise that it's fundamentally flawed?
> > 
> > Is this a separate rant about *only* DAX, or is general RDMA in your sights
> > too? :)
> 
> This is a case where it's not RDMA's _fault_ that there's no good API
> for it to do what it needs to do.  There's a lot of work needed to wean
> Linux device drivers off their assumption that there's a struct page
> for every byte of memory.

People who care seem to have just given up and are using RDMA ODP, so
I'm not optimistic this DAX issue will ever be solved. I've also
almost removed all the struct page references from this flow in RDMA,
so if there is some way that helps it is certainly doable.

Regardless of DAX the pinning indication is now being used during
fork() for some good reasons, and seems to make sense in other use
cases. It just doesn't seem like a way to solve the DAX issue.

More or less it seems to mean that pages pinned cannot be write
protected and more broadly the kernel should not change the PTEs for
those pages independently of the application. ie the more agressive
COW on fork() caused data corruption regressions...

I wonder if the point here is that some page owners can't/won't
support DMA pinning and should just be blocked completely for them.

I'd much rather have write access pin_user_pages() just fail than oops
the kernel on ext4 owned VMAs, for instance.

Jason


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-21 18:20 ` Andy Lutomirski
@ 2020-10-26 15:29   ` Kirill A. Shutemov
  2020-10-26 23:58     ` Andy Lutomirski
  0 siblings, 1 reply; 57+ messages in thread
From: Kirill A. Shutemov @ 2020-10-26 15:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, X86 ML, kvm list, Linux-MM, LKML,
	Kirill A. Shutemov

On Wed, Oct 21, 2020 at 11:20:56AM -0700, Andy Lutomirski wrote:
> > On Oct 19, 2020, at 11:19 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > For removing the userspace mapping, use a trick similar to what NUMA
> > balancing does: convert memory that belongs to KVM memory slots to
> > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> > the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> > VMA must be treated in a special way in the GUP and fault paths. The flag
> > allows GUP to return the page even though it is mapped with PROT_NONE, but
> > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> > would result in -EFAULT.
> >
> 
> I definitely like the direction this patchset is going in, and I think
> that allowing KVM guests to have memory that is inaccessible to QEMU
> is a great idea.
> 
> I do wonder, though: do we really want to do this with these PROT_NONE
> tricks, or should we actually come up with a way to have KVM guest map
> memory that isn't mapped into QEMU's mm_struct at all?  As an example
> of the latter, I mean something a bit like this:
> 
> https://lkml.kernel.org/r/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com
> 
> I don't mean to say that this is a requirement of any kind of
> protected memory like this, but I do think we should understand the
> tradeoffs, in terms of what a full implementation looks like, the
> effort and time frames involved, and the maintenance burden of
> supporting whatever gets merged going forward.

I considered the PROT_NONE trick neat. Complete removing of the mapping
from QEMU would require more changes into KVM and I'm not really familiar
with it.

About tradeoffs: the trick interferes with AutoNUMA. I didn't put much
thought into how we can get it work together. Need to look into it.

Do you see other tradeoffs?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 15/16] KVM: Unmap protected pages from direct mapping
  2020-10-20 12:18   ` David Hildenbrand
  2020-10-20 13:20     ` David Hildenbrand
@ 2020-10-26 19:55     ` Tom Lendacky
  1 sibling, 0 replies; 57+ messages in thread
From: Tom Lendacky @ 2020-10-26 19:55 UTC (permalink / raw)
  To: David Hildenbrand, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Mike Rapoport
  Cc: David Rientjes, Andrea Arcangeli, Kees Cook, Will Drewry,
	Edgecombe, Rick P, Kleen, Andi, Liran Alon, Mike Rapoport, x86,
	kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 10/20/20 7:18 AM, David Hildenbrand wrote:
> On 20.10.20 08:18, Kirill A. Shutemov wrote:
>> If the protected memory feature enabled, unmap guest memory from
>> kernel's direct mappings.
> 
> Gah, ugly. I guess this also defeats compaction, swapping, ... oh gosh.
> As if all of the encrypted VM implementations didn't bring us enough
> ugliness already (SEV extensions also don't support reboots, but can at
> least kexec() IIRC).

SEV does support reboot. SEV-ES using Qemu doesn't support reboot because
of the way Qemu resets the vCPU state. If Qemu could relaunch the guest
through the SEV APIs to reset the vCPU state, then a "reboot" would be
possible.

SEV does support kexec, SEV-ES does not at the moment.

Thanks,
Tom

> 
> Something similar is done with secretmem [1]. And people don't seem to
> like fragmenting the direct mapping (including me).
> 
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20200924132904.1391-1-rppt%40kernel.org&amp;data=04%7C01%7Cthomas.lendacky%40amd.com%7Cb98a5033da37432131b508d874f25194%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387931403890525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BzC%2FeIyOau7BORuUY%2BaiRzYZ%2BOAHANvBDcmV9hpkrts%3D&amp;reserved=0
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory
  2020-10-26 13:28             ` Matthew Wilcox
  2020-10-26 14:16               ` Jason Gunthorpe
@ 2020-10-26 20:52               ` John Hubbard
  1 sibling, 0 replies; 57+ messages in thread
From: John Hubbard @ 2020-10-26 20:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Rientjes, Andrea Arcangeli,
	Kees Cook, Will Drewry, Edgecombe, Rick P, Kleen, Andi,
	Liran Alon, Mike Rapoport, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

On 10/26/20 6:28 AM, Matthew Wilcox wrote:
> On Sun, Oct 25, 2020 at 09:44:07PM -0700, John Hubbard wrote:
>> On 10/25/20 9:21 PM, Matthew Wilcox wrote:
>>> I don't think the page pinning approach is ever valid.  For file
>>
>> Could you qualify that? Surely you don't mean that the entire pin_user_pages
>> story is a waste of time--I would have expected you to make more noise
>> earlier if you thought that, yes?
> 
> I do think page pinning is the wrong approach for everything.  I did say


Not *everything*, just "pinning for DMA", right? Because I don't recall
any viable solutions for Direct IO that avoided gup/pup!

Also, back to Case 5: I *could* create a small patchset to change over
the very few Case 5 call sites to use "gup, lock_page(), write to
page...etc", instead of pup. And also, update pin_user_pages.rst to
recommend that approach in similar situations. After all, it's not
really a long-term DMA pin, which is really what pin_user_pages*() is
intended for.

Would that be something you'd like to see happen? It's certainly easy
enough to fix that up. And your retroactive NAK is sufficient motivation
to do so.


> so at the time, and I continue to say so when the opportunity presents
> itself.  But shouting about it constantly only annoys people, so I don't
> generally bother.  I have other things to work on, and they're productive,
> so I don't need to spend my time arguing.


Sure. As a practical matter, I've assumed that page pinning is not going
to go away any time soon, so I want it to work properly while it's here.
But if there is a viable way to eventually replace dma-pinning with
something better, then let's keep thinking about it. I'm glad to help in
that area.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFCv2 00/16] KVM protected memory extension
  2020-10-26 15:29   ` Kirill A. Shutemov
@ 2020-10-26 23:58     ` Andy Lutomirski
  0 siblings, 0 replies; 57+ messages in thread
From: Andy Lutomirski @ 2020-10-26 23:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Rientjes, Andrea Arcangeli, Kees Cook,
	Will Drewry, Edgecombe, Rick P, Kleen, Andi, Liran Alon,
	Mike Rapoport, X86 ML, kvm list, Linux-MM, LKML,
	Kirill A. Shutemov

On Mon, Oct 26, 2020 at 8:29 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Wed, Oct 21, 2020 at 11:20:56AM -0700, Andy Lutomirski wrote:
> > > On Oct 19, 2020, at 11:19 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > > For removing the userspace mapping, use a trick similar to what NUMA
> > > balancing does: convert memory that belongs to KVM memory slots to
> > > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> > > the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> > > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> > > VMA must be treated in a special way in the GUP and fault paths. The flag
> > > allows GUP to return the page even though it is mapped with PROT_NONE, but
> > > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> > > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> > > would result in -EFAULT.
> > >
> >
> > I definitely like the direction this patchset is going in, and I think
> > that allowing KVM guests to have memory that is inaccessible to QEMU
> > is a great idea.
> >
> > I do wonder, though: do we really want to do this with these PROT_NONE
> > tricks, or should we actually come up with a way to have KVM guest map
> > memory that isn't mapped into QEMU's mm_struct at all?  As an example
> > of the latter, I mean something a bit like this:
> >
> > https://lkml.kernel.org/r/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com
> >
> > I don't mean to say that this is a requirement of any kind of
> > protected memory like this, but I do think we should understand the
> > tradeoffs, in terms of what a full implementation looks like, the
> > effort and time frames involved, and the maintenance burden of
> > supporting whatever gets merged going forward.
>
> I considered the PROT_NONE trick neat. Complete removing of the mapping
> from QEMU would require more changes into KVM and I'm not really familiar
> with it.

I think it's neat.  The big tradeoff I'm concerned about is that it
will likely become ABI once it lands.  That is, if this series lands,
then we will always have to support the case in which QEMU has a
special non-present mapping that is nonetheless reflected as present
in a guest.  This is a bizarre state of affairs, it may become
obsolete if a better API ever shows up, and it might end up placing
constraints on the Linux VM that we don't love going forward.

I don't think my proposal in the referenced thread above is that crazy
or that difficult to implement.  The basic idea is to have a way to
create an mm_struct that is not loaded in CR3 anywhere.  Instead, KVM
will reference it, much as it currently references QEMU's mm_struct,
to mirror mappings into the guest.  This means it would be safe to
have "protected" memory mapped into the special mm_struct because
nothing other than KVM will ever reference the PTEs.  But I think that
someone who really understands the KVM memory mapping code should
chime in.

>
> About tradeoffs: the trick interferes with AutoNUMA. I didn't put much
> thought into how we can get it work together. Need to look into it.
>
> Do you see other tradeoffs?
>
> --
>  Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2020-10-26 23:58 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-20  6:18 [RFCv2 00/16] KVM protected memory extension Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 01/16] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 02/16] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 03/16] x86/kvm: Make DMA pages shared Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 04/16] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 05/16] x86/kvm: Make VirtIO use DMA API in KVM guest Kirill A. Shutemov
2020-10-20  8:06   ` Christoph Hellwig
2020-10-20 12:47     ` Kirill A. Shutemov
2020-10-22  3:31   ` Halil Pasic
2020-10-20  6:18 ` [RFCv2 06/16] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 07/16] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 08/16] KVM: Use GUP instead of copy_from/to_user() to access guest memory Kirill A. Shutemov
2020-10-20  8:25   ` John Hubbard
2020-10-20 12:51     ` Kirill A. Shutemov
2020-10-22 11:49     ` Matthew Wilcox
2020-10-22 19:58       ` John Hubbard
2020-10-26  4:21         ` Matthew Wilcox
2020-10-26  4:44           ` John Hubbard
2020-10-26 13:28             ` Matthew Wilcox
2020-10-26 14:16               ` Jason Gunthorpe
2020-10-26 20:52               ` John Hubbard
2020-10-20 17:29   ` Ira Weiny
2020-10-22 11:37     ` Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 09/16] KVM: mm: Introduce VM_KVM_PROTECTED Kirill A. Shutemov
2020-10-21 18:47   ` Edgecombe, Rick P
2020-10-22 12:01     ` Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 10/16] KVM: x86: Use GUP for page walk instead of __get_user() Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 11/16] KVM: Protected memory extension Kirill A. Shutemov
2020-10-20  7:17   ` Peter Zijlstra
2020-10-20 12:55     ` Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 12/16] KVM: x86: Enabled protected " Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 13/16] KVM: Rework copy_to/from_guest() to avoid direct mapping Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 14/16] KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn() Kirill A. Shutemov
2020-10-21 18:50   ` Edgecombe, Rick P
2020-10-22 12:06     ` Kirill A. Shutemov
2020-10-22 16:59       ` Edgecombe, Rick P
2020-10-23 10:36         ` Kirill A. Shutemov
2020-10-22  3:26   ` Halil Pasic
2020-10-22 12:07     ` Kirill A. Shutemov
2020-10-20  6:18 ` [RFCv2 15/16] KVM: Unmap protected pages from direct mapping Kirill A. Shutemov
2020-10-20  7:12   ` Peter Zijlstra
2020-10-20 12:18   ` David Hildenbrand
2020-10-20 13:20     ` David Hildenbrand
2020-10-21  1:20       ` Edgecombe, Rick P
2020-10-26 19:55     ` Tom Lendacky
2020-10-21 18:49   ` Edgecombe, Rick P
2020-10-23 12:37   ` Mike Rapoport
2020-10-23 16:32     ` Sean Christopherson
2020-10-20  6:18 ` [RFCv2 16/16] mm: Do not use zero page for VM_KVM_PROTECTED VMAs Kirill A. Shutemov
2020-10-20  7:46 ` [RFCv2 00/16] KVM protected memory extension Vitaly Kuznetsov
2020-10-20 13:49   ` Kirill A. Shutemov
2020-10-21 14:46     ` Vitaly Kuznetsov
2020-10-23 11:35       ` Kirill A. Shutemov
2020-10-23 12:01         ` Vitaly Kuznetsov
2020-10-21 18:20 ` Andy Lutomirski
2020-10-26 15:29   ` Kirill A. Shutemov
2020-10-26 23:58     ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).