KVM Archive on lore.kernel.org
 help / color / Atom feed
* [RFCv2 00/13] TDX and guest memory unmapping
@ 2021-04-16 15:40 Kirill A. Shutemov
  2021-04-16 15:40 ` [RFCv2 01/13] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
                   ` (13 more replies)
  0 siblings, 14 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

TDX integrity check failures may lead to system shutdown host kernel must
not allow any writes to TD-private memory. This requirment clashes with
KVM design: KVM expects the guest memory to be mapped into host userspace
(e.g. QEMU).

This patchset aims to start discussion on how we can approach the issue.

The core of the change is in the last patch. Please see more detailed
description of the issue and proposoal of the solution there.

TODO:
  - THP support
  - Clarify semantics wrt. page cache

v2:
  - Unpoison page on free
  - Authorize access to the page: only the KVM that poisoned the page
    allows to touch it
  - FOLL_ALLOW_POISONED is implemented

The patchset can also be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git kvm-unmapped-poison

Kirill A. Shutemov (13):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvmclock: Share hvclock memory with the host
  x86/realmode: Share trampoline area if KVM memory protection enabled
  mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page()
  mm/gup: Add FOLL_ALLOW_POISONED
  shmem: Fail shmem_getpage_gfp() on poisoned pages
  mm: Keep page reference for hwpoison entries
  mm: Replace hwpoison entry with present PTE if page got unpoisoned
  KVM: passdown struct kvm to hva_to_pfn_slow()
  KVM: unmap guest memory using poisoned pages

 arch/powerpc/kvm/book3s_64_mmu_hv.c    |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
 arch/x86/Kconfig                       |   9 +-
 arch/x86/include/asm/cpufeatures.h     |   1 +
 arch/x86/include/asm/io.h              |   4 +-
 arch/x86/include/asm/kvm_para.h        |   5 +
 arch/x86/include/asm/mem_encrypt.h     |   7 +-
 arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
 arch/x86/kernel/kvm.c                  |  19 +++
 arch/x86/kernel/kvmclock.c             |   2 +-
 arch/x86/kernel/pci-swiotlb.c          |   3 +-
 arch/x86/kvm/Kconfig                   |   1 +
 arch/x86/kvm/cpuid.c                   |   3 +-
 arch/x86/kvm/mmu/mmu.c                 |  20 ++-
 arch/x86/kvm/mmu/paging_tmpl.h         |  10 +-
 arch/x86/kvm/x86.c                     |   6 +
 arch/x86/mm/Makefile                   |   2 +
 arch/x86/mm/mem_encrypt.c              |  74 ---------
 arch/x86/mm/mem_encrypt_common.c       |  87 ++++++++++
 arch/x86/mm/pat/set_memory.c           |  10 ++
 arch/x86/realmode/init.c               |   7 +-
 include/linux/kvm_host.h               |  31 +++-
 include/linux/mm.h                     |   1 +
 include/linux/swapops.h                |  20 +++
 include/uapi/linux/kvm_para.h          |   5 +-
 mm/gup.c                               |  29 ++--
 mm/memory.c                            |  44 ++++-
 mm/page_alloc.c                        |   7 +
 mm/rmap.c                              |   2 +-
 mm/shmem.c                             |   7 +
 virt/Makefile                          |   2 +-
 virt/kvm/Kconfig                       |   4 +
 virt/kvm/Makefile                      |   1 +
 virt/kvm/kvm_main.c                    | 212 +++++++++++++++++++------
 virt/kvm/mem_protected.c               | 110 +++++++++++++
 35 files changed, 593 insertions(+), 159 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c
 create mode 100644 virt/kvm/Makefile
 create mode 100644 virt/kvm/mem_protected.c

-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 01/13] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
@ 2021-04-16 15:40 ` Kirill A. Shutemov
  2021-04-16 15:40 ` [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

force_dma_unencrypted() has to return true for KVM guest with the memory
protected enabled. Move it out of AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for the following patches.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  7 +++++-
 arch/x86/include/asm/io.h        |  4 +++-
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        | 30 -------------------------
 arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
 5 files changed, 49 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..2b4ce1722dbd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1520,14 +1520,19 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
 	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..6dc51b31cb0e 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -256,10 +256,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
 	return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c3d5f0236f35..9ca477b9b8ba 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include <linux/dma-direct.h>
 #include <linux/swiotlb.h>
 #include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -390,32 +386,6 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-	/*
-	 * For SEV, all DMA must be to unencrypted addresses.
-	 */
-	if (sev_active())
-		return true;
-
-	/*
-	 * For SME, all DMA must be to unencrypted addresses if the
-	 * device does not support DMA to addresses that include the
-	 * encryption mask.
-	 */
-	if (sme_active()) {
-		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-						dev->bus_dma_limit);
-
-		if (dma_dev_mask <= dma_enc_mask)
-			return true;
-	}
-
-	return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
 	unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..dd791352f73f
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-direct.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	/*
+	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 */
+	if (sev_active())
+		return true;
+
+	/*
+	 * For SME, all DMA must be to unencrypted addresses if the
+	 * device does not support DMA to addresses that include the
+	 * encryption mask.
+	 */
+	if (sme_active()) {
+		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+						dev->bus_dma_limit);
+
+		if (dma_dev_mask <= dma_enc_mask)
+			return true;
+	}
+
+	return false;
+}
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
  2021-04-16 15:40 ` [RFCv2 01/13] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
@ 2021-04-16 15:40 ` Kirill A. Shutemov
  2021-04-16 16:10   ` Borislav Petkov
  2021-04-16 15:40 ` [RFCv2 03/13] x86/kvm: Make DMA pages shared Kirill A. Shutemov
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.

Host side doesn't provide the feature yet, so it is a dead code for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/kvm_para.h      |  5 +++++
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kernel/kvm.c                | 17 +++++++++++++++++
 include/uapi/linux/kvm_para.h        |  3 ++-
 5 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..d8f3d2619913 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
 #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
 #define X86_FEATURE_VM_PAGE_FLUSH	( 8*32+21) /* "" VM Page Flush MSR is supported */
+#define X86_FEATURE_KVM_MEM_PROTECTED	( 8*32+22) /* "" KVM memory protection extension */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..74aea18f3130 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -11,11 +11,16 @@ extern void kvmclock_init(void);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
+bool kvm_mem_protected(void);
 #else
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
 	return false;
 }
+static inline bool kvm_mem_protected(void)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_GUEST */
 
 #define KVM_HYPERCALL \
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 950afebfba88..8d32c41861c9 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -28,11 +28,12 @@
 #define KVM_FEATURE_PV_UNHALT		7
 #define KVM_FEATURE_PV_TLB_FLUSH	9
 #define KVM_FEATURE_ASYNC_PF_VMEXIT	10
-#define KVM_FEATURE_PV_SEND_IPI	11
+#define KVM_FEATURE_PV_SEND_IPI		11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
 #define KVM_FEATURE_MSI_EXT_DEST_ID	15
+#define KVM_FEATURE_MEM_PROTECTED	16
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5e78e01ca3b4..aed6034fcac1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,13 @@
 #include <asm/ptrace.h>
 #include <asm/svm.h>
 
+static bool mem_protected;
+
+bool kvm_mem_protected(void)
+{
+	return mem_protected;
+}
+
 DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
 
 static int kvmapf = 1;
@@ -749,6 +756,16 @@ static void __init kvm_init_platform(void)
 {
 	kvmclock_init();
 	x86_platform.apic_post_init = kvm_apic_init;
+
+	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
+		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
+			panic("Failed to enable KVM memory protection");
+		}
+
+		pr_info("KVM memory protection enabled\n");
+		mem_protected = true;
+		setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+	}
 }
 
 #if defined(CONFIG_AMD_MEM_ENCRYPT)
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..1a216f32e572 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -27,8 +27,9 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
-#define KVM_HC_SEND_IPI		10
+#define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
+#define KVM_HC_ENABLE_MEM_PROTECTED	12
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 03/13] x86/kvm: Make DMA pages shared
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
  2021-04-16 15:40 ` [RFCv2 01/13] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
  2021-04-16 15:40 ` [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
@ 2021-04-16 15:40 ` Kirill A. Shutemov
  2021-04-16 15:40 ` [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Make force_dma_unencrypted() return true for KVM to get DMA pages mapped
as shared.

__set_memory_enc_dec() now informs the host via hypercall if the state
of the page has changed from shared to private or back.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/mm/mem_encrypt_common.c |  5 +++--
 arch/x86/mm/pat/set_memory.c     | 10 ++++++++++
 include/uapi/linux/kvm_para.h    |  1 +
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2b4ce1722dbd..d197b3beb904 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -811,6 +811,7 @@ config KVM_GUEST
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_HV_CALLBACK_VECTOR
+	select X86_MEM_ENCRYPT_COMMON
 	default y
 	help
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index dd791352f73f..6bf0718bb72a 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,14 +10,15 @@
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
 #include <linux/dma-direct.h>
+#include <asm/kvm_para.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
 	/*
-	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 * For SEV and KVM, all DMA must be to unencrypted/shared addresses.
 	 */
-	if (sev_active())
+	if (sev_active() || kvm_mem_protected())
 		return true;
 
 	/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..4b312d80033d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,7 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/kvm_para.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -1977,6 +1978,15 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	struct cpa_data cpa;
 	int ret;
 
+	if (kvm_mem_protected()) {
+		/* TODO: Unsharing memory back */
+		if (WARN_ON_ONCE(enc))
+			return -ENOSYS;
+
+		return kvm_hypercall2(KVM_HC_MEM_SHARE,
+				      __pa(addr) >> PAGE_SHIFT, numpages);
+	}
+
 	/* Nothing to do if memory encryption is not active */
 	if (!mem_encrypt_active())
 		return 0;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 1a216f32e572..09d36683ee0a 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,7 @@
 #define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
 #define KVM_HC_ENABLE_MEM_PROTECTED	12
+#define KVM_HC_MEM_SHARE		13
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2021-04-16 15:40 ` [RFCv2 03/13] x86/kvm: Make DMA pages shared Kirill A. Shutemov
@ 2021-04-16 15:40 ` Kirill A. Shutemov
  2021-04-16 16:21   ` Dave Hansen
  2021-04-16 15:40 ` [RFCv2 05/13] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/mem_encrypt.h |  7 +++--
 arch/x86/kernel/kvm.c              |  2 ++
 arch/x86/kernel/pci-swiotlb.c      |  3 +-
 arch/x86/mm/mem_encrypt.c          | 44 ---------------------------
 arch/x86/mm/mem_encrypt_common.c   | 48 ++++++++++++++++++++++++++++++
 6 files changed, 57 insertions(+), 48 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d197b3beb904..c51d14db5620 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -812,6 +812,7 @@ config KVM_GUEST
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_HV_CALLBACK_VECTOR
 	select X86_MEM_ENCRYPT_COMMON
+	select SWIOTLB
 	default y
 	help
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 31c4df123aa0..a748b30c2f23 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,10 +47,8 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
 
 void __init mem_encrypt_free_decrypted_mem(void);
 
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void);
-
 void __init sev_es_init_vc_handling(void);
+
 bool sme_active(void);
 bool sev_active(void);
 bool sev_es_active(void);
@@ -91,6 +89,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void);
+
 /*
  * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
  * writing to or comparing values from the cr3 register.  Having the
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index aed6034fcac1..ba179f5ca198 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -26,6 +26,7 @@
 #include <linux/kprobes.h>
 #include <linux/nmi.h>
 #include <linux/swait.h>
+#include <linux/swiotlb.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -765,6 +766,7 @@ static void __init kvm_init_platform(void)
 		pr_info("KVM memory protection enabled\n");
 		mem_protected = true;
 		setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+		swiotlb_force = SWIOTLB_FORCE;
 	}
 }
 
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..814060a6ceb0 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -13,6 +13,7 @@
 #include <asm/dma.h>
 #include <asm/xen/swiotlb-xen.h>
 #include <asm/iommu_table.h>
+#include <asm/kvm_para.h>
 
 int swiotlb __read_mostly;
 
@@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (sme_active() || kvm_mem_protected())
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 9ca477b9b8ba..3478f20fb46f 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -409,47 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
 	free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-	pr_info("AMD Memory Encryption Features active:");
-
-	/* Secure Memory Encryption */
-	if (sme_active()) {
-		/*
-		 * SME is mutually exclusive with any of the SEV
-		 * features below.
-		 */
-		pr_cont(" SME\n");
-		return;
-	}
-
-	/* Secure Encrypted Virtualization */
-	if (sev_active())
-		pr_cont(" SEV");
-
-	/* Encrypted Register State */
-	if (sev_es_active())
-		pr_cont(" SEV-ES");
-
-	pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-	if (!sme_me_mask)
-		return;
-
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
-	/*
-	 * With SEV, we need to unroll the rep string I/O instructions.
-	 */
-	if (sev_active())
-		static_branch_enable(&sev_enable_key);
-
-	print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 6bf0718bb72a..351b77361a5d 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -11,6 +11,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/dma-direct.h>
 #include <asm/kvm_para.h>
+#include <asm/mem_encrypt.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -37,3 +38,50 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+static void print_mem_encrypt_feature_info(void)
+{
+	if (kvm_mem_protected()) {
+		pr_info("KVM memory protection enabled\n");
+		return;
+	}
+
+	pr_info("AMD Memory Encryption Features active:");
+
+	/* Secure Memory Encryption */
+	if (sme_active()) {
+		/*
+		 * SME is mutually exclusive with any of the SEV
+		 * features below.
+		 */
+		pr_cont(" SME\n");
+		return;
+	}
+
+	/* Secure Encrypted Virtualization */
+	if (sev_active())
+		pr_cont(" SEV");
+
+	/* Encrypted Register State */
+	if (sev_es_active())
+		pr_cont(" SEV-ES");
+
+	pr_cont("\n");
+}
+
+void __init mem_encrypt_init(void)
+{
+	if (!sme_me_mask && !kvm_mem_protected())
+		return;
+
+	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+	swiotlb_update_mem_attributes();
+
+	/*
+	 * With SEV, we need to unroll the rep string I/O instructions.
+	 */
+	if (sev_active())
+		static_branch_enable(&sev_enable_key);
+
+	print_mem_encrypt_feature_info();
+}
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 05/13] x86/kvmclock: Share hvclock memory with the host
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2021-04-16 15:40 ` [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
@ 2021-04-16 15:40 ` Kirill A. Shutemov
  2021-04-16 15:40 ` [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

hvclock is shared between the guest and the hypervisor. It has to be
accessible by host.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/kvmclock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa593743acf6..3d004b278dba 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -252,7 +252,7 @@ static void __init kvmclock_init_mem(void)
 	 * hvclock is shared between the guest and the hypervisor, must
 	 * be mapped decrypted.
 	 */
-	if (sev_active()) {
+	if (sev_active() || kvm_mem_protected()) {
 		r = set_memory_decrypted((unsigned long) hvclock_mem,
 					 1UL << order);
 		if (r) {
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2021-04-16 15:40 ` [RFCv2 05/13] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
@ 2021-04-16 15:40 ` Kirill A. Shutemov
  2021-04-19 16:49   ` Dave Hansen
  2021-04-16 15:41 ` [RFCv2 07/13] mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page() Kirill A. Shutemov
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

If KVM memory protection is active, the trampoline area will need to be
in shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/realmode/init.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 22fda7d99159..f3b54b5da693 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -10,6 +10,7 @@
 #include <asm/tlbflush.h>
 #include <asm/crash.h>
 #include <asm/sev-es.h>
+#include <asm/kvm_para.h>
 
 struct real_mode_header *real_mode_header;
 u32 *trampoline_cr4_features;
@@ -75,11 +76,11 @@ static void __init setup_real_mode(void)
 	base = (unsigned char *)real_mode_header;
 
 	/*
-	 * If SME is active, the trampoline area will need to be in
-	 * decrypted memory in order to bring up other processors
+	 * If SME or KVM memory protection is active, the trampoline area will
+	 * need to be in decrypted memory in order to bring up other processors
 	 * successfully. This is not needed for SEV.
 	 */
-	if (sme_active())
+	if (sme_active() || kvm_mem_protected())
 		set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);
 
 	memcpy(base, real_mode_blob, size);
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 07/13] mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page()
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2021-04-16 15:40 ` [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
@ 2021-04-16 15:41 ` Kirill A. Shutemov
  2021-04-16 15:41 ` [RFCv2 08/13] mm/gup: Add FOLL_ALLOW_POISONED Kirill A. Shutemov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:41 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Add helpers to convert hwpoison swap entry to pfn and page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/swapops.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..520589b12fb3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -323,6 +323,16 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(hwpoison_entry_to_pfn(entry));
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
@@ -345,6 +355,16 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 	return 0;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return 0;
+}
+
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 }
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 08/13] mm/gup: Add FOLL_ALLOW_POISONED
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2021-04-16 15:41 ` [RFCv2 07/13] mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page() Kirill A. Shutemov
@ 2021-04-16 15:41 ` Kirill A. Shutemov
  2021-04-16 15:41 ` [RFCv2 09/13] shmem: Fail shmem_getpage_gfp() on poisoned pages Kirill A. Shutemov
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:41 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

The new flag allows to bypass check if the page is poisoned and get
reference on it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 29 ++++++++++++++++++-----------
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..378a94481fd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2802,6 +2802,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_ALLOW_POISONED 0x100000 /* bypass poison check */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..dd3b79b03eb5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -384,22 +384,29 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	pte = *ptep;
 	if (!pte_present(pte)) {
-		swp_entry_t entry;
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (pte_none(pte))
+			goto no_page;
+
 		/*
 		 * KSM's break_ksm() relies upon recognizing a ksm page
 		 * even while it is being migrated, so for that case we
 		 * need migration_entry_wait().
 		 */
-		if (likely(!(flags & FOLL_MIGRATION)))
-			goto no_page;
-		if (pte_none(pte))
-			goto no_page;
-		entry = pte_to_swp_entry(pte);
-		if (!is_migration_entry(entry))
-			goto no_page;
-		pte_unmap_unlock(ptep, ptl);
-		migration_entry_wait(mm, pmd, address);
-		goto retry;
+		if (is_migration_entry(entry) && (flags & FOLL_MIGRATION)) {
+			pte_unmap_unlock(ptep, ptl);
+			migration_entry_wait(mm, pmd, address);
+			goto retry;
+		}
+
+		if (is_hwpoison_entry(entry) && (flags & FOLL_ALLOW_POISONED)) {
+			page = hwpoison_entry_to_page(entry);
+			get_page(page);
+			goto out;
+		}
+
+		goto no_page;
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 09/13] shmem: Fail shmem_getpage_gfp() on poisoned pages
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2021-04-16 15:41 ` [RFCv2 08/13] mm/gup: Add FOLL_ALLOW_POISONED Kirill A. Shutemov
@ 2021-04-16 15:41 ` Kirill A. Shutemov
  2021-04-16 15:41 ` [RFCv2 10/13] mm: Keep page reference for hwpoison entries Kirill A. Shutemov
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:41 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Forbid access to poisoned pages.

TODO: Probably more fine-grained approach is needed. It shuld be a
allowed to fault-in these pages as hwpoison entries.

Not-Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/shmem.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..d29a0c9be19c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1832,6 +1832,13 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	if (page)
 		hindex = page->index;
+
+	if (page && PageHWPoison(page)) {
+		unlock_page(page);
+		put_page(page);
+		return -EIO;
+	}
+
 	if (page && sgp == SGP_WRITE)
 		mark_page_accessed(page);
 
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 10/13] mm: Keep page reference for hwpoison entries
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2021-04-16 15:41 ` [RFCv2 09/13] shmem: Fail shmem_getpage_gfp() on poisoned pages Kirill A. Shutemov
@ 2021-04-16 15:41 ` Kirill A. Shutemov
  2021-04-16 15:41 ` [RFCv2 11/13] mm: Replace hwpoison entry with present PTE if page got unpoisoned Kirill A. Shutemov
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:41 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On migration we drop referece to a page once PTE gets replaced with a
migration entry. Migration code keeps the last reference on the page
that gets dropped once migration is complete.

We do the same for hwpoison entries, but it doesn't work: nobody keeps
the reference once page is remapped with hwpoison entries.

Keep page referenced on setting up hwpoison entries, copy the reference
on fork() and return on zap().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 6 ++++++
 mm/rmap.c   | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..b15b0c582186 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -767,6 +767,9 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = pte_swp_mkuffd_wp(pte);
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
+	} else if (is_hwpoison_entry(entry)) {
+		page = hwpoison_entry_to_page(entry);
+		get_page(page);
 	}
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return 0;
@@ -1305,6 +1308,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			page = migration_entry_to_page(entry);
 			rss[mm_counter(page)]--;
+
+		} else if (is_hwpoison_entry(entry)) {
+			put_page(hwpoison_entry_to_page(entry));
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
diff --git a/mm/rmap.c b/mm/rmap.c
index 08c56aaf72eb..f08d1fc28522 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1575,7 +1575,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				dec_mm_counter(mm, mm_counter(page));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
-
+			get_page(page);
 		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 11/13] mm: Replace hwpoison entry with present PTE if page got unpoisoned
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2021-04-16 15:41 ` [RFCv2 10/13] mm: Keep page reference for hwpoison entries Kirill A. Shutemov
@ 2021-04-16 15:41 ` Kirill A. Shutemov
  2021-04-16 15:41 ` [RFCv2 12/13] KVM: passdown struct kvm to hva_to_pfn_slow() Kirill A. Shutemov
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:41 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

If the page got unpoisoned we can replace hwpoison entry with a present
PTE on page fault instead of delivering SIGBUS.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b15b0c582186..56f93e8e98f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3280,7 +3280,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->page = device_private_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
-			ret = VM_FAULT_HWPOISON;
+			page = hwpoison_entry_to_page(entry);
+
+			locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+			if (!locked) {
+				ret = VM_FAULT_RETRY;
+				goto out;
+			}
+
+			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+						       vmf->address, &vmf->ptl);
+
+			if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+				ret = 0;
+			} else if (PageHWPoison(page)) {
+				ret = VM_FAULT_HWPOISON;
+			} else {
+				/*
+				 * The page is unpoisoned. Replace hwpoison
+				 * entry with a present PTE.
+				 */
+
+				inc_mm_counter(vma->vm_mm, mm_counter(page));
+				pte = mk_pte(page, vma->vm_page_prot);
+
+				if (PageAnon(page)) {
+					page_add_anon_rmap(page, vma,
+							   vmf->address, false);
+				} else {
+					page_add_file_rmap(page, false);
+				}
+
+				set_pte_at(vma->vm_mm, vmf->address,
+					   vmf->pte, pte);
+			}
+
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			unlock_page(page);
 		} else {
 			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 12/13] KVM: passdown struct kvm to hva_to_pfn_slow()
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2021-04-16 15:41 ` [RFCv2 11/13] mm: Replace hwpoison entry with present PTE if page got unpoisoned Kirill A. Shutemov
@ 2021-04-16 15:41 ` Kirill A. Shutemov
  2021-04-16 15:41 ` [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
  2021-04-16 16:46 ` [RFCv2 00/13] TDX and guest memory unmapping Matthew Wilcox
  13 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:41 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Make struct kvm pointer available within hva_to_pfn_slow(). It is
prepartion for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/x86/kvm/mmu/mmu.c                 |  8 +++--
 include/linux/kvm_host.h               | 12 ++++---
 virt/kvm/kvm_main.c                    | 49 ++++++++++++++------------
 5 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 38ea396a23d6..86781ff76fcb 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -589,7 +589,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 		write_ok = true;
 	} else {
 		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+		pfn = __gfn_to_pfn_memslot(kvm, memslot, gfn, false, NULL,
 					   writing, &write_ok);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index bb35490400e9..319a1a99153f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -821,7 +821,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 		unsigned long pfn;
 
 		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+		pfn = __gfn_to_pfn_memslot(kvm, memslot, gfn, false, NULL,
 					   writing, upgrade_p);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..3b97342af026 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2687,7 +2687,7 @@ static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 	if (!slot)
 		return KVM_PFN_ERR_FAULT;
 
-	return gfn_to_pfn_memslot_atomic(slot, gfn);
+	return gfn_to_pfn_memslot_atomic(vcpu->kvm, slot, gfn);
 }
 
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
@@ -3672,7 +3672,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	}
 
 	async = false;
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
+	*pfn = __gfn_to_pfn_memslot(vcpu->kvm, slot, gfn, false,
+				    &async, write, writable);
 	if (!async)
 		return false; /* *pfn has correct page already */
 
@@ -3686,7 +3687,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			return true;
 	}
 
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+	*pfn = __gfn_to_pfn_memslot(vcpu->kvm, slot, gfn, false,
+				    NULL, write, writable);
 	return false;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f3b1013fb22c..fadaccb95a4c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -725,11 +725,13 @@ void kvm_set_page_accessed(struct page *page);
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool *async, bool write_fault,
-			       bool *writable);
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
+			     struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm *kvm,
+				    struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
+			       gfn_t gfn, bool atomic, bool *async,
+			       bool write_fault, bool *writable);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8367d88ce39b..3e2dbaef7979 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2009,9 +2009,9 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool *async, bool write_fault,
-			       bool *writable)
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
+			       gfn_t gfn, bool atomic, bool *async,
+			       bool write_fault, bool *writable)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -2041,38 +2041,43 @@ EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
-				    write_fault, writable);
+	return __gfn_to_pfn_memslot(kvm, gfn_to_memslot(kvm, gfn), gfn, false,
+				    NULL, write_fault, writable);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
+			     struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(kvm, slot, gfn, false, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm *kvm,
+				    struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(kvm, slot, gfn, true, NULL, true, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+	return gfn_to_pfn_memslot_atomic(vcpu->kvm,
+					 kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+					 gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
 
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
+	return gfn_to_pfn_memslot(kvm, gfn_to_memslot(kvm, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+	return gfn_to_pfn_memslot(vcpu->kvm,
+				  kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn);
 
@@ -2130,21 +2135,21 @@ void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache)
 		kvm_release_pfn_clean(pfn);
 }
 
-static void kvm_cache_gfn_to_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
+static void kvm_cache_gfn_to_pfn(struct kvm *kvm, 
+				 struct kvm_memory_slot *slot, gfn_t gfn,
 				 struct gfn_to_pfn_cache *cache, u64 gen)
 {
 	kvm_release_pfn(cache->pfn, cache->dirty, cache);
 
-	cache->pfn = gfn_to_pfn_memslot(slot, gfn);
+	cache->pfn = gfn_to_pfn_memslot(kvm, slot, gfn);
 	cache->gfn = gfn;
 	cache->dirty = false;
 	cache->generation = gen;
 }
 
-static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
-			 struct kvm_host_map *map,
-			 struct gfn_to_pfn_cache *cache,
-			 bool atomic)
+static int __kvm_map_gfn(struct kvm *kvm, struct kvm_memslots *slots,
+			 gfn_t gfn, struct kvm_host_map *map,
+			 struct gfn_to_pfn_cache *cache, bool atomic)
 {
 	kvm_pfn_t pfn;
 	void *hva = NULL;
@@ -2160,13 +2165,13 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 			cache->generation != gen) {
 			if (atomic)
 				return -EAGAIN;
-			kvm_cache_gfn_to_pfn(slot, gfn, cache, gen);
+			kvm_cache_gfn_to_pfn(kvm, slot, gfn, cache, gen);
 		}
 		pfn = cache->pfn;
 	} else {
 		if (atomic)
 			return -EAGAIN;
-		pfn = gfn_to_pfn_memslot(slot, gfn);
+		pfn = gfn_to_pfn_memslot(kvm, slot, gfn);
 	}
 	if (is_error_noslot_pfn(pfn))
 		return -EINVAL;
@@ -2199,14 +2204,14 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
 		struct gfn_to_pfn_cache *cache, bool atomic)
 {
-	return __kvm_map_gfn(kvm_memslots(vcpu->kvm), gfn, map,
+	return __kvm_map_gfn(vcpu->kvm, kvm_memslots(vcpu->kvm), gfn, map,
 			cache, atomic);
 }
 EXPORT_SYMBOL_GPL(kvm_map_gfn);
 
 int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 {
-	return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map,
+	return __kvm_map_gfn(vcpu->kvm, kvm_vcpu_memslots(vcpu), gfn, map,
 		NULL, false);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_map);
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2021-04-16 15:41 ` [RFCv2 12/13] KVM: passdown struct kvm to hva_to_pfn_slow() Kirill A. Shutemov
@ 2021-04-16 15:41 ` Kirill A. Shutemov
  2021-04-16 17:30   ` Sean Christopherson
  2021-04-16 16:46 ` [RFCv2 00/13] TDX and guest memory unmapping Matthew Wilcox
  13 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-16 15:41 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

TDX architecture aims to provide resiliency against confidentiality and
integrity attacks. Towards this goal, the TDX architecture helps enforce
the enabling of memory integrity for all TD-private memory.

The CPU memory controller computes the integrity check value (MAC) for
the data (cache line) during writes, and it stores the MAC with the
memory as meta-data. A 28-bit MAC is stored in the ECC bits.

Checking of memory integrity is performed during memory reads. If
integrity check fails, CPU poisones cache line.

On a subsequent consumption (read) of the poisoned data by software,
there are two possible scenarios:

 - Core determines that the execution can continue and it treats
   poison with exception semantics signaled as a #MCE

 - Core determines execution cannot continue,and it does an unbreakable
   shutdown

For more details, see Chapter 14 of Intel TDX Module EAS[1]

As some of integrity check failures may lead to system shutdown host
kernel must not allow any writes to TD-private memory. This requirment
clashes with KVM design: KVM expects the guest memory to be mapped into
host userspace (e.g. QEMU).

This patch aims to start discussion on how we can approach the issue.

For now I intentionally keep TDX out of picture here and try to find a
generic way to unmap KVM guest memory from host userspace. Hopefully, it
makes the patch more approachable. And anyone can try it out.

To the proposal:

Looking into existing codepaths I've discovered that we already have
semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap
entries in page tables:

  - If an application touches a page mapped with the SWP_HWPOISON, it will
    get SIGBUS.

  - GUP will fail with -EFAULT;

Access the poisoned memory via page cache doesn't match required
semantics right now, but it shouldn't be too hard to make it work:
access to poisoned dirty pages should give -EIO or -EHWPOISON.

My idea is that we can mark page as poisoned when we make it TD-private
and replace all PTEs that map the page with SWP_HWPOISON.

TODO: THP support is missing.

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf

Not-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kvm/Kconfig           |   1 +
 arch/x86/kvm/cpuid.c           |   3 +-
 arch/x86/kvm/mmu/mmu.c         |  12 ++-
 arch/x86/kvm/mmu/paging_tmpl.h |  10 +-
 arch/x86/kvm/x86.c             |   6 ++
 include/linux/kvm_host.h       |  19 ++++
 include/uapi/linux/kvm_para.h  |   1 +
 mm/page_alloc.c                |   7 ++
 virt/Makefile                  |   2 +-
 virt/kvm/Kconfig               |   4 +
 virt/kvm/Makefile              |   1 +
 virt/kvm/kvm_main.c            | 163 ++++++++++++++++++++++++++++-----
 virt/kvm/mem_protected.c       | 110 ++++++++++++++++++++++
 13 files changed, 311 insertions(+), 28 deletions(-)
 create mode 100644 virt/kvm/Makefile
 create mode 100644 virt/kvm/mem_protected.c

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 7ac592664c52..b7db1c455e7c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
 	select SRCU
+	select HAVE_KVM_PROTECTED_MEMORY
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 38172ca627d3..1457692c1080 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -796,7 +796,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-			     (1 << KVM_FEATURE_ASYNC_PF_INT);
+			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
+			     (1 << KVM_FEATURE_MEM_PROTECTED);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3b97342af026..3fa76693edcd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -43,6 +43,7 @@
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
 #include <linux/kthread.h>
+#include <linux/rmap.h>
 
 #include <asm/page.h>
 #include <asm/memtype.h>
@@ -2758,7 +2759,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	if (sp->role.level > PG_LEVEL_4K)
 		return;
 
-	__direct_pte_prefetch(vcpu, sp, sptep);
+	if (!vcpu->kvm->mem_protected)
+		__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
 static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
@@ -3725,6 +3727,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
 		return r;
 
+	if (vcpu->kvm->mem_protected && unlikely(!is_noslot_pfn(pfn))) {
+		if (!kvm_protect_pfn(vcpu->kvm, gfn, pfn)) {
+			unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
+			kvm_send_hwpoison_signal(hva, current);
+			return RET_PF_RETRY;
+		}
+	}
+
 	r = RET_PF_RETRY;
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50e268eb8e1a..26b0494a1207 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -397,8 +397,14 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 			goto error;
 
 		ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
-		if (unlikely(__get_user(pte, ptep_user)))
-			goto error;
+		if (vcpu->kvm->mem_protected) {
+			if (copy_from_guest(vcpu->kvm, &pte, host_addr + offset,
+					    sizeof(pte)))
+				goto error;
+		} else {
+			if (unlikely(__get_user(pte, ptep_user)))
+				goto error;
+		}
 		walker->ptep_user[walker->level - 1] = ptep_user;
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b404e4d7dd8..f8183386abe7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		kvm_sched_yield(vcpu->kvm, a0);
 		ret = 0;
 		break;
+	case KVM_HC_ENABLE_MEM_PROTECTED:
+		ret = kvm_protect_memory(vcpu->kvm);
+		break;
+	case KVM_HC_MEM_SHARE:
+		ret = kvm_share_memory(vcpu->kvm, a0, a1);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fadaccb95a4c..cd2374802702 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
+#define KVM_NR_SHARED_RANGES 32
+
 /*
  * Note:
  * memslots are not sorted by id anymore, please use id_to_memslot()
@@ -513,6 +515,10 @@ struct kvm {
 	pid_t userspace_pid;
 	unsigned int max_halt_poll_ns;
 	u32 dirty_ring_size;
+	bool mem_protected;
+	void *id;
+	int nr_shared_ranges;
+	struct range shared_ranges[KVM_NR_SHARED_RANGES];
 };
 
 #define kvm_err(fmt, ...) \
@@ -709,6 +715,16 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot);
 
+int kvm_protect_memory(struct kvm *kvm);
+
+void __kvm_share_memory(struct kvm *kvm, unsigned long start, unsigned long end);
+int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages);
+
+bool kvm_protect_pfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
+void kvm_share_pfn(struct kvm *kvm, kvm_pfn_t pfn);
+
+bool kvm_page_allowed(struct kvm *kvm, struct page *page);
+
 int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 			    struct page **pages, int nr_pages);
 
@@ -718,6 +734,9 @@ unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable);
 unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
 unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 				      bool *writable);
+int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len);
+int copy_to_guest(struct kvm *kvm, unsigned long hva, const void *data, int len);
+
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 09d36683ee0a..743e621111f0 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -17,6 +17,7 @@
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
+#define KVM_EINTR		EINTR
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..5d05129e8b0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1224,6 +1224,13 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 	trace_mm_page_free(page, order);
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) &&
+	    unlikely(PageHWPoison(page))) {
+		void kvm_unpoison_page(struct page *page);
+
+		kvm_unpoison_page(page);
+	}
+
 	if (unlikely(PageHWPoison(page)) && !order) {
 		/*
 		 * Do not let hwpoison pages hit pcplists/buddy
diff --git a/virt/Makefile b/virt/Makefile
index 1cfea9436af9..a931049086a3 100644
--- a/virt/Makefile
+++ b/virt/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y	+= lib/
+obj-y	+= lib/ kvm/
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 1c37ccd5d402..5f28048718f4 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -63,3 +63,7 @@ config HAVE_KVM_NO_POLL
 
 config KVM_XFER_TO_GUEST_WORK
        bool
+
+config HAVE_KVM_PROTECTED_MEMORY
+       bool
+       select MEMORY_FAILURE
diff --git a/virt/kvm/Makefile b/virt/kvm/Makefile
new file mode 100644
index 000000000000..f10c62b98968
--- /dev/null
+++ b/virt/kvm/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_HAVE_KVM_PROTECTED_MEMORY) += mem_protected.o
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3e2dbaef7979..9cbd3716b3e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/io.h>
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
+#include <linux/rmap.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -741,6 +742,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	struct kvm *kvm = kvm_arch_alloc_vm();
 	int r = -ENOMEM;
 	int i;
+	static long id = 0;
 
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
@@ -803,6 +805,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
 
 	mutex_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
+	kvm->id = xa_mk_value(id++);
+	if (id < 0)
+		id = 0;
 	mutex_unlock(&kvm_lock);
 
 	preempt_notifier_inc();
@@ -1852,7 +1857,8 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * The slow path to get the pfn of the specified host virtual address,
  * 1 indicates success, -errno is returned if error is detected.
  */
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
+static int hva_to_pfn_slow(struct kvm *kvm, unsigned long addr,
+			   bool *async, bool write_fault,
 			   bool *writable, kvm_pfn_t *pfn)
 {
 	unsigned int flags = FOLL_HWPOISON;
@@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 		flags |= FOLL_WRITE;
 	if (async)
 		flags |= FOLL_NOWAIT;
+	if (kvm->mem_protected)
+		flags |= FOLL_ALLOW_POISONED;
 
 	npages = get_user_pages_unlocked(addr, 1, &page, flags);
 	if (npages != 1)
 		return npages;
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) &&
+	    kvm->mem_protected && !kvm_page_allowed(kvm, page))
+		return -EHWPOISON;
+
 	/* map read fault as writable if possible */
 	if (unlikely(!write_fault) && writable) {
 		struct page *wpage;
@@ -1961,8 +1973,9 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-			bool write_fault, bool *writable)
+static kvm_pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr,
+			    bool atomic, bool *async,
+			    bool write_fault, bool *writable)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn = 0;
@@ -1977,7 +1990,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	if (atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+	npages = hva_to_pfn_slow(kvm, addr, async, write_fault, writable, &pfn);
 	if (npages == 1)
 		return pfn;
 
@@ -2033,8 +2046,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 		writable = NULL;
 	}
 
-	return hva_to_pfn(addr, atomic, async, write_fault,
-			  writable);
+	return hva_to_pfn(kvm, addr, atomic, async, write_fault, writable);
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
@@ -2338,19 +2350,93 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+	void *vaddr;
+
+	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
+	    !kvm->mem_protected) {
+		return __copy_from_user(data, (void __user *)hva, len);
+	}
+
+	might_fault();
+	kasan_check_write(data, len);
+	check_object_size(data, len, false);
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page,
+						 FOLL_ALLOW_POISONED);
+		if (npages != 1)
+			return -EFAULT;
+
+		if (!kvm_page_allowed(kvm, page))
+			return -EFAULT;
+
+		vaddr = kmap_atomic(page);
+		memcpy(data, vaddr + offset, seg);
+		kunmap_atomic(vaddr);
+
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		data += seg;
+		offset = 0;
+	}
+
+	return 0;
+}
+
+int copy_to_guest(struct kvm *kvm, unsigned long hva, const void *data, int len)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+	void *vaddr;
+
+	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
+	    !kvm->mem_protected) {
+		return __copy_to_user((void __user *)hva, data, len);
+	}
+
+	might_fault();
+	kasan_check_read(data, len);
+	check_object_size(data, len, true);
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page,
+						 FOLL_WRITE | FOLL_ALLOW_POISONED);
+		if (npages != 1)
+			return -EFAULT;
+
+		if (!kvm_page_allowed(kvm, page))
+			return -EFAULT;
+
+		vaddr = kmap_atomic(page);
+		memcpy(vaddr + offset, data, seg);
+		kunmap_atomic(vaddr);
+
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		data += seg;
+		offset = 0;
+	}
+
+	return 0;
+}
+
+static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gfn_t gfn, void *data, int offset, int len)
 {
-	int r;
 	unsigned long addr;
 
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_from_user(data, (void __user *)addr + offset, len);
-	if (r)
-		return -EFAULT;
-	return 0;
+	return copy_from_guest(kvm, data, addr + offset, len);
 }
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
@@ -2358,7 +2444,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -2367,7 +2453,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -2449,7 +2535,8 @@ static int __kvm_write_guest_page(struct kvm *kvm,
 	addr = gfn_to_hva_memslot(memslot, gfn);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_to_user((void __user *)addr + offset, data, len);
+
+	r = copy_to_guest(kvm, addr + offset, data, len);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(kvm, memslot, gfn);
@@ -2586,7 +2673,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_write_guest(kvm, gpa, data, len);
 
-	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
+	r = copy_to_guest(kvm, ghc->hva + offset, data, len);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT);
@@ -2607,7 +2694,6 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 				 unsigned long len)
 {
 	struct kvm_memslots *slots = kvm_memslots(kvm);
-	int r;
 	gpa_t gpa = ghc->gpa + offset;
 
 	BUG_ON(len + offset > ghc->len);
@@ -2623,11 +2709,7 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_read_guest(kvm, gpa, data, len);
 
-	r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
-	if (r)
-		return -EFAULT;
-
-	return 0;
+	return copy_from_guest(kvm, data, ghc->hva + offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached);
 
@@ -2693,6 +2775,41 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
+int kvm_protect_memory(struct kvm *kvm)
+{
+	if (mmap_write_lock_killable(kvm->mm))
+		return -KVM_EINTR;
+
+	kvm->mem_protected = true;
+	kvm_arch_flush_shadow_all(kvm);
+	mmap_write_unlock(kvm->mm);
+
+	return 0;
+}
+
+int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages)
+{
+	unsigned long end = gfn + npages;
+
+	if (!npages || !IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY))
+		return 0;
+
+	__kvm_share_memory(kvm, gfn, end);
+
+	for (; gfn < end; gfn++) {
+		struct page *page = gfn_to_page(kvm, gfn);
+		unsigned long pfn = page_to_pfn(page);
+
+		if (page == KVM_ERR_PTR_BAD_PAGE)
+			continue;
+
+		kvm_share_pfn(kvm, pfn);
+		put_page(page);
+	}
+
+	return 0;
+}
+
 void kvm_sigset_activate(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->sigset_active)
diff --git a/virt/kvm/mem_protected.c b/virt/kvm/mem_protected.c
new file mode 100644
index 000000000000..81882bd3232b
--- /dev/null
+++ b/virt/kvm/mem_protected.c
@@ -0,0 +1,110 @@
+#include <linux/kvm_host.h>
+#include <linux/mm.h>
+#include <linux/rmap.h>
+
+static DEFINE_XARRAY(kvm_pfn_map);
+
+static bool gfn_is_shared(struct kvm *kvm, unsigned long gfn)
+{
+	bool ret = false;
+	int i;
+
+	spin_lock(&kvm->mmu_lock);
+	for (i = 0; i < kvm->nr_shared_ranges; i++) {
+		if (gfn < kvm->shared_ranges[i].start)
+			continue;
+		if (gfn >= kvm->shared_ranges[i].end)
+			continue;
+
+		ret = true;
+		break;
+	}
+	spin_unlock(&kvm->mmu_lock);
+
+	return ret;
+}
+
+bool kvm_protect_pfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	bool ret = true;
+
+	if (gfn_is_shared(kvm, gfn))
+		return true;
+
+	if (is_zero_pfn(pfn))
+		return true;
+
+	/* Only anonymous and shmem/tmpfs pages are supported */
+	if (!PageSwapBacked(page))
+		return false;
+
+	lock_page(page);
+
+	/* Recheck gfn_is_shared() under page lock */
+	if (gfn_is_shared(kvm, gfn))
+		goto out;
+
+	if (!TestSetPageHWPoison(page)) {
+		try_to_unmap(page, TTU_IGNORE_MLOCK);
+		xa_store(&kvm_pfn_map, pfn, kvm->id, GFP_KERNEL);
+	} else if (xa_load(&kvm_pfn_map, pfn) != kvm->id) {
+		ret = false;
+	}
+out:
+	unlock_page(page);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_protect_pfn);
+
+void __kvm_share_memory(struct kvm *kvm,
+			unsigned long start, unsigned long end)
+{
+	/*
+	 * Out of slots.
+	 * Still worth to proceed: the new range may merge with an existing
+	 * one.
+	 */
+	WARN_ON_ONCE(kvm->nr_shared_ranges == ARRAY_SIZE(kvm->shared_ranges));
+
+	spin_lock(&kvm->mmu_lock);
+	kvm->nr_shared_ranges = add_range_with_merge(kvm->shared_ranges,
+						ARRAY_SIZE(kvm->shared_ranges),
+						kvm->nr_shared_ranges,
+						start, end);
+	kvm->nr_shared_ranges = clean_sort_range(kvm->shared_ranges,
+					    ARRAY_SIZE(kvm->shared_ranges));
+	spin_unlock(&kvm->mmu_lock);
+}
+EXPORT_SYMBOL(__kvm_share_memory);
+
+void kvm_share_pfn(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	lock_page(page);
+	if (xa_load(&kvm_pfn_map, pfn) == kvm->id) {
+		xa_erase(&kvm_pfn_map, pfn);
+		ClearPageHWPoison(page);
+	}
+	unlock_page(page);
+}
+EXPORT_SYMBOL(kvm_share_pfn);
+
+void kvm_unpoison_page(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	if (xa_erase(&kvm_pfn_map, pfn))
+		ClearPageHWPoison(page);
+}
+
+bool kvm_page_allowed(struct kvm *kvm, struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	if (!PageHWPoison(page))
+		return true;
+
+	return xa_load(&kvm_pfn_map, pfn) == kvm->id;
+}
-- 
2.26.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature
  2021-04-16 15:40 ` [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
@ 2021-04-16 16:10   ` Borislav Petkov
  2021-04-19 10:10     ` Kirill A. Shutemov
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2021-04-16 16:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, Erdem Aktas,
	Steve Rutherford, Peter Gonda, David Hildenbrand, x86, kvm,
	linux-mm, linux-kernel

On Fri, Apr 16, 2021 at 06:40:55PM +0300, Kirill A. Shutemov wrote:
> Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> 
> Host side doesn't provide the feature yet, so it is a dead code for now.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/cpufeatures.h   |  1 +
>  arch/x86/include/asm/kvm_para.h      |  5 +++++
>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>  arch/x86/kernel/kvm.c                | 17 +++++++++++++++++
>  include/uapi/linux/kvm_para.h        |  3 ++-
>  5 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..d8f3d2619913 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -238,6 +238,7 @@
>  #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
>  #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
>  #define X86_FEATURE_VM_PAGE_FLUSH	( 8*32+21) /* "" VM Page Flush MSR is supported */
> +#define X86_FEATURE_KVM_MEM_PROTECTED	( 8*32+22) /* "" KVM memory protection extension */

That feature bit is still unused.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection
  2021-04-16 15:40 ` [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
@ 2021-04-16 16:21   ` Dave Hansen
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2021-04-16 16:21 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel

On 4/16/21 8:40 AM, Kirill A. Shutemov wrote:
> Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.
...
>  arch/x86/mm/mem_encrypt.c          | 44 ---------------------------
>  arch/x86/mm/mem_encrypt_common.c   | 48 ++++++++++++++++++++++++++++++

The changelog need to at least mention what's going on here.  It doesn't
prepare me at all for having code move around.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d197b3beb904..c51d14db5620 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -812,6 +812,7 @@ config KVM_GUEST
>  	select ARCH_CPUIDLE_HALTPOLL
>  	select X86_HV_CALLBACK_VECTOR
>  	select X86_MEM_ENCRYPT_COMMON
> +	select SWIOTLB
>  	default y
>  	help
>  	  This option enables various optimizations for running under the KVM

So, this feature is always compiled in with KVM.  Could you say a couple
of things about that?  Why did you decide not have a Kconfig option for it?

> diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
> index 31c4df123aa0..a748b30c2f23 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -47,10 +47,8 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
>  
>  void __init mem_encrypt_free_decrypted_mem(void);
>  
> -/* Architecture __weak replacement functions */
> -void __init mem_encrypt_init(void);
> -
>  void __init sev_es_init_vc_handling(void);
> +
>  bool sme_active(void);
>  bool sev_active(void);
>  bool sev_es_active(void);
> @@ -91,6 +89,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
>  
>  #endif	/* CONFIG_AMD_MEM_ENCRYPT */
>  
> +/* Architecture __weak replacement functions */
> +void __init mem_encrypt_init(void);

FWIW, I'd rather have the code movement in separate patches from the
functional changes.

>  /*
>   * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
>   * writing to or comparing values from the cr3 register.  Having the
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index aed6034fcac1..ba179f5ca198 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -26,6 +26,7 @@
>  #include <linux/kprobes.h>
>  #include <linux/nmi.h>
>  #include <linux/swait.h>
> +#include <linux/swiotlb.h>
>  #include <asm/timer.h>
>  #include <asm/cpu.h>
>  #include <asm/traps.h>
> @@ -765,6 +766,7 @@ static void __init kvm_init_platform(void)
>  		pr_info("KVM memory protection enabled\n");
>  		mem_protected = true;
>  		setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
> +		swiotlb_force = SWIOTLB_FORCE;
>  	}
>  }
>  
> diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> index c2cfa5e7c152..814060a6ceb0 100644
> --- a/arch/x86/kernel/pci-swiotlb.c
> +++ b/arch/x86/kernel/pci-swiotlb.c
> @@ -13,6 +13,7 @@
>  #include <asm/dma.h>
>  #include <asm/xen/swiotlb-xen.h>
>  #include <asm/iommu_table.h>
> +#include <asm/kvm_para.h>
>  
>  int swiotlb __read_mostly;
>  
> @@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
>  	 * buffers are allocated and used for devices that do not support
>  	 * the addressing range required for the encryption mask.
>  	 */
> -	if (sme_active())
> +	if (sme_active() || kvm_mem_protected())
>  		swiotlb = 1;
>  
>  	return swiotlb;

While I don't doubt you got it right, it would be nice to also explain
in the changelog why you manipulate both 'swiotlb_force' and 'swiotlb'.

> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 9ca477b9b8ba..3478f20fb46f 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -409,47 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
>  
>  	free_init_pages("unused decrypted", vaddr, vaddr_end);
>  }
> -
> -static void print_mem_encrypt_feature_info(void)
> -{
> -	pr_info("AMD Memory Encryption Features active:");
> -
> -	/* Secure Memory Encryption */
> -	if (sme_active()) {
> -		/*
> -		 * SME is mutually exclusive with any of the SEV
> -		 * features below.
> -		 */
> -		pr_cont(" SME\n");
> -		return;
> -	}
> -
> -	/* Secure Encrypted Virtualization */
> -	if (sev_active())
> -		pr_cont(" SEV");
> -
> -	/* Encrypted Register State */
> -	if (sev_es_active())
> -		pr_cont(" SEV-ES");
> -
> -	pr_cont("\n");
> -}
> -
> -/* Architecture __weak replacement functions */
> -void __init mem_encrypt_init(void)
> -{
> -	if (!sme_me_mask)
> -		return;
> -
> -	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
> -	swiotlb_update_mem_attributes();
> -
> -	/*
> -	 * With SEV, we need to unroll the rep string I/O instructions.
> -	 */
> -	if (sev_active())
> -		static_branch_enable(&sev_enable_key);
> -
> -	print_mem_encrypt_feature_info();
> -}
> -
> diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> index 6bf0718bb72a..351b77361a5d 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -11,6 +11,7 @@
>  #include <linux/mem_encrypt.h>
>  #include <linux/dma-direct.h>
>  #include <asm/kvm_para.h>
> +#include <asm/mem_encrypt.h>
>  
>  /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
>  bool force_dma_unencrypted(struct device *dev)
> @@ -37,3 +38,50 @@ bool force_dma_unencrypted(struct device *dev)
>  
>  	return false;
>  }
> +
> +static void print_mem_encrypt_feature_info(void)
> +{
> +	if (kvm_mem_protected()) {
> +		pr_info("KVM memory protection enabled\n");
> +		return;
> +	}

I understand that they're touching similar areas of code, but I'm a bit
unnerved with memory protection being in all these "encryption"
functions and files.

I think some thoughtful renaming is in order.

> +	pr_info("AMD Memory Encryption Features active:");
> +
> +	/* Secure Memory Encryption */
> +	if (sme_active()) {
> +		/*
> +		 * SME is mutually exclusive with any of the SEV
> +		 * features below.
> +		 */
> +		pr_cont(" SME\n");
> +		return;
> +	}
> +
> +	/* Secure Encrypted Virtualization */
> +	if (sev_active())
> +		pr_cont(" SEV");
> +
> +	/* Encrypted Register State */
> +	if (sev_es_active())
> +		pr_cont(" SEV-ES");
> +
> +	pr_cont("\n");
> +}

This, for instance really shouldn't be in common code.  It should be in
an AMD-specific area.

> +void __init mem_encrypt_init(void)
> +{
> +	if (!sme_me_mask && !kvm_mem_protected())
> +		return;
> +
> +	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
> +	swiotlb_update_mem_attributes();
> +
> +	/*
> +	 * With SEV, we need to unroll the rep string I/O instructions.
> +	 */
> +	if (sev_active())
> +		static_branch_enable(&sev_enable_key);
> +
> +	print_mem_encrypt_feature_info();
> +}

This function is called like this:

>         /*
>          * This needs to be called before any devices perform DMA
>          * operations that might use the SWIOTLB bounce buffers. It will
>          * mark the bounce buffers as decrypted so that their usage will
>          * not cause "plain-text" data to be decrypted when accessed.
>          */
>         mem_encrypt_init();

So, maybe this should be x86_swiotlb_init() or something.  Then, move
the print_mem_encrypt_feature_info() elsewhere, probably back out to
mem_init().  Maybe even just call it print_arch_mem_features() or something.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 00/13] TDX and guest memory unmapping
  2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2021-04-16 15:41 ` [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
@ 2021-04-16 16:46 ` Matthew Wilcox
  13 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox @ 2021-04-16 16:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, Erdem Aktas,
	Steve Rutherford, Peter Gonda, David Hildenbrand, x86, kvm,
	linux-mm, linux-kernel

On Fri, Apr 16, 2021 at 06:40:53PM +0300, Kirill A. Shutemov wrote:
> TDX integrity check failures may lead to system shutdown host kernel must
> not allow any writes to TD-private memory. This requirment clashes with
> KVM design: KVM expects the guest memory to be mapped into host userspace
> (e.g. QEMU).
> 
> This patchset aims to start discussion on how we can approach the issue.
> 
> The core of the change is in the last patch. Please see more detailed
> description of the issue and proposoal of the solution there.

This seems to have some parallels with s390's arch_make_page_accessible().
Is there any chance to combine the two, so we don't end up with duplicated
hooks all over the MM for this kind of thing?

https://patchwork.kernel.org/project/kvm/cover/20200214222658.12946-1-borntraeger@de.ibm.com/

and recent THP/Folio-related discussion:
https://lore.kernel.org/linux-mm/20210409194059.GW2531743@casper.infradead.org/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-16 15:41 ` [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
@ 2021-04-16 17:30   ` Sean Christopherson
  2021-04-19 11:32     ` Xiaoyao Li
  2021-04-19 14:26     ` Kirill A. Shutemov
  0 siblings, 2 replies; 31+ messages in thread
From: Sean Christopherson @ 2021-04-16 17:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Jim Mattson,
	David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel

On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b404e4d7dd8..f8183386abe7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  		kvm_sched_yield(vcpu->kvm, a0);
>  		ret = 0;
>  		break;
> +	case KVM_HC_ENABLE_MEM_PROTECTED:
> +		ret = kvm_protect_memory(vcpu->kvm);
> +		break;
> +	case KVM_HC_MEM_SHARE:
> +		ret = kvm_share_memory(vcpu->kvm, a0, a1);

Can you take a look at a proposed hypercall interface for SEV live migration and
holler if you (or anyone else) thinks it will have extensibility issues?

https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.kalra@amd.com

> +		break;
>  	default:
>  		ret = -KVM_ENOSYS;
>  		break;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fadaccb95a4c..cd2374802702 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  }
>  #endif
>  
> +#define KVM_NR_SHARED_RANGES 32
> +
>  /*
>   * Note:
>   * memslots are not sorted by id anymore, please use id_to_memslot()
> @@ -513,6 +515,10 @@ struct kvm {
>  	pid_t userspace_pid;
>  	unsigned int max_halt_poll_ns;
>  	u32 dirty_ring_size;
> +	bool mem_protected;
> +	void *id;
> +	int nr_shared_ranges;
> +	struct range shared_ranges[KVM_NR_SHARED_RANGES];

Hard no for me.  IMO, anything that requires KVM to track shared/pinned pages in
a separate tree/array is non-starter.  More specific to TDX #MCs, KVM should not
be the canonical reference for the state of a page.

>  };
>  
>  #define kvm_err(fmt, ...) \

...

> @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  		flags |= FOLL_WRITE;
>  	if (async)
>  		flags |= FOLL_NOWAIT;
> +	if (kvm->mem_protected)
> +		flags |= FOLL_ALLOW_POISONED;

This is unsafe, only the flows that are mapping the PFN into the guest should
use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page.

>  
>  	npages = get_user_pages_unlocked(addr, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) &&
> +	    kvm->mem_protected && !kvm_page_allowed(kvm, page))
> +		return -EHWPOISON;
> +
>  	/* map read fault as writable if possible */
>  	if (unlikely(!write_fault) && writable) {
>  		struct page *wpage;

...

> @@ -2338,19 +2350,93 @@ static int next_segment(unsigned long len, int offset)
>  		return len;
>  }
>  
> -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -				 void *data, int offset, int len)
> +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len)
> +{
> +	int offset = offset_in_page(hva);
> +	struct page *page;
> +	int npages, seg;
> +	void *vaddr;
> +
> +	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
> +	    !kvm->mem_protected) {
> +		return __copy_from_user(data, (void __user *)hva, len);
> +	}
> +
> +	might_fault();
> +	kasan_check_write(data, len);
> +	check_object_size(data, len, false);
> +
> +	while ((seg = next_segment(len, offset)) != 0) {
> +		npages = get_user_pages_unlocked(hva, 1, &page,
> +						 FOLL_ALLOW_POISONED);
> +		if (npages != 1)
> +			return -EFAULT;
> +
> +		if (!kvm_page_allowed(kvm, page))
> +			return -EFAULT;
> +
> +		vaddr = kmap_atomic(page);
> +		memcpy(data, vaddr + offset, seg);
> +		kunmap_atomic(vaddr);

Why is KVM allowed to access a poisoned page?  I would expect shared pages to
_not_ be poisoned.  Except for pure software emulation of SEV, KVM can't access
guest private memory.

> +
> +		put_page(page);
> +		len -= seg;
> +		hva += seg;
> +		data += seg;
> +		offset = 0;
> +	}
> +
> +	return 0;
> +}

...
  
> @@ -2693,6 +2775,41 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>  
> +int kvm_protect_memory(struct kvm *kvm)
> +{
> +	if (mmap_write_lock_killable(kvm->mm))
> +		return -KVM_EINTR;
> +
> +	kvm->mem_protected = true;
> +	kvm_arch_flush_shadow_all(kvm);
> +	mmap_write_unlock(kvm->mm);
> +
> +	return 0;
> +}
> +
> +int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages)
> +{
> +	unsigned long end = gfn + npages;
> +
> +	if (!npages || !IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY))
> +		return 0;
> +
> +	__kvm_share_memory(kvm, gfn, end);
> +
> +	for (; gfn < end; gfn++) {
> +		struct page *page = gfn_to_page(kvm, gfn);
> +		unsigned long pfn = page_to_pfn(page);
> +
> +		if (page == KVM_ERR_PTR_BAD_PAGE)
> +			continue;
> +
> +		kvm_share_pfn(kvm, pfn);

I like the idea of using "special" PTE value to denote guest private memory,
e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in the
manipulation of the special flag/value.

Today, userspace owns the gfn->hva translations and the kernel effectively owns
the hva->pfn translations (with input from userspace).  KVM just connects the
dots.

Having KVM own the shared/private transitions means KVM is now part owner of the
entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU
and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM taking
mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away
from userspace.  E.g. userspace strategy could be to use a separate backing/pool
for shared memory and change the gfn->hva translation (memslots) in reaction to
a shared/private conversion.  Automatically swizzling things in KVM takes away
that option.

IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or
protects memory, userspace calls into the kernel to change state, and the kernel
manages the page tables to prevent bad actors.  KVM simply does the plumbing for
the guest page tables.

> +		put_page(page);
> +	}
> +
> +	return 0;
> +}
> +
>  void kvm_sigset_activate(struct kvm_vcpu *vcpu)
>  {
>  	if (!vcpu->sigset_active)
> diff --git a/virt/kvm/mem_protected.c b/virt/kvm/mem_protected.c
> new file mode 100644
> index 000000000000..81882bd3232b
> --- /dev/null
> +++ b/virt/kvm/mem_protected.c
> @@ -0,0 +1,110 @@
> +#include <linux/kvm_host.h>
> +#include <linux/mm.h>
> +#include <linux/rmap.h>
> +
> +static DEFINE_XARRAY(kvm_pfn_map);
> +
> +static bool gfn_is_shared(struct kvm *kvm, unsigned long gfn)
> +{
> +	bool ret = false;
> +	int i;
> +
> +	spin_lock(&kvm->mmu_lock);

Taking mmu_lock for write in the page fault path is a non-starter.  The TDP MMU
is being converted to support concurrent faults, which is a foundational pillar
of its scalability.

> +	for (i = 0; i < kvm->nr_shared_ranges; i++) {
> +		if (gfn < kvm->shared_ranges[i].start)
> +			continue;
> +		if (gfn >= kvm->shared_ranges[i].end)
> +			continue;
> +
> +		ret = true;
> +		break;
> +	}
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	return ret;
> +}


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature
  2021-04-16 16:10   ` Borislav Petkov
@ 2021-04-19 10:10     ` Kirill A. Shutemov
  0 siblings, 0 replies; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-19 10:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, Erdem Aktas,
	Steve Rutherford, Peter Gonda, David Hildenbrand, x86, kvm,
	linux-mm, linux-kernel

On Fri, Apr 16, 2021 at 06:10:30PM +0200, Borislav Petkov wrote:
> On Fri, Apr 16, 2021 at 06:40:55PM +0300, Kirill A. Shutemov wrote:
> > Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> > 
> > Host side doesn't provide the feature yet, so it is a dead code for now.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/include/asm/cpufeatures.h   |  1 +
> >  arch/x86/include/asm/kvm_para.h      |  5 +++++
> >  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >  arch/x86/kernel/kvm.c                | 17 +++++++++++++++++
> >  include/uapi/linux/kvm_para.h        |  3 ++-
> >  5 files changed, 27 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index 84b887825f12..d8f3d2619913 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -238,6 +238,7 @@
> >  #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
> >  #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
> >  #define X86_FEATURE_VM_PAGE_FLUSH	( 8*32+21) /* "" VM Page Flush MSR is supported */
> > +#define X86_FEATURE_KVM_MEM_PROTECTED	( 8*32+22) /* "" KVM memory protection extension */
> 
> That feature bit is still unused.

Sorry. Will drop it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-16 17:30   ` Sean Christopherson
@ 2021-04-19 11:32     ` Xiaoyao Li
  2021-04-19 14:26     ` Kirill A. Shutemov
  1 sibling, 0 replies; 31+ messages in thread
From: Xiaoyao Li @ 2021-04-19 11:32 UTC (permalink / raw)
  To: Sean Christopherson, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Jim Mattson,
	David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel

On 4/17/2021 1:30 AM, Sean Christopherson wrote:
> On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
[...]
>> index fadaccb95a4c..cd2374802702 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>>   }
>>   #endif
>>   
>> +#define KVM_NR_SHARED_RANGES 32
>> +
>>   /*
>>    * Note:
>>    * memslots are not sorted by id anymore, please use id_to_memslot()
>> @@ -513,6 +515,10 @@ struct kvm {
>>   	pid_t userspace_pid;
>>   	unsigned int max_halt_poll_ns;
>>   	u32 dirty_ring_size;
>> +	bool mem_protected;
>> +	void *id;
>> +	int nr_shared_ranges;
>> +	struct range shared_ranges[KVM_NR_SHARED_RANGES];
> 
> Hard no for me.  IMO, anything that requires KVM to track shared/pinned pages in
> a separate tree/array is non-starter.  More specific to TDX #MCs, KVM should not
> be the canonical reference for the state of a page.
> 

Do you mean something in struct page to track if the page is shared or 
private?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-16 17:30   ` Sean Christopherson
  2021-04-19 11:32     ` Xiaoyao Li
@ 2021-04-19 14:26     ` Kirill A. Shutemov
  2021-04-19 16:01       ` Sean Christopherson
  1 sibling, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-19 14:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Fri, Apr 16, 2021 at 05:30:30PM +0000, Sean Christopherson wrote:
> On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1b404e4d7dd8..f8183386abe7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >  		kvm_sched_yield(vcpu->kvm, a0);
> >  		ret = 0;
> >  		break;
> > +	case KVM_HC_ENABLE_MEM_PROTECTED:
> > +		ret = kvm_protect_memory(vcpu->kvm);
> > +		break;
> > +	case KVM_HC_MEM_SHARE:
> > +		ret = kvm_share_memory(vcpu->kvm, a0, a1);
> 
> Can you take a look at a proposed hypercall interface for SEV live migration and
> holler if you (or anyone else) thinks it will have extensibility issues?
> 
> https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.kalra@amd.com

Will look closer. Thanks.

> > @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> >  		flags |= FOLL_WRITE;
> >  	if (async)
> >  		flags |= FOLL_NOWAIT;
> > +	if (kvm->mem_protected)
> > +		flags |= FOLL_ALLOW_POISONED;
> 
> This is unsafe, only the flows that are mapping the PFN into the guest should
> use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page.

That's true for TDX. I prototyped with pure KVM with minimal modification
to the guest. We had to be more permissive for the reason. It will go
away for TDX.

> > -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > -				 void *data, int offset, int len)
> > +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len)
> > +{
> > +	int offset = offset_in_page(hva);
> > +	struct page *page;
> > +	int npages, seg;
> > +	void *vaddr;
> > +
> > +	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
> > +	    !kvm->mem_protected) {
> > +		return __copy_from_user(data, (void __user *)hva, len);
> > +	}
> > +
> > +	might_fault();
> > +	kasan_check_write(data, len);
> > +	check_object_size(data, len, false);
> > +
> > +	while ((seg = next_segment(len, offset)) != 0) {
> > +		npages = get_user_pages_unlocked(hva, 1, &page,
> > +						 FOLL_ALLOW_POISONED);
> > +		if (npages != 1)
> > +			return -EFAULT;
> > +
> > +		if (!kvm_page_allowed(kvm, page))
> > +			return -EFAULT;
> > +
> > +		vaddr = kmap_atomic(page);
> > +		memcpy(data, vaddr + offset, seg);
> > +		kunmap_atomic(vaddr);
> 
> Why is KVM allowed to access a poisoned page?  I would expect shared pages to
> _not_ be poisoned.  Except for pure software emulation of SEV, KVM can't access
> guest private memory.

Again, it's not going to be in TDX implementation.


> I like the idea of using "special" PTE value to denote guest private memory,
> e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in the
> manipulation of the special flag/value.
> 
> Today, userspace owns the gfn->hva translations and the kernel effectively owns
> the hva->pfn translations (with input from userspace).  KVM just connects the
> dots.
> 
> Having KVM own the shared/private transitions means KVM is now part owner of the
> entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU
> and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM taking
> mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away
> from userspace.  E.g. userspace strategy could be to use a separate backing/pool
> for shared memory and change the gfn->hva translation (memslots) in reaction to
> a shared/private conversion.  Automatically swizzling things in KVM takes away
> that option.
> 
> IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or
> protects memory, userspace calls into the kernel to change state, and the kernel
> manages the page tables to prevent bad actors.  KVM simply does the plumbing for
> the guest page tables.

That's a new perspective for me. Very interesting.

Let's see how it can look like:

 - KVM only allows poisoned pages (or whatever flag we end up using for
   protection) in the private mappings. SIGBUS otherwise.

 - Poisoned pages must be tied to the KVM instance to be allowed in the
   private mappings. Like kvm->id in the current prototype. SIGBUS
   otherwise.

 - Pages get poisoned on fault in if the VMA has a new vmflag set.

 - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
   access such pages.

 - Poisoned pages produced this way get unpoisoned on free.

 - The new VMA flag set by userspace. mprotect(2)?

 - Add a new GUP flag to retrive such pages from the userspace mapping.
   Used only for private mapping population.

 - Shared gfn ranges managed by userspace, based on hypercalls from the
   guest.

 - Shared mappings get populated via normal VMA. Any poisoned pages here
   would lead to SIGBUS.

So far it looks pretty straight-forward.

The only thing that I don't understand is at way point the page gets tied
to the KVM instance. Currently we do it just before populating shadow
entries, but it would not work with the new scheme: as we poison pages
on fault it they may never get inserted into shadow entries. That's not
good as we rely on the info to unpoison page on free.

Maybe we should tie VMA to the KVM instance on setting the vmflags?
I donno.

Any comments?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 14:26     ` Kirill A. Shutemov
@ 2021-04-19 16:01       ` Sean Christopherson
  2021-04-19 16:40         ` Kirill A. Shutemov
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Christopherson @ 2021-04-19 16:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Fri, Apr 16, 2021 at 05:30:30PM +0000, Sean Christopherson wrote:
> > I like the idea of using "special" PTE value to denote guest private memory,
> > e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in the
> > manipulation of the special flag/value.
> > 
> > Today, userspace owns the gfn->hva translations and the kernel effectively owns
> > the hva->pfn translations (with input from userspace).  KVM just connects the
> > dots.
> > 
> > Having KVM own the shared/private transitions means KVM is now part owner of the
> > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU
> > and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM taking
> > mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away
> > from userspace.  E.g. userspace strategy could be to use a separate backing/pool
> > for shared memory and change the gfn->hva translation (memslots) in reaction to
> > a shared/private conversion.  Automatically swizzling things in KVM takes away
> > that option.
> > 
> > IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or
> > protects memory, userspace calls into the kernel to change state, and the kernel
> > manages the page tables to prevent bad actors.  KVM simply does the plumbing for
> > the guest page tables.
> 
> That's a new perspective for me. Very interesting.
> 
> Let's see how it can look like:
> 
>  - KVM only allows poisoned pages (or whatever flag we end up using for
>    protection) in the private mappings. SIGBUS otherwise.
> 
>  - Poisoned pages must be tied to the KVM instance to be allowed in the
>    private mappings. Like kvm->id in the current prototype. SIGBUS
>    otherwise.
> 
>  - Pages get poisoned on fault in if the VMA has a new vmflag set.
> 
>  - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
>    access such pages.
> 
>  - Poisoned pages produced this way get unpoisoned on free.
> 
>  - The new VMA flag set by userspace. mprotect(2)?

Ya, or mmap(), though I'm not entirely sure a VMA flag would suffice.  The
notion of the page being private is tied to the PFN, which would suggest "struct
page" needs to be involved.

But fundamentally the private pages, are well, private.  They can't be shared
across processes, so I think we could (should?) require the VMA to always be
MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
enough to prevent userspace and unaware kernel code from acquiring a reference
to the underlying page?

>  - Add a new GUP flag to retrive such pages from the userspace mapping.
>    Used only for private mapping population.

>  - Shared gfn ranges managed by userspace, based on hypercalls from the
>    guest.
> 
>  - Shared mappings get populated via normal VMA. Any poisoned pages here
>    would lead to SIGBUS.
> 
> So far it looks pretty straight-forward.
> 
> The only thing that I don't understand is at way point the page gets tied
> to the KVM instance. Currently we do it just before populating shadow
> entries, but it would not work with the new scheme: as we poison pages
> on fault it they may never get inserted into shadow entries. That's not
> good as we rely on the info to unpoison page on free.

Can you elaborate on what you mean by "unpoison"?  If the page is never actually
mapped into the guest, then its poisoned status is nothing more than a software
flag, i.e. nothing extra needs to be done on free.  If the page is mapped into
the guest, then KVM can be made responsible for reinitializing the page with
keyid=0 when the page is removed from the guest.

The TDX Module prevents mapping the same PFN into multiple guests, so the kernel
doesn't actually have to care _which_ KVM instance(s) is associated with a page,
it only needs to prevent installing valid PTEs in the host page tables.

> Maybe we should tie VMA to the KVM instance on setting the vmflags?
> I donno.
> 
> Any comments?
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 16:01       ` Sean Christopherson
@ 2021-04-19 16:40         ` Kirill A. Shutemov
  2021-04-19 18:09           ` Sean Christopherson
  0 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-19 16:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Fri, Apr 16, 2021 at 05:30:30PM +0000, Sean Christopherson wrote:
> > > I like the idea of using "special" PTE value to denote guest private memory,
> > > e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in the
> > > manipulation of the special flag/value.
> > > 
> > > Today, userspace owns the gfn->hva translations and the kernel effectively owns
> > > the hva->pfn translations (with input from userspace).  KVM just connects the
> > > dots.
> > > 
> > > Having KVM own the shared/private transitions means KVM is now part owner of the
> > > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU
> > > and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM taking
> > > mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away
> > > from userspace.  E.g. userspace strategy could be to use a separate backing/pool
> > > for shared memory and change the gfn->hva translation (memslots) in reaction to
> > > a shared/private conversion.  Automatically swizzling things in KVM takes away
> > > that option.
> > > 
> > > IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or
> > > protects memory, userspace calls into the kernel to change state, and the kernel
> > > manages the page tables to prevent bad actors.  KVM simply does the plumbing for
> > > the guest page tables.
> > 
> > That's a new perspective for me. Very interesting.
> > 
> > Let's see how it can look like:
> > 
> >  - KVM only allows poisoned pages (or whatever flag we end up using for
> >    protection) in the private mappings. SIGBUS otherwise.
> > 
> >  - Poisoned pages must be tied to the KVM instance to be allowed in the
> >    private mappings. Like kvm->id in the current prototype. SIGBUS
> >    otherwise.
> > 
> >  - Pages get poisoned on fault in if the VMA has a new vmflag set.
> > 
> >  - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
> >    access such pages.
> > 
> >  - Poisoned pages produced this way get unpoisoned on free.
> > 
> >  - The new VMA flag set by userspace. mprotect(2)?
> 
> Ya, or mmap(), though I'm not entirely sure a VMA flag would suffice.  The
> notion of the page being private is tied to the PFN, which would suggest "struct
> page" needs to be involved.

PG_hwpoison will be set on the page, so it's tied to pfn.

> But fundamentally the private pages, are well, private.  They can't be shared
> across processes, so I think we could (should?) require the VMA to always be
> MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
> enough to prevent userspace and unaware kernel code from acquiring a reference
> to the underlying page?

Shared pages should be fine too (you folks wanted tmpfs support).

The poisoned pages must be useless outside of the process with the blessed
struct kvm. See kvm_pfn_map in the patch.

> >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> >    Used only for private mapping population.
> 
> >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> >    guest.
> > 
> >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> >    would lead to SIGBUS.
> > 
> > So far it looks pretty straight-forward.
> > 
> > The only thing that I don't understand is at way point the page gets tied
> > to the KVM instance. Currently we do it just before populating shadow
> > entries, but it would not work with the new scheme: as we poison pages
> > on fault it they may never get inserted into shadow entries. That's not
> > good as we rely on the info to unpoison page on free.
> 
> Can you elaborate on what you mean by "unpoison"?  If the page is never actually
> mapped into the guest, then its poisoned status is nothing more than a software
> flag, i.e. nothing extra needs to be done on free.

Normally, poisoned flag preserved for freed pages as it usually indicate
hardware issue. In this case we need return page to the normal circulation.
So we need a way to differentiate two kinds of page poison. Current patch
does this by adding page's pfn to kvm_pfn_map. But this will not work if
we uncouple poisoning and adding to shadow PTE.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled
  2021-04-16 15:40 ` [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
@ 2021-04-19 16:49   ` Dave Hansen
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2021-04-19 16:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	Erdem Aktas, Steve Rutherford, Peter Gonda, David Hildenbrand,
	x86, kvm, linux-mm, linux-kernel

On 4/16/21 8:40 AM, Kirill A. Shutemov wrote:
>  	/*
> -	 * If SME is active, the trampoline area will need to be in
> -	 * decrypted memory in order to bring up other processors
> +	 * If SME or KVM memory protection is active, the trampoline area will
> +	 * need to be in decrypted memory in order to bring up other processors
>  	 * successfully. This is not needed for SEV.
>  	 */
> -	if (sme_active())
> +	if (sme_active() || kvm_mem_protected())
>  		set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);

Could you take a look at all the places you've added these:

	if (foo() || kvm_mem_protected())
		bar();

spots and see if some refactoring is in order?

I suspect that some thought about what the high-level commonalities are,
plus some thoughtful helper function names would go a long way to making
this whole thing understandable.


set_memory_decrypted() as a name needs to go.  It almost needs to be
something like:

	set_memory_sharing()

or something.  The "sharing" would be between the kernel and devices
(for SME), or the guest kernel and host kernel for protected memory.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 16:40         ` Kirill A. Shutemov
@ 2021-04-19 18:09           ` Sean Christopherson
  2021-04-19 18:12             ` David Hildenbrand
  2021-04-19 18:53             ` Kirill A. Shutemov
  0 siblings, 2 replies; 31+ messages in thread
From: Sean Christopherson @ 2021-04-19 18:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
> > But fundamentally the private pages, are well, private.  They can't be shared
> > across processes, so I think we could (should?) require the VMA to always be
> > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
> > enough to prevent userspace and unaware kernel code from acquiring a reference
> > to the underlying page?
> 
> Shared pages should be fine too (you folks wanted tmpfs support).

Is that a conflict though?  If the private->shared conversion request is kicked
out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?

Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
shared, and dirty data can't be written back to the file.

> The poisoned pages must be useless outside of the process with the blessed
> struct kvm. See kvm_pfn_map in the patch.

The big requirement for kernel TDX support is that the pages are useless in the
host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
single KVM guest can have access to a page at any given time.  I believe the RMP
provides the same guarantees for SEV-SNP.

SEV/SEV-ES could still end up with corruption if multiple guests map the same
private page, but that's obviously not the end of the world since it's the status
quo today.  Living with that shortcoming might be a worthy tradeoff if punting
mutual exclusion between guests to firmware/hardware allows us to simplify the
kernel implementation.

> > >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> > >    Used only for private mapping population.
> > 
> > >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> > >    guest.
> > > 
> > >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> > >    would lead to SIGBUS.
> > > 
> > > So far it looks pretty straight-forward.
> > > 
> > > The only thing that I don't understand is at way point the page gets tied
> > > to the KVM instance. Currently we do it just before populating shadow
> > > entries, but it would not work with the new scheme: as we poison pages
> > > on fault it they may never get inserted into shadow entries. That's not
> > > good as we rely on the info to unpoison page on free.
> > 
> > Can you elaborate on what you mean by "unpoison"?  If the page is never actually
> > mapped into the guest, then its poisoned status is nothing more than a software
> > flag, i.e. nothing extra needs to be done on free.
> 
> Normally, poisoned flag preserved for freed pages as it usually indicate
> hardware issue. In this case we need return page to the normal circulation.
> So we need a way to differentiate two kinds of page poison. Current patch
> does this by adding page's pfn to kvm_pfn_map. But this will not work if
> we uncouple poisoning and adding to shadow PTE.

Why use PG_hwpoison then?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 18:09           ` Sean Christopherson
@ 2021-04-19 18:12             ` David Hildenbrand
  2021-04-19 18:53             ` Kirill A. Shutemov
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2021-04-19 18:12 UTC (permalink / raw)
  To: Sean Christopherson, Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda, x86,
	kvm, linux-mm, linux-kernel

On 19.04.21 20:09, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
>> On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
>>> But fundamentally the private pages, are well, private.  They can't be shared
>>> across processes, so I think we could (should?) require the VMA to always be
>>> MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
>>> enough to prevent userspace and unaware kernel code from acquiring a reference
>>> to the underlying page?
>>
>> Shared pages should be fine too (you folks wanted tmpfs support).
> 
> Is that a conflict though?  If the private->shared conversion request is kicked
> out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> 
> Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> shared, and dirty data can't be written back to the file.
> 
>> The poisoned pages must be useless outside of the process with the blessed
>> struct kvm. See kvm_pfn_map in the patch.
> 
> The big requirement for kernel TDX support is that the pages are useless in the
> host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
> single KVM guest can have access to a page at any given time.  I believe the RMP
> provides the same guarantees for SEV-SNP.
> 
> SEV/SEV-ES could still end up with corruption if multiple guests map the same
> private page, but that's obviously not the end of the world since it's the status
> quo today.  Living with that shortcoming might be a worthy tradeoff if punting
> mutual exclusion between guests to firmware/hardware allows us to simplify the
> kernel implementation.
> 
>>>>   - Add a new GUP flag to retrive such pages from the userspace mapping.
>>>>     Used only for private mapping population.
>>>
>>>>   - Shared gfn ranges managed by userspace, based on hypercalls from the
>>>>     guest.
>>>>
>>>>   - Shared mappings get populated via normal VMA. Any poisoned pages here
>>>>     would lead to SIGBUS.
>>>>
>>>> So far it looks pretty straight-forward.
>>>>
>>>> The only thing that I don't understand is at way point the page gets tied
>>>> to the KVM instance. Currently we do it just before populating shadow
>>>> entries, but it would not work with the new scheme: as we poison pages
>>>> on fault it they may never get inserted into shadow entries. That's not
>>>> good as we rely on the info to unpoison page on free.
>>>
>>> Can you elaborate on what you mean by "unpoison"?  If the page is never actually
>>> mapped into the guest, then its poisoned status is nothing more than a software
>>> flag, i.e. nothing extra needs to be done on free.
>>
>> Normally, poisoned flag preserved for freed pages as it usually indicate
>> hardware issue. In this case we need return page to the normal circulation.
>> So we need a way to differentiate two kinds of page poison. Current patch
>> does this by adding page's pfn to kvm_pfn_map. But this will not work if
>> we uncouple poisoning and adding to shadow PTE.
> 
> Why use PG_hwpoison then?
> 

I already raised that reusing PG_hwpoison is not what we want. And I 
repeat, to me this all looks like a big hack; some things you (Sena) 
propose sound cleaner, at least to me.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 18:09           ` Sean Christopherson
  2021-04-19 18:12             ` David Hildenbrand
@ 2021-04-19 18:53             ` Kirill A. Shutemov
  2021-04-19 20:09               ` Sean Christopherson
  1 sibling, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-19 18:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 06:09:29PM +0000, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
> > > But fundamentally the private pages, are well, private.  They can't be shared
> > > across processes, so I think we could (should?) require the VMA to always be
> > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
> > > enough to prevent userspace and unaware kernel code from acquiring a reference
> > > to the underlying page?
> > 
> > Shared pages should be fine too (you folks wanted tmpfs support).
> 
> Is that a conflict though?  If the private->shared conversion request is kicked
> out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> 
> Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> shared, and dirty data can't be written back to the file.

It can be remapped, but faulting in the page would produce hwpoison entry.
I don't see other way to make Google's use-case with tmpfs-backed guest
memory work.

> > The poisoned pages must be useless outside of the process with the blessed
> > struct kvm. See kvm_pfn_map in the patch.
> 
> The big requirement for kernel TDX support is that the pages are useless in the
> host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
> single KVM guest can have access to a page at any given time.  I believe the RMP
> provides the same guarantees for SEV-SNP.
> 
> SEV/SEV-ES could still end up with corruption if multiple guests map the same
> private page, but that's obviously not the end of the world since it's the status
> quo today.  Living with that shortcoming might be a worthy tradeoff if punting
> mutual exclusion between guests to firmware/hardware allows us to simplify the
> kernel implementation.

The critical question is whether we ever need to translate hva->pfn after
the page is added to the guest private memory. I believe we do, but I
never checked. And that's the reason we need to keep hwpoison entries
around, which encode pfn.

If we don't, it would simplify the solution: kvm_pfn_map is not needed.
Single bit-per page would be enough.

> > > >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> > > >    Used only for private mapping population.
> > > 
> > > >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> > > >    guest.
> > > > 
> > > >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> > > >    would lead to SIGBUS.
> > > > 
> > > > So far it looks pretty straight-forward.
> > > > 
> > > > The only thing that I don't understand is at way point the page gets tied
> > > > to the KVM instance. Currently we do it just before populating shadow
> > > > entries, but it would not work with the new scheme: as we poison pages
> > > > on fault it they may never get inserted into shadow entries. That's not
> > > > good as we rely on the info to unpoison page on free.
> > > 
> > > Can you elaborate on what you mean by "unpoison"?  If the page is never actually
> > > mapped into the guest, then its poisoned status is nothing more than a software
> > > flag, i.e. nothing extra needs to be done on free.
> > 
> > Normally, poisoned flag preserved for freed pages as it usually indicate
> > hardware issue. In this case we need return page to the normal circulation.
> > So we need a way to differentiate two kinds of page poison. Current patch
> > does this by adding page's pfn to kvm_pfn_map. But this will not work if
> > we uncouple poisoning and adding to shadow PTE.
> 
> Why use PG_hwpoison then?

Page flags are scarce. I don't want to take occupy a new one until I'm
sure I must.

And we can re-use existing infrastructure to SIGBUS on access to such
pages.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 18:53             ` Kirill A. Shutemov
@ 2021-04-19 20:09               ` Sean Christopherson
  2021-04-19 22:57                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Christopherson @ 2021-04-19 20:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 06:09:29PM +0000, Sean Christopherson wrote:
> > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
> > > > But fundamentally the private pages, are well, private.  They can't be shared
> > > > across processes, so I think we could (should?) require the VMA to always be
> > > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
> > > > enough to prevent userspace and unaware kernel code from acquiring a reference
> > > > to the underlying page?
> > > 
> > > Shared pages should be fine too (you folks wanted tmpfs support).
> > 
> > Is that a conflict though?  If the private->shared conversion request is kicked
> > out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> > 
> > Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> > shared, and dirty data can't be written back to the file.
> 
> It can be remapped, but faulting in the page would produce hwpoison entry.

It sounds like you're thinking the whole tmpfs file is poisoned.  My thought is
that userspace would need to do something like for guest private memory:

	mmap(NULL, guest_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_GUEST_ONLY, fd, 0);

The MAP_GUEST_ONLY would be used by the kernel to ensure the resulting VMA can
only point at private/poisoned memory, e.g. on fault, the associated PFN would
be tagged with PG_hwpoison or whtaever.  @fd in this case could point at tmpfs,
but I don't think it's a hard requirement.

On conversion to shared, userspace could then do:

	munmap(<addr>, <size>)
	mmap(<addr>, <size>, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED_NOREPLACE, fd, <offset>);

or

	mmap(<addr>, <size>, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, <offset>);

or

	ioctl(kvm, KVM_SET_USER_MEMORY_REGION, <delete private range>);
	mmap(NULL, <size>, PROT_READ|PROT_WRITE, MAP_SHARED, fd, <offset>);
	ioctl(kvm, KVM_SET_USER_MEMORY_REGION, <add shared range>);

Combinations would also work, e.g. unmap the private range and move the memslot.
The private and shared memory regions could also be backed differently, e.g.
tmpfs for shared memory, anonymous for private memory.

> I don't see other way to make Google's use-case with tmpfs-backed guest
> memory work.

The underlying use-case is to be able to access guest memory from more than one
process, e.g. so that communication with the guest isn't limited to the VMM
process associated with the KVM instances.  By definition, guest private memory
can't be accessed by the host; I don't see how anyone, Google included, can have
any real requirements about

> > > The poisoned pages must be useless outside of the process with the blessed
> > > struct kvm. See kvm_pfn_map in the patch.
> > 
> > The big requirement for kernel TDX support is that the pages are useless in the
> > host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
> > single KVM guest can have access to a page at any given time.  I believe the RMP
> > provides the same guarantees for SEV-SNP.
> > 
> > SEV/SEV-ES could still end up with corruption if multiple guests map the same
> > private page, but that's obviously not the end of the world since it's the status
> > quo today.  Living with that shortcoming might be a worthy tradeoff if punting
> > mutual exclusion between guests to firmware/hardware allows us to simplify the
> > kernel implementation.
> 
> The critical question is whether we ever need to translate hva->pfn after
> the page is added to the guest private memory. I believe we do, but I
> never checked. And that's the reason we need to keep hwpoison entries
> around, which encode pfn.

As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail on
the backend).  But in that case, KVM still has the original PFN, the "new"
translation becomes a sanity check to make sure that the zapped translation
wasn't moved unexpectedly.

Regardless, I don't see what that has to do with kvm_pfn_map.  At some point,
gup() has to fault in the page or look at the host PTE value.  For the latter,
at least on x86, we can throw info into the PTE itself to tag it as guest-only.
No matter what implementation we settle on, I think we've failed if we end up in
a situation where the primary MMU has pages it doesn't know are guest-only.

> If we don't, it would simplify the solution: kvm_pfn_map is not needed.
> Single bit-per page would be enough.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 20:09               ` Sean Christopherson
@ 2021-04-19 22:57                 ` Kirill A. Shutemov
  2021-04-20 17:13                   ` Sean Christopherson
  0 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2021-04-19 22:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Mon, Apr 19, 2021 at 08:09:13PM +0000, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Mon, Apr 19, 2021 at 06:09:29PM +0000, Sean Christopherson wrote:
> > > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > > On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
> > > > > But fundamentally the private pages, are well, private.  They can't be shared
> > > > > across processes, so I think we could (should?) require the VMA to always be
> > > > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
> > > > > enough to prevent userspace and unaware kernel code from acquiring a reference
> > > > > to the underlying page?
> > > > 
> > > > Shared pages should be fine too (you folks wanted tmpfs support).
> > > 
> > > Is that a conflict though?  If the private->shared conversion request is kicked
> > > out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> > > 
> > > Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> > > shared, and dirty data can't be written back to the file.
> > 
> > It can be remapped, but faulting in the page would produce hwpoison entry.
> 
> It sounds like you're thinking the whole tmpfs file is poisoned.

No. File is not poisoned. Pages got poisoned when they are faulted into
flagged VMA. Different VM can use the same file as long as offsets do not
overlap.

> My thought is that userspace would need to do something like for guest
> private memory:
> 
> 	mmap(NULL, guest_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_GUEST_ONLY, fd, 0);
> 
> The MAP_GUEST_ONLY would be used by the kernel to ensure the resulting VMA can
> only point at private/poisoned memory, e.g. on fault, the associated PFN would
> be tagged with PG_hwpoison or whtaever.  @fd in this case could point at tmpfs,
> but I don't think it's a hard requirement.
> 
> On conversion to shared, userspace could then do:
> 
> 	munmap(<addr>, <size>)
> 	mmap(<addr>, <size>, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED_NOREPLACE, fd, <offset>);
> 
> or
> 
> 	mmap(<addr>, <size>, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, <offset>);
> 

I played with this variant before, but initiated from kernel. Should work
fine.

> or
> 
> 	ioctl(kvm, KVM_SET_USER_MEMORY_REGION, <delete private range>);
> 	mmap(NULL, <size>, PROT_READ|PROT_WRITE, MAP_SHARED, fd, <offset>);
> 	ioctl(kvm, KVM_SET_USER_MEMORY_REGION, <add shared range>);
> 
> Combinations would also work, e.g. unmap the private range and move the memslot.
> The private and shared memory regions could also be backed differently, e.g.
> tmpfs for shared memory, anonymous for private memory.

Right. Kernel has to be flexible enough to provide any of the schemes.

> 
> > I don't see other way to make Google's use-case with tmpfs-backed guest
> > memory work.
> 
> The underlying use-case is to be able to access guest memory from more than one
> process, e.g. so that communication with the guest isn't limited to the VMM
> process associated with the KVM instances.  By definition, guest private memory
> can't be accessed by the host; I don't see how anyone, Google included, can have
> any real requirements about
> 
> > > > The poisoned pages must be useless outside of the process with the blessed
> > > > struct kvm. See kvm_pfn_map in the patch.
> > > 
> > > The big requirement for kernel TDX support is that the pages are useless in the
> > > host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
> > > single KVM guest can have access to a page at any given time.  I believe the RMP
> > > provides the same guarantees for SEV-SNP.
> > > 
> > > SEV/SEV-ES could still end up with corruption if multiple guests map the same
> > > private page, but that's obviously not the end of the world since it's the status
> > > quo today.  Living with that shortcoming might be a worthy tradeoff if punting
> > > mutual exclusion between guests to firmware/hardware allows us to simplify the
> > > kernel implementation.
> > 
> > The critical question is whether we ever need to translate hva->pfn after
> > the page is added to the guest private memory. I believe we do, but I
> > never checked. And that's the reason we need to keep hwpoison entries
> > around, which encode pfn.
> 
> As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
> guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail on
> the backend).  But in that case, KVM still has the original PFN, the "new"
> translation becomes a sanity check to make sure that the zapped translation
> wasn't moved unexpectedly.
> 
> Regardless, I don't see what that has to do with kvm_pfn_map.  At some point,
> gup() has to fault in the page or look at the host PTE value.  For the latter,
> at least on x86, we can throw info into the PTE itself to tag it as guest-only.
> No matter what implementation we settle on, I think we've failed if we end up in
> a situation where the primary MMU has pages it doesn't know are guest-only.

I try to understand if it's a problem if KVM sees a guest-only PTE, but
it's for other VM. Like two VM's try to use the same tmpfs file as guest
memory. We cannot insert the pfn into two TD/SEV guest at once, but can it
cause other problems? I'm not sure.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages
  2021-04-19 22:57                 ` Kirill A. Shutemov
@ 2021-04-20 17:13                   ` Sean Christopherson
  0 siblings, 0 replies; 31+ messages in thread
From: Sean Christopherson @ 2021-04-20 17:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, Erdem Aktas, Steve Rutherford, Peter Gonda,
	David Hildenbrand, x86, kvm, linux-mm, linux-kernel

On Tue, Apr 20, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 08:09:13PM +0000, Sean Christopherson wrote:
> > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > The critical question is whether we ever need to translate hva->pfn after
> > > the page is added to the guest private memory. I believe we do, but I
> > > never checked. And that's the reason we need to keep hwpoison entries
> > > around, which encode pfn.
> > 
> > As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
> > guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail on
> > the backend).  But in that case, KVM still has the original PFN, the "new"
> > translation becomes a sanity check to make sure that the zapped translation
> > wasn't moved unexpectedly.
> > 
> > Regardless, I don't see what that has to do with kvm_pfn_map.  At some point,
> > gup() has to fault in the page or look at the host PTE value.  For the latter,
> > at least on x86, we can throw info into the PTE itself to tag it as guest-only.
> > No matter what implementation we settle on, I think we've failed if we end up in
> > a situation where the primary MMU has pages it doesn't know are guest-only.
> 
> I try to understand if it's a problem if KVM sees a guest-only PTE, but
> it's for other VM. Like two VM's try to use the same tmpfs file as guest
> memory. We cannot insert the pfn into two TD/SEV guest at once, but can it
> cause other problems? I'm not sure.

For TDX and SNP, "firmware" will prevent assigning the same PFN to multiple VMs.

For SEV and SEV-ES, the PSP (what I'm calling "firmware") will not prevent
assigning the same page to multiple guests.  But the failure mode in that case,
assuming the guests have different ASIDs, is limited to corruption of the guest.

On the other hand, for SEV/SEV-ES it's not invalid to assign the same ASID to
multiple guests (there's an in-flight patch to do exactly that[*]), and sharing
PFNs between guests with the same ASID would also be valid.  In other words, if
we want to enforce PFN association in the kernel, I think the association should
be per-ASID, not per-KVM guest.

So, I don't think we _need_ to rely on the TDX/SNP behavior, but if leveraging
firmware to handle those checks means avoiding additional complexity in the
kernel, then I think it's worth leaning on firmware even if it means SEV/SEV-ES
don't enjoy the same level of robustness.

[*] https://lkml.kernel.org/r/20210408223214.2582277-1-natet@google.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFCv2 00/13] TDX and guest memory unmapping
@ 2021-04-22 14:36 Jue Wang
  0 siblings, 0 replies; 31+ messages in thread
From: Jue Wang @ 2021-04-22 14:36 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: andi.kleen, dave.hansen, david, erdemaktas, isaku.yamahata,
	jmattson, kvm, linux-kernel, linux-mm, luto, peterz, pgonda,
	rick.p.edgecombe, rientjes, seanjc, srutherford, x86

On Fri, 16 Apr 2021 18:40:53 +0300, Kirill A. Shutemov wrote:

> TDX integrity check failures may lead to system shutdown host kernel must
> not allow any writes to TD-private memory. This requirment clashes with
> KVM design: KVM expects the guest memory to be mapped into host userspace
> (e.g. QEMU).

> This patchset aims to start discussion on how we can approach the issue.

Hi Kirill,

Some potential food for thought:

Repurpose Linux page hwpoison semantics for TDX-private memory protection is
smart, however, treating PG_hwpoison or hwpoison swap pte differently when
kvm->mem_protected=true implicitly disabled the original capability of page
hwpoison: protecting the whole system from known corrupted physical memory
and giving user space applications an opportunity to recover from physical
memory corruptions.

Have you considered introducing a set of similar but independent
page/pte semantics
for TDX private memory protection purpose?

Best regards,
-Jue

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, back to index

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-16 15:40 [RFCv2 00/13] TDX and guest memory unmapping Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 01/13] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 02/13] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
2021-04-16 16:10   ` Borislav Petkov
2021-04-19 10:10     ` Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 03/13] x86/kvm: Make DMA pages shared Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 04/13] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
2021-04-16 16:21   ` Dave Hansen
2021-04-16 15:40 ` [RFCv2 05/13] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
2021-04-16 15:40 ` [RFCv2 06/13] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
2021-04-19 16:49   ` Dave Hansen
2021-04-16 15:41 ` [RFCv2 07/13] mm: Add hwpoison_entry_to_pfn() and hwpoison_entry_to_page() Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 08/13] mm/gup: Add FOLL_ALLOW_POISONED Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 09/13] shmem: Fail shmem_getpage_gfp() on poisoned pages Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 10/13] mm: Keep page reference for hwpoison entries Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 11/13] mm: Replace hwpoison entry with present PTE if page got unpoisoned Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 12/13] KVM: passdown struct kvm to hva_to_pfn_slow() Kirill A. Shutemov
2021-04-16 15:41 ` [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
2021-04-16 17:30   ` Sean Christopherson
2021-04-19 11:32     ` Xiaoyao Li
2021-04-19 14:26     ` Kirill A. Shutemov
2021-04-19 16:01       ` Sean Christopherson
2021-04-19 16:40         ` Kirill A. Shutemov
2021-04-19 18:09           ` Sean Christopherson
2021-04-19 18:12             ` David Hildenbrand
2021-04-19 18:53             ` Kirill A. Shutemov
2021-04-19 20:09               ` Sean Christopherson
2021-04-19 22:57                 ` Kirill A. Shutemov
2021-04-20 17:13                   ` Sean Christopherson
2021-04-16 16:46 ` [RFCv2 00/13] TDX and guest memory unmapping Matthew Wilcox
2021-04-22 14:36 Jue Wang

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git