linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFCv1 0/7] TDX and guest memory unmapping
@ 2021-04-02 15:26 Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 1/7] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

TDX integrity check failures may lead to system shutdown host kernel must
not allow any writes to TD-private memory. This requirment clashes with
KVM design: KVM expects the guest memory to be mapped into host userspace
(e.g. QEMU).

This patchset aims to start discussion on how we can approach the issue.

The core of the change is in the last patch. Please see more detailed
description of the issue and proposoal of the solution there.

The patchset can also be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git kvm-unmapped-poison

Kirill A. Shutemov (7):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvmclock: Share hvclock memory with the host
  x86/realmode: Share trampoline area if KVM memory protection enabled
  KVM: unmap guest memory using poisoned pages

 arch/x86/Kconfig                     |   9 +-
 arch/x86/include/asm/cpufeatures.h   |   1 +
 arch/x86/include/asm/io.h            |   4 +-
 arch/x86/include/asm/kvm_para.h      |   5 +
 arch/x86/include/asm/mem_encrypt.h   |   7 +-
 arch/x86/include/uapi/asm/kvm_para.h |   3 +-
 arch/x86/kernel/kvm.c                |  20 ++++
 arch/x86/kernel/kvmclock.c           |   2 +-
 arch/x86/kernel/pci-swiotlb.c        |   3 +-
 arch/x86/kvm/Kconfig                 |   1 +
 arch/x86/kvm/cpuid.c                 |   3 +-
 arch/x86/kvm/mmu/mmu.c               |  15 ++-
 arch/x86/kvm/mmu/paging_tmpl.h       |  10 +-
 arch/x86/kvm/x86.c                   |   6 +
 arch/x86/mm/Makefile                 |   2 +
 arch/x86/mm/mem_encrypt.c            |  74 ------------
 arch/x86/mm/mem_encrypt_common.c     |  87 ++++++++++++++
 arch/x86/mm/pat/set_memory.c         |  10 ++
 arch/x86/realmode/init.c             |   7 +-
 include/linux/kvm_host.h             |  12 ++
 include/linux/swapops.h              |  20 ++++
 include/uapi/linux/kvm_para.h        |   5 +-
 mm/gup.c                             |  31 +++--
 mm/memory.c                          |  45 +++++++-
 mm/page_vma_mapped.c                 |   8 +-
 mm/rmap.c                            |   2 +-
 mm/shmem.c                           |   7 ++
 virt/kvm/Kconfig                     |   3 +
 virt/kvm/kvm_main.c                  | 164 ++++++++++++++++++++++++---
 29 files changed, 442 insertions(+), 124 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

-- 
2.26.3



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFCv1 1/7] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
@ 2021-04-02 15:26 ` Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

force_dma_unencrypted() has to return true for KVM guest with the memory
protected enabled. Move it out of AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for the following patches.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  7 +++++-
 arch/x86/include/asm/io.h        |  4 +++-
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        | 30 -------------------------
 arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
 5 files changed, 49 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..2b4ce1722dbd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1520,14 +1520,19 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
 	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..6dc51b31cb0e 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -256,10 +256,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
 	return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c3d5f0236f35..9ca477b9b8ba 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include <linux/dma-direct.h>
 #include <linux/swiotlb.h>
 #include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -390,32 +386,6 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-	/*
-	 * For SEV, all DMA must be to unencrypted addresses.
-	 */
-	if (sev_active())
-		return true;
-
-	/*
-	 * For SME, all DMA must be to unencrypted addresses if the
-	 * device does not support DMA to addresses that include the
-	 * encryption mask.
-	 */
-	if (sme_active()) {
-		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-						dev->bus_dma_limit);
-
-		if (dma_dev_mask <= dma_enc_mask)
-			return true;
-	}
-
-	return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
 	unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..dd791352f73f
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-direct.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	/*
+	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 */
+	if (sev_active())
+		return true;
+
+	/*
+	 * For SME, all DMA must be to unencrypted addresses if the
+	 * device does not support DMA to addresses that include the
+	 * encryption mask.
+	 */
+	if (sme_active()) {
+		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+						dev->bus_dma_limit);
+
+		if (dma_dev_mask <= dma_enc_mask)
+			return true;
+	}
+
+	return false;
+}
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature
  2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 1/7] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
@ 2021-04-02 15:26 ` Kirill A. Shutemov
  2021-04-08  9:52   ` Borislav Petkov
  2021-04-02 15:26 ` [RFCv1 3/7] x86/kvm: Make DMA pages shared Kirill A. Shutemov
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.

Host side doesn't provide the feature yet, so it is a dead code for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/kvm_para.h      |  5 +++++
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kernel/kvm.c                | 18 ++++++++++++++++++
 include/uapi/linux/kvm_para.h        |  3 ++-
 5 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..5b6f23e6edc4 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
 #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
 #define X86_FEATURE_VM_PAGE_FLUSH	( 8*32+21) /* "" VM Page Flush MSR is supported */
+#define X86_FEATURE_KVM_MEM_PROTECTED	( 8*32+22) /* KVM memory protection extenstion */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..74aea18f3130 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -11,11 +11,16 @@ extern void kvmclock_init(void);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
+bool kvm_mem_protected(void);
 #else
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
 	return false;
 }
+static inline bool kvm_mem_protected(void)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_GUEST */
 
 #define KVM_HYPERCALL \
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 950afebfba88..8d32c41861c9 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -28,11 +28,12 @@
 #define KVM_FEATURE_PV_UNHALT		7
 #define KVM_FEATURE_PV_TLB_FLUSH	9
 #define KVM_FEATURE_ASYNC_PF_VMEXIT	10
-#define KVM_FEATURE_PV_SEND_IPI	11
+#define KVM_FEATURE_PV_SEND_IPI		11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
 #define KVM_FEATURE_MSI_EXT_DEST_ID	15
+#define KVM_FEATURE_MEM_PROTECTED	16
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5e78e01ca3b4..e6989e1b74eb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,13 @@
 #include <asm/ptrace.h>
 #include <asm/svm.h>
 
+static bool mem_protected;
+
+bool kvm_mem_protected(void)
+{
+	return mem_protected;
+}
+
 DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
 
 static int kvmapf = 1;
@@ -749,6 +756,17 @@ static void __init kvm_init_platform(void)
 {
 	kvmclock_init();
 	x86_platform.apic_post_init = kvm_apic_init;
+
+	if (kvm_para_has_feature(KVM_FEATURE_MEM_PROTECTED)) {
+		if (kvm_hypercall0(KVM_HC_ENABLE_MEM_PROTECTED)) {
+			pr_err("Failed to enable KVM memory protection\n");
+			return;
+		}
+
+		pr_info("KVM memory protection enabled\n");
+		mem_protected = true;
+		setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+	}
 }
 
 #if defined(CONFIG_AMD_MEM_ENCRYPT)
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..1a216f32e572 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -27,8 +27,9 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
-#define KVM_HC_SEND_IPI		10
+#define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
+#define KVM_HC_ENABLE_MEM_PROTECTED	12
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFCv1 3/7] x86/kvm: Make DMA pages shared
  2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 1/7] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
@ 2021-04-02 15:26 ` Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 4/7] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Make force_dma_unencrypted() return true for KVM to get DMA pages mapped
as shared.

__set_memory_enc_dec() now informs the host via hypercall if the state
of the page has changed from shared to private or back.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/mm/mem_encrypt_common.c |  5 +++--
 arch/x86/mm/pat/set_memory.c     | 10 ++++++++++
 include/uapi/linux/kvm_para.h    |  1 +
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2b4ce1722dbd..d197b3beb904 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -811,6 +811,7 @@ config KVM_GUEST
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_HV_CALLBACK_VECTOR
+	select X86_MEM_ENCRYPT_COMMON
 	default y
 	help
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index dd791352f73f..6bf0718bb72a 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,14 +10,15 @@
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
 #include <linux/dma-direct.h>
+#include <asm/kvm_para.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
 	/*
-	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 * For SEV and KVM, all DMA must be to unencrypted/shared addresses.
 	 */
-	if (sev_active())
+	if (sev_active() || kvm_mem_protected())
 		return true;
 
 	/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..4b312d80033d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,7 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/kvm_para.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -1977,6 +1978,15 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	struct cpa_data cpa;
 	int ret;
 
+	if (kvm_mem_protected()) {
+		/* TODO: Unsharing memory back */
+		if (WARN_ON_ONCE(enc))
+			return -ENOSYS;
+
+		return kvm_hypercall2(KVM_HC_MEM_SHARE,
+				      __pa(addr) >> PAGE_SHIFT, numpages);
+	}
+
 	/* Nothing to do if memory encryption is not active */
 	if (!mem_encrypt_active())
 		return 0;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 1a216f32e572..09d36683ee0a 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -30,6 +30,7 @@
 #define KVM_HC_SEND_IPI			10
 #define KVM_HC_SCHED_YIELD		11
 #define KVM_HC_ENABLE_MEM_PROTECTED	12
+#define KVM_HC_MEM_SHARE		13
 
 /*
  * hypercalls use architecture specific
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFCv1 4/7] x86/kvm: Use bounce buffers for KVM memory protection
  2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2021-04-02 15:26 ` [RFCv1 3/7] x86/kvm: Make DMA pages shared Kirill A. Shutemov
@ 2021-04-02 15:26 ` Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 5/7] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

Mirror SEV, use SWIOTLB always if KVM memory protection is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/mem_encrypt.h |  7 +++--
 arch/x86/kernel/kvm.c              |  2 ++
 arch/x86/kernel/pci-swiotlb.c      |  3 +-
 arch/x86/mm/mem_encrypt.c          | 44 ---------------------------
 arch/x86/mm/mem_encrypt_common.c   | 48 ++++++++++++++++++++++++++++++
 6 files changed, 57 insertions(+), 48 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d197b3beb904..c51d14db5620 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -812,6 +812,7 @@ config KVM_GUEST
 	select ARCH_CPUIDLE_HALTPOLL
 	select X86_HV_CALLBACK_VECTOR
 	select X86_MEM_ENCRYPT_COMMON
+	select SWIOTLB
 	default y
 	help
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 31c4df123aa0..a748b30c2f23 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,10 +47,8 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
 
 void __init mem_encrypt_free_decrypted_mem(void);
 
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void);
-
 void __init sev_es_init_vc_handling(void);
+
 bool sme_active(void);
 bool sev_active(void);
 bool sev_es_active(void);
@@ -91,6 +89,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void);
+
 /*
  * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
  * writing to or comparing values from the cr3 register.  Having the
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e6989e1b74eb..45aee29e4294 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -26,6 +26,7 @@
 #include <linux/kprobes.h>
 #include <linux/nmi.h>
 #include <linux/swait.h>
+#include <linux/swiotlb.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -766,6 +767,7 @@ static void __init kvm_init_platform(void)
 		pr_info("KVM memory protection enabled\n");
 		mem_protected = true;
 		setup_force_cpu_cap(X86_FEATURE_KVM_MEM_PROTECTED);
+		swiotlb_force = SWIOTLB_FORCE;
 	}
 }
 
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..814060a6ceb0 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -13,6 +13,7 @@
 #include <asm/dma.h>
 #include <asm/xen/swiotlb-xen.h>
 #include <asm/iommu_table.h>
+#include <asm/kvm_para.h>
 
 int swiotlb __read_mostly;
 
@@ -49,7 +50,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (sme_active() || kvm_mem_protected())
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 9ca477b9b8ba..3478f20fb46f 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -409,47 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
 	free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-	pr_info("AMD Memory Encryption Features active:");
-
-	/* Secure Memory Encryption */
-	if (sme_active()) {
-		/*
-		 * SME is mutually exclusive with any of the SEV
-		 * features below.
-		 */
-		pr_cont(" SME\n");
-		return;
-	}
-
-	/* Secure Encrypted Virtualization */
-	if (sev_active())
-		pr_cont(" SEV");
-
-	/* Encrypted Register State */
-	if (sev_es_active())
-		pr_cont(" SEV-ES");
-
-	pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-	if (!sme_me_mask)
-		return;
-
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
-	/*
-	 * With SEV, we need to unroll the rep string I/O instructions.
-	 */
-	if (sev_active())
-		static_branch_enable(&sev_enable_key);
-
-	print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 6bf0718bb72a..351b77361a5d 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -11,6 +11,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/dma-direct.h>
 #include <asm/kvm_para.h>
+#include <asm/mem_encrypt.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -37,3 +38,50 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+static void print_mem_encrypt_feature_info(void)
+{
+	if (kvm_mem_protected()) {
+		pr_info("KVM memory protection enabled\n");
+		return;
+	}
+
+	pr_info("AMD Memory Encryption Features active:");
+
+	/* Secure Memory Encryption */
+	if (sme_active()) {
+		/*
+		 * SME is mutually exclusive with any of the SEV
+		 * features below.
+		 */
+		pr_cont(" SME\n");
+		return;
+	}
+
+	/* Secure Encrypted Virtualization */
+	if (sev_active())
+		pr_cont(" SEV");
+
+	/* Encrypted Register State */
+	if (sev_es_active())
+		pr_cont(" SEV-ES");
+
+	pr_cont("\n");
+}
+
+void __init mem_encrypt_init(void)
+{
+	if (!sme_me_mask && !kvm_mem_protected())
+		return;
+
+	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+	swiotlb_update_mem_attributes();
+
+	/*
+	 * With SEV, we need to unroll the rep string I/O instructions.
+	 */
+	if (sev_active())
+		static_branch_enable(&sev_enable_key);
+
+	print_mem_encrypt_feature_info();
+}
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFCv1 5/7] x86/kvmclock: Share hvclock memory with the host
  2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2021-04-02 15:26 ` [RFCv1 4/7] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
@ 2021-04-02 15:26 ` Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 6/7] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
  6 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

hvclock is shared between the guest and the hypervisor. It has to be
accessible by host.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/kvmclock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa593743acf6..3d004b278dba 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -252,7 +252,7 @@ static void __init kvmclock_init_mem(void)
 	 * hvclock is shared between the guest and the hypervisor, must
 	 * be mapped decrypted.
 	 */
-	if (sev_active()) {
+	if (sev_active() || kvm_mem_protected()) {
 		r = set_memory_decrypted((unsigned long) hvclock_mem,
 					 1UL << order);
 		if (r) {
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFCv1 6/7] x86/realmode: Share trampoline area if KVM memory protection enabled
  2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2021-04-02 15:26 ` [RFCv1 5/7] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
@ 2021-04-02 15:26 ` Kirill A. Shutemov
  2021-04-02 15:26 ` [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
  6 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

If KVM memory protection is active, the trampoline area will need to be
in shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/realmode/init.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 22fda7d99159..f3b54b5da693 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -10,6 +10,7 @@
 #include <asm/tlbflush.h>
 #include <asm/crash.h>
 #include <asm/sev-es.h>
+#include <asm/kvm_para.h>
 
 struct real_mode_header *real_mode_header;
 u32 *trampoline_cr4_features;
@@ -75,11 +76,11 @@ static void __init setup_real_mode(void)
 	base = (unsigned char *)real_mode_header;
 
 	/*
-	 * If SME is active, the trampoline area will need to be in
-	 * decrypted memory in order to bring up other processors
+	 * If SME or KVM memory protection is active, the trampoline area will
+	 * need to be in decrypted memory in order to bring up other processors
 	 * successfully. This is not needed for SEV.
 	 */
-	if (sme_active())
+	if (sme_active() || kvm_mem_protected())
 		set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);
 
 	memcpy(base, real_mode_blob, size);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2021-04-02 15:26 ` [RFCv1 6/7] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
@ 2021-04-02 15:26 ` Kirill A. Shutemov
  2021-04-06  7:44   ` David Hildenbrand
  2021-04-07 14:55   ` David Hildenbrand
  6 siblings, 2 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-02 15:26 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

TDX architecture aims to provide resiliency against confidentiality and
integrity attacks. Towards this goal, the TDX architecture helps enforce
the enabling of memory integrity for all TD-private memory.

The CPU memory controller computes the integrity check value (MAC) for
the data (cache line) during writes, and it stores the MAC with the
memory as meta-data. A 28-bit MAC is stored in the ECC bits.

Checking of memory integrity is performed during memory reads. If
integrity check fails, CPU poisones cache line.

On a subsequent consumption (read) of the poisoned data by software,
there are two possible scenarios:

 - Core determines that the execution can continue and it treats
   poison with exception semantics signaled as a #MCE

 - Core determines execution cannot continue,and it does an unbreakable
   shutdown

For more details, see Chapter 14 of Intel TDX Module EAS[1]

As some of integrity check failures may lead to system shutdown host
kernel must not allow any writes to TD-private memory. This requirment
clashes with KVM design: KVM expects the guest memory to be mapped into
host userspace (e.g. QEMU).

This patch aims to start discussion on how we can approach the issue.

For now I intentionally keep TDX out of picture here and try to find a
generic way to unmap KVM guest memory from host userspace. Hopefully, it
makes the patch more approachable. And anyone can try it out.

To the proposal:

Looking into existing codepaths I've discovered that we already have
semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap
entries in page tables:

  - If an application touches a page mapped with the SWP_HWPOISON, it will
    get SIGBUS.

  - GUP will fail with -EFAULT;

Access the poisoned memory via page cache doesn't match required
semantics right now, but it shouldn't be too hard to make it work:
access to poisoned dirty pages should give -EIO or -EHWPOISON.

My idea is that we can mark page as poisoned when we make it TD-private
and replace all PTEs that map the page with SWP_HWPOISON.

The patch is proof-of-concept and has known issues:

  - Limited to swap-backed pages for now: anon or tmpfs/shmem

  - No THP support

  - Need a new FOLL_XXX flags to access such pages from KVM code.

  - Page unpoisoning is not implemented. It proved to be more difficult
    than I expected. I'm looking into solution.

  - Poisoned pages must be tied to KVM instance and another KVM must not
    be able to map the page into guest.

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf

Not-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kvm/Kconfig           |   1 +
 arch/x86/kvm/cpuid.c           |   3 +-
 arch/x86/kvm/mmu/mmu.c         |  15 ++-
 arch/x86/kvm/mmu/paging_tmpl.h |  10 +-
 arch/x86/kvm/x86.c             |   6 ++
 include/linux/kvm_host.h       |  12 +++
 include/linux/swapops.h        |  20 ++++
 include/uapi/linux/kvm_para.h  |   1 +
 mm/gup.c                       |  31 ++++---
 mm/memory.c                    |  45 ++++++++-
 mm/page_vma_mapped.c           |   8 +-
 mm/rmap.c                      |   2 +-
 mm/shmem.c                     |   7 ++
 virt/kvm/Kconfig               |   3 +
 virt/kvm/kvm_main.c            | 164 +++++++++++++++++++++++++++++----
 15 files changed, 290 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 7ac592664c52..b7db1c455e7c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
 	select SRCU
+	select HAVE_KVM_PROTECTED_MEMORY
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 38172ca627d3..1457692c1080 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -796,7 +796,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-			     (1 << KVM_FEATURE_ASYNC_PF_INT);
+			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
+			     (1 << KVM_FEATURE_MEM_PROTECTED);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..53a69c8c59f1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -43,6 +43,7 @@
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
 #include <linux/kthread.h>
+#include <linux/rmap.h>
 
 #include <asm/page.h>
 #include <asm/memtype.h>
@@ -2758,7 +2759,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	if (sp->role.level > PG_LEVEL_4K)
 		return;
 
-	__direct_pte_prefetch(vcpu, sp, sptep);
+	if (!vcpu->kvm->mem_protected)
+		__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
 static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
@@ -3723,6 +3725,17 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
 		return r;
 
+	if (vcpu->kvm->mem_protected && unlikely(!is_noslot_pfn(pfn)) &&
+	    !gfn_is_shared(vcpu->kvm, gfn)) {
+		struct page *page = pfn_to_page(pfn);
+		lock_page(page);
+		VM_BUG_ON_PAGE(!PageSwapBacked(page) && !PageReserved(page), page);
+		/* Recheck gfn_is_shared() under page lock */
+		if (!gfn_is_shared(vcpu->kvm, gfn) && !TestSetPageHWPoison(page))
+			try_to_unmap(page, TTU_IGNORE_MLOCK);
+		unlock_page(page);
+	}
+
 	r = RET_PF_RETRY;
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50e268eb8e1a..26b0494a1207 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -397,8 +397,14 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 			goto error;
 
 		ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
-		if (unlikely(__get_user(pte, ptep_user)))
-			goto error;
+		if (vcpu->kvm->mem_protected) {
+			if (copy_from_guest(vcpu->kvm, &pte, host_addr + offset,
+					    sizeof(pte)))
+				goto error;
+		} else {
+			if (unlikely(__get_user(pte, ptep_user)))
+				goto error;
+		}
 		walker->ptep_user[walker->level - 1] = ptep_user;
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b404e4d7dd8..f8183386abe7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		kvm_sched_yield(vcpu->kvm, a0);
 		ret = 0;
 		break;
+	case KVM_HC_ENABLE_MEM_PROTECTED:
+		ret = kvm_protect_memory(vcpu->kvm);
+		break;
+	case KVM_HC_MEM_SHARE:
+		ret = kvm_share_memory(vcpu->kvm, a0, a1);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f3b1013fb22c..f941bcbefb79 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
+#define KVM_NR_SHARED_RANGES 32
+
 /*
  * Note:
  * memslots are not sorted by id anymore, please use id_to_memslot()
@@ -513,6 +515,9 @@ struct kvm {
 	pid_t userspace_pid;
 	unsigned int max_halt_poll_ns;
 	u32 dirty_ring_size;
+	bool mem_protected;
+	int nr_shared_ranges;
+	struct range shared_ranges[KVM_NR_SHARED_RANGES];
 };
 
 #define kvm_err(fmt, ...) \
@@ -709,6 +714,10 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot);
 
+int kvm_protect_memory(struct kvm *kvm);
+int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages);
+bool gfn_is_shared(struct kvm *kvm, unsigned long gfn);
+
 int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 			    struct page **pages, int nr_pages);
 
@@ -718,6 +727,9 @@ unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable);
 unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
 unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 				      bool *writable);
+int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len);
+int copy_to_guest(struct kvm *kvm, unsigned long hva, const void *data, int len);
+
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..520589b12fb3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -323,6 +323,16 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(hwpoison_entry_to_pfn(entry));
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
@@ -345,6 +355,16 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 	return 0;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return 0;
+}
+
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 }
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 09d36683ee0a..743e621111f0 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -17,6 +17,7 @@
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
+#define KVM_EINTR		EINTR
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..ce4fdf213455 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -384,22 +384,31 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	pte = *ptep;
 	if (!pte_present(pte)) {
-		swp_entry_t entry;
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (pte_none(pte))
+			goto no_page;
+
 		/*
 		 * KSM's break_ksm() relies upon recognizing a ksm page
 		 * even while it is being migrated, so for that case we
 		 * need migration_entry_wait().
 		 */
-		if (likely(!(flags & FOLL_MIGRATION)))
-			goto no_page;
-		if (pte_none(pte))
-			goto no_page;
-		entry = pte_to_swp_entry(pte);
-		if (!is_migration_entry(entry))
-			goto no_page;
-		pte_unmap_unlock(ptep, ptl);
-		migration_entry_wait(mm, pmd, address);
-		goto retry;
+		if (is_migration_entry(entry) && (flags & FOLL_MIGRATION)) {
+			pte_unmap_unlock(ptep, ptl);
+			migration_entry_wait(mm, pmd, address);
+			goto retry;
+		}
+
+		if (is_hwpoison_entry(entry)) {
+			page = hwpoison_entry_to_page(entry);
+			if (PageHWPoison(page) /* && (flags & FOLL_ALLOW_POISONED) */) {
+				get_page(page);
+				goto out;
+			}
+		}
+
+		goto no_page;
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..524dce15a087 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -767,6 +767,9 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = pte_swp_mkuffd_wp(pte);
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
+	} else if (is_hwpoison_entry(entry)) {
+		page = hwpoison_entry_to_page(entry);
+		get_page(page);
 	}
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return 0;
@@ -1305,6 +1308,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			page = migration_entry_to_page(entry);
 			rss[mm_counter(page)]--;
+
+		} else if (is_hwpoison_entry(entry)) {
+			put_page(hwpoison_entry_to_page(entry));
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
@@ -3274,7 +3280,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->page = device_private_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
-			ret = VM_FAULT_HWPOISON;
+			page = hwpoison_entry_to_page(entry);
+
+			locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+			if (!locked) {
+				ret = VM_FAULT_RETRY;
+				goto out;
+			}
+
+			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+						       vmf->address, &vmf->ptl);
+
+			if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+				ret = 0;
+			} else if (PageHWPoison(page)) {
+				ret = VM_FAULT_HWPOISON;
+			} else {
+				/*
+				 * The page is unpoisoned. Replace hwpoison
+				 * entry with a present PTE.
+				 */
+
+				inc_mm_counter(vma->vm_mm, mm_counter(page));
+				pte = mk_pte(page, vma->vm_page_prot);
+
+				if (PageAnon(page)) {
+					page_add_anon_rmap(page, vma,
+							   vmf->address, false);
+				} else {
+					page_add_file_rmap(page, false);
+				}
+
+				set_pte_at(vma->vm_mm, vmf->address,
+					   vmf->pte, pte);
+			}
+
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			unlock_page(page);
 		} else {
 			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
@@ -3282,7 +3324,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 	}
 
-
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry, vma, vmf->address);
 	swapcache = page;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 86e3a3688d59..8fffae175104 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -93,10 +93,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
 
-		if (!is_migration_entry(entry))
+		if (is_migration_entry(entry))
+			pfn = migration_entry_to_pfn(entry);
+		else if (is_hwpoison_entry(entry))
+			pfn = hwpoison_entry_to_pfn(entry);
+		else
 			return false;
-
-		pfn = migration_entry_to_pfn(entry);
 	} else if (is_swap_pte(*pvmw->pte)) {
 		swp_entry_t entry;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 08c56aaf72eb..f08d1fc28522 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1575,7 +1575,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				dec_mm_counter(mm, mm_counter(page));
 				set_pte_at(mm, address, pvmw.pte, pteval);
 			}
-
+			get_page(page);
 		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..d29a0c9be19c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1832,6 +1832,13 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	if (page)
 		hindex = page->index;
+
+	if (page && PageHWPoison(page)) {
+		unlock_page(page);
+		put_page(page);
+		return -EIO;
+	}
+
 	if (page && sgp == SGP_WRITE)
 		mark_page_accessed(page);
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 1c37ccd5d402..50d7422386aa 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -63,3 +63,6 @@ config HAVE_KVM_NO_POLL
 
 config KVM_XFER_TO_GUEST_WORK
        bool
+
+config HAVE_KVM_PROTECTED_MEMORY
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8367d88ce39b..f182c54bfa34 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/io.h>
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
+#include <linux/rmap.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -2333,19 +2334,85 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+	void *vaddr;
+
+	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
+	    !kvm->mem_protected) {
+		return __copy_from_user(data, (void __user *)hva, len);
+	}
+
+	might_fault();
+	kasan_check_write(data, len);
+	check_object_size(data, len, false);
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page, 0);
+		if (npages != 1)
+			return -EFAULT;
+
+		vaddr = kmap_atomic(page);
+		memcpy(data, vaddr + offset, seg);
+		kunmap_atomic(vaddr);
+
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		data += seg;
+		offset = 0;
+	}
+
+	return 0;
+}
+
+int copy_to_guest(struct kvm *kvm, unsigned long hva, const void *data, int len)
+{
+	int offset = offset_in_page(hva);
+	struct page *page;
+	int npages, seg;
+	void *vaddr;
+
+	if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
+	    !kvm->mem_protected) {
+		return __copy_to_user((void __user *)hva, data, len);
+	}
+
+	might_fault();
+	kasan_check_read(data, len);
+	check_object_size(data, len, true);
+
+	while ((seg = next_segment(len, offset)) != 0) {
+		npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE);
+		if (npages != 1)
+			return -EFAULT;
+
+		vaddr = kmap_atomic(page);
+		memcpy(vaddr + offset, data, seg);
+		kunmap_atomic(vaddr);
+
+		put_page(page);
+		len -= seg;
+		hva += seg;
+		data += seg;
+		offset = 0;
+	}
+
+	return 0;
+}
+
+static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gfn_t gfn, void *data, int offset, int len)
 {
-	int r;
 	unsigned long addr;
 
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_from_user(data, (void __user *)addr + offset, len);
-	if (r)
-		return -EFAULT;
-	return 0;
+	return copy_from_guest(kvm, data, addr + offset, len);
 }
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
@@ -2353,7 +2420,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -2362,7 +2429,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -2444,7 +2511,8 @@ static int __kvm_write_guest_page(struct kvm *kvm,
 	addr = gfn_to_hva_memslot(memslot, gfn);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_to_user((void __user *)addr + offset, data, len);
+
+	r = copy_to_guest(kvm, addr + offset, data, len);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(kvm, memslot, gfn);
@@ -2581,7 +2649,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_write_guest(kvm, gpa, data, len);
 
-	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
+	r = copy_to_guest(kvm, ghc->hva + offset, data, len);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT);
@@ -2602,7 +2670,6 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 				 unsigned long len)
 {
 	struct kvm_memslots *slots = kvm_memslots(kvm);
-	int r;
 	gpa_t gpa = ghc->gpa + offset;
 
 	BUG_ON(len + offset > ghc->len);
@@ -2618,11 +2685,7 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (unlikely(!ghc->memslot))
 		return kvm_read_guest(kvm, gpa, data, len);
 
-	r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
-	if (r)
-		return -EFAULT;
-
-	return 0;
+	return copy_from_guest(kvm, data, ghc->hva + offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached);
 
@@ -2688,6 +2751,73 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
+int kvm_protect_memory(struct kvm *kvm)
+{
+	if (mmap_write_lock_killable(kvm->mm))
+		return -KVM_EINTR;
+	kvm->mem_protected = true;
+	kvm_arch_flush_shadow_all(kvm);
+	mmap_write_unlock(kvm->mm);
+
+	return 0;
+}
+
+bool gfn_is_shared(struct kvm *kvm, unsigned long gfn)
+{
+	bool ret = false;
+	int i;
+
+	spin_lock(&kvm->mmu_lock);
+	for (i = 0; i < kvm->nr_shared_ranges; i++) {
+		if (gfn < kvm->shared_ranges[i].start)
+			continue;
+		if (gfn >= kvm->shared_ranges[i].end)
+			continue;
+
+		ret = true;
+		break;
+	}
+	spin_unlock(&kvm->mmu_lock);
+
+	return ret;
+}
+
+int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages)
+{
+	unsigned long end = gfn + npages;
+
+	if (!npages)
+		return 0;
+
+	/*
+	 * Out of slots.
+	 * Still worth to proceed: the new range may merge with an existing
+	 * one.
+	 */
+	WARN_ON_ONCE(kvm->nr_shared_ranges == ARRAY_SIZE(kvm->shared_ranges));
+
+	spin_lock(&kvm->mmu_lock);
+	kvm->nr_shared_ranges = add_range_with_merge(kvm->shared_ranges,
+						ARRAY_SIZE(kvm->shared_ranges),
+						kvm->nr_shared_ranges, gfn, end);
+	kvm->nr_shared_ranges = clean_sort_range(kvm->shared_ranges,
+					    ARRAY_SIZE(kvm->shared_ranges));
+	spin_unlock(&kvm->mmu_lock);
+
+	for (; gfn < end; gfn++) {
+		struct page *page = gfn_to_page(kvm, gfn);
+
+		if (page == KVM_ERR_PTR_BAD_PAGE)
+			continue;
+		lock_page(page);
+		ClearPageHWPoison(page);
+		unlock_page(page);
+		put_page(page);
+	}
+
+	return 0;
+}
+
 void kvm_sigset_activate(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->sigset_active)
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-02 15:26 ` [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
@ 2021-04-06  7:44   ` David Hildenbrand
  2021-04-06 10:50     ` Kirill A. Shutemov
  2021-04-06 14:33     ` Dave Hansen
  2021-04-07 14:55   ` David Hildenbrand
  1 sibling, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2021-04-06  7:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 02.04.21 17:26, Kirill A. Shutemov wrote:
> TDX architecture aims to provide resiliency against confidentiality and
> integrity attacks. Towards this goal, the TDX architecture helps enforce
> the enabling of memory integrity for all TD-private memory.
> 
> The CPU memory controller computes the integrity check value (MAC) for
> the data (cache line) during writes, and it stores the MAC with the
> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> 
> Checking of memory integrity is performed during memory reads. If
> integrity check fails, CPU poisones cache line.
> 
> On a subsequent consumption (read) of the poisoned data by software,
> there are two possible scenarios:
> 
>   - Core determines that the execution can continue and it treats
>     poison with exception semantics signaled as a #MCE
> 
>   - Core determines execution cannot continue,and it does an unbreakable
>     shutdown
> 
> For more details, see Chapter 14 of Intel TDX Module EAS[1]
> 
> As some of integrity check failures may lead to system shutdown host
> kernel must not allow any writes to TD-private memory. This requirment
> clashes with KVM design: KVM expects the guest memory to be mapped into
> host userspace (e.g. QEMU).

So what you are saying is that if QEMU would write to such memory, it 
could crash the kernel? What a broken design.

"As some of integrity check failures may lead to system shutdown host" 
-- usually we expect to recover from an MCE by killing the affected 
process, which would be the right thing to do here.

How can it happen that "Core determines execution cannot continue,and it 
does an unbreakable shutdown". Who is "Core"? CPU "core", MM "core" ? 
And why would it decide to do a shutdown instead of just killing the 
process?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-06  7:44   ` David Hildenbrand
@ 2021-04-06 10:50     ` Kirill A. Shutemov
  2021-04-06 14:33     ` Dave Hansen
  1 sibling, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-06 10:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel

On Tue, Apr 06, 2021 at 09:44:07AM +0200, David Hildenbrand wrote:
> On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > TDX architecture aims to provide resiliency against confidentiality and
> > integrity attacks. Towards this goal, the TDX architecture helps enforce
> > the enabling of memory integrity for all TD-private memory.
> > 
> > The CPU memory controller computes the integrity check value (MAC) for
> > the data (cache line) during writes, and it stores the MAC with the
> > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > 
> > Checking of memory integrity is performed during memory reads. If
> > integrity check fails, CPU poisones cache line.
> > 
> > On a subsequent consumption (read) of the poisoned data by software,
> > there are two possible scenarios:
> > 
> >   - Core determines that the execution can continue and it treats
> >     poison with exception semantics signaled as a #MCE
> > 
> >   - Core determines execution cannot continue,and it does an unbreakable
> >     shutdown
> > 
> > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > 
> > As some of integrity check failures may lead to system shutdown host
> > kernel must not allow any writes to TD-private memory. This requirment
> > clashes with KVM design: KVM expects the guest memory to be mapped into
> > host userspace (e.g. QEMU).
> 
> So what you are saying is that if QEMU would write to such memory, it could
> crash the kernel? What a broken design.

Cannot disagree. #MCE for integrity check is very questionable. But I'm not
CPU engineer.

> "As some of integrity check failures may lead to system shutdown host" --
> usually we expect to recover from an MCE by killing the affected process,
> which would be the right thing to do here.

In the most cases that's what happen.

> How can it happen that "Core determines execution cannot continue,and it
> does an unbreakable shutdown". Who is "Core"? CPU "core", MM "core" ?

CPU core.

> And why would it decide to do a shutdown instead of just killing the
> process?

<As I said, I'm not CPU engineer. Below is my understanding of the issue.>

If the CPU handles long flow instruction (involves microcode and doing
multiple memory accesses), consuming poison somewhere in the middle leads
to CPU not being able to get back into sane state and the only option is
system shutdown.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-06  7:44   ` David Hildenbrand
  2021-04-06 10:50     ` Kirill A. Shutemov
@ 2021-04-06 14:33     ` Dave Hansen
  2021-04-06 14:57       ` David Hildenbrand
  2021-04-06 17:52       ` Tom Lendacky
  1 sibling, 2 replies; 27+ messages in thread
From: Dave Hansen @ 2021-04-06 14:33 UTC (permalink / raw)
  To: David Hildenbrand, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Sean Christopherson,
	Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 4/6/21 12:44 AM, David Hildenbrand wrote:
> On 02.04.21 17:26, Kirill A. Shutemov wrote:
>> TDX architecture aims to provide resiliency against confidentiality and
>> integrity attacks. Towards this goal, the TDX architecture helps enforce
>> the enabling of memory integrity for all TD-private memory.
>>
>> The CPU memory controller computes the integrity check value (MAC) for
>> the data (cache line) during writes, and it stores the MAC with the
>> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
>>
>> Checking of memory integrity is performed during memory reads. If
>> integrity check fails, CPU poisones cache line.
>>
>> On a subsequent consumption (read) of the poisoned data by software,
>> there are two possible scenarios:
>>
>>   - Core determines that the execution can continue and it treats
>>     poison with exception semantics signaled as a #MCE
>>
>>   - Core determines execution cannot continue,and it does an unbreakable
>>     shutdown
>>
>> For more details, see Chapter 14 of Intel TDX Module EAS[1]
>>
>> As some of integrity check failures may lead to system shutdown host
>> kernel must not allow any writes to TD-private memory. This requirment
>> clashes with KVM design: KVM expects the guest memory to be mapped into
>> host userspace (e.g. QEMU).
> 
> So what you are saying is that if QEMU would write to such memory, it
> could crash the kernel? What a broken design.

IMNHO, the broken design is mapping the memory to userspace in the first
place.  Why the heck would you actually expose something with the MMU to
a context that can't possibly meaningfully access or safely write to it?

This started with SEV.  QEMU creates normal memory mappings with the SEV
C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
those are instantiated, they have the C-bit set.  So, we have mismatched
mappings.  Where does that lead?  The two mappings not only differ in
the encryption bit, causing one side to read gibberish if the other
writes: they're not even cache coherent.

That's the situation *TODAY*, even ignoring TDX.

BTW, I'm pretty sure I know the answer to the "why would you expose this
to userspace" question: it's what QEMU/KVM did alreadhy for
non-encrypted memory, so this was the quickest way to get SEV working.

So, I don't like the #MC either.  But, this series is a step in the
right direction for TDX *AND* SEV.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-06 14:33     ` Dave Hansen
@ 2021-04-06 14:57       ` David Hildenbrand
  2021-04-07 13:16         ` Kirill A. Shutemov
  2021-04-06 17:52       ` Tom Lendacky
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2021-04-06 14:57 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 06.04.21 16:33, Dave Hansen wrote:
> On 4/6/21 12:44 AM, David Hildenbrand wrote:
>> On 02.04.21 17:26, Kirill A. Shutemov wrote:
>>> TDX architecture aims to provide resiliency against confidentiality and
>>> integrity attacks. Towards this goal, the TDX architecture helps enforce
>>> the enabling of memory integrity for all TD-private memory.
>>>
>>> The CPU memory controller computes the integrity check value (MAC) for
>>> the data (cache line) during writes, and it stores the MAC with the
>>> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
>>>
>>> Checking of memory integrity is performed during memory reads. If
>>> integrity check fails, CPU poisones cache line.
>>>
>>> On a subsequent consumption (read) of the poisoned data by software,
>>> there are two possible scenarios:
>>>
>>>    - Core determines that the execution can continue and it treats
>>>      poison with exception semantics signaled as a #MCE
>>>
>>>    - Core determines execution cannot continue,and it does an unbreakable
>>>      shutdown
>>>
>>> For more details, see Chapter 14 of Intel TDX Module EAS[1]
>>>
>>> As some of integrity check failures may lead to system shutdown host
>>> kernel must not allow any writes to TD-private memory. This requirment
>>> clashes with KVM design: KVM expects the guest memory to be mapped into
>>> host userspace (e.g. QEMU).
>>
>> So what you are saying is that if QEMU would write to such memory, it
>> could crash the kernel? What a broken design.
> 
> IMNHO, the broken design is mapping the memory to userspace in the first
> place.  Why the heck would you actually expose something with the MMU to
> a context that can't possibly meaningfully access or safely write to it?

I'd say the broken design is being able to crash the machine via a 
simple memory write, instead of only crashing a single process in case 
you're doing something nasty. From the evaluation of the problem it 
feels like this was a CPU design workaround: instead of properly 
cleaning up when it gets tricky within the core, just crash the machine. 
And that's a CPU "feature", not a kernel "feature". Now we have to fix 
broken HW in the kernel - once again.

However, you raise a valid point: it does not make too much sense to to 
map this into user space. Not arguing against that; but crashing the 
machine is just plain ugly.

I wonder: why do we even *want* a VMA/mmap describing that memory? 
Sounds like: for hacking support for that memory type into QEMU/KVM.

This all feels wrong, but I cannot really tell how it could be better. 
That memory can really only be used (right now?) with hardware 
virtualization from some point on. From that point on (right from the 
start?), there should be no VMA/mmap/page tables for user space anymore.

Or am I missing something? Is there still valid user space access?

> 
> This started with SEV.  QEMU creates normal memory mappings with the SEV
> C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
> those are instantiated, they have the C-bit set.  So, we have mismatched
> mappings.  Where does that lead?  The two mappings not only differ in
> the encryption bit, causing one side to read gibberish if the other
> writes: they're not even cache coherent.
> 
> That's the situation *TODAY*, even ignoring TDX.
> 
> BTW, I'm pretty sure I know the answer to the "why would you expose this
> to userspace" question: it's what QEMU/KVM did alreadhy for
> non-encrypted memory, so this was the quickest way to get SEV working.
> 

Yes, I guess so. It was the fastest way to "hack" it into QEMU.

Would we ever even want a VMA/mmap/process page tables for that memory? 
How could user space ever do something *not so nasty* with that memory 
(in the current context of VMs)?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-06 14:33     ` Dave Hansen
  2021-04-06 14:57       ` David Hildenbrand
@ 2021-04-06 17:52       ` Tom Lendacky
  1 sibling, 0 replies; 27+ messages in thread
From: Tom Lendacky @ 2021-04-06 17:52 UTC (permalink / raw)
  To: Dave Hansen, David Hildenbrand, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Sean Christopherson,
	Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov

On 4/6/21 9:33 AM, Dave Hansen wrote:
> On 4/6/21 12:44 AM, David Hildenbrand wrote:
>> On 02.04.21 17:26, Kirill A. Shutemov wrote:
>>> TDX architecture aims to provide resiliency against confidentiality and
>>> integrity attacks. Towards this goal, the TDX architecture helps enforce
>>> the enabling of memory integrity for all TD-private memory.
>>>
>>> The CPU memory controller computes the integrity check value (MAC) for
>>> the data (cache line) during writes, and it stores the MAC with the
>>> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
>>>
>>> Checking of memory integrity is performed during memory reads. If
>>> integrity check fails, CPU poisones cache line.
>>>
>>> On a subsequent consumption (read) of the poisoned data by software,
>>> there are two possible scenarios:
>>>
>>>   - Core determines that the execution can continue and it treats
>>>     poison with exception semantics signaled as a #MCE
>>>
>>>   - Core determines execution cannot continue,and it does an unbreakable
>>>     shutdown
>>>
>>> For more details, see Chapter 14 of Intel TDX Module EAS[1]
>>>
>>> As some of integrity check failures may lead to system shutdown host
>>> kernel must not allow any writes to TD-private memory. This requirment
>>> clashes with KVM design: KVM expects the guest memory to be mapped into
>>> host userspace (e.g. QEMU).
>>
>> So what you are saying is that if QEMU would write to such memory, it
>> could crash the kernel? What a broken design.
> 
> IMNHO, the broken design is mapping the memory to userspace in the first
> place.  Why the heck would you actually expose something with the MMU to
> a context that can't possibly meaningfully access or safely write to it?
> 
> This started with SEV.  QEMU creates normal memory mappings with the SEV
> C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
> those are instantiated, they have the C-bit set.  So, we have mismatched
> mappings.  Where does that lead?  The two mappings not only differ in
> the encryption bit, causing one side to read gibberish if the other
> writes: they're not even cache coherent.

QEMU is running on the hypervisor side, so even if the C-bit is set for
its memory mappings, it would use the hypervisor key to access the memory,
not the guest key. So it doesn't matter from a QEMU perspective whether it
creates mappings with or without the C-bit. The C-bit in the NPT is only
used if the guest is accessing the memory as shared/un-encrypted, in which
case the the hypervisor key is then used.

The latest EPYC hardware provides cache coherency for encrypted /
non-encrypted accesses (X86_FEATURE_SME_COHERENT).

> 
> That's the situation *TODAY*, even ignoring TDX.
> 
> BTW, I'm pretty sure I know the answer to the "why would you expose this
> to userspace" question: it's what QEMU/KVM did alreadhy for
> non-encrypted memory, so this was the quickest way to get SEV working.
> 
> So, I don't like the #MC either.  But, this series is a step in the
> right direction for TDX *AND* SEV.

So, yes, this is a step in the right direction.

Thanks,
Tom

> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-06 14:57       ` David Hildenbrand
@ 2021-04-07 13:16         ` Kirill A. Shutemov
  2021-04-07 13:31           ` Christophe de Dinechin
  2021-04-07 14:09           ` David Hildenbrand
  0 siblings, 2 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-07 13:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote:
> On 06.04.21 16:33, Dave Hansen wrote:
> > On 4/6/21 12:44 AM, David Hildenbrand wrote:
> > > On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > > > TDX architecture aims to provide resiliency against confidentiality and
> > > > integrity attacks. Towards this goal, the TDX architecture helps enforce
> > > > the enabling of memory integrity for all TD-private memory.
> > > > 
> > > > The CPU memory controller computes the integrity check value (MAC) for
> > > > the data (cache line) during writes, and it stores the MAC with the
> > > > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > > > 
> > > > Checking of memory integrity is performed during memory reads. If
> > > > integrity check fails, CPU poisones cache line.
> > > > 
> > > > On a subsequent consumption (read) of the poisoned data by software,
> > > > there are two possible scenarios:
> > > > 
> > > >    - Core determines that the execution can continue and it treats
> > > >      poison with exception semantics signaled as a #MCE
> > > > 
> > > >    - Core determines execution cannot continue,and it does an unbreakable
> > > >      shutdown
> > > > 
> > > > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > > > 
> > > > As some of integrity check failures may lead to system shutdown host
> > > > kernel must not allow any writes to TD-private memory. This requirment
> > > > clashes with KVM design: KVM expects the guest memory to be mapped into
> > > > host userspace (e.g. QEMU).
> > > 
> > > So what you are saying is that if QEMU would write to such memory, it
> > > could crash the kernel? What a broken design.
> > 
> > IMNHO, the broken design is mapping the memory to userspace in the first
> > place.  Why the heck would you actually expose something with the MMU to
> > a context that can't possibly meaningfully access or safely write to it?
> 
> I'd say the broken design is being able to crash the machine via a simple
> memory write, instead of only crashing a single process in case you're doing
> something nasty. From the evaluation of the problem it feels like this was a
> CPU design workaround: instead of properly cleaning up when it gets tricky
> within the core, just crash the machine. And that's a CPU "feature", not a
> kernel "feature". Now we have to fix broken HW in the kernel - once again.
> 
> However, you raise a valid point: it does not make too much sense to to map
> this into user space. Not arguing against that; but crashing the machine is
> just plain ugly.
> 
> I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds
> like: for hacking support for that memory type into QEMU/KVM.
> 
> This all feels wrong, but I cannot really tell how it could be better. That
> memory can really only be used (right now?) with hardware virtualization
> from some point on. From that point on (right from the start?), there should
> be no VMA/mmap/page tables for user space anymore.
> 
> Or am I missing something? Is there still valid user space access?

There is. For IO (e.g. virtio) the guest mark a range of memory as shared
(or unencrypted for AMD SEV). The range is not pre-defined.

> > This started with SEV.  QEMU creates normal memory mappings with the SEV
> > C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
> > those are instantiated, they have the C-bit set.  So, we have mismatched
> > mappings.  Where does that lead?  The two mappings not only differ in
> > the encryption bit, causing one side to read gibberish if the other
> > writes: they're not even cache coherent.
> > 
> > That's the situation *TODAY*, even ignoring TDX.
> > 
> > BTW, I'm pretty sure I know the answer to the "why would you expose this
> > to userspace" question: it's what QEMU/KVM did alreadhy for
> > non-encrypted memory, so this was the quickest way to get SEV working.
> > 
> 
> Yes, I guess so. It was the fastest way to "hack" it into QEMU.
> 
> Would we ever even want a VMA/mmap/process page tables for that memory? How
> could user space ever do something *not so nasty* with that memory (in the
> current context of VMs)?

In the future, the memory should be still managable by host MM: migration,
swapping, etc. But it's long way there. For now, the guest memory
effectively pinned on the host.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-07 13:16         ` Kirill A. Shutemov
@ 2021-04-07 13:31           ` Christophe de Dinechin
  2021-04-07 14:09             ` Andi Kleen
  2021-04-07 14:09           ` David Hildenbrand
  1 sibling, 1 reply; 27+ messages in thread
From: Christophe de Dinechin @ 2021-04-07 13:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Hildenbrand, Dave Hansen, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Sean Christopherson, Jim Mattson, David Rientjes,
	Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm,
	linux-mm, linux-kernel, Kirill A. Shutemov



> On 7 Apr 2021, at 15:16, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote:
>> On 06.04.21 16:33, Dave Hansen wrote:
>>> On 4/6/21 12:44 AM, David Hildenbrand wrote:
>>>> On 02.04.21 17:26, Kirill A. Shutemov wrote:
>>>>> TDX architecture aims to provide resiliency against confidentiality and
>>>>> integrity attacks. Towards this goal, the TDX architecture helps enforce
>>>>> the enabling of memory integrity for all TD-private memory.
>>>>> 
>>>>> The CPU memory controller computes the integrity check value (MAC) for
>>>>> the data (cache line) during writes, and it stores the MAC with the
>>>>> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
>>>>> 
>>>>> Checking of memory integrity is performed during memory reads. If
>>>>> integrity check fails, CPU poisones cache line.
>>>>> 
>>>>> On a subsequent consumption (read) of the poisoned data by software,
>>>>> there are two possible scenarios:
>>>>> 
>>>>>   - Core determines that the execution can continue and it treats
>>>>>     poison with exception semantics signaled as a #MCE
>>>>> 
>>>>>   - Core determines execution cannot continue,and it does an unbreakable
>>>>>     shutdown
>>>>> 
>>>>> For more details, see Chapter 14 of Intel TDX Module EAS[1]
>>>>> 
>>>>> As some of integrity check failures may lead to system shutdown host
>>>>> kernel must not allow any writes to TD-private memory. This requirment
>>>>> clashes with KVM design: KVM expects the guest memory to be mapped into
>>>>> host userspace (e.g. QEMU).
>>>> 
>>>> So what you are saying is that if QEMU would write to such memory, it
>>>> could crash the kernel? What a broken design.
>>> 
>>> IMNHO, the broken design is mapping the memory to userspace in the first
>>> place.  Why the heck would you actually expose something with the MMU to
>>> a context that can't possibly meaningfully access or safely write to it?
>> 
>> I'd say the broken design is being able to crash the machine via a simple
>> memory write, instead of only crashing a single process in case you're doing
>> something nasty. From the evaluation of the problem it feels like this was a
>> CPU design workaround: instead of properly cleaning up when it gets tricky
>> within the core, just crash the machine. And that's a CPU "feature", not a
>> kernel "feature". Now we have to fix broken HW in the kernel - once again.
>> 
>> However, you raise a valid point: it does not make too much sense to to map
>> this into user space. Not arguing against that; but crashing the machine is
>> just plain ugly.
>> 
>> I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds
>> like: for hacking support for that memory type into QEMU/KVM.
>> 
>> This all feels wrong, but I cannot really tell how it could be better. That
>> memory can really only be used (right now?) with hardware virtualization
>> from some point on. From that point on (right from the start?), there should
>> be no VMA/mmap/page tables for user space anymore.
>> 
>> Or am I missing something? Is there still valid user space access?
> 
> There is. For IO (e.g. virtio) the guest mark a range of memory as shared
> (or unencrypted for AMD SEV). The range is not pre-defined.
> 
>>> This started with SEV.  QEMU creates normal memory mappings with the SEV
>>> C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
>>> those are instantiated, they have the C-bit set.  So, we have mismatched
>>> mappings.  Where does that lead?  The two mappings not only differ in
>>> the encryption bit, causing one side to read gibberish if the other
>>> writes: they're not even cache coherent.
>>> 
>>> That's the situation *TODAY*, even ignoring TDX.
>>> 
>>> BTW, I'm pretty sure I know the answer to the "why would you expose this
>>> to userspace" question: it's what QEMU/KVM did alreadhy for
>>> non-encrypted memory, so this was the quickest way to get SEV working.
>>> 
>> 
>> Yes, I guess so. It was the fastest way to "hack" it into QEMU.
>> 
>> Would we ever even want a VMA/mmap/process page tables for that memory? How
>> could user space ever do something *not so nasty* with that memory (in the
>> current context of VMs)?
> 
> In the future, the memory should be still managable by host MM: migration,
> swapping, etc. But it's long way there. For now, the guest memory
> effectively pinned on the host.

Is there even a theoretical way to restore an encrypted page e.g. from (host)
swap without breaking the integrity check? Or will that only be possible with
assistance from within the encrypted enclave?


> 
> -- 
> Kirill A. Shutemov
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-07 13:31           ` Christophe de Dinechin
@ 2021-04-07 14:09             ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2021-04-07 14:09 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Kirill A. Shutemov, David Hildenbrand, Dave Hansen, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Sean Christopherson,
	Jim Mattson, David Rientjes, Edgecombe, Rick P, Kleen, Andi,
	Yamahata, Isaku, x86, kvm, linux-mm, linux-kernel,
	Kirill A. Shutemov

Christophe de Dinechin <cdupontd@redhat.com> writes:

> Is there even a theoretical way to restore an encrypted page e.g. from (host)
> swap without breaking the integrity check? Or will that only be possible with
> assistance from within the encrypted enclave?

Only the later.

You would need balloning. It's in principle possible, but currently
not implemented.

In general host swap without balloning is usually a bad idea anyways
because it often just swaps a lot of cache data that could easily be
thrown away instead.

-andi


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-07 13:16         ` Kirill A. Shutemov
  2021-04-07 13:31           ` Christophe de Dinechin
@ 2021-04-07 14:09           ` David Hildenbrand
  2021-04-07 14:36             ` Kirill A. Shutemov
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2021-04-07 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On 07.04.21 15:16, Kirill A. Shutemov wrote:
> On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote:
>> On 06.04.21 16:33, Dave Hansen wrote:
>>> On 4/6/21 12:44 AM, David Hildenbrand wrote:
>>>> On 02.04.21 17:26, Kirill A. Shutemov wrote:
>>>>> TDX architecture aims to provide resiliency against confidentiality and
>>>>> integrity attacks. Towards this goal, the TDX architecture helps enforce
>>>>> the enabling of memory integrity for all TD-private memory.
>>>>>
>>>>> The CPU memory controller computes the integrity check value (MAC) for
>>>>> the data (cache line) during writes, and it stores the MAC with the
>>>>> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
>>>>>
>>>>> Checking of memory integrity is performed during memory reads. If
>>>>> integrity check fails, CPU poisones cache line.
>>>>>
>>>>> On a subsequent consumption (read) of the poisoned data by software,
>>>>> there are two possible scenarios:
>>>>>
>>>>>     - Core determines that the execution can continue and it treats
>>>>>       poison with exception semantics signaled as a #MCE
>>>>>
>>>>>     - Core determines execution cannot continue,and it does an unbreakable
>>>>>       shutdown
>>>>>
>>>>> For more details, see Chapter 14 of Intel TDX Module EAS[1]
>>>>>
>>>>> As some of integrity check failures may lead to system shutdown host
>>>>> kernel must not allow any writes to TD-private memory. This requirment
>>>>> clashes with KVM design: KVM expects the guest memory to be mapped into
>>>>> host userspace (e.g. QEMU).
>>>>
>>>> So what you are saying is that if QEMU would write to such memory, it
>>>> could crash the kernel? What a broken design.
>>>
>>> IMNHO, the broken design is mapping the memory to userspace in the first
>>> place.  Why the heck would you actually expose something with the MMU to
>>> a context that can't possibly meaningfully access or safely write to it?
>>
>> I'd say the broken design is being able to crash the machine via a simple
>> memory write, instead of only crashing a single process in case you're doing
>> something nasty. From the evaluation of the problem it feels like this was a
>> CPU design workaround: instead of properly cleaning up when it gets tricky
>> within the core, just crash the machine. And that's a CPU "feature", not a
>> kernel "feature". Now we have to fix broken HW in the kernel - once again.
>>
>> However, you raise a valid point: it does not make too much sense to to map
>> this into user space. Not arguing against that; but crashing the machine is
>> just plain ugly.
>>
>> I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds
>> like: for hacking support for that memory type into QEMU/KVM.
>>
>> This all feels wrong, but I cannot really tell how it could be better. That
>> memory can really only be used (right now?) with hardware virtualization
>> from some point on. From that point on (right from the start?), there should
>> be no VMA/mmap/page tables for user space anymore.
>>
>> Or am I missing something? Is there still valid user space access?
> 
> There is. For IO (e.g. virtio) the guest mark a range of memory as shared
> (or unencrypted for AMD SEV). The range is not pre-defined.
> 

Ah right, rings a bell. One obvious alternative would be to let user 
space only explicitly map what is shared and can be safely accessed, 
instead of doing it the other way around. But that obviously requires 
more thought/work and clashes with future MM changes you discuss below.

>>> This started with SEV.  QEMU creates normal memory mappings with the SEV
>>> C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
>>> those are instantiated, they have the C-bit set.  So, we have mismatched
>>> mappings.  Where does that lead?  The two mappings not only differ in
>>> the encryption bit, causing one side to read gibberish if the other
>>> writes: they're not even cache coherent.
>>>
>>> That's the situation *TODAY*, even ignoring TDX.
>>>
>>> BTW, I'm pretty sure I know the answer to the "why would you expose this
>>> to userspace" question: it's what QEMU/KVM did alreadhy for
>>> non-encrypted memory, so this was the quickest way to get SEV working.
>>>
>>
>> Yes, I guess so. It was the fastest way to "hack" it into QEMU.
>>
>> Would we ever even want a VMA/mmap/process page tables for that memory? How
>> could user space ever do something *not so nasty* with that memory (in the
>> current context of VMs)?
> 
> In the future, the memory should be still managable by host MM: migration,
> swapping, etc. But it's long way there. For now, the guest memory

I was involved in the s390x implementation where this already works, 
simply because whenever encrypted memory is read/written from the 
hypervisor, you simple read/write the encrypted data; once the VM 
accesses that memory, it reads/writes unencrypted memory. For this 
reason, migration, swapping etc. works fairly naturally.

I do wonder how x86-64 wants to tackle that; In the far future, will it 
be valid to again read/write encrypted memory, especially from user space?

> effectively pinned on the host.

Right, I remember that limitation for SEV.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-07 14:09           ` David Hildenbrand
@ 2021-04-07 14:36             ` Kirill A. Shutemov
  0 siblings, 0 replies; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-07 14:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Wed, Apr 07, 2021 at 04:09:35PM +0200, David Hildenbrand wrote:
> On 07.04.21 15:16, Kirill A. Shutemov wrote:
> > On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote:
> > > On 06.04.21 16:33, Dave Hansen wrote:
> > > > On 4/6/21 12:44 AM, David Hildenbrand wrote:
> > > > > On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > > > > > TDX architecture aims to provide resiliency against confidentiality and
> > > > > > integrity attacks. Towards this goal, the TDX architecture helps enforce
> > > > > > the enabling of memory integrity for all TD-private memory.
> > > > > > 
> > > > > > The CPU memory controller computes the integrity check value (MAC) for
> > > > > > the data (cache line) during writes, and it stores the MAC with the
> > > > > > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > > > > > 
> > > > > > Checking of memory integrity is performed during memory reads. If
> > > > > > integrity check fails, CPU poisones cache line.
> > > > > > 
> > > > > > On a subsequent consumption (read) of the poisoned data by software,
> > > > > > there are two possible scenarios:
> > > > > > 
> > > > > >     - Core determines that the execution can continue and it treats
> > > > > >       poison with exception semantics signaled as a #MCE
> > > > > > 
> > > > > >     - Core determines execution cannot continue,and it does an unbreakable
> > > > > >       shutdown
> > > > > > 
> > > > > > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > > > > > 
> > > > > > As some of integrity check failures may lead to system shutdown host
> > > > > > kernel must not allow any writes to TD-private memory. This requirment
> > > > > > clashes with KVM design: KVM expects the guest memory to be mapped into
> > > > > > host userspace (e.g. QEMU).
> > > > > 
> > > > > So what you are saying is that if QEMU would write to such memory, it
> > > > > could crash the kernel? What a broken design.
> > > > 
> > > > IMNHO, the broken design is mapping the memory to userspace in the first
> > > > place.  Why the heck would you actually expose something with the MMU to
> > > > a context that can't possibly meaningfully access or safely write to it?
> > > 
> > > I'd say the broken design is being able to crash the machine via a simple
> > > memory write, instead of only crashing a single process in case you're doing
> > > something nasty. From the evaluation of the problem it feels like this was a
> > > CPU design workaround: instead of properly cleaning up when it gets tricky
> > > within the core, just crash the machine. And that's a CPU "feature", not a
> > > kernel "feature". Now we have to fix broken HW in the kernel - once again.
> > > 
> > > However, you raise a valid point: it does not make too much sense to to map
> > > this into user space. Not arguing against that; but crashing the machine is
> > > just plain ugly.
> > > 
> > > I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds
> > > like: for hacking support for that memory type into QEMU/KVM.
> > > 
> > > This all feels wrong, but I cannot really tell how it could be better. That
> > > memory can really only be used (right now?) with hardware virtualization
> > > from some point on. From that point on (right from the start?), there should
> > > be no VMA/mmap/page tables for user space anymore.
> > > 
> > > Or am I missing something? Is there still valid user space access?
> > 
> > There is. For IO (e.g. virtio) the guest mark a range of memory as shared
> > (or unencrypted for AMD SEV). The range is not pre-defined.
> > 
> 
> Ah right, rings a bell. One obvious alternative would be to let user space
> only explicitly map what is shared and can be safely accessed, instead of
> doing it the other way around. But that obviously requires more thought/work
> and clashes with future MM changes you discuss below.

IIUC, HyperV's VMBus uses pre-defined range that communicated through
ACPI. KVM/virtio can do the same in theory, but it would require changes
in the existing driver model.

> > > > This started with SEV.  QEMU creates normal memory mappings with the SEV
> > > > C-bit (encryption) disabled.  The kernel plumbs those into NPT, but when
> > > > those are instantiated, they have the C-bit set.  So, we have mismatched
> > > > mappings.  Where does that lead?  The two mappings not only differ in
> > > > the encryption bit, causing one side to read gibberish if the other
> > > > writes: they're not even cache coherent.
> > > > 
> > > > That's the situation *TODAY*, even ignoring TDX.
> > > > 
> > > > BTW, I'm pretty sure I know the answer to the "why would you expose this
> > > > to userspace" question: it's what QEMU/KVM did alreadhy for
> > > > non-encrypted memory, so this was the quickest way to get SEV working.
> > > > 
> > > 
> > > Yes, I guess so. It was the fastest way to "hack" it into QEMU.
> > > 
> > > Would we ever even want a VMA/mmap/process page tables for that memory? How
> > > could user space ever do something *not so nasty* with that memory (in the
> > > current context of VMs)?
> > 
> > In the future, the memory should be still managable by host MM: migration,
> > swapping, etc. But it's long way there. For now, the guest memory
> 
> I was involved in the s390x implementation where this already works, simply
> because whenever encrypted memory is read/written from the hypervisor, you
> simple read/write the encrypted data; once the VM accesses that memory, it
> reads/writes unencrypted memory. For this reason, migration, swapping etc.
> works fairly naturally.

In TDX case, the encryption tied to the physical address of the encrypted
block. Moving the block to other place in memory would produce garbage.
It's done intentionally to protected against replay attack.

> I do wonder how x86-64 wants to tackle that; In the far future, will it be
> valid to again read/write encrypted memory, especially from user space?
>

It would require assistance from the guest and/or TDX module.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-02 15:26 ` [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
  2021-04-06  7:44   ` David Hildenbrand
@ 2021-04-07 14:55   ` David Hildenbrand
  2021-04-07 15:10     ` Andi Kleen
  2021-04-09 13:33     ` Kirill A. Shutemov
  1 sibling, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2021-04-07 14:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson
  Cc: David Rientjes, Edgecombe, Rick P, Kleen, Andi, Yamahata, Isaku,
	x86, kvm, linux-mm, linux-kernel, Kirill A. Shutemov,
	Oscar Salvador, Naoya Horiguchi

On 02.04.21 17:26, Kirill A. Shutemov wrote:
> TDX architecture aims to provide resiliency against confidentiality and
> integrity attacks. Towards this goal, the TDX architecture helps enforce
> the enabling of memory integrity for all TD-private memory.
> 
> The CPU memory controller computes the integrity check value (MAC) for
> the data (cache line) during writes, and it stores the MAC with the
> memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> 
> Checking of memory integrity is performed during memory reads. If
> integrity check fails, CPU poisones cache line.
> 
> On a subsequent consumption (read) of the poisoned data by software,
> there are two possible scenarios:
> 
>   - Core determines that the execution can continue and it treats
>     poison with exception semantics signaled as a #MCE
> 
>   - Core determines execution cannot continue,and it does an unbreakable
>     shutdown
> 
> For more details, see Chapter 14 of Intel TDX Module EAS[1]
> 
> As some of integrity check failures may lead to system shutdown host
> kernel must not allow any writes to TD-private memory. This requirment
> clashes with KVM design: KVM expects the guest memory to be mapped into
> host userspace (e.g. QEMU).
> 
> This patch aims to start discussion on how we can approach the issue.
> 
> For now I intentionally keep TDX out of picture here and try to find a
> generic way to unmap KVM guest memory from host userspace. Hopefully, it
> makes the patch more approachable. And anyone can try it out.
> 
> To the proposal:
> 
> Looking into existing codepaths I've discovered that we already have
> semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap
> entries in page tables:
> 
>    - If an application touches a page mapped with the SWP_HWPOISON, it will
>      get SIGBUS.
> 
>    - GUP will fail with -EFAULT;
> 
> Access the poisoned memory via page cache doesn't match required
> semantics right now, but it shouldn't be too hard to make it work:
> access to poisoned dirty pages should give -EIO or -EHWPOISON.
> 
> My idea is that we can mark page as poisoned when we make it TD-private
> and replace all PTEs that map the page with SWP_HWPOISON.

It looks quite hacky (well, what did I expect from an RFC :) ) you can 
no longer distinguish actually poisoned pages from "temporarily 
poisoned" pages. FOLL_ALLOW_POISONED sounds especially nasty and 
dangerous -  "I want to read/write a poisoned page, trust me, I know 
what I am doing".

Storing the state for each individual page initially sounded like the 
right thing to do, but I wonder if we couldn't handle this on a per-VMA 
level. You can just remember the handful of shared ranges internally 
like you do right now AFAIU.


 From what I get, you want a way to

1. Unmap pages from the user space page tables.

2. Disallow re-faulting of the protected pages into the page tables. On 
user space access, you want to deliver some signal (e.g., SIGBUS).

3. Allow selected users to still grab the pages (esp. KVM to fault them 
into the page tables).

4. Allow access to currently shared specific pages from user space.

Right now, you achieve

1. Via try_to_unmap()
2. TestSetPageHWPoison
3. TBD (e.g., FOLL_ALLOW_POISONED)
4. ClearPageHWPoison()


If we could bounce all writes to shared pages through the kernel, things 
could end up a little easier. Some very rough idea:

We could let user space setup VM memory as
mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating 
protected memory (I assume via a KVM ioctl), make sure the VMAs cannot 
be set to PROT_WRITE anymore. This would already properly unmap and 
deliver a SIGSEGV when trying to write from user space.

You could then still access the pages, e.g., via FOLL_FORCE or a new 
fancy flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This 
would allow an ioctl to write page content and to map the pages into NPTs.

As an extension, we could think about (re?)mapping some shared pages 
read|write. The question is how to synchronize with user space.

I have no idea how expensive would be bouncing writes (and reads?) 
through the kernel. Did you ever experiment with that/evaluate that?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-07 14:55   ` David Hildenbrand
@ 2021-04-07 15:10     ` Andi Kleen
  2021-04-09 13:33     ` Kirill A. Shutemov
  1 sibling, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2021-04-07 15:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Oscar Salvador,
	Naoya Horiguchi

David Hildenbrand <david@redhat.com> writes:

> I have no idea how expensive would be bouncing writes (and reads?)
> through the kernel. Did you ever experiment with that/evaluate that?

I would expect it to be quite expensive, as in virtio IO performance
tanking.

-Andi


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature
  2021-04-02 15:26 ` [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
@ 2021-04-08  9:52   ` Borislav Petkov
  2021-04-09 13:36     ` Kirill A. Shutemov
  0 siblings, 1 reply; 27+ messages in thread
From: Borislav Petkov @ 2021-04-08  9:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Fri, Apr 02, 2021 at 06:26:40PM +0300, Kirill A. Shutemov wrote:
> Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> 
> Host side doesn't provide the feature yet, so it is a dead code for now.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/cpufeatures.h   |  1 +
>  arch/x86/include/asm/kvm_para.h      |  5 +++++
>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>  arch/x86/kernel/kvm.c                | 18 ++++++++++++++++++
>  include/uapi/linux/kvm_para.h        |  3 ++-
>  5 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..5b6f23e6edc4 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -238,6 +238,7 @@
>  #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
>  #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
>  #define X86_FEATURE_VM_PAGE_FLUSH	( 8*32+21) /* "" VM Page Flush MSR is supported */
> +#define X86_FEATURE_KVM_MEM_PROTECTED	( 8*32+22) /* KVM memory protection extenstion */
										    ^^^^^^^^^^
What's that feature bit for?

Also, use a spellchecker pls: "extenstion".

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-07 14:55   ` David Hildenbrand
  2021-04-07 15:10     ` Andi Kleen
@ 2021-04-09 13:33     ` Kirill A. Shutemov
  2021-04-09 13:50       ` David Hildenbrand
  1 sibling, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-09 13:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Oscar Salvador,
	Naoya Horiguchi

On Wed, Apr 07, 2021 at 04:55:54PM +0200, David Hildenbrand wrote:
> On 02.04.21 17:26, Kirill A. Shutemov wrote:
> > TDX architecture aims to provide resiliency against confidentiality and
> > integrity attacks. Towards this goal, the TDX architecture helps enforce
> > the enabling of memory integrity for all TD-private memory.
> > 
> > The CPU memory controller computes the integrity check value (MAC) for
> > the data (cache line) during writes, and it stores the MAC with the
> > memory as meta-data. A 28-bit MAC is stored in the ECC bits.
> > 
> > Checking of memory integrity is performed during memory reads. If
> > integrity check fails, CPU poisones cache line.
> > 
> > On a subsequent consumption (read) of the poisoned data by software,
> > there are two possible scenarios:
> > 
> >   - Core determines that the execution can continue and it treats
> >     poison with exception semantics signaled as a #MCE
> > 
> >   - Core determines execution cannot continue,and it does an unbreakable
> >     shutdown
> > 
> > For more details, see Chapter 14 of Intel TDX Module EAS[1]
> > 
> > As some of integrity check failures may lead to system shutdown host
> > kernel must not allow any writes to TD-private memory. This requirment
> > clashes with KVM design: KVM expects the guest memory to be mapped into
> > host userspace (e.g. QEMU).
> > 
> > This patch aims to start discussion on how we can approach the issue.
> > 
> > For now I intentionally keep TDX out of picture here and try to find a
> > generic way to unmap KVM guest memory from host userspace. Hopefully, it
> > makes the patch more approachable. And anyone can try it out.
> > 
> > To the proposal:
> > 
> > Looking into existing codepaths I've discovered that we already have
> > semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap
> > entries in page tables:
> > 
> >    - If an application touches a page mapped with the SWP_HWPOISON, it will
> >      get SIGBUS.
> > 
> >    - GUP will fail with -EFAULT;
> > 
> > Access the poisoned memory via page cache doesn't match required
> > semantics right now, but it shouldn't be too hard to make it work:
> > access to poisoned dirty pages should give -EIO or -EHWPOISON.
> > 
> > My idea is that we can mark page as poisoned when we make it TD-private
> > and replace all PTEs that map the page with SWP_HWPOISON.
> 
> It looks quite hacky (well, what did I expect from an RFC :) ) you can no
> longer distinguish actually poisoned pages from "temporarily poisoned"
> pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous -  "I want
> to read/write a poisoned page, trust me, I know what I am doing".
> 
> Storing the state for each individual page initially sounded like the right
> thing to do, but I wonder if we couldn't handle this on a per-VMA level. You
> can just remember the handful of shared ranges internally like you do right
> now AFAIU.

per-VMA would not fly for file-backed (e.g. tmpfs) memory. We may need to
combine PG_hwpoison with VMA flag. Maybe per-inode tracking would also be
required. Or per-memslot. I donno. Need more experiments.

Note, I use PG_hwpoison now, but if we find a show-stopper issue where we
would see confusion with a real poison, we can switch to new flags and
a new swap_type(). I have not seen a reason yet.

> From what I get, you want a way to
> 
> 1. Unmap pages from the user space page tables.

Plain unmap would not work for some use-cases. Some CSPs want to
preallocate memory in a specific way. It's a way to provide a fine-grained
NUMA policy.

The existing mapping has to be converted.

> 2. Disallow re-faulting of the protected pages into the page tables. On user
> space access, you want to deliver some signal (e.g., SIGBUS).

Note that userspace mapping is the only source of pfn's for VM's shadow
mapping. The fault should be allow, but lead to non-present PTE that still
encodes pfn.

> 3. Allow selected users to still grab the pages (esp. KVM to fault them into
> the page tables).

As long as fault leads to non-present PTEs we are fine. Usespace still may
want to mlock() some of guest memory. There's no reason to prevent this.

> 4. Allow access to currently shared specific pages from user space.
> 
> Right now, you achieve
> 
> 1. Via try_to_unmap()
> 2. TestSetPageHWPoison
> 3. TBD (e.g., FOLL_ALLOW_POISONED)
> 4. ClearPageHWPoison()
> 
> 
> If we could bounce all writes to shared pages through the kernel, things
> could end up a little easier. Some very rough idea:
> 
> We could let user space setup VM memory as
> mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating protected
> memory (I assume via a KVM ioctl), make sure the VMAs cannot be set to
> PROT_WRITE anymore. This would already properly unmap and deliver a SIGSEGV
> when trying to write from user space.
> 
> You could then still access the pages, e.g., via FOLL_FORCE or a new fancy
> flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This would
> allow an ioctl to write page content and to map the pages into NPTs.
> 
> As an extension, we could think about (re?)mapping some shared pages
> read|write. The question is how to synchronize with user space.
> 
> I have no idea how expensive would be bouncing writes (and reads?) through
> the kernel. Did you ever experiment with that/evaluate that?

It's going to be double bounce buffer: on the guest we force swiotlb to
make it go through shared region. I don't think it's a good idea.

There are a number of way to share a memory. It's going to be decided by
the way we get these pages unmapped in the first place.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature
  2021-04-08  9:52   ` Borislav Petkov
@ 2021-04-09 13:36     ` Kirill A. Shutemov
  2021-04-09 14:37       ` Borislav Petkov
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-09 13:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Thu, Apr 08, 2021 at 11:52:35AM +0200, Borislav Petkov wrote:
> On Fri, Apr 02, 2021 at 06:26:40PM +0300, Kirill A. Shutemov wrote:
> > Provide basic helpers, KVM_FEATURE, CPUID flag and a hypercall.
> > 
> > Host side doesn't provide the feature yet, so it is a dead code for now.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/include/asm/cpufeatures.h   |  1 +
> >  arch/x86/include/asm/kvm_para.h      |  5 +++++
> >  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >  arch/x86/kernel/kvm.c                | 18 ++++++++++++++++++
> >  include/uapi/linux/kvm_para.h        |  3 ++-
> >  5 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index 84b887825f12..5b6f23e6edc4 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -238,6 +238,7 @@
> >  #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
> >  #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
> >  #define X86_FEATURE_VM_PAGE_FLUSH	( 8*32+21) /* "" VM Page Flush MSR is supported */
> > +#define X86_FEATURE_KVM_MEM_PROTECTED	( 8*32+22) /* KVM memory protection extenstion */
> 										    ^^^^^^^^^^
> What's that feature bit for?

The patchset is still in path-finding stage. I'll be more specific once we
settle on how the feature works.
 
> Also, use a spellchecker pls: "extenstion".

Ouch. Thanks.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-09 13:33     ` Kirill A. Shutemov
@ 2021-04-09 13:50       ` David Hildenbrand
  2021-04-09 14:12         ` Kirill A. Shutemov
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2021-04-09 13:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Oscar Salvador,
	Naoya Horiguchi

>> It looks quite hacky (well, what did I expect from an RFC :) ) you can no
>> longer distinguish actually poisoned pages from "temporarily poisoned"
>> pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous -  "I want
>> to read/write a poisoned page, trust me, I know what I am doing".
>>
>> Storing the state for each individual page initially sounded like the right
>> thing to do, but I wonder if we couldn't handle this on a per-VMA level. You
>> can just remember the handful of shared ranges internally like you do right
>> now AFAIU.
> 
> per-VMA would not fly for file-backed (e.g. tmpfs) memory. We may need to
> combine PG_hwpoison with VMA flag. Maybe per-inode tracking would also be
> required. Or per-memslot. I donno. Need more experiments.

Indeed.

> 
> Note, I use PG_hwpoison now, but if we find a show-stopper issue where we
> would see confusion with a real poison, we can switch to new flags and
> a new swap_type(). I have not seen a reason yet.

I think we'll want a dedicate mechanism to cleanly mark pages as 
"protected". Finding a page flag you can use will be the problematic 
part, but should not be impossible if we have a good reason to do so 
(even if it means making the feature mutually exclusive with other 
features).

> 
>>  From what I get, you want a way to
>>
>> 1. Unmap pages from the user space page tables.
> 
> Plain unmap would not work for some use-cases. Some CSPs want to
> preallocate memory in a specific way. It's a way to provide a fine-grained
> NUMA policy.
> 
> The existing mapping has to be converted.
> 
>> 2. Disallow re-faulting of the protected pages into the page tables. On user
>> space access, you want to deliver some signal (e.g., SIGBUS).
> 
> Note that userspace mapping is the only source of pfn's for VM's shadow
> mapping. The fault should be allow, but lead to non-present PTE that still
> encodes pfn.

Makes sense, but I guess that's the part still to be implemented (see 
next comment).

> 
>> 3. Allow selected users to still grab the pages (esp. KVM to fault them into
>> the page tables).
> 
> As long as fault leads to non-present PTEs we are fine. Usespace still may
> want to mlock() some of guest memory. There's no reason to prevent this.

I'm curious, even get_user_pages() will lead to a present PTE as is, no? 
So that will need modifications I assume. (although I think it 
fundamentally differs to the way get_user_pages() works - trigger a 
fault first, then lookup the PTE in the page tables).

>> 4. Allow access to currently shared specific pages from user space.
>>
>> Right now, you achieve
>>
>> 1. Via try_to_unmap()
>> 2. TestSetPageHWPoison
>> 3. TBD (e.g., FOLL_ALLOW_POISONED)
>> 4. ClearPageHWPoison()
>>
>>
>> If we could bounce all writes to shared pages through the kernel, things
>> could end up a little easier. Some very rough idea:
>>
>> We could let user space setup VM memory as
>> mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating protected
>> memory (I assume via a KVM ioctl), make sure the VMAs cannot be set to
>> PROT_WRITE anymore. This would already properly unmap and deliver a SIGSEGV
>> when trying to write from user space.
>>
>> You could then still access the pages, e.g., via FOLL_FORCE or a new fancy
>> flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This would
>> allow an ioctl to write page content and to map the pages into NPTs.
>>
>> As an extension, we could think about (re?)mapping some shared pages
>> read|write. The question is how to synchronize with user space.
>>
>> I have no idea how expensive would be bouncing writes (and reads?) through
>> the kernel. Did you ever experiment with that/evaluate that?
> 
> It's going to be double bounce buffer: on the guest we force swiotlb to
> make it go through shared region. I don't think it's a good idea.

So if it's already slow, do we really care? ;)

> 
> There are a number of way to share a memory. It's going to be decided by
> the way we get these pages unmapped in the first place.

I agree that shared memory can be somewhat problematic and would require 
tracking it per page.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-09 13:50       ` David Hildenbrand
@ 2021-04-09 14:12         ` Kirill A. Shutemov
  2021-04-09 14:18           ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill A. Shutemov @ 2021-04-09 14:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Oscar Salvador,
	Naoya Horiguchi

On Fri, Apr 09, 2021 at 03:50:42PM +0200, David Hildenbrand wrote:
> > > 3. Allow selected users to still grab the pages (esp. KVM to fault them into
> > > the page tables).
> > 
> > As long as fault leads to non-present PTEs we are fine. Usespace still may
> > want to mlock() some of guest memory. There's no reason to prevent this.
> 
> I'm curious, even get_user_pages() will lead to a present PTE as is, no? So
> that will need modifications I assume. (although I think it fundamentally
> differs to the way get_user_pages() works - trigger a fault first, then
> lookup the PTE in the page tables).

For now, the patch has two step poisoning: first fault in, on the add to
shadow PTE -- poison. By the time VM has chance to use the page it's
poisoned and unmapped from the host userspace.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
  2021-04-09 14:12         ` Kirill A. Shutemov
@ 2021-04-09 14:18           ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2021-04-09 14:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov, Oscar Salvador,
	Naoya Horiguchi

On 09.04.21 16:12, Kirill A. Shutemov wrote:
> On Fri, Apr 09, 2021 at 03:50:42PM +0200, David Hildenbrand wrote:
>>>> 3. Allow selected users to still grab the pages (esp. KVM to fault them into
>>>> the page tables).
>>>
>>> As long as fault leads to non-present PTEs we are fine. Usespace still may
>>> want to mlock() some of guest memory. There's no reason to prevent this.
>>
>> I'm curious, even get_user_pages() will lead to a present PTE as is, no? So
>> that will need modifications I assume. (although I think it fundamentally
>> differs to the way get_user_pages() works - trigger a fault first, then
>> lookup the PTE in the page tables).
> 
> For now, the patch has two step poisoning: first fault in, on the add to
> shadow PTE -- poison. By the time VM has chance to use the page it's
> poisoned and unmapped from the host userspace.

IIRC, this then assumes that while a page is protected, it will remain 
mapped into the NPT; because, there is no way to remap into NPT later 
because the pages have already been poisoned.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature
  2021-04-09 13:36     ` Kirill A. Shutemov
@ 2021-04-09 14:37       ` Borislav Petkov
  0 siblings, 0 replies; 27+ messages in thread
From: Borislav Petkov @ 2021-04-09 14:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Sean Christopherson, Jim Mattson, David Rientjes, Edgecombe,
	Rick P, Kleen, Andi, Yamahata, Isaku, x86, kvm, linux-mm,
	linux-kernel, Kirill A. Shutemov

On Fri, Apr 09, 2021 at 04:36:01PM +0300, Kirill A. Shutemov wrote:
> The patchset is still in path-finding stage. I'll be more specific once we
> settle on how the feature works.

This is not why I'm asking: these feature bits are visible to userspace
in /proc/cpuinfo and if you don't have a use case to show this to
userspace, use the "" in the comment.

And if you don't need that feature bit at all (you're only setting it
but not querying it) you should not add it at all.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-04-09 14:37 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-02 15:26 [RFCv1 0/7] TDX and guest memory unmapping Kirill A. Shutemov
2021-04-02 15:26 ` [RFCv1 1/7] x86/mm: Move force_dma_unencrypted() to common code Kirill A. Shutemov
2021-04-02 15:26 ` [RFCv1 2/7] x86/kvm: Introduce KVM memory protection feature Kirill A. Shutemov
2021-04-08  9:52   ` Borislav Petkov
2021-04-09 13:36     ` Kirill A. Shutemov
2021-04-09 14:37       ` Borislav Petkov
2021-04-02 15:26 ` [RFCv1 3/7] x86/kvm: Make DMA pages shared Kirill A. Shutemov
2021-04-02 15:26 ` [RFCv1 4/7] x86/kvm: Use bounce buffers for KVM memory protection Kirill A. Shutemov
2021-04-02 15:26 ` [RFCv1 5/7] x86/kvmclock: Share hvclock memory with the host Kirill A. Shutemov
2021-04-02 15:26 ` [RFCv1 6/7] x86/realmode: Share trampoline area if KVM memory protection enabled Kirill A. Shutemov
2021-04-02 15:26 ` [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Kirill A. Shutemov
2021-04-06  7:44   ` David Hildenbrand
2021-04-06 10:50     ` Kirill A. Shutemov
2021-04-06 14:33     ` Dave Hansen
2021-04-06 14:57       ` David Hildenbrand
2021-04-07 13:16         ` Kirill A. Shutemov
2021-04-07 13:31           ` Christophe de Dinechin
2021-04-07 14:09             ` Andi Kleen
2021-04-07 14:09           ` David Hildenbrand
2021-04-07 14:36             ` Kirill A. Shutemov
2021-04-06 17:52       ` Tom Lendacky
2021-04-07 14:55   ` David Hildenbrand
2021-04-07 15:10     ` Andi Kleen
2021-04-09 13:33     ` Kirill A. Shutemov
2021-04-09 13:50       ` David Hildenbrand
2021-04-09 14:12         ` Kirill A. Shutemov
2021-04-09 14:18           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).