kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Add AMD SEV page encryption bitmap support.
@ 2020-12-01  0:45 Ashish Kalra
  2020-12-01  0:45 ` [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Ashish Kalra
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:45 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Ashish Kalra <ashish.kalra@amd.com>

The series add support for AMD SEV page encryption bitmap.

SEV guest VMs have the concept of private and shared memory. Private memory
is encrypted with the guest-specific key, while shared memory may be encrypted
with hypervisor key. The patch series introduces a new hypercall.
The guest OS can use this hypercall to notify the page encryption status.

The patch adds new ioctls KVM_{SET,GET}_PAGE_ENC_BITMAP. The ioctl can be used
by qemu to get the page encryption bitmap. Qemu can consult this bitmap
during guest live migration / page migration and/or guest debugging to know
whether the page is encrypted.

The page encryption bitmap support is required for SEV guest live migration,
guest page migration and guest debugging.

The patch-set also adds support for bypassing unencrypted guest memory
regions for DBG_DECRYPT API calls, guest memory region encryption status
in sev_dbg_decrypt() is now referenced using the page encryption bitmap.

A branch containing these patches is available here:
https://github.com/AMDESE/linux/tree/sev-page-encryption-bitmap-v2

Changes since v1:
 - Fix in sev_dbg_crypt() to release RCU read lock if hva_to_gfn() fails
   when bypassing DBG_DECRYPT API calls for unencrypted guest memory.
 - Comment fix for Patch 7/9. 

Ashish Kalra (4):
  KVM: SVM: Add support for static allocation of unified Page Encryption
    Bitmap.
  KVM: x86: Mark _bss_decrypted section variables as decrypted in page
    encryption bitmap.
  KVM: x86: Add kexec support for SEV page encryption bitmap.
  KVM: SVM: Bypass DBG_DECRYPT API calls for unecrypted guest memory.

Brijesh Singh (5):
  KVM: x86: Add AMD SEV specific Hypercall3
  KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall
  KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl
  mm: x86: Invoke hypercall when page encryption status is changed.
  KVM: x86: Introduce KVM_SET_PAGE_ENC_BITMAP ioctl

 Documentation/virt/kvm/api.rst        |  71 ++++++
 Documentation/virt/kvm/hypercalls.rst |  15 ++
 arch/x86/include/asm/kvm_host.h       |   7 +
 arch/x86/include/asm/kvm_para.h       |  12 +
 arch/x86/include/asm/mem_encrypt.h    |   4 +
 arch/x86/include/asm/paravirt.h       |  10 +
 arch/x86/include/asm/paravirt_types.h |   2 +
 arch/x86/kernel/kvm.c                 |  28 +++
 arch/x86/kernel/kvmclock.c            |  12 +
 arch/x86/kernel/paravirt.c            |   1 +
 arch/x86/kvm/svm/sev.c                | 321 ++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c                |   5 +
 arch/x86/kvm/svm/svm.h                |   7 +
 arch/x86/kvm/vmx/vmx.c                |   1 +
 arch/x86/kvm/x86.c                    |  35 +++
 arch/x86/mm/mem_encrypt.c             |  63 ++++-
 arch/x86/mm/pat/set_memory.c          |   7 +
 include/uapi/linux/kvm.h              |  13 ++
 include/uapi/linux/kvm_para.h         |   1 +
 19 files changed, 614 insertions(+), 1 deletion(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
@ 2020-12-01  0:45 ` Ashish Kalra
  2020-12-03  0:34   ` Sean Christopherson
  2020-12-01  0:46 ` [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall Ashish Kalra
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:45 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Brijesh Singh <brijesh.singh@amd.com>

KVM hypercall framework relies on alternative framework to patch the
VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
apply_alternative() is called then it defaults to VMCALL. The approach
works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
will be able to decode the instruction and do the right things. But
when SEV is active, guest memory is encrypted with guest key and
hypervisor will not be able to decode the instruction bytes.

Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
will be used by the SEV guest to notify encrypted pages to the hypervisor.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/kvm_para.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..bc1b11d057fc 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 	return ret;
 }
 
+static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	long ret;
+
+	asm volatile("vmmcall"
+		     : "=a"(ret)
+		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
+		     : "memory");
+	return ret;
+}
+
 #ifdef CONFIG_KVM_GUEST
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
  2020-12-01  0:45 ` [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Ashish Kalra
@ 2020-12-01  0:46 ` Ashish Kalra
  2020-12-02 16:54   ` Dr. David Alan Gilbert
  2020-12-01  0:47 ` [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl Ashish Kalra
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:46 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Brijesh Singh <brijesh.singh@amd.com>

This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/hypercalls.rst | 15 +++++
 arch/x86/include/asm/kvm_host.h       |  2 +
 arch/x86/kvm/svm/sev.c                | 90 +++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c                |  2 +
 arch/x86/kvm/svm/svm.h                |  4 ++
 arch/x86/kvm/vmx/vmx.c                |  1 +
 arch/x86/kvm/x86.c                    |  6 ++
 include/uapi/linux/kvm_para.h         |  1 +
 8 files changed, 121 insertions(+)

diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst
index ed4fddd364ea..7aff0cebab7c 100644
--- a/Documentation/virt/kvm/hypercalls.rst
+++ b/Documentation/virt/kvm/hypercalls.rst
@@ -169,3 +169,18 @@ a0: destination APIC ID
 
 :Usage example: When sending a call-function IPI-many to vCPUs, yield if
 	        any of the IPI target vCPUs was preempted.
+
+
+8. KVM_HC_PAGE_ENC_STATUS
+-------------------------
+:Architecture: x86
+:Status: active
+:Purpose: Notify the encryption status changes in guest page table (SEV guest)
+
+a0: the guest physical address of the start page
+a1: the number of pages
+a2: encryption attribute
+
+   Where:
+	* 1: Encryption attribute is set
+	* 0: Encryption attribute is cleared
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f002cdb13a0b..d035dc983a7a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1282,6 +1282,8 @@ struct kvm_x86_ops {
 
 	void (*migrate_timers)(struct kvm_vcpu *vcpu);
 	void (*msr_filter_changed)(struct kvm_vcpu *vcpu);
+	int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa,
+				  unsigned long sz, unsigned long mode);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c0b14106258a..6b8bc1297f9c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -927,6 +927,93 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return ret;
 }
 
+static int sev_resize_page_enc_bitmap(struct kvm *kvm, unsigned long new_size)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	unsigned long *map;
+	unsigned long sz;
+
+	if (sev->page_enc_bmap_size >= new_size)
+		return 0;
+
+	sz = ALIGN(new_size, BITS_PER_LONG) / 8;
+
+	map = vmalloc(sz);
+	if (!map) {
+		pr_err_once("Failed to allocate encrypted bitmap size %lx\n",
+				sz);
+		return -ENOMEM;
+	}
+
+	/* mark the page encrypted (by default) */
+	memset(map, 0xff, sz);
+
+	bitmap_copy(map, sev->page_enc_bmap, sev->page_enc_bmap_size);
+	kvfree(sev->page_enc_bmap);
+
+	sev->page_enc_bmap = map;
+	sev->page_enc_bmap_size = new_size;
+
+	return 0;
+}
+
+int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
+				  unsigned long npages, unsigned long enc)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	kvm_pfn_t pfn_start, pfn_end;
+	gfn_t gfn_start, gfn_end;
+
+	if (!sev_guest(kvm))
+		return -EINVAL;
+
+	if (!npages)
+		return 0;
+
+	gfn_start = gpa_to_gfn(gpa);
+	gfn_end = gfn_start + npages;
+
+	/* out of bound access error check */
+	if (gfn_end <= gfn_start)
+		return -EINVAL;
+
+	/* lets make sure that gpa exist in our memslot */
+	pfn_start = gfn_to_pfn(kvm, gfn_start);
+	pfn_end = gfn_to_pfn(kvm, gfn_end);
+
+	if (is_error_noslot_pfn(pfn_start) && !is_noslot_pfn(pfn_start)) {
+		/*
+		 * Allow guest MMIO range(s) to be added
+		 * to the page encryption bitmap.
+		 */
+		return -EINVAL;
+	}
+
+	if (is_error_noslot_pfn(pfn_end) && !is_noslot_pfn(pfn_end)) {
+		/*
+		 * Allow guest MMIO range(s) to be added
+		 * to the page encryption bitmap.
+		 */
+		return -EINVAL;
+	}
+
+	mutex_lock(&kvm->lock);
+
+	if (sev->page_enc_bmap_size < gfn_end)
+		goto unlock;
+
+	if (enc)
+		__bitmap_set(sev->page_enc_bmap, gfn_start,
+				gfn_end - gfn_start);
+	else
+		__bitmap_clear(sev->page_enc_bmap, gfn_start,
+				gfn_end - gfn_start);
+
+unlock:
+	mutex_unlock(&kvm->lock);
+	return 0;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1123,6 +1210,9 @@ void sev_vm_destroy(struct kvm *kvm)
 
 	sev_unbind_asid(kvm, sev->handle);
 	sev_asid_free(sev->asid);
+
+	kvfree(sev->page_enc_bmap);
+	sev->page_enc_bmap = NULL;
 }
 
 int __init sev_hardware_setup(void)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6dc337b9c231..7122ea5f7c47 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4312,6 +4312,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
 
 	.msr_filter_changed = svm_msr_filter_changed,
+
+	.page_enc_status_hc = svm_page_enc_status_hc,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index fdff76eb6ceb..0103a23ca174 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -66,6 +66,8 @@ struct kvm_sev_info {
 	int fd;			/* SEV device fd */
 	unsigned long pages_locked; /* Number of pages locked */
 	struct list_head regions_list;  /* List of registered regions */
+	unsigned long *page_enc_bmap;
+	unsigned long page_enc_bmap_size;
 };
 
 struct kvm_svm {
@@ -409,6 +411,8 @@ int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
 			       bool has_error_code, u32 error_code);
 int nested_svm_exit_special(struct vcpu_svm *svm);
 void sync_nested_vmcb_control(struct vcpu_svm *svm);
+int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
+                           unsigned long npages, unsigned long enc);
 
 extern struct kvm_x86_nested_ops svm_nested_ops;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c3441e7e5a87..5bc37a38e6be 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7722,6 +7722,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 
 	.msr_filter_changed = vmx_msr_filter_changed,
 	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
+	.page_enc_status_hc = NULL,
 };
 
 static __init int hardware_setup(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3fdc16cfd6f..3afc78f18f69 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8125,6 +8125,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		kvm_sched_yield(vcpu->kvm, a0);
 		ret = 0;
 		break;
+	case KVM_HC_PAGE_ENC_STATUS:
+		ret = -KVM_ENOSYS;
+		if (kvm_x86_ops.page_enc_status_hc)
+			ret = kvm_x86_ops.page_enc_status_hc(vcpu->kvm,
+					a0, a1, a2);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..847b83b75dc8 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -29,6 +29,7 @@
 #define KVM_HC_CLOCK_PAIRING		9
 #define KVM_HC_SEND_IPI		10
 #define KVM_HC_SCHED_YIELD		11
+#define KVM_HC_PAGE_ENC_STATUS		12
 
 /*
  * hypercalls use architecture specific
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
  2020-12-01  0:45 ` [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Ashish Kalra
  2020-12-01  0:46 ` [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall Ashish Kalra
@ 2020-12-01  0:47 ` Ashish Kalra
  2020-12-06 11:02   ` Dov Murik
  2020-12-01  0:47 ` [PATCH v2 4/9] mm: x86: Invoke hypercall when page encryption status is changed Ashish Kalra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:47 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Brijesh Singh <brijesh.singh@amd.com>

The ioctl can be used to retrieve page encryption bitmap for a given
gfn range.

Return the correct bitmap as per the number of pages being requested
by the user. Ensure that we only copy bmap->num_pages bytes in the
userspace buffer, if bmap->num_pages is not byte aligned we read
the trailing bits from the userspace and copy those bits as is.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst  | 27 +++++++++++++
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/svm/sev.c          | 70 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/x86.c              | 12 ++++++
 include/uapi/linux/kvm.h        | 12 ++++++
 7 files changed, 125 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 70254eaa5229..ae410f4332ab 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4671,6 +4671,33 @@ This ioctl resets VCPU registers and control structures according to
 the clear cpu reset definition in the POP. However, the cpu is not put
 into ESA mode. This reset is a superset of the initial reset.
 
+4.125 KVM_GET_PAGE_ENC_BITMAP (vm ioctl)
+---------------------------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_page_enc_bitmap (in/out)
+:Returns: 0 on success, -1 on error
+
+/* for KVM_GET_PAGE_ENC_BITMAP */
+struct kvm_page_enc_bitmap {
+	__u64 start_gfn;
+	__u64 num_pages;
+	union {
+		void __user *enc_bitmap; /* one bit per page */
+		__u64 padding2;
+	};
+};
+
+The encrypted VMs have the concept of private and shared pages. The private
+pages are encrypted with the guest-specific key, while the shared pages may
+be encrypted with the hypervisor key. The KVM_GET_PAGE_ENC_BITMAP can
+be used to get the bitmap indicating whether the guest page is private
+or shared. The bitmap can be used during the guest migration. If the page
+is private then the userspace need to use SEV migration commands to transmit
+the page.
+
 
 4.125 KVM_S390_PV_COMMAND
 -------------------------
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d035dc983a7a..8c2e40199ecb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1284,6 +1284,8 @@ struct kvm_x86_ops {
 	void (*msr_filter_changed)(struct kvm_vcpu *vcpu);
 	int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa,
 				  unsigned long sz, unsigned long mode);
+	int (*get_page_enc_bitmap)(struct kvm *kvm,
+				struct kvm_page_enc_bitmap *bmap);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6b8bc1297f9c..a6586dd29767 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1014,6 +1014,76 @@ int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
 	return 0;
 }
 
+int svm_get_page_enc_bitmap(struct kvm *kvm,
+				   struct kvm_page_enc_bitmap *bmap)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	unsigned long gfn_start, gfn_end;
+	unsigned long sz, i, sz_bytes;
+	unsigned long *bitmap;
+	int ret, n;
+
+	if (!sev_guest(kvm))
+		return -ENOTTY;
+
+	gfn_start = bmap->start_gfn;
+	gfn_end = gfn_start + bmap->num_pages;
+
+	sz = ALIGN(bmap->num_pages, BITS_PER_LONG) / BITS_PER_BYTE;
+	bitmap = kmalloc(sz, GFP_KERNEL);
+	if (!bitmap)
+		return -ENOMEM;
+
+	/* by default all pages are marked encrypted */
+	memset(bitmap, 0xff, sz);
+
+	mutex_lock(&kvm->lock);
+	if (sev->page_enc_bmap) {
+		i = gfn_start;
+		for_each_clear_bit_from(i, sev->page_enc_bmap,
+				      min(sev->page_enc_bmap_size, gfn_end))
+			clear_bit(i - gfn_start, bitmap);
+	}
+	mutex_unlock(&kvm->lock);
+
+	ret = -EFAULT;
+
+	n = bmap->num_pages % BITS_PER_BYTE;
+	sz_bytes = ALIGN(bmap->num_pages, BITS_PER_BYTE) / BITS_PER_BYTE;
+
+	/*
+	 * Return the correct bitmap as per the number of pages being
+	 * requested by the user. Ensure that we only copy bmap->num_pages
+	 * bytes in the userspace buffer, if bmap->num_pages is not byte
+	 * aligned we read the trailing bits from the userspace and copy
+	 * those bits as is.
+	 */
+
+	if (n) {
+		unsigned char *bitmap_kernel = (unsigned char *)bitmap;
+		unsigned char bitmap_user;
+		unsigned long offset, mask;
+
+		offset = bmap->num_pages / BITS_PER_BYTE;
+		if (copy_from_user(&bitmap_user, bmap->enc_bitmap + offset,
+				sizeof(unsigned char)))
+			goto out;
+
+		mask = GENMASK(n - 1, 0);
+		bitmap_user &= ~mask;
+		bitmap_kernel[offset] &= mask;
+		bitmap_kernel[offset] |= bitmap_user;
+	}
+
+	if (copy_to_user(bmap->enc_bitmap, bitmap, sz_bytes))
+		goto out;
+
+	ret = 0;
+out:
+	kfree(bitmap);
+	return ret;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7122ea5f7c47..bff89cab3ed0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4314,6 +4314,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.msr_filter_changed = svm_msr_filter_changed,
 
 	.page_enc_status_hc = svm_page_enc_status_hc,
+	.get_page_enc_bitmap = svm_get_page_enc_bitmap,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0103a23ca174..4ce73f1034b9 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -413,6 +413,7 @@ int nested_svm_exit_special(struct vcpu_svm *svm);
 void sync_nested_vmcb_control(struct vcpu_svm *svm);
 int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
                            unsigned long npages, unsigned long enc);
+int svm_get_page_enc_bitmap(struct kvm *kvm, struct kvm_page_enc_bitmap *bmap);
 
 extern struct kvm_x86_nested_ops svm_nested_ops;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3afc78f18f69..d3cb95a4dd55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5695,6 +5695,18 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	case KVM_X86_SET_MSR_FILTER:
 		r = kvm_vm_ioctl_set_msr_filter(kvm, argp);
 		break;
+	case KVM_GET_PAGE_ENC_BITMAP: {
+		struct kvm_page_enc_bitmap bitmap;
+
+		r = -EFAULT;
+		if (copy_from_user(&bitmap, argp, sizeof(bitmap)))
+			goto out;
+
+		r = -ENOTTY;
+		if (kvm_x86_ops.get_page_enc_bitmap)
+			r = kvm_x86_ops.get_page_enc_bitmap(kvm, &bitmap);
+		break;
+	}
 	default:
 		r = -ENOTTY;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 886802b8ffba..d0b9171bdb03 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -532,6 +532,16 @@ struct kvm_dirty_log {
 	};
 };
 
+/* for KVM_GET_PAGE_ENC_BITMAP */
+struct kvm_page_enc_bitmap {
+	__u64 start_gfn;
+	__u64 num_pages;
+	union {
+		void __user *enc_bitmap; /* one bit per page */
+		__u64 padding2;
+	};
+};
+
 /* for KVM_CLEAR_DIRTY_LOG */
 struct kvm_clear_dirty_log {
 	__u32 slot;
@@ -1563,6 +1573,8 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_DIRTY_LOG_RING */
 #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
 
+#define KVM_GET_PAGE_ENC_BITMAP	_IOW(KVMIO, 0xc6, struct kvm_page_enc_bitmap)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/9] mm: x86: Invoke hypercall when page encryption status is changed.
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
                   ` (2 preceding siblings ...)
  2020-12-01  0:47 ` [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl Ashish Kalra
@ 2020-12-01  0:47 ` Ashish Kalra
  2020-12-01  0:47 ` [PATCH v2 5/9] KVM: x86: Introduce KVM_SET_PAGE_ENC_BITMAP ioctl Ashish Kalra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:47 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Brijesh Singh <brijesh.singh@amd.com>

Invoke a hypercall when a memory region is changed from encrypted ->
decrypted and vice versa. Hypervisor needs to know the page encryption
status during the guest migration.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/paravirt.h       | 10 +++++
 arch/x86/include/asm/paravirt_types.h |  2 +
 arch/x86/kernel/paravirt.c            |  1 +
 arch/x86/mm/mem_encrypt.c             | 57 ++++++++++++++++++++++++++-
 arch/x86/mm/pat/set_memory.c          |  7 ++++
 5 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d25cc6830e89..7aeb7c508c53 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,12 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+static inline void page_encryption_changed(unsigned long vaddr, int npages,
+						bool enc)
+{
+	PVOP_VCALL3(mmu.page_encryption_changed, vaddr, npages, enc);
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -840,6 +846,10 @@ static inline void paravirt_arch_dup_mmap(struct mm_struct *oldmm,
 static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 {
 }
+
+static inline void page_encryption_changed(unsigned long vaddr, int npages, bool enc)
+{
+}
 #endif
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PARAVIRT_H */
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 0fad9f61c76a..d7787ec4d19f 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -209,6 +209,8 @@ struct pv_mmu_ops {
 
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
+	void (*page_encryption_changed)(unsigned long vaddr, int npages,
+					bool enc);
 
 #ifdef CONFIG_PARAVIRT_XXL
 	struct paravirt_callee_save read_cr2;
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 6c3407ba6ee9..52913356b6fa 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -340,6 +340,7 @@ struct paravirt_patch_template pv_ops = {
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
 
 	.mmu.exit_mmap		= paravirt_nop,
+	.mmu.page_encryption_changed	= paravirt_nop,
 
 #ifdef CONFIG_PARAVIRT_XXL
 	.mmu.read_cr2		= __PV_IS_CALLEE_SAVE(native_read_cr2),
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index bc0833713be9..9d1ac65050d0 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -19,6 +19,7 @@
 #include <linux/kernel.h>
 #include <linux/bitops.h>
 #include <linux/dma-mapping.h>
+#include <linux/kvm_para.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -29,6 +30,7 @@
 #include <asm/processor-flags.h>
 #include <asm/msr.h>
 #include <asm/cmdline.h>
+#include <asm/kvm_para.h>
 
 #include "mm_internal.h"
 
@@ -198,6 +200,47 @@ void __init sme_early_init(void)
 		swiotlb_force = SWIOTLB_FORCE;
 }
 
+static void set_memory_enc_dec_hypercall(unsigned long vaddr, int npages,
+					bool enc)
+{
+	unsigned long sz = npages << PAGE_SHIFT;
+	unsigned long vaddr_end, vaddr_next;
+
+	vaddr_end = vaddr + sz;
+
+	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
+		int psize, pmask, level;
+		unsigned long pfn;
+		pte_t *kpte;
+
+		kpte = lookup_address(vaddr, &level);
+		if (!kpte || pte_none(*kpte))
+			return;
+
+		switch (level) {
+		case PG_LEVEL_4K:
+			pfn = pte_pfn(*kpte);
+			break;
+		case PG_LEVEL_2M:
+			pfn = pmd_pfn(*(pmd_t *)kpte);
+			break;
+		case PG_LEVEL_1G:
+			pfn = pud_pfn(*(pud_t *)kpte);
+			break;
+		default:
+			return;
+		}
+
+		psize = page_level_size(level);
+		pmask = page_level_mask(level);
+
+		kvm_sev_hypercall3(KVM_HC_PAGE_ENC_STATUS,
+				   pfn << PAGE_SHIFT, psize >> PAGE_SHIFT, enc);
+
+		vaddr_next = (vaddr & pmask) + psize;
+	}
+}
+
 static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
 {
 	pgprot_t old_prot, new_prot;
@@ -255,12 +298,13 @@ static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
 static int __init early_set_memory_enc_dec(unsigned long vaddr,
 					   unsigned long size, bool enc)
 {
-	unsigned long vaddr_end, vaddr_next;
+	unsigned long vaddr_end, vaddr_next, start;
 	unsigned long psize, pmask;
 	int split_page_size_mask;
 	int level, ret;
 	pte_t *kpte;
 
+	start = vaddr;
 	vaddr_next = vaddr;
 	vaddr_end = vaddr + size;
 
@@ -315,6 +359,8 @@ static int __init early_set_memory_enc_dec(unsigned long vaddr,
 
 	ret = 0;
 
+	set_memory_enc_dec_hypercall(start, PAGE_ALIGN(size) >> PAGE_SHIFT,
+					enc);
 out:
 	__flush_tlb_all();
 	return ret;
@@ -448,6 +494,15 @@ void __init mem_encrypt_init(void)
 	if (sev_active())
 		static_branch_enable(&sev_enable_key);
 
+#ifdef CONFIG_PARAVIRT
+	/*
+	 * With SEV, we need to make a hypercall when page encryption state is
+	 * changed.
+	 */
+	if (sev_active())
+		pv_ops.mmu.page_encryption_changed = set_memory_enc_dec_hypercall;
+#endif
+
 	print_mem_encrypt_feature_info();
 }
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 40baa90e74f4..dcd4557bb7fa 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include <asm/proto.h>
 #include <asm/memtype.h>
 #include <asm/set_memory.h>
+#include <asm/paravirt.h>
 
 #include "../mm_internal.h"
 
@@ -2012,6 +2013,12 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	/* Notify hypervisor that a given memory range is mapped encrypted
+	 * or decrypted. The hypervisor will use this information during the
+	 * VM migration.
+	 */
+	page_encryption_changed(addr, numpages, enc);
+
 	return ret;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/9] KVM: x86: Introduce KVM_SET_PAGE_ENC_BITMAP ioctl
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
                   ` (3 preceding siblings ...)
  2020-12-01  0:47 ` [PATCH v2 4/9] mm: x86: Invoke hypercall when page encryption status is changed Ashish Kalra
@ 2020-12-01  0:47 ` Ashish Kalra
  2020-12-01  0:47 ` [PATCH v2 6/9] KVM: SVM: Add support for static allocation of unified Page Encryption Bitmap Ashish Kalra
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:47 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Brijesh Singh <brijesh.singh@amd.com>

The ioctl can be used to set page encryption bitmap for an
incoming guest.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst  | 44 +++++++++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/svm/sev.c          | 50 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/x86.c              | 12 ++++++++
 include/uapi/linux/kvm.h        |  1 +
 7 files changed, 111 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ae410f4332ab..1a3336cbbfe8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4698,6 +4698,28 @@ or shared. The bitmap can be used during the guest migration. If the page
 is private then the userspace need to use SEV migration commands to transmit
 the page.
 
+4.126 KVM_SET_PAGE_ENC_BITMAP (vm ioctl)
+---------------------------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_page_enc_bitmap (in/out)
+:Returns: 0 on success, -1 on error
+
+/* for KVM_SET_PAGE_ENC_BITMAP */
+struct kvm_page_enc_bitmap {
+	__u64 start_gfn;
+	__u64 num_pages;
+	union {
+		void __user *enc_bitmap; /* one bit per page */
+		__u64 padding2;
+	};
+};
+
+During the guest live migration the outgoing guest exports its page encryption
+bitmap, the KVM_SET_PAGE_ENC_BITMAP can be used to build the page encryption
+bitmap for an incoming guest.
 
 4.125 KVM_S390_PV_COMMAND
 -------------------------
@@ -4852,6 +4874,28 @@ into user space.
 If a vCPU is in running state while this ioctl is invoked, the vCPU may
 experience inconsistent filtering behavior on MSR accesses.
 
+4.126 KVM_SET_PAGE_ENC_BITMAP (vm ioctl)
+---------------------------------------
+
+:Capability: basic
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_page_enc_bitmap (in/out)
+:Returns: 0 on success, -1 on error
+
+/* for KVM_SET_PAGE_ENC_BITMAP */
+struct kvm_page_enc_bitmap {
+	__u64 start_gfn;
+	__u64 num_pages;
+	union {
+		void __user *enc_bitmap; /* one bit per page */
+		__u64 padding2;
+	};
+};
+
+During the guest live migration the outgoing guest exports its page encryption
+bitmap, the KVM_SET_PAGE_ENC_BITMAP can be used to build the page encryption
+bitmap for an incoming guest.
 
 5. The kvm_run structure
 ========================
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8c2e40199ecb..352ebc576036 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1286,6 +1286,8 @@ struct kvm_x86_ops {
 				  unsigned long sz, unsigned long mode);
 	int (*get_page_enc_bitmap)(struct kvm *kvm,
 				struct kvm_page_enc_bitmap *bmap);
+	int (*set_page_enc_bitmap)(struct kvm *kvm,
+				struct kvm_page_enc_bitmap *bmap);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a6586dd29767..e99ea9c711de 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1084,6 +1084,56 @@ int svm_get_page_enc_bitmap(struct kvm *kvm,
 	return ret;
 }
 
+int svm_set_page_enc_bitmap(struct kvm *kvm,
+				   struct kvm_page_enc_bitmap *bmap)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	unsigned long gfn_start, gfn_end;
+	unsigned long *bitmap;
+	unsigned long sz;
+	int ret;
+
+	if (!sev_guest(kvm))
+		return -ENOTTY;
+	/* special case of resetting the complete bitmap */
+	if (!bmap->enc_bitmap) {
+		mutex_lock(&kvm->lock);
+		/* by default all pages are marked encrypted */
+		if (sev->page_enc_bmap_size)
+			bitmap_fill(sev->page_enc_bmap,
+				    sev->page_enc_bmap_size);
+		mutex_unlock(&kvm->lock);
+		return 0;
+	}
+
+	gfn_start = bmap->start_gfn;
+	gfn_end = gfn_start + bmap->num_pages;
+
+	sz = ALIGN(bmap->num_pages, BITS_PER_LONG) / 8;
+	bitmap = kmalloc(sz, GFP_KERNEL);
+	if (!bitmap)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(bitmap, bmap->enc_bitmap, sz))
+		goto out;
+
+	mutex_lock(&kvm->lock);
+	ret = sev_resize_page_enc_bitmap(kvm, gfn_end);
+	if (ret)
+		goto unlock;
+
+	bitmap_copy(sev->page_enc_bmap + BIT_WORD(gfn_start), bitmap,
+		    (gfn_end - gfn_start));
+
+	ret = 0;
+unlock:
+	mutex_unlock(&kvm->lock);
+out:
+	kfree(bitmap);
+	return ret;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index bff89cab3ed0..6ebdf20773ea 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4315,6 +4315,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
 	.page_enc_status_hc = svm_page_enc_status_hc,
 	.get_page_enc_bitmap = svm_get_page_enc_bitmap,
+	.set_page_enc_bitmap = svm_set_page_enc_bitmap,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 4ce73f1034b9..2268c0ab650b 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -414,6 +414,7 @@ void sync_nested_vmcb_control(struct vcpu_svm *svm);
 int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
                            unsigned long npages, unsigned long enc);
 int svm_get_page_enc_bitmap(struct kvm *kvm, struct kvm_page_enc_bitmap *bmap);
+int svm_set_page_enc_bitmap(struct kvm *kvm, struct kvm_page_enc_bitmap *bmap);
 
 extern struct kvm_x86_nested_ops svm_nested_ops;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3cb95a4dd55..3cf64a94004f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5707,6 +5707,18 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			r = kvm_x86_ops.get_page_enc_bitmap(kvm, &bitmap);
 		break;
 	}
+	case KVM_SET_PAGE_ENC_BITMAP: {
+		struct kvm_page_enc_bitmap bitmap;
+
+		r = -EFAULT;
+		if (copy_from_user(&bitmap, argp, sizeof(bitmap)))
+			goto out;
+
+		r = -ENOTTY;
+		if (kvm_x86_ops.set_page_enc_bitmap)
+			r = kvm_x86_ops.set_page_enc_bitmap(kvm, &bitmap);
+		break;
+	}
 	default:
 		r = -ENOTTY;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d0b9171bdb03..8e1adcd598a8 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1574,6 +1574,7 @@ struct kvm_pv_cmd {
 #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
 
 #define KVM_GET_PAGE_ENC_BITMAP	_IOW(KVMIO, 0xc6, struct kvm_page_enc_bitmap)
+#define KVM_SET_PAGE_ENC_BITMAP	_IOW(KVMIO, 0xc7, struct kvm_page_enc_bitmap)
 
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/9] KVM: SVM: Add support for static allocation of unified Page Encryption Bitmap.
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
                   ` (4 preceding siblings ...)
  2020-12-01  0:47 ` [PATCH v2 5/9] KVM: x86: Introduce KVM_SET_PAGE_ENC_BITMAP ioctl Ashish Kalra
@ 2020-12-01  0:47 ` Ashish Kalra
  2020-12-01  0:48 ` [PATCH v2 7/9] KVM: x86: Mark _bss_decrypted section variables as decrypted in page encryption bitmap Ashish Kalra
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:47 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Ashish Kalra <ashish.kalra@amd.com>

Add support for static allocation of the unified Page encryption bitmap by
extending kvm_arch_commit_memory_region() callack to add svm specific x86_ops
which can read the userspace provided memory region/memslots and calculate
the amount of guest RAM managed by the KVM and grow the bitmap based
on that information, i.e. the highest guest PA that is mapped by a memslot.

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/svm/sev.c          | 35 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/x86.c              |  5 +++++
 5 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 352ebc576036..91fc22d793e8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1282,6 +1282,7 @@ struct kvm_x86_ops {
 
 	void (*migrate_timers)(struct kvm_vcpu *vcpu);
 	void (*msr_filter_changed)(struct kvm_vcpu *vcpu);
+	void (*commit_memory_region)(struct kvm *kvm, enum kvm_mr_change change);
 	int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa,
 				  unsigned long sz, unsigned long mode);
 	int (*get_page_enc_bitmap)(struct kvm *kvm,
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e99ea9c711de..8b089cef1eba 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -957,6 +957,41 @@ static int sev_resize_page_enc_bitmap(struct kvm *kvm, unsigned long new_size)
 	return 0;
 }
 
+void svm_commit_memory_region(struct kvm *kvm, enum kvm_mr_change change)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	gfn_t start, end = 0;
+
+	spin_lock(&kvm->mmu_lock);
+	if (change == KVM_MR_CREATE) {
+		slots = kvm_memslots(kvm);
+		kvm_for_each_memslot(memslot, slots) {
+			start = memslot->base_gfn;
+			end = memslot->base_gfn + memslot->npages;
+			/*
+			 * KVM memslots is a sorted list, starting with
+			 * the highest mapped guest PA, so pick the topmost
+			 * valid guest PA.
+			 */
+			if (memslot->npages)
+				break;
+		}
+	}
+	spin_unlock(&kvm->mmu_lock);
+
+	if (end) {
+		/*
+		 * NORE: This callback is invoked in vm ioctl
+		 * set_user_memory_region, hence we can use a
+		 * mutex here.
+		 */
+		mutex_lock(&kvm->lock);
+		sev_resize_page_enc_bitmap(kvm, end);
+		mutex_unlock(&kvm->lock);
+	}
+}
+
 int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
 				  unsigned long npages, unsigned long enc)
 {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6ebdf20773ea..7aa7858c8209 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4313,6 +4313,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
 	.msr_filter_changed = svm_msr_filter_changed,
 
+	.commit_memory_region = svm_commit_memory_region,
 	.page_enc_status_hc = svm_page_enc_status_hc,
 	.get_page_enc_bitmap = svm_get_page_enc_bitmap,
 	.set_page_enc_bitmap = svm_set_page_enc_bitmap,
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 2268c0ab650b..5a4656bad681 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -415,6 +415,7 @@ int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
                            unsigned long npages, unsigned long enc);
 int svm_get_page_enc_bitmap(struct kvm *kvm, struct kvm_page_enc_bitmap *bmap);
 int svm_set_page_enc_bitmap(struct kvm *kvm, struct kvm_page_enc_bitmap *bmap);
+void svm_commit_memory_region(struct kvm *kvm, enum kvm_mr_change change);
 
 extern struct kvm_x86_nested_ops svm_nested_ops;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3cf64a94004f..c1acbd397b50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10717,6 +10717,11 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	/* Free the arrays associated with the old memslot. */
 	if (change == KVM_MR_MOVE)
 		kvm_arch_free_memslot(kvm, old);
+
+	if (change == KVM_MR_CREATE || change == KVM_MR_DELETE) {
+		if (kvm_x86_ops.commit_memory_region)
+			kvm_x86_ops.commit_memory_region(kvm, change);
+	}
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 7/9] KVM: x86: Mark _bss_decrypted section variables as decrypted in page encryption bitmap.
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
                   ` (5 preceding siblings ...)
  2020-12-01  0:47 ` [PATCH v2 6/9] KVM: SVM: Add support for static allocation of unified Page Encryption Bitmap Ashish Kalra
@ 2020-12-01  0:48 ` Ashish Kalra
  2020-12-01  0:48 ` [PATCH v2 8/9] KVM: x86: Add kexec support for SEV " Ashish Kalra
  2020-12-01  0:48 ` [PATCH v2 9/9] KVM: SVM: Bypass DBG_DECRYPT API calls for unecrypted guest memory Ashish Kalra
  8 siblings, 0 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:48 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Ashish Kalra <ashish.kalra@amd.com>

Ensure that _bss_decrypted section variables such as hv_clock_boot and
wall_clock are marked as decrypted in the page encryption bitmap if
sev guest is active.

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/mem_encrypt.h |  4 ++++
 arch/x86/kernel/kvmclock.c         | 12 ++++++++++++
 arch/x86/mm/mem_encrypt.c          |  6 ++++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 2f62bbdd9d12..a4fd6a4229eb 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -43,6 +43,8 @@ void __init sme_enable(struct boot_params *bp);
 
 int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size);
 int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
+void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages,
+					    bool enc);
 
 void __init mem_encrypt_free_decrypted_mem(void);
 
@@ -82,6 +84,8 @@ static inline int __init
 early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 0; }
 static inline int __init
 early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0; }
+static inline void __init
+early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages, bool enc) {}
 
 static inline void mem_encrypt_free_decrypted_mem(void) { }
 
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa593743acf6..94a4fbf80e44 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -333,6 +333,18 @@ void __init kvmclock_init(void)
 	pr_info("kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
+	if (sev_active()) {
+		unsigned long nr_pages;
+		/*
+		 * sizeof(hv_clock_boot) is already PAGE_SIZE aligned
+		 */
+		early_set_mem_enc_dec_hypercall((unsigned long)hv_clock_boot,
+						1, 0);
+		nr_pages = DIV_ROUND_UP(sizeof(wall_clock), PAGE_SIZE);
+		early_set_mem_enc_dec_hypercall((unsigned long)&wall_clock,
+						nr_pages, 0);
+	}
+
 	this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
 	kvm_register_clock("primary cpu clock");
 	pvclock_set_pvti_cpu0_va(hv_clock_boot);
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 9d1ac65050d0..1bcfbcd2bfd7 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -376,6 +376,12 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size)
 	return early_set_memory_enc_dec(vaddr, size, true);
 }
 
+void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages,
+					    bool enc)
+{
+	set_memory_enc_dec_hypercall(vaddr, npages, enc);
+}
+
 /*
  * SME and SEV are very similar but they are not the same, so there are
  * times that the kernel will need to distinguish between SME and SEV. The
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 8/9] KVM: x86: Add kexec support for SEV page encryption bitmap.
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
                   ` (6 preceding siblings ...)
  2020-12-01  0:48 ` [PATCH v2 7/9] KVM: x86: Mark _bss_decrypted section variables as decrypted in page encryption bitmap Ashish Kalra
@ 2020-12-01  0:48 ` Ashish Kalra
  2020-12-01  0:48 ` [PATCH v2 9/9] KVM: SVM: Bypass DBG_DECRYPT API calls for unecrypted guest memory Ashish Kalra
  8 siblings, 0 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:48 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Ashish Kalra <ashish.kalra@amd.com>

Reset the host's page encryption bitmap related to kernel
specific page encryption status settings before we load a
new kernel by kexec. We cannot reset the complete
page encryption bitmap here as we need to retain the
UEFI/OVMF firmware specific settings.

The host's page encryption bitmap is maintained for the
guest to keep the encrypted/decrypted state of the guest pages,
therefore we need to explicitly mark all shared pages as
encrypted again before rebooting into the new guest kernel.

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7f57ede3cb8e..55d845e025b2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -38,6 +38,7 @@
 #include <asm/cpuidle_haltpoll.h>
 #include <asm/ptrace.h>
 #include <asm/svm.h>
+#include <asm/e820/api.h>
 
 DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
 
@@ -383,6 +384,33 @@ static void kvm_pv_guest_cpu_reboot(void *unused)
 	 */
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
+	/*
+	 * Reset the host's page encryption bitmap related to kernel
+	 * specific page encryption status settings before we load a
+	 * new kernel by kexec. NOTE: We cannot reset the complete
+	 * page encryption bitmap here as we need to retain the
+	 * UEFI/OVMF firmware specific settings.
+	 */
+	if (sev_active() & (smp_processor_id() == 0)) {
+		int i;
+		unsigned long nr_pages;
+
+		for (i = 0; i < e820_table->nr_entries; i++) {
+			struct e820_entry *entry = &e820_table->entries[i];
+			unsigned long start_pfn;
+			unsigned long end_pfn;
+
+			if (entry->type != E820_TYPE_RAM)
+				continue;
+
+			start_pfn = entry->addr >> PAGE_SHIFT;
+			end_pfn = (entry->addr + entry->size) >> PAGE_SHIFT;
+			nr_pages = DIV_ROUND_UP(entry->size, PAGE_SIZE);
+
+			kvm_sev_hypercall3(KVM_HC_PAGE_ENC_STATUS,
+					   entry->addr, nr_pages, 1);
+		}
+	}
 	kvm_pv_disable_apf();
 	kvm_disable_steal_time();
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 9/9] KVM: SVM: Bypass DBG_DECRYPT API calls for unecrypted guest memory.
  2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
                   ` (7 preceding siblings ...)
  2020-12-01  0:48 ` [PATCH v2 8/9] KVM: x86: Add kexec support for SEV " Ashish Kalra
@ 2020-12-01  0:48 ` Ashish Kalra
  8 siblings, 0 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-01  0:48 UTC (permalink / raw)
  To: pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

From: Ashish Kalra <ashish.kalra@amd.com>

For all explicitly unecrypted guest memory regions such as S/W IOTLB
bounce buffers, dma_decrypted() allocated regions and for guest regions
marked as "__bss_decrypted", ensure that DBG_DECRYPT API calls are
bypassed for such regions. The guest memory regions encryption status
is referenced using the page encryption bitmap.

Uses the two added infrastructure functions hva_to_memslot() and
hva_to_gfn().

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 8b089cef1eba..2524a47531ee 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -763,6 +763,37 @@ static int __sev_dbg_encrypt_user(struct kvm *kvm, unsigned long paddr,
 	return ret;
 }
 
+static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
+					      unsigned long hva)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot;
+
+	kvm_for_each_memslot(memslot, slots) {
+		if (hva >= memslot->userspace_addr &&
+		    hva < memslot->userspace_addr +
+			      (memslot->npages << PAGE_SHIFT))
+			return memslot;
+	}
+
+	return NULL;
+}
+
+static bool hva_to_gfn(struct kvm *kvm, unsigned long hva, gfn_t *gfn)
+{
+	struct kvm_memory_slot *memslot;
+	gpa_t gpa_offset;
+
+	memslot = hva_to_memslot(kvm, hva);
+	if (!memslot)
+		return false;
+
+	gpa_offset = hva - memslot->userspace_addr;
+	*gfn = ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset) >> PAGE_SHIFT;
+
+	return true;
+}
+
 static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec)
 {
 	unsigned long vaddr, vaddr_end, next_vaddr;
@@ -792,6 +823,50 @@ static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec)
 	for (; vaddr < vaddr_end; vaddr = next_vaddr) {
 		int len, s_off, d_off;
 
+		if (dec) {
+			struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+			struct page *src_tpage = NULL;
+			gfn_t gfn_start;
+			int srcu_idx;
+
+			/* ensure hva_to_gfn translations remain valid */
+			srcu_idx = srcu_read_lock(&kvm->srcu);
+			if (!hva_to_gfn(kvm, vaddr, &gfn_start)) {
+				srcu_read_unlock(&kvm->srcu, srcu_idx);
+				return -EINVAL;
+			}
+			if (sev->page_enc_bmap) {
+				if (!test_bit(gfn_start, sev->page_enc_bmap)) {
+					src_tpage = alloc_page(GFP_KERNEL);
+					if (!src_tpage) {
+						srcu_read_unlock(&kvm->srcu, srcu_idx);
+						return -ENOMEM;
+					}
+					/*
+					 * Since user buffer may not be page aligned, calculate the
+					 * offset within the page.
+					*/
+					s_off = vaddr & ~PAGE_MASK;
+					d_off = dst_vaddr & ~PAGE_MASK;
+					len = min_t(size_t, (PAGE_SIZE - s_off), size);
+
+					if (copy_from_user(page_address(src_tpage),
+							   (void __user *)(uintptr_t)vaddr, len)) {
+						__free_page(src_tpage);
+						srcu_read_unlock(&kvm->srcu, srcu_idx);
+						return -EFAULT;
+					}
+					if (copy_to_user((void __user *)(uintptr_t)dst_vaddr,
+							 page_address(src_tpage), len)) {
+						ret = -EFAULT;
+					}
+					__free_page(src_tpage);
+					srcu_read_unlock(&kvm->srcu, srcu_idx);
+					goto already_decrypted;
+				}
+			}
+		}
+
 		/* lock userspace source and destination page */
 		src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0);
 		if (IS_ERR(src_p))
@@ -836,6 +911,7 @@ static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec)
 		sev_unpin_memory(kvm, src_p, n);
 		sev_unpin_memory(kvm, dst_p, n);
 
+already_decrypted:
 		if (ret)
 			goto err;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall
  2020-12-01  0:46 ` [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall Ashish Kalra
@ 2020-12-02 16:54   ` Dr. David Alan Gilbert
  2020-12-02 21:22     ` Ashish Kalra
  0 siblings, 1 reply; 38+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-02 16:54 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: pbonzini, tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh

* Ashish Kalra (Ashish.Kalra@amd.com) wrote:
> From: Brijesh Singh <brijesh.singh@amd.com>
> 
> This hypercall is used by the SEV guest to notify a change in the page
> encryption status to the hypervisor. The hypercall should be invoked
> only when the encryption attribute is changed from encrypted -> decrypted
> and vice versa. By default all guest pages are considered encrypted.

Is it defined whether these are supposed to be called before or after
the the page type has been changed; is it change the type and then
notify or the other way around?

Dave

> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: x86@kernel.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
>  Documentation/virt/kvm/hypercalls.rst | 15 +++++
>  arch/x86/include/asm/kvm_host.h       |  2 +
>  arch/x86/kvm/svm/sev.c                | 90 +++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c                |  2 +
>  arch/x86/kvm/svm/svm.h                |  4 ++
>  arch/x86/kvm/vmx/vmx.c                |  1 +
>  arch/x86/kvm/x86.c                    |  6 ++
>  include/uapi/linux/kvm_para.h         |  1 +
>  8 files changed, 121 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst
> index ed4fddd364ea..7aff0cebab7c 100644
> --- a/Documentation/virt/kvm/hypercalls.rst
> +++ b/Documentation/virt/kvm/hypercalls.rst
> @@ -169,3 +169,18 @@ a0: destination APIC ID
>  
>  :Usage example: When sending a call-function IPI-many to vCPUs, yield if
>  	        any of the IPI target vCPUs was preempted.
> +
> +
> +8. KVM_HC_PAGE_ENC_STATUS
> +-------------------------
> +:Architecture: x86
> +:Status: active
> +:Purpose: Notify the encryption status changes in guest page table (SEV guest)
> +
> +a0: the guest physical address of the start page
> +a1: the number of pages
> +a2: encryption attribute
> +
> +   Where:
> +	* 1: Encryption attribute is set
> +	* 0: Encryption attribute is cleared
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f002cdb13a0b..d035dc983a7a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1282,6 +1282,8 @@ struct kvm_x86_ops {
>  
>  	void (*migrate_timers)(struct kvm_vcpu *vcpu);
>  	void (*msr_filter_changed)(struct kvm_vcpu *vcpu);
> +	int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa,
> +				  unsigned long sz, unsigned long mode);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index c0b14106258a..6b8bc1297f9c 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -927,6 +927,93 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	return ret;
>  }
>  
> +static int sev_resize_page_enc_bitmap(struct kvm *kvm, unsigned long new_size)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	unsigned long *map;
> +	unsigned long sz;
> +
> +	if (sev->page_enc_bmap_size >= new_size)
> +		return 0;
> +
> +	sz = ALIGN(new_size, BITS_PER_LONG) / 8;
> +
> +	map = vmalloc(sz);
> +	if (!map) {
> +		pr_err_once("Failed to allocate encrypted bitmap size %lx\n",
> +				sz);
> +		return -ENOMEM;
> +	}
> +
> +	/* mark the page encrypted (by default) */
> +	memset(map, 0xff, sz);
> +
> +	bitmap_copy(map, sev->page_enc_bmap, sev->page_enc_bmap_size);
> +	kvfree(sev->page_enc_bmap);
> +
> +	sev->page_enc_bmap = map;
> +	sev->page_enc_bmap_size = new_size;
> +
> +	return 0;
> +}
> +
> +int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
> +				  unsigned long npages, unsigned long enc)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	kvm_pfn_t pfn_start, pfn_end;
> +	gfn_t gfn_start, gfn_end;
> +
> +	if (!sev_guest(kvm))
> +		return -EINVAL;
> +
> +	if (!npages)
> +		return 0;
> +
> +	gfn_start = gpa_to_gfn(gpa);
> +	gfn_end = gfn_start + npages;
> +
> +	/* out of bound access error check */
> +	if (gfn_end <= gfn_start)
> +		return -EINVAL;
> +
> +	/* lets make sure that gpa exist in our memslot */
> +	pfn_start = gfn_to_pfn(kvm, gfn_start);
> +	pfn_end = gfn_to_pfn(kvm, gfn_end);
> +
> +	if (is_error_noslot_pfn(pfn_start) && !is_noslot_pfn(pfn_start)) {
> +		/*
> +		 * Allow guest MMIO range(s) to be added
> +		 * to the page encryption bitmap.
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	if (is_error_noslot_pfn(pfn_end) && !is_noslot_pfn(pfn_end)) {
> +		/*
> +		 * Allow guest MMIO range(s) to be added
> +		 * to the page encryption bitmap.
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (sev->page_enc_bmap_size < gfn_end)
> +		goto unlock;
> +
> +	if (enc)
> +		__bitmap_set(sev->page_enc_bmap, gfn_start,
> +				gfn_end - gfn_start);
> +	else
> +		__bitmap_clear(sev->page_enc_bmap, gfn_start,
> +				gfn_end - gfn_start);
> +
> +unlock:
> +	mutex_unlock(&kvm->lock);
> +	return 0;
> +}
> +
>  int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_sev_cmd sev_cmd;
> @@ -1123,6 +1210,9 @@ void sev_vm_destroy(struct kvm *kvm)
>  
>  	sev_unbind_asid(kvm, sev->handle);
>  	sev_asid_free(sev->asid);
> +
> +	kvfree(sev->page_enc_bmap);
> +	sev->page_enc_bmap = NULL;
>  }
>  
>  int __init sev_hardware_setup(void)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 6dc337b9c231..7122ea5f7c47 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4312,6 +4312,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>  	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
>  
>  	.msr_filter_changed = svm_msr_filter_changed,
> +
> +	.page_enc_status_hc = svm_page_enc_status_hc,
>  };
>  
>  static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index fdff76eb6ceb..0103a23ca174 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -66,6 +66,8 @@ struct kvm_sev_info {
>  	int fd;			/* SEV device fd */
>  	unsigned long pages_locked; /* Number of pages locked */
>  	struct list_head regions_list;  /* List of registered regions */
> +	unsigned long *page_enc_bmap;
> +	unsigned long page_enc_bmap_size;
>  };
>  
>  struct kvm_svm {
> @@ -409,6 +411,8 @@ int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
>  			       bool has_error_code, u32 error_code);
>  int nested_svm_exit_special(struct vcpu_svm *svm);
>  void sync_nested_vmcb_control(struct vcpu_svm *svm);
> +int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
> +                           unsigned long npages, unsigned long enc);
>  
>  extern struct kvm_x86_nested_ops svm_nested_ops;
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c3441e7e5a87..5bc37a38e6be 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7722,6 +7722,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  
>  	.msr_filter_changed = vmx_msr_filter_changed,
>  	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
> +	.page_enc_status_hc = NULL,
>  };
>  
>  static __init int hardware_setup(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a3fdc16cfd6f..3afc78f18f69 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8125,6 +8125,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  		kvm_sched_yield(vcpu->kvm, a0);
>  		ret = 0;
>  		break;
> +	case KVM_HC_PAGE_ENC_STATUS:
> +		ret = -KVM_ENOSYS;
> +		if (kvm_x86_ops.page_enc_status_hc)
> +			ret = kvm_x86_ops.page_enc_status_hc(vcpu->kvm,
> +					a0, a1, a2);
> +		break;
>  	default:
>  		ret = -KVM_ENOSYS;
>  		break;
> diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> index 8b86609849b9..847b83b75dc8 100644
> --- a/include/uapi/linux/kvm_para.h
> +++ b/include/uapi/linux/kvm_para.h
> @@ -29,6 +29,7 @@
>  #define KVM_HC_CLOCK_PAIRING		9
>  #define KVM_HC_SEND_IPI		10
>  #define KVM_HC_SCHED_YIELD		11
> +#define KVM_HC_PAGE_ENC_STATUS		12
>  
>  /*
>   * hypercalls use architecture specific
> -- 
> 2.17.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall
  2020-12-02 16:54   ` Dr. David Alan Gilbert
@ 2020-12-02 21:22     ` Ashish Kalra
  2020-12-06 10:25       ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2020-12-02 21:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh

Hello Dave,

On Wed, Dec 02, 2020 at 04:54:20PM +0000, Dr. David Alan Gilbert wrote:
> * Ashish Kalra (Ashish.Kalra@amd.com) wrote:
> > From: Brijesh Singh <brijesh.singh@amd.com>
> > 
> > This hypercall is used by the SEV guest to notify a change in the page
> > encryption status to the hypervisor. The hypercall should be invoked
> > only when the encryption attribute is changed from encrypted -> decrypted
> > and vice versa. By default all guest pages are considered encrypted.
> 
> Is it defined whether these are supposed to be called before or after
> the the page type has been changed; is it change the type and then
> notify or the other way around?
> 

There is nothing really specified as such, the guest makes the hypercall
immediately after modifying the page tables. There is surely going to be
some latency before the VMM knows about this and the guest page table
update.

Thanks,
Ashish

> 
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Borislav Petkov <bp@suse.de>
> > Cc: Tom Lendacky <thomas.lendacky@amd.com>
> > Cc: x86@kernel.org
> > Cc: kvm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
> > Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> > Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> > ---
> >  Documentation/virt/kvm/hypercalls.rst | 15 +++++
> >  arch/x86/include/asm/kvm_host.h       |  2 +
> >  arch/x86/kvm/svm/sev.c                | 90 +++++++++++++++++++++++++++
> >  arch/x86/kvm/svm/svm.c                |  2 +
> >  arch/x86/kvm/svm/svm.h                |  4 ++
> >  arch/x86/kvm/vmx/vmx.c                |  1 +
> >  arch/x86/kvm/x86.c                    |  6 ++
> >  include/uapi/linux/kvm_para.h         |  1 +
> >  8 files changed, 121 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst
> > index ed4fddd364ea..7aff0cebab7c 100644
> > --- a/Documentation/virt/kvm/hypercalls.rst
> > +++ b/Documentation/virt/kvm/hypercalls.rst
> > @@ -169,3 +169,18 @@ a0: destination APIC ID
> >  
> >  :Usage example: When sending a call-function IPI-many to vCPUs, yield if
> >  	        any of the IPI target vCPUs was preempted.
> > +
> > +
> > +8. KVM_HC_PAGE_ENC_STATUS
> > +-------------------------
> > +:Architecture: x86
> > +:Status: active
> > +:Purpose: Notify the encryption status changes in guest page table (SEV guest)
> > +
> > +a0: the guest physical address of the start page
> > +a1: the number of pages
> > +a2: encryption attribute
> > +
> > +   Where:
> > +	* 1: Encryption attribute is set
> > +	* 0: Encryption attribute is cleared
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f002cdb13a0b..d035dc983a7a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1282,6 +1282,8 @@ struct kvm_x86_ops {
> >  
> >  	void (*migrate_timers)(struct kvm_vcpu *vcpu);
> >  	void (*msr_filter_changed)(struct kvm_vcpu *vcpu);
> > +	int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa,
> > +				  unsigned long sz, unsigned long mode);
> >  };
> >  
> >  struct kvm_x86_nested_ops {
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index c0b14106258a..6b8bc1297f9c 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -927,6 +927,93 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >  	return ret;
> >  }
> >  
> > +static int sev_resize_page_enc_bitmap(struct kvm *kvm, unsigned long new_size)
> > +{
> > +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> > +	unsigned long *map;
> > +	unsigned long sz;
> > +
> > +	if (sev->page_enc_bmap_size >= new_size)
> > +		return 0;
> > +
> > +	sz = ALIGN(new_size, BITS_PER_LONG) / 8;
> > +
> > +	map = vmalloc(sz);
> > +	if (!map) {
> > +		pr_err_once("Failed to allocate encrypted bitmap size %lx\n",
> > +				sz);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* mark the page encrypted (by default) */
> > +	memset(map, 0xff, sz);
> > +
> > +	bitmap_copy(map, sev->page_enc_bmap, sev->page_enc_bmap_size);
> > +	kvfree(sev->page_enc_bmap);
> > +
> > +	sev->page_enc_bmap = map;
> > +	sev->page_enc_bmap_size = new_size;
> > +
> > +	return 0;
> > +}
> > +
> > +int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
> > +				  unsigned long npages, unsigned long enc)
> > +{
> > +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> > +	kvm_pfn_t pfn_start, pfn_end;
> > +	gfn_t gfn_start, gfn_end;
> > +
> > +	if (!sev_guest(kvm))
> > +		return -EINVAL;
> > +
> > +	if (!npages)
> > +		return 0;
> > +
> > +	gfn_start = gpa_to_gfn(gpa);
> > +	gfn_end = gfn_start + npages;
> > +
> > +	/* out of bound access error check */
> > +	if (gfn_end <= gfn_start)
> > +		return -EINVAL;
> > +
> > +	/* lets make sure that gpa exist in our memslot */
> > +	pfn_start = gfn_to_pfn(kvm, gfn_start);
> > +	pfn_end = gfn_to_pfn(kvm, gfn_end);
> > +
> > +	if (is_error_noslot_pfn(pfn_start) && !is_noslot_pfn(pfn_start)) {
> > +		/*
> > +		 * Allow guest MMIO range(s) to be added
> > +		 * to the page encryption bitmap.
> > +		 */
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (is_error_noslot_pfn(pfn_end) && !is_noslot_pfn(pfn_end)) {
> > +		/*
> > +		 * Allow guest MMIO range(s) to be added
> > +		 * to the page encryption bitmap.
> > +		 */
> > +		return -EINVAL;
> > +	}
> > +
> > +	mutex_lock(&kvm->lock);
> > +
> > +	if (sev->page_enc_bmap_size < gfn_end)
> > +		goto unlock;
> > +
> > +	if (enc)
> > +		__bitmap_set(sev->page_enc_bmap, gfn_start,
> > +				gfn_end - gfn_start);
> > +	else
> > +		__bitmap_clear(sev->page_enc_bmap, gfn_start,
> > +				gfn_end - gfn_start);
> > +
> > +unlock:
> > +	mutex_unlock(&kvm->lock);
> > +	return 0;
> > +}
> > +
> >  int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
> >  {
> >  	struct kvm_sev_cmd sev_cmd;
> > @@ -1123,6 +1210,9 @@ void sev_vm_destroy(struct kvm *kvm)
> >  
> >  	sev_unbind_asid(kvm, sev->handle);
> >  	sev_asid_free(sev->asid);
> > +
> > +	kvfree(sev->page_enc_bmap);
> > +	sev->page_enc_bmap = NULL;
> >  }
> >  
> >  int __init sev_hardware_setup(void)
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 6dc337b9c231..7122ea5f7c47 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4312,6 +4312,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> >  	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
> >  
> >  	.msr_filter_changed = svm_msr_filter_changed,
> > +
> > +	.page_enc_status_hc = svm_page_enc_status_hc,
> >  };
> >  
> >  static struct kvm_x86_init_ops svm_init_ops __initdata = {
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index fdff76eb6ceb..0103a23ca174 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -66,6 +66,8 @@ struct kvm_sev_info {
> >  	int fd;			/* SEV device fd */
> >  	unsigned long pages_locked; /* Number of pages locked */
> >  	struct list_head regions_list;  /* List of registered regions */
> > +	unsigned long *page_enc_bmap;
> > +	unsigned long page_enc_bmap_size;
> >  };
> >  
> >  struct kvm_svm {
> > @@ -409,6 +411,8 @@ int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
> >  			       bool has_error_code, u32 error_code);
> >  int nested_svm_exit_special(struct vcpu_svm *svm);
> >  void sync_nested_vmcb_control(struct vcpu_svm *svm);
> > +int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
> > +                           unsigned long npages, unsigned long enc);
> >  
> >  extern struct kvm_x86_nested_ops svm_nested_ops;
> >  
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index c3441e7e5a87..5bc37a38e6be 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7722,6 +7722,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
> >  
> >  	.msr_filter_changed = vmx_msr_filter_changed,
> >  	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
> > +	.page_enc_status_hc = NULL,
> >  };
> >  
> >  static __init int hardware_setup(void)
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a3fdc16cfd6f..3afc78f18f69 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8125,6 +8125,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >  		kvm_sched_yield(vcpu->kvm, a0);
> >  		ret = 0;
> >  		break;
> > +	case KVM_HC_PAGE_ENC_STATUS:
> > +		ret = -KVM_ENOSYS;
> > +		if (kvm_x86_ops.page_enc_status_hc)
> > +			ret = kvm_x86_ops.page_enc_status_hc(vcpu->kvm,
> > +					a0, a1, a2);
> > +		break;
> >  	default:
> >  		ret = -KVM_ENOSYS;
> >  		break;
> > diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> > index 8b86609849b9..847b83b75dc8 100644
> > --- a/include/uapi/linux/kvm_para.h
> > +++ b/include/uapi/linux/kvm_para.h
> > @@ -29,6 +29,7 @@
> >  #define KVM_HC_CLOCK_PAIRING		9
> >  #define KVM_HC_SEND_IPI		10
> >  #define KVM_HC_SCHED_YIELD		11
> > +#define KVM_HC_PAGE_ENC_STATUS		12
> >  
> >  /*
> >   * hypercalls use architecture specific
> > -- 
> > 2.17.1
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-01  0:45 ` [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Ashish Kalra
@ 2020-12-03  0:34   ` Sean Christopherson
  2020-12-04 17:16     ` Brijesh Singh
  2020-12-06 10:26     ` Paolo Bonzini
  0 siblings, 2 replies; 38+ messages in thread
From: Sean Christopherson @ 2020-12-03  0:34 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: pbonzini, tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

On Tue, Dec 01, 2020, Ashish Kalra wrote:
> From: Brijesh Singh <brijesh.singh@amd.com>
> 
> KVM hypercall framework relies on alternative framework to patch the
> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> apply_alternative() is called then it defaults to VMCALL. The approach
> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> will be able to decode the instruction and do the right things. But
> when SEV is active, guest memory is encrypted with guest key and
> hypervisor will not be able to decode the instruction bytes.
> 
> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> will be used by the SEV guest to notify encrypted pages to the hypervisor.

What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
think there are any existing KVM hypercalls that happen before alternatives are
patched, i.e. it'll be a nop for sane kernel builds.

I'm also skeptical that a KVM specific hypercall is the right approach for the
encryption behavior, but I'll take that up in the patches later in the series.

> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: x86@kernel.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Steve Rutherford <srutherford@google.com>
> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
>  arch/x86/include/asm/kvm_para.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 338119852512..bc1b11d057fc 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>  	return ret;
>  }
>  
> +static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
> +				      unsigned long p2, unsigned long p3)
> +{
> +	long ret;
> +
> +	asm volatile("vmmcall"
> +		     : "=a"(ret)
> +		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
> +		     : "memory");
> +	return ret;
> +}
> +
>  #ifdef CONFIG_KVM_GUEST
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-03  0:34   ` Sean Christopherson
@ 2020-12-04 17:16     ` Brijesh Singh
  2020-12-06 10:26     ` Paolo Bonzini
  1 sibling, 0 replies; 38+ messages in thread
From: Brijesh Singh @ 2020-12-04 17:16 UTC (permalink / raw)
  To: Sean Christopherson, Ashish Kalra
  Cc: brijesh.singh, pbonzini, tglx, mingo, hpa, joro, bp,
	thomas.lendacky, x86, kvm, linux-kernel, srutherford, dovmurik,
	tobin, jejb, frankeh, dgilbert


On 12/2/20 6:34 PM, Sean Christopherson wrote:
> On Tue, Dec 01, 2020, Ashish Kalra wrote:
>> From: Brijesh Singh <brijesh.singh@amd.com>
>>
>> KVM hypercall framework relies on alternative framework to patch the
>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
>> apply_alternative() is called then it defaults to VMCALL. The approach
>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
>> will be able to decode the instruction and do the right things. But
>> when SEV is active, guest memory is encrypted with guest key and
>> hypervisor will not be able to decode the instruction bytes.
>>
>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
>> will be used by the SEV guest to notify encrypted pages to the hypervisor.
> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> think there are any existing KVM hypercalls that happen before alternatives are
> patched, i.e. it'll be a nop for sane kernel builds.


If we invert the X86_FEATURE_VMMCALL to default to VMMCALL then it
should work fine without this patch. So far there was no hypercall made
before the alternative patching took place. Since the page state change
can occur much before the alternative patching so we need to default to
VMMCALL when SEV is active.


> I'm also skeptical that a KVM specific hypercall is the right approach for the
> encryption behavior, but I'll take that up in the patches later in the series.


Great, I am open to explore other alternative approaches.


>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>> Cc: Joerg Roedel <joro@8bytes.org>
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>> Cc: x86@kernel.org
>> Cc: kvm@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Reviewed-by: Steve Rutherford <srutherford@google.com>
>> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
>> ---
>>  arch/x86/include/asm/kvm_para.h | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>> index 338119852512..bc1b11d057fc 100644
>> --- a/arch/x86/include/asm/kvm_para.h
>> +++ b/arch/x86/include/asm/kvm_para.h
>> @@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>>  	return ret;
>>  }
>>  
>> +static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
>> +				      unsigned long p2, unsigned long p3)
>> +{
>> +	long ret;
>> +
>> +	asm volatile("vmmcall"
>> +		     : "=a"(ret)
>> +		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
>> +		     : "memory");
>> +	return ret;
>> +}
>> +
>>  #ifdef CONFIG_KVM_GUEST
>>  bool kvm_para_available(void);
>>  unsigned int kvm_arch_para_features(void);
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall
  2020-12-02 21:22     ` Ashish Kalra
@ 2020-12-06 10:25       ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2020-12-06 10:25 UTC (permalink / raw)
  To: Ashish Kalra, Dr. David Alan Gilbert
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh

On 02/12/20 22:22, Ashish Kalra wrote:
> Hello Dave,
> 
> On Wed, Dec 02, 2020 at 04:54:20PM +0000, Dr. David Alan Gilbert wrote:
>> * Ashish Kalra (Ashish.Kalra@amd.com) wrote:
>>> From: Brijesh Singh <brijesh.singh@amd.com>
>>>
>>> This hypercall is used by the SEV guest to notify a change in the page
>>> encryption status to the hypervisor. The hypercall should be invoked
>>> only when the encryption attribute is changed from encrypted -> decrypted
>>> and vice versa. By default all guest pages are considered encrypted.
>>
>> Is it defined whether these are supposed to be called before or after
>> the the page type has been changed; is it change the type and then
>> notify or the other way around?

It doesn't matter.  However, you have do it before writing to the page, 
and the content of the page is unspecified between the hypercall and the 
write to the page.

So you cannot for example encrypt a page in place (using the same PFN 
but different settings of the C bit), you need to do:

	hypercall(); /* mark enc_data as encrypted */
	/*
	 * The contents of enc_data is now undefined as it can change
	 * across migration.
	 */
	memset(enc_data, unenc_data, PAGE_SIZE);

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-03  0:34   ` Sean Christopherson
  2020-12-04 17:16     ` Brijesh Singh
@ 2020-12-06 10:26     ` Paolo Bonzini
  2020-12-07 20:41       ` Sean Christopherson
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2020-12-06 10:26 UTC (permalink / raw)
  To: Sean Christopherson, Ashish Kalra
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

On 03/12/20 01:34, Sean Christopherson wrote:
> On Tue, Dec 01, 2020, Ashish Kalra wrote:
>> From: Brijesh Singh <brijesh.singh@amd.com>
>>
>> KVM hypercall framework relies on alternative framework to patch the
>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
>> apply_alternative() is called then it defaults to VMCALL. The approach
>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
>> will be able to decode the instruction and do the right things. But
>> when SEV is active, guest memory is encrypted with guest key and
>> hypervisor will not be able to decode the instruction bytes.
>>
>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
>> will be used by the SEV guest to notify encrypted pages to the hypervisor.
> 
> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> think there are any existing KVM hypercalls that happen before alternatives are
> patched, i.e. it'll be a nop for sane kernel builds.
> 
> I'm also skeptical that a KVM specific hypercall is the right approach for the
> encryption behavior, but I'll take that up in the patches later in the series.

Do you think that it's the guest that should "donate" memory for the 
bitmap instead?

Paolo

> 
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>> Cc: Joerg Roedel <joro@8bytes.org>
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>> Cc: x86@kernel.org
>> Cc: kvm@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Reviewed-by: Steve Rutherford <srutherford@google.com>
>> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
>> ---
>>   arch/x86/include/asm/kvm_para.h | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>> index 338119852512..bc1b11d057fc 100644
>> --- a/arch/x86/include/asm/kvm_para.h
>> +++ b/arch/x86/include/asm/kvm_para.h
>> @@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>>   	return ret;
>>   }
>>   
>> +static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
>> +				      unsigned long p2, unsigned long p3)
>> +{
>> +	long ret;
>> +
>> +	asm volatile("vmmcall"
>> +		     : "=a"(ret)
>> +		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
>> +		     : "memory");
>> +	return ret;
>> +}
>> +
>>   #ifdef CONFIG_KVM_GUEST
>>   bool kvm_para_available(void);
>>   unsigned int kvm_arch_para_features(void);
>> -- 
>> 2.17.1
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl
  2020-12-01  0:47 ` [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl Ashish Kalra
@ 2020-12-06 11:02   ` Dov Murik
  2020-12-07 22:00     ` Ashish Kalra
  0 siblings, 1 reply; 38+ messages in thread
From: Dov Murik @ 2020-12-06 11:02 UTC (permalink / raw)
  To: Ashish Kalra, pbonzini
  Cc: tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, tobin, jejb, frankeh,
	dgilbert



On 01/12/2020 2:47, Ashish Kalra wrote:
> From: Brijesh Singh <brijesh.singh@amd.com>
> 
> The ioctl can be used to retrieve page encryption bitmap for a given
> gfn range.
> 
> Return the correct bitmap as per the number of pages being requested
> by the user. Ensure that we only copy bmap->num_pages bytes in the
> userspace buffer, if bmap->num_pages is not byte aligned we read
> the trailing bits from the userspace and copy those bits as is.

I think you meant to say "Ensure that we only copy bmap->num_pages 
*bits* in the userspace buffer".  But maybe I'm missed something.


> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: x86@kernel.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
>   Documentation/virt/kvm/api.rst  | 27 +++++++++++++
>   arch/x86/include/asm/kvm_host.h |  2 +
>   arch/x86/kvm/svm/sev.c          | 70 +++++++++++++++++++++++++++++++++
>   arch/x86/kvm/svm/svm.c          |  1 +
>   arch/x86/kvm/svm/svm.h          |  1 +
>   arch/x86/kvm/x86.c              | 12 ++++++
>   include/uapi/linux/kvm.h        | 12 ++++++
>   7 files changed, 125 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 70254eaa5229..ae410f4332ab 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4671,6 +4671,33 @@ This ioctl resets VCPU registers and control structures according to
>   the clear cpu reset definition in the POP. However, the cpu is not put
>   into ESA mode. This reset is a superset of the initial reset.
> 
> +4.125 KVM_GET_PAGE_ENC_BITMAP (vm ioctl)
> +---------------------------------------
> +
> +:Capability: basic
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_page_enc_bitmap (in/out)
> +:Returns: 0 on success, -1 on error
> +
> +/* for KVM_GET_PAGE_ENC_BITMAP */
> +struct kvm_page_enc_bitmap {
> +	__u64 start_gfn;
> +	__u64 num_pages;
> +	union {
> +		void __user *enc_bitmap; /* one bit per page */
> +		__u64 padding2;
> +	};
> +};
> +
> +The encrypted VMs have the concept of private and shared pages. The private
> +pages are encrypted with the guest-specific key, while the shared pages may
> +be encrypted with the hypervisor key. The KVM_GET_PAGE_ENC_BITMAP can
> +be used to get the bitmap indicating whether the guest page is private
> +or shared. The bitmap can be used during the guest migration. If the page
> +is private then the userspace need to use SEV migration commands to transmit
> +the page.
> +
> 
>   4.125 KVM_S390_PV_COMMAND
>   -------------------------
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d035dc983a7a..8c2e40199ecb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1284,6 +1284,8 @@ struct kvm_x86_ops {
>   	void (*msr_filter_changed)(struct kvm_vcpu *vcpu);
>   	int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa,
>   				  unsigned long sz, unsigned long mode);
> +	int (*get_page_enc_bitmap)(struct kvm *kvm,
> +				struct kvm_page_enc_bitmap *bmap);
>   };
> 
>   struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 6b8bc1297f9c..a6586dd29767 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -1014,6 +1014,76 @@ int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
>   	return 0;
>   }
> 
> +int svm_get_page_enc_bitmap(struct kvm *kvm,
> +				   struct kvm_page_enc_bitmap *bmap)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	unsigned long gfn_start, gfn_end;
> +	unsigned long sz, i, sz_bytes;
> +	unsigned long *bitmap;
> +	int ret, n;
> +
> +	if (!sev_guest(kvm))
> +		return -ENOTTY;
> +
> +	gfn_start = bmap->start_gfn;
> +	gfn_end = gfn_start + bmap->num_pages;
> +
> +	sz = ALIGN(bmap->num_pages, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bitmap = kmalloc(sz, GFP_KERNEL);

Maybe use bitmap_alloc which accepts size in bits (and corresponding 
bitmap_free)?


> +	if (!bitmap)
> +		return -ENOMEM;
> +
> +	/* by default all pages are marked encrypted */
> +	memset(bitmap, 0xff, sz);

Maybe use bitmap_fill to clarify the intent?


> +
> +	mutex_lock(&kvm->lock);
> +	if (sev->page_enc_bmap) {
> +		i = gfn_start;
> +		for_each_clear_bit_from(i, sev->page_enc_bmap,
> +				      min(sev->page_enc_bmap_size, gfn_end))
> +			clear_bit(i - gfn_start, bitmap);
> +	}
> +	mutex_unlock(&kvm->lock);
> +
> +	ret = -EFAULT;
> +
> +	n = bmap->num_pages % BITS_PER_BYTE;
> +	sz_bytes = ALIGN(bmap->num_pages, BITS_PER_BYTE) / BITS_PER_BYTE;

Maybe clearer:

	sz_bytes = BITS_TO_BYTES(bmap->num_pages);



> +
> +	/*
> +	 * Return the correct bitmap as per the number of pages being
> +	 * requested by the user. Ensure that we only copy bmap->num_pages
> +	 * bytes in the userspace buffer, if bmap->num_pages is not byte
> +	 * aligned we read the trailing bits from the userspace and copy
> +	 * those bits as is.
> +	 */

(see my comment on the commit message above.)


> +
> +	if (n) {
> +		unsigned char *bitmap_kernel = (unsigned char *)bitmap;
> +		unsigned char bitmap_user;
> +		unsigned long offset, mask;
> +
> +		offset = bmap->num_pages / BITS_PER_BYTE;
> +		if (copy_from_user(&bitmap_user, bmap->enc_bitmap + offset,
> +				sizeof(unsigned char)))
> +			goto out;
> +
> +		mask = GENMASK(n - 1, 0);
> +		bitmap_user &= ~mask;
> +		bitmap_kernel[offset] &= mask;
> +		bitmap_kernel[offset] |= bitmap_user;
> +	}
> +
> +	if (copy_to_user(bmap->enc_bitmap, bitmap, sz_bytes))
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	kfree(bitmap);
> +	return ret;
> +}
> +
>   int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>   {
>   	struct kvm_sev_cmd sev_cmd;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7122ea5f7c47..bff89cab3ed0 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4314,6 +4314,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>   	.msr_filter_changed = svm_msr_filter_changed,
> 
>   	.page_enc_status_hc = svm_page_enc_status_hc,
> +	.get_page_enc_bitmap = svm_get_page_enc_bitmap,
>   };
> 
>   static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 0103a23ca174..4ce73f1034b9 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -413,6 +413,7 @@ int nested_svm_exit_special(struct vcpu_svm *svm);
>   void sync_nested_vmcb_control(struct vcpu_svm *svm);
>   int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
>                              unsigned long npages, unsigned long enc);
> +int svm_get_page_enc_bitmap(struct kvm *kvm, struct kvm_page_enc_bitmap *bmap);
> 
>   extern struct kvm_x86_nested_ops svm_nested_ops;
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3afc78f18f69..d3cb95a4dd55 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5695,6 +5695,18 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   	case KVM_X86_SET_MSR_FILTER:
>   		r = kvm_vm_ioctl_set_msr_filter(kvm, argp);
>   		break;
> +	case KVM_GET_PAGE_ENC_BITMAP: {
> +		struct kvm_page_enc_bitmap bitmap;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&bitmap, argp, sizeof(bitmap)))
> +			goto out;
> +
> +		r = -ENOTTY;
> +		if (kvm_x86_ops.get_page_enc_bitmap)
> +			r = kvm_x86_ops.get_page_enc_bitmap(kvm, &bitmap);
> +		break;
> +	}
>   	default:
>   		r = -ENOTTY;
>   	}
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 886802b8ffba..d0b9171bdb03 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -532,6 +532,16 @@ struct kvm_dirty_log {
>   	};
>   };
> 
> +/* for KVM_GET_PAGE_ENC_BITMAP */
> +struct kvm_page_enc_bitmap {
> +	__u64 start_gfn;
> +	__u64 num_pages;
> +	union {
> +		void __user *enc_bitmap; /* one bit per page */
> +		__u64 padding2;
> +	};
> +};
> +
>   /* for KVM_CLEAR_DIRTY_LOG */
>   struct kvm_clear_dirty_log {
>   	__u32 slot;
> @@ -1563,6 +1573,8 @@ struct kvm_pv_cmd {
>   /* Available with KVM_CAP_DIRTY_LOG_RING */
>   #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
> 
> +#define KVM_GET_PAGE_ENC_BITMAP	_IOW(KVMIO, 0xc6, struct kvm_page_enc_bitmap)

I see that kvm/next already defines ioctls numbered 0xc6 and 0xc7. 
Wouldn't these new ioctls (KVM_GET_PAGE_ENC_BITMAP, 
KVM_SET_PAGE_ENC_BITMAP) collide?


> +
>   /* Secure Encrypted Virtualization command */
>   enum sev_cmd_id {
>   	/* Guest initialization commands */
> 

-Dov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-06 10:26     ` Paolo Bonzini
@ 2020-12-07 20:41       ` Sean Christopherson
  2020-12-08  3:09         ` Steve Rutherford
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2020-12-07 20:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ashish Kalra, tglx, mingo, hpa, joro, bp, thomas.lendacky, x86,
	kvm, linux-kernel, srutherford, brijesh.singh, dovmurik, tobin,
	jejb, frankeh, dgilbert

On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> On 03/12/20 01:34, Sean Christopherson wrote:
> > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > > From: Brijesh Singh <brijesh.singh@amd.com>
> > > 
> > > KVM hypercall framework relies on alternative framework to patch the
> > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > > apply_alternative() is called then it defaults to VMCALL. The approach
> > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > > will be able to decode the instruction and do the right things. But
> > > when SEV is active, guest memory is encrypted with guest key and
> > > hypervisor will not be able to decode the instruction bytes.
> > > 
> > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > 
> > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > think there are any existing KVM hypercalls that happen before alternatives are
> > patched, i.e. it'll be a nop for sane kernel builds.
> > 
> > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > encryption behavior, but I'll take that up in the patches later in the series.
> 
> Do you think that it's the guest that should "donate" memory for the bitmap
> instead?

No.  Two things I'd like to explore:

  1. Making the hypercall to announce/request private vs. shared common across
     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
     I'm concerned that we'll end up with multiple hypercalls that do more or
     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
     pipe dream, but I'd like to at least explore options before shoving in KVM-
     only hypercalls.

  2. Tracking shared memory via a list of ranges instead of a using bitmap to
     track all of guest memory.  For most use cases, the vast majority of guest
     memory will be private, most ranges will be 2mb+, and conversions between
     private and shared will be uncommon events, i.e. the overhead to walk and
     split/merge list entries is hopefully not a big concern.  I suspect a list
     would consume far less memory, hopefully without impacting performance.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl
  2020-12-06 11:02   ` Dov Murik
@ 2020-12-07 22:00     ` Ashish Kalra
  0 siblings, 0 replies; 38+ messages in thread
From: Ashish Kalra @ 2020-12-07 22:00 UTC (permalink / raw)
  To: Dov Murik
  Cc: pbonzini, tglx, mingo, hpa, joro, bp, thomas.lendacky, x86, kvm,
	linux-kernel, srutherford, brijesh.singh, tobin, jejb, frankeh,
	dgilbert

Hello Dov,

On Sun, Dec 06, 2020 at 01:02:47PM +0200, Dov Murik wrote:
> 
> 
> On 01/12/2020 2:47, Ashish Kalra wrote:
> > From: Brijesh Singh <brijesh.singh@amd.com>
> > 
> > The ioctl can be used to retrieve page encryption bitmap for a given
> > gfn range.
> > 
> > Return the correct bitmap as per the number of pages being requested
> > by the user. Ensure that we only copy bmap->num_pages bytes in the
> > userspace buffer, if bmap->num_pages is not byte aligned we read
> > the trailing bits from the userspace and copy those bits as is.
> 
> I think you meant to say "Ensure that we only copy bmap->num_pages *bits* in
> the userspace buffer".  But maybe I'm missed something.
>

Yes, that is correct. It should read bmap->num_pages *bits* instead of
*bytes*, i will fix the comments.

> 
> > 
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Borislav Petkov <bp@suse.de>
> > Cc: Tom Lendacky <thomas.lendacky@amd.com>
> > Cc: x86@kernel.org
> > Cc: kvm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
> > Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> > Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> > ---
> >   Documentation/virt/kvm/api.rst  | 27 +++++++++++++
> >   arch/x86/include/asm/kvm_host.h |  2 +
> >   arch/x86/kvm/svm/sev.c          | 70 +++++++++++++++++++++++++++++++++
> >   arch/x86/kvm/svm/svm.c          |  1 +
> >   arch/x86/kvm/svm/svm.h          |  1 +
> >   arch/x86/kvm/x86.c              | 12 ++++++
> >   include/uapi/linux/kvm.h        | 12 ++++++
> >   7 files changed, 125 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 70254eaa5229..ae410f4332ab 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4671,6 +4671,33 @@ This ioctl resets VCPU registers and control structures according to
> >   the clear cpu reset definition in the POP. However, the cpu is not put
> >   into ESA mode. This reset is a superset of the initial reset.
> > 
> > +4.125 KVM_GET_PAGE_ENC_BITMAP (vm ioctl)
> > +---------------------------------------
> > +
> > +:Capability: basic
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_page_enc_bitmap (in/out)
> > +:Returns: 0 on success, -1 on error
> > +
> > +/* for KVM_GET_PAGE_ENC_BITMAP */
> > +struct kvm_page_enc_bitmap {
> > +	__u64 start_gfn;
> > +	__u64 num_pages;
> > +	union {
> > +		void __user *enc_bitmap; /* one bit per page */
> > +		__u64 padding2;
> > +	};
> > +};
> > +
> > +The encrypted VMs have the concept of private and shared pages. The private
> > +pages are encrypted with the guest-specific key, while the shared pages may
> > +be encrypted with the hypervisor key. The KVM_GET_PAGE_ENC_BITMAP can
> > +be used to get the bitmap indicating whether the guest page is private
> > +or shared. The bitmap can be used during the guest migration. If the page
> > +is private then the userspace need to use SEV migration commands to transmit
> > +the page.
> > +
> > 
> >   4.125 KVM_S390_PV_COMMAND
> >   -------------------------
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index d035dc983a7a..8c2e40199ecb 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1284,6 +1284,8 @@ struct kvm_x86_ops {
> >   	void (*msr_filter_changed)(struct kvm_vcpu *vcpu);
> >   	int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa,
> >   				  unsigned long sz, unsigned long mode);
> > +	int (*get_page_enc_bitmap)(struct kvm *kvm,
> > +				struct kvm_page_enc_bitmap *bmap);
> >   };
> > 
> >   struct kvm_x86_nested_ops {
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 6b8bc1297f9c..a6586dd29767 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -1014,6 +1014,76 @@ int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
> >   	return 0;
> >   }
> > 
> > +int svm_get_page_enc_bitmap(struct kvm *kvm,
> > +				   struct kvm_page_enc_bitmap *bmap)
> > +{
> > +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> > +	unsigned long gfn_start, gfn_end;
> > +	unsigned long sz, i, sz_bytes;
> > +	unsigned long *bitmap;
> > +	int ret, n;
> > +
> > +	if (!sev_guest(kvm))
> > +		return -ENOTTY;
> > +
> > +	gfn_start = bmap->start_gfn;
> > +	gfn_end = gfn_start + bmap->num_pages;
> > +
> > +	sz = ALIGN(bmap->num_pages, BITS_PER_LONG) / BITS_PER_BYTE;
> > +	bitmap = kmalloc(sz, GFP_KERNEL);
> 
> Maybe use bitmap_alloc which accepts size in bits (and corresponding
> bitmap_free)?
>

I will look at this. 

> 
> > +	if (!bitmap)
> > +		return -ENOMEM;
> > +
> > +	/* by default all pages are marked encrypted */
> > +	memset(bitmap, 0xff, sz);
> 
> Maybe use bitmap_fill to clarify the intent?
>

Again, i will look at this. 
> 
> > +
> > +	mutex_lock(&kvm->lock);
> > +	if (sev->page_enc_bmap) {
> > +		i = gfn_start;
> > +		for_each_clear_bit_from(i, sev->page_enc_bmap,
> > +				      min(sev->page_enc_bmap_size, gfn_end))
> > +			clear_bit(i - gfn_start, bitmap);
> > +	}
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	ret = -EFAULT;
> > +
> > +	n = bmap->num_pages % BITS_PER_BYTE;
> > +	sz_bytes = ALIGN(bmap->num_pages, BITS_PER_BYTE) / BITS_PER_BYTE;
> 
> Maybe clearer:
> 
> 	sz_bytes = BITS_TO_BYTES(bmap->num_pages);
> 
> 
> 
> > +
> > +	/*
> > +	 * Return the correct bitmap as per the number of pages being
> > +	 * requested by the user. Ensure that we only copy bmap->num_pages
> > +	 * bytes in the userspace buffer, if bmap->num_pages is not byte
> > +	 * aligned we read the trailing bits from the userspace and copy
> > +	 * those bits as is.
> > +	 */
> 
> (see my comment on the commit message above.)
> 
Yes, as i mentioned above, this need to be bmap->num pages *bits* and 
not *bytes*. 

> 
> > +
> > +	if (n) {
> > +		unsigned char *bitmap_kernel = (unsigned char *)bitmap;
> > +		unsigned char bitmap_user;
> > +		unsigned long offset, mask;
> > +
> > +		offset = bmap->num_pages / BITS_PER_BYTE;
> > +		if (copy_from_user(&bitmap_user, bmap->enc_bitmap + offset,
> > +				sizeof(unsigned char)))
> > +			goto out;
> > +
> > +		mask = GENMASK(n - 1, 0);
> > +		bitmap_user &= ~mask;
> > +		bitmap_kernel[offset] &= mask;
> > +		bitmap_kernel[offset] |= bitmap_user;
> > +	}
> > +
> > +	if (copy_to_user(bmap->enc_bitmap, bitmap, sz_bytes))
> > +		goto out;
> > +
> > +	ret = 0;
> > +out:
> > +	kfree(bitmap);
> > +	return ret;
> > +}
> > +
> >   int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
> >   {
> >   	struct kvm_sev_cmd sev_cmd;
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 7122ea5f7c47..bff89cab3ed0 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4314,6 +4314,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> >   	.msr_filter_changed = svm_msr_filter_changed,
> > 
> >   	.page_enc_status_hc = svm_page_enc_status_hc,
> > +	.get_page_enc_bitmap = svm_get_page_enc_bitmap,
> >   };
> > 
> >   static struct kvm_x86_init_ops svm_init_ops __initdata = {
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index 0103a23ca174..4ce73f1034b9 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -413,6 +413,7 @@ int nested_svm_exit_special(struct vcpu_svm *svm);
> >   void sync_nested_vmcb_control(struct vcpu_svm *svm);
> >   int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa,
> >                              unsigned long npages, unsigned long enc);
> > +int svm_get_page_enc_bitmap(struct kvm *kvm, struct kvm_page_enc_bitmap *bmap);
> > 
> >   extern struct kvm_x86_nested_ops svm_nested_ops;
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 3afc78f18f69..d3cb95a4dd55 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -5695,6 +5695,18 @@ long kvm_arch_vm_ioctl(struct file *filp,
> >   	case KVM_X86_SET_MSR_FILTER:
> >   		r = kvm_vm_ioctl_set_msr_filter(kvm, argp);
> >   		break;
> > +	case KVM_GET_PAGE_ENC_BITMAP: {
> > +		struct kvm_page_enc_bitmap bitmap;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&bitmap, argp, sizeof(bitmap)))
> > +			goto out;
> > +
> > +		r = -ENOTTY;
> > +		if (kvm_x86_ops.get_page_enc_bitmap)
> > +			r = kvm_x86_ops.get_page_enc_bitmap(kvm, &bitmap);
> > +		break;
> > +	}
> >   	default:
> >   		r = -ENOTTY;
> >   	}
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 886802b8ffba..d0b9171bdb03 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -532,6 +532,16 @@ struct kvm_dirty_log {
> >   	};
> >   };
> > 
> > +/* for KVM_GET_PAGE_ENC_BITMAP */
> > +struct kvm_page_enc_bitmap {
> > +	__u64 start_gfn;
> > +	__u64 num_pages;
> > +	union {
> > +		void __user *enc_bitmap; /* one bit per page */
> > +		__u64 padding2;
> > +	};
> > +};
> > +
> >   /* for KVM_CLEAR_DIRTY_LOG */
> >   struct kvm_clear_dirty_log {
> >   	__u32 slot;
> > @@ -1563,6 +1573,8 @@ struct kvm_pv_cmd {
> >   /* Available with KVM_CAP_DIRTY_LOG_RING */
> >   #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
> > 
> > +#define KVM_GET_PAGE_ENC_BITMAP	_IOW(KVMIO, 0xc6, struct kvm_page_enc_bitmap)
> 
> I see that kvm/next already defines ioctls numbered 0xc6 and 0xc7. Wouldn't
> these new ioctls (KVM_GET_PAGE_ENC_BITMAP, KVM_SET_PAGE_ENC_BITMAP) collide?
>

Yes, but they will be fixed for the next version of the patch-set i am
going to post. 

Thanks for your feedback.
Ashish

> 
> > +
> >   /* Secure Encrypted Virtualization command */
> >   enum sev_cmd_id {
> >   	/* Guest initialization commands */
> > 
> 
> -Dov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-07 20:41       ` Sean Christopherson
@ 2020-12-08  3:09         ` Steve Rutherford
  2020-12-08  4:16           ` Kalra, Ashish
  2020-12-08 16:29           ` Brijesh Singh
  0 siblings, 2 replies; 38+ messages in thread
From: Steve Rutherford @ 2020-12-08  3:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Ashish Kalra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Joerg Roedel, Borislav Petkov, Tom Lendacky,
	X86 ML, KVM list, LKML, Brijesh Singh, dovmurik, tobin, jejb,
	frankeh, dgilbert

On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > On 03/12/20 01:34, Sean Christopherson wrote:
> > > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > > > From: Brijesh Singh <brijesh.singh@amd.com>
> > > >
> > > > KVM hypercall framework relies on alternative framework to patch the
> > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > > > apply_alternative() is called then it defaults to VMCALL. The approach
> > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > > > will be able to decode the instruction and do the right things. But
> > > > when SEV is active, guest memory is encrypted with guest key and
> > > > hypervisor will not be able to decode the instruction bytes.
> > > >
> > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > > > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > >
> > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > > think there are any existing KVM hypercalls that happen before alternatives are
> > > patched, i.e. it'll be a nop for sane kernel builds.
> > >
> > > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > > encryption behavior, but I'll take that up in the patches later in the series.
> >
> > Do you think that it's the guest that should "donate" memory for the bitmap
> > instead?
>
> No.  Two things I'd like to explore:
>
>   1. Making the hypercall to announce/request private vs. shared common across
>      hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
>      I'm concerned that we'll end up with multiple hypercalls that do more or
>      less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
>      pipe dream, but I'd like to at least explore options before shoving in KVM-
>      only hypercalls.
>
>
>   2. Tracking shared memory via a list of ranges instead of a using bitmap to
>      track all of guest memory.  For most use cases, the vast majority of guest
>      memory will be private, most ranges will be 2mb+, and conversions between
>      private and shared will be uncommon events, i.e. the overhead to walk and
>      split/merge list entries is hopefully not a big concern.  I suspect a list
>      would consume far less memory, hopefully without impacting performance.

For a fancier data structure, I'd suggest an interval tree. Linux
already has an rbtree-based interval tree implementation, which would
likely work, and would probably assuage any performance concerns.

Something like this would not be worth doing unless most of the shared
pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
60ish discontiguous shared regions. This is by no means a thorough
search, but it's suggestive. If this is typical, then the bitmap would
be far less efficient than most any interval-based data structure.

You'd have to allow userspace to upper bound the number of intervals
(similar to the maximum bitmap size), to prevent host OOMs due to
malicious guests. There's something nice about the guest donating
memory for this, since that would eliminate the OOM risk.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-08  3:09         ` Steve Rutherford
@ 2020-12-08  4:16           ` Kalra, Ashish
  2020-12-08 16:29           ` Brijesh Singh
  1 sibling, 0 replies; 38+ messages in thread
From: Kalra, Ashish @ 2020-12-08  4:16 UTC (permalink / raw)
  To: Steve Rutherford
  Cc: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Joerg Roedel, Borislav Petkov, Lendacky, Thomas,
	X86 ML, KVM list, LKML, Singh, Brijesh, dovmurik, tobin, jejb,
	frankeh, dgilbert

I don’t think that the bitmap by itself is really a performance bottleneck here.

Thanks,
Ashish

> On Dec 7, 2020, at 9:10 PM, Steve Rutherford <srutherford@google.com> wrote:
> 
> On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
>> 
>>> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
>>> On 03/12/20 01:34, Sean Christopherson wrote:
>>>> On Tue, Dec 01, 2020, Ashish Kalra wrote:
>>>>> From: Brijesh Singh <brijesh.singh@amd.com>
>>>>> 
>>>>> KVM hypercall framework relies on alternative framework to patch the
>>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
>>>>> apply_alternative() is called then it defaults to VMCALL. The approach
>>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
>>>>> will be able to decode the instruction and do the right things. But
>>>>> when SEV is active, guest memory is encrypted with guest key and
>>>>> hypervisor will not be able to decode the instruction bytes.
>>>>> 
>>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
>>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor.
>>>> 
>>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
>>>> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
>>>> think there are any existing KVM hypercalls that happen before alternatives are
>>>> patched, i.e. it'll be a nop for sane kernel builds.
>>>> 
>>>> I'm also skeptical that a KVM specific hypercall is the right approach for the
>>>> encryption behavior, but I'll take that up in the patches later in the series.
>>> 
>>> Do you think that it's the guest that should "donate" memory for the bitmap
>>> instead?
>> 
>> No.  Two things I'd like to explore:
>> 
>>  1. Making the hypercall to announce/request private vs. shared common across
>>     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
>>     I'm concerned that we'll end up with multiple hypercalls that do more or
>>     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
>>     pipe dream, but I'd like to at least explore options before shoving in KVM-
>>     only hypercalls.
>> 
>> 
>>  2. Tracking shared memory via a list of ranges instead of a using bitmap to
>>     track all of guest memory.  For most use cases, the vast majority of guest
>>     memory will be private, most ranges will be 2mb+, and conversions between
>>     private and shared will be uncommon events, i.e. the overhead to walk and
>>     split/merge list entries is hopefully not a big concern.  I suspect a list
>>     would consume far less memory, hopefully without impacting performance.
> 
> For a fancier data structure, I'd suggest an interval tree. Linux
> already has an rbtree-based interval tree implementation, which would
> likely work, and would probably assuage any performance concerns.
> 
> Something like this would not be worth doing unless most of the shared
> pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> 60ish discontiguous shared regions. This is by no means a thorough
> search, but it's suggestive. If this is typical, then the bitmap would
> be far less efficient than most any interval-based data structure.
> 
> You'd have to allow userspace to upper bound the number of intervals
> (similar to the maximum bitmap size), to prevent host OOMs due to
> malicious guests. There's something nice about the guest donating
> memory for this, since that would eliminate the OOM risk.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-08  3:09         ` Steve Rutherford
  2020-12-08  4:16           ` Kalra, Ashish
@ 2020-12-08 16:29           ` Brijesh Singh
  2020-12-11 22:55             ` Ashish Kalra
  1 sibling, 1 reply; 38+ messages in thread
From: Brijesh Singh @ 2020-12-08 16:29 UTC (permalink / raw)
  To: Steve Rutherford, Sean Christopherson
  Cc: brijesh.singh, Paolo Bonzini, Ashish Kalra, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Joerg Roedel, Borislav Petkov,
	Tom Lendacky, X86 ML, KVM list, LKML, dovmurik, tobin, jejb,
	frankeh, dgilbert


On 12/7/20 9:09 PM, Steve Rutherford wrote:
> On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
>> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
>>> On 03/12/20 01:34, Sean Christopherson wrote:
>>>> On Tue, Dec 01, 2020, Ashish Kalra wrote:
>>>>> From: Brijesh Singh <brijesh.singh@amd.com>
>>>>>
>>>>> KVM hypercall framework relies on alternative framework to patch the
>>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
>>>>> apply_alternative() is called then it defaults to VMCALL. The approach
>>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
>>>>> will be able to decode the instruction and do the right things. But
>>>>> when SEV is active, guest memory is encrypted with guest key and
>>>>> hypervisor will not be able to decode the instruction bytes.
>>>>>
>>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
>>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor.
>>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
>>>> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
>>>> think there are any existing KVM hypercalls that happen before alternatives are
>>>> patched, i.e. it'll be a nop for sane kernel builds.
>>>>
>>>> I'm also skeptical that a KVM specific hypercall is the right approach for the
>>>> encryption behavior, but I'll take that up in the patches later in the series.
>>> Do you think that it's the guest that should "donate" memory for the bitmap
>>> instead?
>> No.  Two things I'd like to explore:
>>
>>   1. Making the hypercall to announce/request private vs. shared common across
>>      hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
>>      I'm concerned that we'll end up with multiple hypercalls that do more or
>>      less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
>>      pipe dream, but I'd like to at least explore options before shoving in KVM-
>>      only hypercalls.
>>
>>
>>   2. Tracking shared memory via a list of ranges instead of a using bitmap to
>>      track all of guest memory.  For most use cases, the vast majority of guest
>>      memory will be private, most ranges will be 2mb+, and conversions between
>>      private and shared will be uncommon events, i.e. the overhead to walk and
>>      split/merge list entries is hopefully not a big concern.  I suspect a list
>>      would consume far less memory, hopefully without impacting performance.
> For a fancier data structure, I'd suggest an interval tree. Linux
> already has an rbtree-based interval tree implementation, which would
> likely work, and would probably assuage any performance concerns.
>
> Something like this would not be worth doing unless most of the shared
> pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> 60ish discontiguous shared regions. This is by no means a thorough
> search, but it's suggestive. If this is typical, then the bitmap would
> be far less efficient than most any interval-based data structure.
>
> You'd have to allow userspace to upper bound the number of intervals
> (similar to the maximum bitmap size), to prevent host OOMs due to
> malicious guests. There's something nice about the guest donating
> memory for this, since that would eliminate the OOM risk.


Tracking the list of ranges may not be bad idea, especially if we use
the some kind of rbtree-based data structure to update the ranges. It
will certainly be better than bitmap which grows based on the guest
memory size and as you guys see in the practice most of the pages will
be guest private. I am not sure if guest donating a memory will cover
all the cases, e.g what if we do a memory hotplug (increase the guest
ram from 2GB to 64GB), will donated memory range will be enough to store
the metadata.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-08 16:29           ` Brijesh Singh
@ 2020-12-11 22:55             ` Ashish Kalra
       [not found]               ` <20201212045603.GA27415@ashkalra_ubuntu_server>
  0 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2020-12-11 22:55 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: Steve Rutherford, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Joerg Roedel,
	Borislav Petkov, Tom Lendacky, X86 ML, KVM list, LKML, dovmurik,
	tobin, jejb, frankeh, dgilbert

Hello All,

On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> 
> On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> >> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> >>> On 03/12/20 01:34, Sean Christopherson wrote:
> >>>> On Tue, Dec 01, 2020, Ashish Kalra wrote:
> >>>>> From: Brijesh Singh <brijesh.singh@amd.com>
> >>>>>
> >>>>> KVM hypercall framework relies on alternative framework to patch the
> >>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> >>>>> apply_alternative() is called then it defaults to VMCALL. The approach
> >>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> >>>>> will be able to decode the instruction and do the right things. But
> >>>>> when SEV is active, guest memory is encrypted with guest key and
> >>>>> hypervisor will not be able to decode the instruction bytes.
> >>>>>
> >>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> >>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor.
> >>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> >>>> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> >>>> think there are any existing KVM hypercalls that happen before alternatives are
> >>>> patched, i.e. it'll be a nop for sane kernel builds.
> >>>>
> >>>> I'm also skeptical that a KVM specific hypercall is the right approach for the
> >>>> encryption behavior, but I'll take that up in the patches later in the series.
> >>> Do you think that it's the guest that should "donate" memory for the bitmap
> >>> instead?
> >> No.  Two things I'd like to explore:
> >>
> >>   1. Making the hypercall to announce/request private vs. shared common across
> >>      hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> >>      I'm concerned that we'll end up with multiple hypercalls that do more or
> >>      less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> >>      pipe dream, but I'd like to at least explore options before shoving in KVM-
> >>      only hypercalls.
> >>
> >>
> >>   2. Tracking shared memory via a list of ranges instead of a using bitmap to
> >>      track all of guest memory.  For most use cases, the vast majority of guest
> >>      memory will be private, most ranges will be 2mb+, and conversions between
> >>      private and shared will be uncommon events, i.e. the overhead to walk and
> >>      split/merge list entries is hopefully not a big concern.  I suspect a list
> >>      would consume far less memory, hopefully without impacting performance.
> > For a fancier data structure, I'd suggest an interval tree. Linux
> > already has an rbtree-based interval tree implementation, which would
> > likely work, and would probably assuage any performance concerns.
> >
> > Something like this would not be worth doing unless most of the shared
> > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > 60ish discontiguous shared regions. This is by no means a thorough
> > search, but it's suggestive. If this is typical, then the bitmap would
> > be far less efficient than most any interval-based data structure.
> >
> > You'd have to allow userspace to upper bound the number of intervals
> > (similar to the maximum bitmap size), to prevent host OOMs due to
> > malicious guests. There's something nice about the guest donating
> > memory for this, since that would eliminate the OOM risk.
> 
> 
> Tracking the list of ranges may not be bad idea, especially if we use
> the some kind of rbtree-based data structure to update the ranges. It
> will certainly be better than bitmap which grows based on the guest
> memory size and as you guys see in the practice most of the pages will
> be guest private. I am not sure if guest donating a memory will cover
> all the cases, e.g what if we do a memory hotplug (increase the guest
> ram from 2GB to 64GB), will donated memory range will be enough to store
> the metadata.
> 
>. 

With reference to internal discussions regarding the above, i am going
to look into specific items as listed below :

1). "hypercall" related :
a). Explore the SEV-SNP page change request structure (included in GHCB),
see if there is something common there than can be re-used for SEV/SEV-ES
page encryption status hypercalls.
b). Explore if there is any common hypercall framework i can use in 
Linux/KVM.

2). related to the "backing" data structure - explore using a range-based
list or something like rbtree-based interval tree data structure
(as mentioned by Steve above) to replace the current bitmap based
implementation.

Thanks,
Ashish


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
       [not found]               ` <20201212045603.GA27415@ashkalra_ubuntu_server>
@ 2020-12-18 19:39                 ` Dr. David Alan Gilbert
       [not found]                   ` <E79E09A2-F314-4B59-B7AE-07B1D422DF2B@amd.com>
  0 siblings, 1 reply; 38+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-18 19:39 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: Brijesh Singh, Steve Rutherford, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Tom Lendacky, X86 ML, KVM list,
	LKML, dovmurik, tobin, jejb, frankeh

* Ashish Kalra (ashish.kalra@amd.com) wrote:
> On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> > Hello All,
> > 
> > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> > > 
> > > On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > > >>> On 03/12/20 01:34, Sean Christopherson wrote:
> > > >>>> On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > > >>>>> From: Brijesh Singh <brijesh.singh@amd.com>
> > > >>>>>
> > > >>>>> KVM hypercall framework relies on alternative framework to patch the
> > > >>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > > >>>>> apply_alternative() is called then it defaults to VMCALL. The approach
> > > >>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > > >>>>> will be able to decode the instruction and do the right things. But
> > > >>>>> when SEV is active, guest memory is encrypted with guest key and
> > > >>>>> hypervisor will not be able to decode the instruction bytes.
> > > >>>>>
> > > >>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > > >>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > > >>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > > >>>> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > > >>>> think there are any existing KVM hypercalls that happen before alternatives are
> > > >>>> patched, i.e. it'll be a nop for sane kernel builds.
> > > >>>>
> > > >>>> I'm also skeptical that a KVM specific hypercall is the right approach for the
> > > >>>> encryption behavior, but I'll take that up in the patches later in the series.
> > > >>> Do you think that it's the guest that should "donate" memory for the bitmap
> > > >>> instead?
> > > >> No.  Two things I'd like to explore:
> > > >>
> > > >>   1. Making the hypercall to announce/request private vs. shared common across
> > > >>      hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> > > >>      I'm concerned that we'll end up with multiple hypercalls that do more or
> > > >>      less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> > > >>      pipe dream, but I'd like to at least explore options before shoving in KVM-
> > > >>      only hypercalls.
> > > >>
> > > >>
> > > >>   2. Tracking shared memory via a list of ranges instead of a using bitmap to
> > > >>      track all of guest memory.  For most use cases, the vast majority of guest
> > > >>      memory will be private, most ranges will be 2mb+, and conversions between
> > > >>      private and shared will be uncommon events, i.e. the overhead to walk and
> > > >>      split/merge list entries is hopefully not a big concern.  I suspect a list
> > > >>      would consume far less memory, hopefully without impacting performance.
> > > > For a fancier data structure, I'd suggest an interval tree. Linux
> > > > already has an rbtree-based interval tree implementation, which would
> > > > likely work, and would probably assuage any performance concerns.
> > > >
> > > > Something like this would not be worth doing unless most of the shared
> > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > > > 60ish discontiguous shared regions. This is by no means a thorough
> > > > search, but it's suggestive. If this is typical, then the bitmap would
> > > > be far less efficient than most any interval-based data structure.
> > > >
> > > > You'd have to allow userspace to upper bound the number of intervals
> > > > (similar to the maximum bitmap size), to prevent host OOMs due to
> > > > malicious guests. There's something nice about the guest donating
> > > > memory for this, since that would eliminate the OOM risk.
> > > 
> > > 
> > > Tracking the list of ranges may not be bad idea, especially if we use
> > > the some kind of rbtree-based data structure to update the ranges. It
> > > will certainly be better than bitmap which grows based on the guest
> > > memory size and as you guys see in the practice most of the pages will
> > > be guest private. I am not sure if guest donating a memory will cover
> > > all the cases, e.g what if we do a memory hotplug (increase the guest
> > > ram from 2GB to 64GB), will donated memory range will be enough to store
> > > the metadata.
> > > 
> > >. 
> > 
> > With reference to internal discussions regarding the above, i am going
> > to look into specific items as listed below :
> > 
> > 1). "hypercall" related :
> > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > see if there is something common there than can be re-used for SEV/SEV-ES
> > page encryption status hypercalls.
> > b). Explore if there is any common hypercall framework i can use in 
> > Linux/KVM.
> > 
> > 2). related to the "backing" data structure - explore using a range-based
> > list or something like rbtree-based interval tree data structure
> > (as mentioned by Steve above) to replace the current bitmap based
> > implementation.
> > 
> > 
> 
> I do agree that a range-based list or an interval tree data structure is a
> really good "logical" fit for the guest page encryption status tracking.
> 
> We can only keep track of the guest unencrypted shared pages in the
> range(s) list (which will keep the data structure quite compact) and all
> the guest private/encrypted memory does not really need any tracking in
> the list, anything not in the list will be encrypted/private.
> 
> Also looking at a more "practical" use case, here is the current log of
> page encryption status hypercalls when booting a linux guest :
> 
> ...

<snip>

> [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
....

> [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1

....
> [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
....

> [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
....

> [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
....
> [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
....

> As can be observed here, all guest MMIO ranges are initially setup as
> shared, and those are all contigious guest page ranges.
> 
> After that the encryption status hypercalls are invoked when DMA gets
> triggered during disk i/o while booting the guest ... here again the
> guest page ranges are contigious, though mostly single page is touched 
> and a lot of page re-use is observed. 
> 
> So a range-based list/structure will be a "good" fit for such usage
> scenarios.

It seems surprisingly common to flick the same pages back and forth between
encrypted and clear for quite a while;  why is this?

Dave


> Thanks,
> Ashish
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
       [not found]                   ` <E79E09A2-F314-4B59-B7AE-07B1D422DF2B@amd.com>
@ 2020-12-18 19:56                     ` Dr. David Alan Gilbert
  2021-01-06 23:05                       ` Ashish Kalra
  0 siblings, 1 reply; 38+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-18 19:56 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Singh, Brijesh, Steve Rutherford, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh

* Kalra, Ashish (Ashish.Kalra@amd.com) wrote:
> Hello Dave,
> 
> On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Ashish Kalra (ashish.kalra@amd.com) wrote:
> On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> Hello All,
> 
> On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> 
> On 12/7/20 9:09 PM, Steve Rutherford wrote:
> On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> On 03/12/20 01:34, Sean Christopherson wrote:
> On Tue, Dec 01, 2020, Ashish Kalra wrote:
> From: Brijesh Singh <brijesh.singh@amd.com>
> 
> KVM hypercall framework relies on alternative framework to patch the
> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> apply_alternative() is called then it defaults to VMCALL. The approach
> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> will be able to decode the instruction and do the right things. But
> when SEV is active, guest memory is encrypted with guest key and
> hypervisor will not be able to decode the instruction bytes.
> 
> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> will be used by the SEV guest to notify encrypted pages to the hypervisor.
> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> think there are any existing KVM hypercalls that happen before alternatives are
> patched, i.e. it'll be a nop for sane kernel builds.
> 
> I'm also skeptical that a KVM specific hypercall is the right approach for the
> encryption behavior, but I'll take that up in the patches later in the series.
> Do you think that it's the guest that should "donate" memory for the bitmap
> instead?
> No.  Two things I'd like to explore:
> 
>  1. Making the hypercall to announce/request private vs. shared common across
>     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
>     I'm concerned that we'll end up with multiple hypercalls that do more or
>     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
>     pipe dream, but I'd like to at least explore options before shoving in KVM-
>     only hypercalls.
> 
> 
>  2. Tracking shared memory via a list of ranges instead of a using bitmap to
>     track all of guest memory.  For most use cases, the vast majority of guest
>     memory will be private, most ranges will be 2mb+, and conversions between
>     private and shared will be uncommon events, i.e. the overhead to walk and
>     split/merge list entries is hopefully not a big concern.  I suspect a list
>     would consume far less memory, hopefully without impacting performance.
> For a fancier data structure, I'd suggest an interval tree. Linux
> already has an rbtree-based interval tree implementation, which would
> likely work, and would probably assuage any performance concerns.
> 
> Something like this would not be worth doing unless most of the shared
> pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> 60ish discontiguous shared regions. This is by no means a thorough
> search, but it's suggestive. If this is typical, then the bitmap would
> be far less efficient than most any interval-based data structure.
> 
> You'd have to allow userspace to upper bound the number of intervals
> (similar to the maximum bitmap size), to prevent host OOMs due to
> malicious guests. There's something nice about the guest donating
> memory for this, since that would eliminate the OOM risk.
> 
> 
> Tracking the list of ranges may not be bad idea, especially if we use
> the some kind of rbtree-based data structure to update the ranges. It
> will certainly be better than bitmap which grows based on the guest
> memory size and as you guys see in the practice most of the pages will
> be guest private. I am not sure if guest donating a memory will cover
> all the cases, e.g what if we do a memory hotplug (increase the guest
> ram from 2GB to 64GB), will donated memory range will be enough to store
> the metadata.
> 
> .
> 
> With reference to internal discussions regarding the above, i am going
> to look into specific items as listed below :
> 
> 1). "hypercall" related :
> a). Explore the SEV-SNP page change request structure (included in GHCB),
> see if there is something common there than can be re-used for SEV/SEV-ES
> page encryption status hypercalls.
> b). Explore if there is any common hypercall framework i can use in
> Linux/KVM.
> 
> 2). related to the "backing" data structure - explore using a range-based
> list or something like rbtree-based interval tree data structure
> (as mentioned by Steve above) to replace the current bitmap based
> implementation.
> 
> 
> 
> I do agree that a range-based list or an interval tree data structure is a
> really good "logical" fit for the guest page encryption status tracking.
> 
> We can only keep track of the guest unencrypted shared pages in the
> range(s) list (which will keep the data structure quite compact) and all
> the guest private/encrypted memory does not really need any tracking in
> the list, anything not in the list will be encrypted/private.
> 
> Also looking at a more "practical" use case, here is the current log of
> page encryption status hypercalls when booting a linux guest :
> 
> ...
> 
> <snip>
> 
> [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
> ....
> 
> [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1
> 
> ....
> [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
> ....
> 
> [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
> ....
> 
> [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
> ....
> [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> ....
> 
> As can be observed here, all guest MMIO ranges are initially setup as
> shared, and those are all contigious guest page ranges.
> 
> After that the encryption status hypercalls are invoked when DMA gets
> triggered during disk i/o while booting the guest ... here again the
> guest page ranges are contigious, though mostly single page is touched
> and a lot of page re-use is observed.
> 
> So a range-based list/structure will be a "good" fit for such usage
> scenarios.
> 
> It seems surprisingly common to flick the same pages back and forth between
> encrypted and clear for quite a while;  why is this?
> 
> 
> dma_alloc_coherent()'s will allocate pages and then call
> set_decrypted() on them and then at dma_free_coherent(), set_encrypted()
> is called on the pages to be freed. So these observations in the logs
> where a lot of single 4K pages are seeing C-bit transitions and
> corresponding hypercalls are the ones associated with
> dma_alloc_coherent().

It makes me wonder if it might be worth teaching it to hold onto those
DMA pages somewhere until it needs them for something else and avoid the
extra hypercalls; just something to think about.

Dave

> Thanks,
> Ashish
> 
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2020-12-18 19:56                     ` Dr. David Alan Gilbert
@ 2021-01-06 23:05                       ` Ashish Kalra
  2021-01-07  1:01                         ` Steve Rutherford
  0 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2021-01-06 23:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Singh, Brijesh, Steve Rutherford, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh

On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote:
> * Kalra, Ashish (Ashish.Kalra@amd.com) wrote:
> > Hello Dave,
> > 
> > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Ashish Kalra (ashish.kalra@amd.com) wrote:
> > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> > Hello All,
> > 
> > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> > 
> > On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > On 03/12/20 01:34, Sean Christopherson wrote:
> > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > From: Brijesh Singh <brijesh.singh@amd.com>
> > 
> > KVM hypercall framework relies on alternative framework to patch the
> > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > apply_alternative() is called then it defaults to VMCALL. The approach
> > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > will be able to decode the instruction and do the right things. But
> > when SEV is active, guest memory is encrypted with guest key and
> > hypervisor will not be able to decode the instruction bytes.
> > 
> > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > think there are any existing KVM hypercalls that happen before alternatives are
> > patched, i.e. it'll be a nop for sane kernel builds.
> > 
> > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > encryption behavior, but I'll take that up in the patches later in the series.
> > Do you think that it's the guest that should "donate" memory for the bitmap
> > instead?
> > No.  Two things I'd like to explore:
> > 
> >  1. Making the hypercall to announce/request private vs. shared common across
> >     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> >     I'm concerned that we'll end up with multiple hypercalls that do more or
> >     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> >     pipe dream, but I'd like to at least explore options before shoving in KVM-
> >     only hypercalls.
> > 
> > 
> >  2. Tracking shared memory via a list of ranges instead of a using bitmap to
> >     track all of guest memory.  For most use cases, the vast majority of guest
> >     memory will be private, most ranges will be 2mb+, and conversions between
> >     private and shared will be uncommon events, i.e. the overhead to walk and
> >     split/merge list entries is hopefully not a big concern.  I suspect a list
> >     would consume far less memory, hopefully without impacting performance.
> > For a fancier data structure, I'd suggest an interval tree. Linux
> > already has an rbtree-based interval tree implementation, which would
> > likely work, and would probably assuage any performance concerns.
> > 
> > Something like this would not be worth doing unless most of the shared
> > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > 60ish discontiguous shared regions. This is by no means a thorough
> > search, but it's suggestive. If this is typical, then the bitmap would
> > be far less efficient than most any interval-based data structure.
> > 
> > You'd have to allow userspace to upper bound the number of intervals
> > (similar to the maximum bitmap size), to prevent host OOMs due to
> > malicious guests. There's something nice about the guest donating
> > memory for this, since that would eliminate the OOM risk.
> > 
> > 
> > Tracking the list of ranges may not be bad idea, especially if we use
> > the some kind of rbtree-based data structure to update the ranges. It
> > will certainly be better than bitmap which grows based on the guest
> > memory size and as you guys see in the practice most of the pages will
> > be guest private. I am not sure if guest donating a memory will cover
> > all the cases, e.g what if we do a memory hotplug (increase the guest
> > ram from 2GB to 64GB), will donated memory range will be enough to store
> > the metadata.
> > 
> > .
> > 
> > With reference to internal discussions regarding the above, i am going
> > to look into specific items as listed below :
> > 
> > 1). "hypercall" related :
> > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > see if there is something common there than can be re-used for SEV/SEV-ES
> > page encryption status hypercalls.
> > b). Explore if there is any common hypercall framework i can use in
> > Linux/KVM.
> > 
> > 2). related to the "backing" data structure - explore using a range-based
> > list or something like rbtree-based interval tree data structure
> > (as mentioned by Steve above) to replace the current bitmap based
> > implementation.
> > 
> > 
> > 
> > I do agree that a range-based list or an interval tree data structure is a
> > really good "logical" fit for the guest page encryption status tracking.
> > 
> > We can only keep track of the guest unencrypted shared pages in the
> > range(s) list (which will keep the data structure quite compact) and all
> > the guest private/encrypted memory does not really need any tracking in
> > the list, anything not in the list will be encrypted/private.
> > 
> > Also looking at a more "practical" use case, here is the current log of
> > page encryption status hypercalls when booting a linux guest :
> > 
> > ...
> > 
> > <snip>
> > 
> > [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> > [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
> > ....
> > 
> > [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> > [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> > [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> > [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1
> > 
> > ....
> > [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> > [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> > [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> > [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
> > ....
> > 
> > [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> > [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> > [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> > [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
> > ....
> > 
> > [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> > [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> > [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> > [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
> > ....
> > [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> > [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> > [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > ....
> > 
> > As can be observed here, all guest MMIO ranges are initially setup as
> > shared, and those are all contigious guest page ranges.
> > 
> > After that the encryption status hypercalls are invoked when DMA gets
> > triggered during disk i/o while booting the guest ... here again the
> > guest page ranges are contigious, though mostly single page is touched
> > and a lot of page re-use is observed.
> > 
> > So a range-based list/structure will be a "good" fit for such usage
> > scenarios.
> > 
> > It seems surprisingly common to flick the same pages back and forth between
> > encrypted and clear for quite a while;  why is this?
> > 
> > 
> > dma_alloc_coherent()'s will allocate pages and then call
> > set_decrypted() on them and then at dma_free_coherent(), set_encrypted()
> > is called on the pages to be freed. So these observations in the logs
> > where a lot of single 4K pages are seeing C-bit transitions and
> > corresponding hypercalls are the ones associated with
> > dma_alloc_coherent().
> 
> It makes me wonder if it might be worth teaching it to hold onto those
> DMA pages somewhere until it needs them for something else and avoid the
> extra hypercalls; just something to think about.
> 
> Dave

Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments :

It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status. 
These will be like 15-20 nodes/entries.

Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup.

As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be 
using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions 
(dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex 
and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges).

Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup 
won't be too expensive compared to a tree lookup. In other words, this be a fixed size list.

Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status.

Looking fwd. to any comments/feedback/thoughts on the above.

Thanks,
Ashish


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-06 23:05                       ` Ashish Kalra
@ 2021-01-07  1:01                         ` Steve Rutherford
  2021-01-07  1:34                           ` Ashish Kalra
  2021-01-07 17:07                           ` Ashish Kalra
  0 siblings, 2 replies; 38+ messages in thread
From: Steve Rutherford @ 2021-01-07  1:01 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: Dr. David Alan Gilbert, Singh, Brijesh, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh

Avoiding an rbtree for such a small (but unstable) list seems correct.

For the unencrypted region list strategy, the only questions that I
have are fairly secondary.
- How should the kernel upper bound the size of the list in the face
of malicious guests, but still support large guests? (Something
similar to the size provided in the bitmap API would work).
- What serialization format should be used for the ioctl API?
(Usermode could send down a pointer to a user region and a size. The
kernel could then populate that with an array of structs containing
bases and limits for unencrypted regions.)
- How will the kernel tag a guest as having exceeded its maximum list
size, in order to indicate that the list is now incomplete? (Track a
poison bit, and send it up when getting the serialized list of
regions).

In my view, there are two main competitors to this strategy:
- (Existing) Bitmap API
- A guest memory donation based model

The existing bitmap API avoids any issues with growing too large,
since it's size is predictable.

To elaborate on the memory donation based model, the guest could put
an encryption status data structure into unencrypted guest memory, and
then use a hypercall to inform the host where the base of that
structure is located. The main advantage of this is that it side steps
any issues around malicious guests causing large allocations.

The unencrypted region list seems very practical. It's biggest
advantage over the bitmap is how cheap it will be to pass the
structure up from the kernel. A memory donation based model could
achieve similar performance, but with some additional complexity.

Does anyone view the memory donation model as worth the complexity?
Does anyone think the simplicity of the bitmap is a better tradeoff
compared to an unencrypted region list?
Or have other ideas that are not mentioned here?


On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote:
>
> On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote:
> > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote:
> > > Hello Dave,
> > >
> > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > >
> > > * Ashish Kalra (ashish.kalra@amd.com) wrote:
> > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> > > Hello All,
> > >
> > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> > >
> > > On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > > On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > > On 03/12/20 01:34, Sean Christopherson wrote:
> > > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > > From: Brijesh Singh <brijesh.singh@amd.com>
> > >
> > > KVM hypercall framework relies on alternative framework to patch the
> > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > > apply_alternative() is called then it defaults to VMCALL. The approach
> > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > > will be able to decode the instruction and do the right things. But
> > > when SEV is active, guest memory is encrypted with guest key and
> > > hypervisor will not be able to decode the instruction bytes.
> > >
> > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > > think there are any existing KVM hypercalls that happen before alternatives are
> > > patched, i.e. it'll be a nop for sane kernel builds.
> > >
> > > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > > encryption behavior, but I'll take that up in the patches later in the series.
> > > Do you think that it's the guest that should "donate" memory for the bitmap
> > > instead?
> > > No.  Two things I'd like to explore:
> > >
> > >  1. Making the hypercall to announce/request private vs. shared common across
> > >     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> > >     I'm concerned that we'll end up with multiple hypercalls that do more or
> > >     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> > >     pipe dream, but I'd like to at least explore options before shoving in KVM-
> > >     only hypercalls.
> > >
> > >
> > >  2. Tracking shared memory via a list of ranges instead of a using bitmap to
> > >     track all of guest memory.  For most use cases, the vast majority of guest
> > >     memory will be private, most ranges will be 2mb+, and conversions between
> > >     private and shared will be uncommon events, i.e. the overhead to walk and
> > >     split/merge list entries is hopefully not a big concern.  I suspect a list
> > >     would consume far less memory, hopefully without impacting performance.
> > > For a fancier data structure, I'd suggest an interval tree. Linux
> > > already has an rbtree-based interval tree implementation, which would
> > > likely work, and would probably assuage any performance concerns.
> > >
> > > Something like this would not be worth doing unless most of the shared
> > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > > 60ish discontiguous shared regions. This is by no means a thorough
> > > search, but it's suggestive. If this is typical, then the bitmap would
> > > be far less efficient than most any interval-based data structure.
> > >
> > > You'd have to allow userspace to upper bound the number of intervals
> > > (similar to the maximum bitmap size), to prevent host OOMs due to
> > > malicious guests. There's something nice about the guest donating
> > > memory for this, since that would eliminate the OOM risk.
> > >
> > >
> > > Tracking the list of ranges may not be bad idea, especially if we use
> > > the some kind of rbtree-based data structure to update the ranges. It
> > > will certainly be better than bitmap which grows based on the guest
> > > memory size and as you guys see in the practice most of the pages will
> > > be guest private. I am not sure if guest donating a memory will cover
> > > all the cases, e.g what if we do a memory hotplug (increase the guest
> > > ram from 2GB to 64GB), will donated memory range will be enough to store
> > > the metadata.
> > >
> > > .
> > >
> > > With reference to internal discussions regarding the above, i am going
> > > to look into specific items as listed below :
> > >
> > > 1). "hypercall" related :
> > > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > > see if there is something common there than can be re-used for SEV/SEV-ES
> > > page encryption status hypercalls.
> > > b). Explore if there is any common hypercall framework i can use in
> > > Linux/KVM.
> > >
> > > 2). related to the "backing" data structure - explore using a range-based
> > > list or something like rbtree-based interval tree data structure
> > > (as mentioned by Steve above) to replace the current bitmap based
> > > implementation.
> > >
> > >
> > >
> > > I do agree that a range-based list or an interval tree data structure is a
> > > really good "logical" fit for the guest page encryption status tracking.
> > >
> > > We can only keep track of the guest unencrypted shared pages in the
> > > range(s) list (which will keep the data structure quite compact) and all
> > > the guest private/encrypted memory does not really need any tracking in
> > > the list, anything not in the list will be encrypted/private.
> > >
> > > Also looking at a more "practical" use case, here is the current log of
> > > page encryption status hypercalls when booting a linux guest :
> > >
> > > ...
> > >
> > > <snip>
> > >
> > > [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> > > [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
> > > ....
> > >
> > > [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> > > [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> > > [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> > > [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1
> > >
> > > ....
> > > [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> > > [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> > > [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> > > [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
> > > ....
> > >
> > > [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> > > [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> > > [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> > > [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
> > > ....
> > >
> > > [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> > > [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> > > [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> > > [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
> > > ....
> > > [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> > > [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> > > [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > ....
> > >
> > > As can be observed here, all guest MMIO ranges are initially setup as
> > > shared, and those are all contigious guest page ranges.
> > >
> > > After that the encryption status hypercalls are invoked when DMA gets
> > > triggered during disk i/o while booting the guest ... here again the
> > > guest page ranges are contigious, though mostly single page is touched
> > > and a lot of page re-use is observed.
> > >
> > > So a range-based list/structure will be a "good" fit for such usage
> > > scenarios.
> > >
> > > It seems surprisingly common to flick the same pages back and forth between
> > > encrypted and clear for quite a while;  why is this?
> > >
> > >
> > > dma_alloc_coherent()'s will allocate pages and then call
> > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted()
> > > is called on the pages to be freed. So these observations in the logs
> > > where a lot of single 4K pages are seeing C-bit transitions and
> > > corresponding hypercalls are the ones associated with
> > > dma_alloc_coherent().
> >
> > It makes me wonder if it might be worth teaching it to hold onto those
> > DMA pages somewhere until it needs them for something else and avoid the
> > extra hypercalls; just something to think about.
> >
> > Dave
>
> Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments :
>
> It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status.
> These will be like 15-20 nodes/entries.
>
> Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup.
>
> As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be
> using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions
> (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex
> and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges).
>
> Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup
> won't be too expensive compared to a tree lookup. In other words, this be a fixed size list.
>
> Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status.
>
> Looking fwd. to any comments/feedback/thoughts on the above.
>
> Thanks,
> Ashish
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07  1:01                         ` Steve Rutherford
@ 2021-01-07  1:34                           ` Ashish Kalra
  2021-01-07  8:05                             ` Ashish Kalra
  2021-01-07 17:07                           ` Ashish Kalra
  1 sibling, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2021-01-07  1:34 UTC (permalink / raw)
  To: Steve Rutherford
  Cc: Dr. David Alan Gilbert, Singh, Brijesh, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh

Hello Steve,

My thoughts here ...

On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> Avoiding an rbtree for such a small (but unstable) list seems correct.
> 

I agree.

> For the unencrypted region list strategy, the only questions that I
> have are fairly secondary.
> - How should the kernel upper bound the size of the list in the face
> of malicious guests, but still support large guests? (Something
> similar to the size provided in the bitmap API would work).

Please note that in our current implementation, we don't do any bitmap
resize based on guest page encryption status hypercall, the max. bitmap
size is computed dynamically by walking the KVM memslots and finding the
maximum guest physical address size.

So, malicious guests cannot do large bitmap allocations using the
hypercalls. 

> - What serialization format should be used for the ioctl API?
> (Usermode could send down a pointer to a user region and a size. The
> kernel could then populate that with an array of structs containing
> bases and limits for unencrypted regions.)
> - How will the kernel tag a guest as having exceeded its maximum list
> size, in order to indicate that the list is now incomplete? (Track a
> poison bit, and send it up when getting the serialized list of
> regions).
> 

With reference to the serialization concerns with active live
migration and simultaneous page encryption bitmap updates, please note
that in our current VMM implementation, after each memory region
migration cycle we pause live migration, re-sync the page encryption
bitmap, XOR it to the last synced bitmap, and then re-transfer any
modified pages accordingly. 

I have a prototype implementation for this in Qemu, which seems to
work fine.

> In my view, there are two main competitors to this strategy:
> - (Existing) Bitmap API
> - A guest memory donation based model
> 
> The existing bitmap API avoids any issues with growing too large,
> since it's size is predictable.
> 

Yes, as i mentioned above, it's size is predictable and cannot grow too
large. 

> To elaborate on the memory donation based model, the guest could put
> an encryption status data structure into unencrypted guest memory, and
> then use a hypercall to inform the host where the base of that
> structure is located. The main advantage of this is that it side steps
> any issues around malicious guests causing large allocations.
> 
> The unencrypted region list seems very practical. It's biggest
> advantage over the bitmap is how cheap it will be to pass the
> structure up from the kernel. A memory donation based model could
> achieve similar performance, but with some additional complexity.
> 
> Does anyone view the memory donation model as worth the complexity?
> Does anyone think the simplicity of the bitmap is a better tradeoff
> compared to an unencrypted region list?

One advantage in sticking with the bitmap is that it maps very nicely to
the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh
designed and implemented it is very similar to dirty page bitmap tracking
and syncing between KVM and Qemu. The same logic is re-used for the page
encryption bitmap which means quite mininal changes and code resuse in
Qemu. 

Any changes to the backing data structure, will require additional
mapping logic to be added to Qemu.

This is one advantage in keeping the bitmap logic.

Thanks,
Ashish

> Or have other ideas that are not mentioned here?
> 
> 
> On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote:
> >
> > On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote:
> > > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote:
> > > > Hello Dave,
> > > >
> > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > >
> > > > * Ashish Kalra (ashish.kalra@amd.com) wrote:
> > > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> > > > Hello All,
> > > >
> > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> > > >
> > > > On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > > > On 03/12/20 01:34, Sean Christopherson wrote:
> > > > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > > > From: Brijesh Singh <brijesh.singh@amd.com>
> > > >
> > > > KVM hypercall framework relies on alternative framework to patch the
> > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > > > apply_alternative() is called then it defaults to VMCALL. The approach
> > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > > > will be able to decode the instruction and do the right things. But
> > > > when SEV is active, guest memory is encrypted with guest key and
> > > > hypervisor will not be able to decode the instruction bytes.
> > > >
> > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > > > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > > > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > > > think there are any existing KVM hypercalls that happen before alternatives are
> > > > patched, i.e. it'll be a nop for sane kernel builds.
> > > >
> > > > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > > > encryption behavior, but I'll take that up in the patches later in the series.
> > > > Do you think that it's the guest that should "donate" memory for the bitmap
> > > > instead?
> > > > No.  Two things I'd like to explore:
> > > >
> > > >  1. Making the hypercall to announce/request private vs. shared common across
> > > >     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> > > >     I'm concerned that we'll end up with multiple hypercalls that do more or
> > > >     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> > > >     pipe dream, but I'd like to at least explore options before shoving in KVM-
> > > >     only hypercalls.
> > > >
> > > >
> > > >  2. Tracking shared memory via a list of ranges instead of a using bitmap to
> > > >     track all of guest memory.  For most use cases, the vast majority of guest
> > > >     memory will be private, most ranges will be 2mb+, and conversions between
> > > >     private and shared will be uncommon events, i.e. the overhead to walk and
> > > >     split/merge list entries is hopefully not a big concern.  I suspect a list
> > > >     would consume far less memory, hopefully without impacting performance.
> > > > For a fancier data structure, I'd suggest an interval tree. Linux
> > > > already has an rbtree-based interval tree implementation, which would
> > > > likely work, and would probably assuage any performance concerns.
> > > >
> > > > Something like this would not be worth doing unless most of the shared
> > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > > > 60ish discontiguous shared regions. This is by no means a thorough
> > > > search, but it's suggestive. If this is typical, then the bitmap would
> > > > be far less efficient than most any interval-based data structure.
> > > >
> > > > You'd have to allow userspace to upper bound the number of intervals
> > > > (similar to the maximum bitmap size), to prevent host OOMs due to
> > > > malicious guests. There's something nice about the guest donating
> > > > memory for this, since that would eliminate the OOM risk.
> > > >
> > > >
> > > > Tracking the list of ranges may not be bad idea, especially if we use
> > > > the some kind of rbtree-based data structure to update the ranges. It
> > > > will certainly be better than bitmap which grows based on the guest
> > > > memory size and as you guys see in the practice most of the pages will
> > > > be guest private. I am not sure if guest donating a memory will cover
> > > > all the cases, e.g what if we do a memory hotplug (increase the guest
> > > > ram from 2GB to 64GB), will donated memory range will be enough to store
> > > > the metadata.
> > > >
> > > > .
> > > >
> > > > With reference to internal discussions regarding the above, i am going
> > > > to look into specific items as listed below :
> > > >
> > > > 1). "hypercall" related :
> > > > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > > > see if there is something common there than can be re-used for SEV/SEV-ES
> > > > page encryption status hypercalls.
> > > > b). Explore if there is any common hypercall framework i can use in
> > > > Linux/KVM.
> > > >
> > > > 2). related to the "backing" data structure - explore using a range-based
> > > > list or something like rbtree-based interval tree data structure
> > > > (as mentioned by Steve above) to replace the current bitmap based
> > > > implementation.
> > > >
> > > >
> > > >
> > > > I do agree that a range-based list or an interval tree data structure is a
> > > > really good "logical" fit for the guest page encryption status tracking.
> > > >
> > > > We can only keep track of the guest unencrypted shared pages in the
> > > > range(s) list (which will keep the data structure quite compact) and all
> > > > the guest private/encrypted memory does not really need any tracking in
> > > > the list, anything not in the list will be encrypted/private.
> > > >
> > > > Also looking at a more "practical" use case, here is the current log of
> > > > page encryption status hypercalls when booting a linux guest :
> > > >
> > > > ...
> > > >
> > > > <snip>
> > > >
> > > > [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> > > > [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > > [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > > [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
> > > > ....
> > > >
> > > > [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> > > > [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> > > > [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> > > > [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1
> > > >
> > > > ....
> > > > [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> > > > [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> > > > [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> > > > [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
> > > > ....
> > > >
> > > > [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> > > > [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> > > > [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> > > > [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
> > > > ....
> > > >
> > > > [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> > > > [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> > > > [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> > > > [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
> > > > ....
> > > > [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > > [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > > [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> > > > [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> > > > [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > > [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > > ....
> > > >
> > > > As can be observed here, all guest MMIO ranges are initially setup as
> > > > shared, and those are all contigious guest page ranges.
> > > >
> > > > After that the encryption status hypercalls are invoked when DMA gets
> > > > triggered during disk i/o while booting the guest ... here again the
> > > > guest page ranges are contigious, though mostly single page is touched
> > > > and a lot of page re-use is observed.
> > > >
> > > > So a range-based list/structure will be a "good" fit for such usage
> > > > scenarios.
> > > >
> > > > It seems surprisingly common to flick the same pages back and forth between
> > > > encrypted and clear for quite a while;  why is this?
> > > >
> > > >
> > > > dma_alloc_coherent()'s will allocate pages and then call
> > > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted()
> > > > is called on the pages to be freed. So these observations in the logs
> > > > where a lot of single 4K pages are seeing C-bit transitions and
> > > > corresponding hypercalls are the ones associated with
> > > > dma_alloc_coherent().
> > >
> > > It makes me wonder if it might be worth teaching it to hold onto those
> > > DMA pages somewhere until it needs them for something else and avoid the
> > > extra hypercalls; just something to think about.
> > >
> > > Dave
> >
> > Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments :
> >
> > It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status.
> > These will be like 15-20 nodes/entries.
> >
> > Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup.
> >
> > As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be
> > using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions
> > (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex
> > and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges).
> >
> > Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup
> > won't be too expensive compared to a tree lookup. In other words, this be a fixed size list.
> >
> > Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status.
> >
> > Looking fwd. to any comments/feedback/thoughts on the above.
> >
> > Thanks,
> > Ashish
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07  1:34                           ` Ashish Kalra
@ 2021-01-07  8:05                             ` Ashish Kalra
  2021-01-08  0:47                               ` Ashish Kalra
  0 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2021-01-07  8:05 UTC (permalink / raw)
  To: Steve Rutherford
  Cc: Dr. David Alan Gilbert, Singh, Brijesh, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, jon.grimm

Hello Steve,

Sorry, i realized later that i replied to this email with regard to the
current bitmap implementation and not the unencrpyted region list
strategy.

I am now looking at your thoughts/questions with regard to the
unencrypted region list strategy and will reply to them accordingly.

Thanks,
Ashish

On Thu, Jan 07, 2021 at 01:34:14AM +0000, Ashish Kalra wrote:
> Hello Steve,
> 
> My thoughts here ...
> 
> On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> > Avoiding an rbtree for such a small (but unstable) list seems correct.
> > 
> 
> I agree.
> 
> > For the unencrypted region list strategy, the only questions that I
> > have are fairly secondary.
> > - How should the kernel upper bound the size of the list in the face
> > of malicious guests, but still support large guests? (Something
> > similar to the size provided in the bitmap API would work).
> 
> Please note that in our current implementation, we don't do any bitmap
> resize based on guest page encryption status hypercall, the max. bitmap
> size is computed dynamically by walking the KVM memslots and finding the
> maximum guest physical address size.
> 
> So, malicious guests cannot do large bitmap allocations using the
> hypercalls. 
> 
> > - What serialization format should be used for the ioctl API?
> > (Usermode could send down a pointer to a user region and a size. The
> > kernel could then populate that with an array of structs containing
> > bases and limits for unencrypted regions.)
> > - How will the kernel tag a guest as having exceeded its maximum list
> > size, in order to indicate that the list is now incomplete? (Track a
> > poison bit, and send it up when getting the serialized list of
> > regions).
> > 
> 
> With reference to the serialization concerns with active live
> migration and simultaneous page encryption bitmap updates, please note
> that in our current VMM implementation, after each memory region
> migration cycle we pause live migration, re-sync the page encryption
> bitmap, XOR it to the last synced bitmap, and then re-transfer any
> modified pages accordingly. 
> 
> I have a prototype implementation for this in Qemu, which seems to
> work fine.
> 
> > In my view, there are two main competitors to this strategy:
> > - (Existing) Bitmap API
> > - A guest memory donation based model
> > 
> > The existing bitmap API avoids any issues with growing too large,
> > since it's size is predictable.
> > 
> 
> Yes, as i mentioned above, it's size is predictable and cannot grow too
> large. 
> 
> > To elaborate on the memory donation based model, the guest could put
> > an encryption status data structure into unencrypted guest memory, and
> > then use a hypercall to inform the host where the base of that
> > structure is located. The main advantage of this is that it side steps
> > any issues around malicious guests causing large allocations.
> > 
> > The unencrypted region list seems very practical. It's biggest
> > advantage over the bitmap is how cheap it will be to pass the
> > structure up from the kernel. A memory donation based model could
> > achieve similar performance, but with some additional complexity.
> > 
> > Does anyone view the memory donation model as worth the complexity?
> > Does anyone think the simplicity of the bitmap is a better tradeoff
> > compared to an unencrypted region list?
> 
> One advantage in sticking with the bitmap is that it maps very nicely to
> the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh
> designed and implemented it is very similar to dirty page bitmap tracking
> and syncing between KVM and Qemu. The same logic is re-used for the page
> encryption bitmap which means quite mininal changes and code resuse in
> Qemu. 
> 
> Any changes to the backing data structure, will require additional
> mapping logic to be added to Qemu.
> 
> This is one advantage in keeping the bitmap logic.
> 
> Thanks,
> Ashish
> 
> > Or have other ideas that are not mentioned here?
> > 
> > 
> > On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote:
> > >
> > > On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote:
> > > > > Hello Dave,
> > > > >
> > > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > > >
> > > > > * Ashish Kalra (ashish.kalra@amd.com) wrote:
> > > > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> > > > > Hello All,
> > > > >
> > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> > > > >
> > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > > > > On 03/12/20 01:34, Sean Christopherson wrote:
> > > > > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > > > > From: Brijesh Singh <brijesh.singh@amd.com>
> > > > >
> > > > > KVM hypercall framework relies on alternative framework to patch the
> > > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > > > > apply_alternative() is called then it defaults to VMCALL. The approach
> > > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > > > > will be able to decode the instruction and do the right things. But
> > > > > when SEV is active, guest memory is encrypted with guest key and
> > > > > hypervisor will not be able to decode the instruction bytes.
> > > > >
> > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > > > > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > > > > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > > > > think there are any existing KVM hypercalls that happen before alternatives are
> > > > > patched, i.e. it'll be a nop for sane kernel builds.
> > > > >
> > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > > > > encryption behavior, but I'll take that up in the patches later in the series.
> > > > > Do you think that it's the guest that should "donate" memory for the bitmap
> > > > > instead?
> > > > > No.  Two things I'd like to explore:
> > > > >
> > > > >  1. Making the hypercall to announce/request private vs. shared common across
> > > > >     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> > > > >     I'm concerned that we'll end up with multiple hypercalls that do more or
> > > > >     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> > > > >     pipe dream, but I'd like to at least explore options before shoving in KVM-
> > > > >     only hypercalls.
> > > > >
> > > > >
> > > > >  2. Tracking shared memory via a list of ranges instead of a using bitmap to
> > > > >     track all of guest memory.  For most use cases, the vast majority of guest
> > > > >     memory will be private, most ranges will be 2mb+, and conversions between
> > > > >     private and shared will be uncommon events, i.e. the overhead to walk and
> > > > >     split/merge list entries is hopefully not a big concern.  I suspect a list
> > > > >     would consume far less memory, hopefully without impacting performance.
> > > > > For a fancier data structure, I'd suggest an interval tree. Linux
> > > > > already has an rbtree-based interval tree implementation, which would
> > > > > likely work, and would probably assuage any performance concerns.
> > > > >
> > > > > Something like this would not be worth doing unless most of the shared
> > > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > > > > 60ish discontiguous shared regions. This is by no means a thorough
> > > > > search, but it's suggestive. If this is typical, then the bitmap would
> > > > > be far less efficient than most any interval-based data structure.
> > > > >
> > > > > You'd have to allow userspace to upper bound the number of intervals
> > > > > (similar to the maximum bitmap size), to prevent host OOMs due to
> > > > > malicious guests. There's something nice about the guest donating
> > > > > memory for this, since that would eliminate the OOM risk.
> > > > >
> > > > >
> > > > > Tracking the list of ranges may not be bad idea, especially if we use
> > > > > the some kind of rbtree-based data structure to update the ranges. It
> > > > > will certainly be better than bitmap which grows based on the guest
> > > > > memory size and as you guys see in the practice most of the pages will
> > > > > be guest private. I am not sure if guest donating a memory will cover
> > > > > all the cases, e.g what if we do a memory hotplug (increase the guest
> > > > > ram from 2GB to 64GB), will donated memory range will be enough to store
> > > > > the metadata.
> > > > >
> > > > > .
> > > > >
> > > > > With reference to internal discussions regarding the above, i am going
> > > > > to look into specific items as listed below :
> > > > >
> > > > > 1). "hypercall" related :
> > > > > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > > > > see if there is something common there than can be re-used for SEV/SEV-ES
> > > > > page encryption status hypercalls.
> > > > > b). Explore if there is any common hypercall framework i can use in
> > > > > Linux/KVM.
> > > > >
> > > > > 2). related to the "backing" data structure - explore using a range-based
> > > > > list or something like rbtree-based interval tree data structure
> > > > > (as mentioned by Steve above) to replace the current bitmap based
> > > > > implementation.
> > > > >
> > > > >
> > > > >
> > > > > I do agree that a range-based list or an interval tree data structure is a
> > > > > really good "logical" fit for the guest page encryption status tracking.
> > > > >
> > > > > We can only keep track of the guest unencrypted shared pages in the
> > > > > range(s) list (which will keep the data structure quite compact) and all
> > > > > the guest private/encrypted memory does not really need any tracking in
> > > > > the list, anything not in the list will be encrypted/private.
> > > > >
> > > > > Also looking at a more "practical" use case, here is the current log of
> > > > > page encryption status hypercalls when booting a linux guest :
> > > > >
> > > > > ...
> > > > >
> > > > > <snip>
> > > > >
> > > > > [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> > > > > [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > > > [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > > > [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
> > > > > ....
> > > > >
> > > > > [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> > > > > [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> > > > > [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> > > > > [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1
> > > > >
> > > > > ....
> > > > > [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> > > > > [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> > > > > [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> > > > > [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
> > > > > ....
> > > > >
> > > > > [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> > > > > [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> > > > > [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> > > > > [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
> > > > > ....
> > > > >
> > > > > [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> > > > > [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> > > > > [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> > > > > [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
> > > > > ....
> > > > > [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > > > [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > > > [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> > > > > [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> > > > > [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > > > [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > > > ....
> > > > >
> > > > > As can be observed here, all guest MMIO ranges are initially setup as
> > > > > shared, and those are all contigious guest page ranges.
> > > > >
> > > > > After that the encryption status hypercalls are invoked when DMA gets
> > > > > triggered during disk i/o while booting the guest ... here again the
> > > > > guest page ranges are contigious, though mostly single page is touched
> > > > > and a lot of page re-use is observed.
> > > > >
> > > > > So a range-based list/structure will be a "good" fit for such usage
> > > > > scenarios.
> > > > >
> > > > > It seems surprisingly common to flick the same pages back and forth between
> > > > > encrypted and clear for quite a while;  why is this?
> > > > >
> > > > >
> > > > > dma_alloc_coherent()'s will allocate pages and then call
> > > > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted()
> > > > > is called on the pages to be freed. So these observations in the logs
> > > > > where a lot of single 4K pages are seeing C-bit transitions and
> > > > > corresponding hypercalls are the ones associated with
> > > > > dma_alloc_coherent().
> > > >
> > > > It makes me wonder if it might be worth teaching it to hold onto those
> > > > DMA pages somewhere until it needs them for something else and avoid the
> > > > extra hypercalls; just something to think about.
> > > >
> > > > Dave
> > >
> > > Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments :
> > >
> > > It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status.
> > > These will be like 15-20 nodes/entries.
> > >
> > > Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup.
> > >
> > > As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be
> > > using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions
> > > (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex
> > > and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges).
> > >
> > > Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup
> > > won't be too expensive compared to a tree lookup. In other words, this be a fixed size list.
> > >
> > > Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status.
> > >
> > > Looking fwd. to any comments/feedback/thoughts on the above.
> > >
> > > Thanks,
> > > Ashish
> > >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07  1:01                         ` Steve Rutherford
  2021-01-07  1:34                           ` Ashish Kalra
@ 2021-01-07 17:07                           ` Ashish Kalra
  2021-01-07 17:26                             ` Sean Christopherson
  1 sibling, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2021-01-07 17:07 UTC (permalink / raw)
  To: Steve Rutherford
  Cc: Dr. David Alan Gilbert, Singh, Brijesh, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, jon.grimm

Hello Steve,

On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> Avoiding an rbtree for such a small (but unstable) list seems correct.
> 
> For the unencrypted region list strategy, the only questions that I
> have are fairly secondary.
> - How should the kernel upper bound the size of the list in the face
> of malicious guests, but still support large guests? (Something
> similar to the size provided in the bitmap API would work).

I am thinking of another scenario, where a malicious guest can make
infinite/repetetive hypercalls and DOS attack the host. 

But probably this is a more generic issue, this can be done by any guest
and under any hypervisor, i don't know what kind of mitigations exist
for such a scenario ?

Potentially, the guest memory donation model can handle such an attack,
because in this model, the hypervisor will expect only one hypercall,
any repetetive hypercalls can make the hypervisor disable the guest ?

Thanks,
Ashish

> - What serialization format should be used for the ioctl API?
> (Usermode could send down a pointer to a user region and a size. The
> kernel could then populate that with an array of structs containing
> bases and limits for unencrypted regions.)
> - How will the kernel tag a guest as having exceeded its maximum list
> size, in order to indicate that the list is now incomplete? (Track a
> poison bit, and send it up when getting the serialized list of
> regions).
> 
> In my view, there are two main competitors to this strategy:
> - (Existing) Bitmap API
> - A guest memory donation based model
> 
> The existing bitmap API avoids any issues with growing too large,
> since it's size is predictable.
> 
> To elaborate on the memory donation based model, the guest could put
> an encryption status data structure into unencrypted guest memory, and
> then use a hypercall to inform the host where the base of that
> structure is located. The main advantage of this is that it side steps
> any issues around malicious guests causing large allocations.
> 
> The unencrypted region list seems very practical. It's biggest
> advantage over the bitmap is how cheap it will be to pass the
> structure up from the kernel. A memory donation based model could
> achieve similar performance, but with some additional complexity.
> 
> Does anyone view the memory donation model as worth the complexity?
> Does anyone think the simplicity of the bitmap is a better tradeoff
> compared to an unencrypted region list?
> Or have other ideas that are not mentioned here?
> 
> 
> On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote:
> >
> > On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote:
> > > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote:
> > > > Hello Dave,
> > > >
> > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > >
> > > > * Ashish Kalra (ashish.kalra@amd.com) wrote:
> > > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote:
> > > > Hello All,
> > > >
> > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote:
> > > >
> > > > On 12/7/20 9:09 PM, Steve Rutherford wrote:
> > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > On Sun, Dec 06, 2020, Paolo Bonzini wrote:
> > > > On 03/12/20 01:34, Sean Christopherson wrote:
> > > > On Tue, Dec 01, 2020, Ashish Kalra wrote:
> > > > From: Brijesh Singh <brijesh.singh@amd.com>
> > > >
> > > > KVM hypercall framework relies on alternative framework to patch the
> > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
> > > > apply_alternative() is called then it defaults to VMCALL. The approach
> > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
> > > > will be able to decode the instruction and do the right things. But
> > > > when SEV is active, guest memory is encrypted with guest key and
> > > > hypervisor will not be able to decode the instruction bytes.
> > > >
> > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
> > > > will be used by the SEV guest to notify encrypted pages to the hypervisor.
> > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
> > > > and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
> > > > think there are any existing KVM hypercalls that happen before alternatives are
> > > > patched, i.e. it'll be a nop for sane kernel builds.
> > > >
> > > > I'm also skeptical that a KVM specific hypercall is the right approach for the
> > > > encryption behavior, but I'll take that up in the patches later in the series.
> > > > Do you think that it's the guest that should "donate" memory for the bitmap
> > > > instead?
> > > > No.  Two things I'd like to explore:
> > > >
> > > >  1. Making the hypercall to announce/request private vs. shared common across
> > > >     hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
> > > >     I'm concerned that we'll end up with multiple hypercalls that do more or
> > > >     less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
> > > >     pipe dream, but I'd like to at least explore options before shoving in KVM-
> > > >     only hypercalls.
> > > >
> > > >
> > > >  2. Tracking shared memory via a list of ranges instead of a using bitmap to
> > > >     track all of guest memory.  For most use cases, the vast majority of guest
> > > >     memory will be private, most ranges will be 2mb+, and conversions between
> > > >     private and shared will be uncommon events, i.e. the overhead to walk and
> > > >     split/merge list entries is hopefully not a big concern.  I suspect a list
> > > >     would consume far less memory, hopefully without impacting performance.
> > > > For a fancier data structure, I'd suggest an interval tree. Linux
> > > > already has an rbtree-based interval tree implementation, which would
> > > > likely work, and would probably assuage any performance concerns.
> > > >
> > > > Something like this would not be worth doing unless most of the shared
> > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
> > > > 60ish discontiguous shared regions. This is by no means a thorough
> > > > search, but it's suggestive. If this is typical, then the bitmap would
> > > > be far less efficient than most any interval-based data structure.
> > > >
> > > > You'd have to allow userspace to upper bound the number of intervals
> > > > (similar to the maximum bitmap size), to prevent host OOMs due to
> > > > malicious guests. There's something nice about the guest donating
> > > > memory for this, since that would eliminate the OOM risk.
> > > >
> > > >
> > > > Tracking the list of ranges may not be bad idea, especially if we use
> > > > the some kind of rbtree-based data structure to update the ranges. It
> > > > will certainly be better than bitmap which grows based on the guest
> > > > memory size and as you guys see in the practice most of the pages will
> > > > be guest private. I am not sure if guest donating a memory will cover
> > > > all the cases, e.g what if we do a memory hotplug (increase the guest
> > > > ram from 2GB to 64GB), will donated memory range will be enough to store
> > > > the metadata.
> > > >
> > > > .
> > > >
> > > > With reference to internal discussions regarding the above, i am going
> > > > to look into specific items as listed below :
> > > >
> > > > 1). "hypercall" related :
> > > > a). Explore the SEV-SNP page change request structure (included in GHCB),
> > > > see if there is something common there than can be re-used for SEV/SEV-ES
> > > > page encryption status hypercalls.
> > > > b). Explore if there is any common hypercall framework i can use in
> > > > Linux/KVM.
> > > >
> > > > 2). related to the "backing" data structure - explore using a range-based
> > > > list or something like rbtree-based interval tree data structure
> > > > (as mentioned by Steve above) to replace the current bitmap based
> > > > implementation.
> > > >
> > > >
> > > >
> > > > I do agree that a range-based list or an interval tree data structure is a
> > > > really good "logical" fit for the guest page encryption status tracking.
> > > >
> > > > We can only keep track of the guest unencrypted shared pages in the
> > > > range(s) list (which will keep the data structure quite compact) and all
> > > > the guest private/encrypted memory does not really need any tracking in
> > > > the list, anything not in the list will be encrypted/private.
> > > >
> > > > Also looking at a more "practical" use case, here is the current log of
> > > > page encryption status hypercalls when booting a linux guest :
> > > >
> > > > ...
> > > >
> > > > <snip>
> > > >
> > > > [   56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 1
> > > > [   56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > > [   56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages  = 1, enc = 0
> > > > [   56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages  = 1, enc = 0
> > > > ....
> > > >
> > > > [   56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 0
> > > > [   56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 0
> > > > [   56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages  = 1, enc = 1
> > > > [   56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages  = 1, enc = 1
> > > >
> > > > ....
> > > > [   56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 0
> > > > [   56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 0
> > > > [   56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages  = 1, enc = 1
> > > > [   56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages  = 1, enc = 1
> > > > ....
> > > >
> > > > [   56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 0
> > > > [   56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 0
> > > > [   56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages  = 1, enc = 1
> > > > [   56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages  = 1, enc = 1
> > > > ....
> > > >
> > > > [   56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 0
> > > > [   56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 0
> > > > [   56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages  = 1, enc = 1
> > > > [   56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages  = 1, enc = 1
> > > > ....
> > > > [   56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > > [   56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > > [   56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 1
> > > > [   56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 0
> > > > [   56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages  = 100, enc = 0
> > > > [   56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages  = 1, enc = 1
> > > > ....
> > > >
> > > > As can be observed here, all guest MMIO ranges are initially setup as
> > > > shared, and those are all contigious guest page ranges.
> > > >
> > > > After that the encryption status hypercalls are invoked when DMA gets
> > > > triggered during disk i/o while booting the guest ... here again the
> > > > guest page ranges are contigious, though mostly single page is touched
> > > > and a lot of page re-use is observed.
> > > >
> > > > So a range-based list/structure will be a "good" fit for such usage
> > > > scenarios.
> > > >
> > > > It seems surprisingly common to flick the same pages back and forth between
> > > > encrypted and clear for quite a while;  why is this?
> > > >
> > > >
> > > > dma_alloc_coherent()'s will allocate pages and then call
> > > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted()
> > > > is called on the pages to be freed. So these observations in the logs
> > > > where a lot of single 4K pages are seeing C-bit transitions and
> > > > corresponding hypercalls are the ones associated with
> > > > dma_alloc_coherent().
> > >
> > > It makes me wonder if it might be worth teaching it to hold onto those
> > > DMA pages somewhere until it needs them for something else and avoid the
> > > extra hypercalls; just something to think about.
> > >
> > > Dave
> >
> > Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments :
> >
> > It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status.
> > These will be like 15-20 nodes/entries.
> >
> > Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup.
> >
> > As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be
> > using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions
> > (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex
> > and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges).
> >
> > Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup
> > won't be too expensive compared to a tree lookup. In other words, this be a fixed size list.
> >
> > Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status.
> >
> > Looking fwd. to any comments/feedback/thoughts on the above.
> >
> > Thanks,
> > Ashish
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07 17:07                           ` Ashish Kalra
@ 2021-01-07 17:26                             ` Sean Christopherson
  2021-01-07 18:41                               ` Ashish Kalra
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2021-01-07 17:26 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: Steve Rutherford, Dr. David Alan Gilbert, Singh, Brijesh,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, jon.grimm

On Thu, Jan 07, 2021, Ashish Kalra wrote:
> Hello Steve,
> 
> On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> > Avoiding an rbtree for such a small (but unstable) list seems correct.
> > 
> > For the unencrypted region list strategy, the only questions that I
> > have are fairly secondary.
> > - How should the kernel upper bound the size of the list in the face
> > of malicious guests, but still support large guests? (Something
> > similar to the size provided in the bitmap API would work).
> 
> I am thinking of another scenario, where a malicious guest can make
> infinite/repetetive hypercalls and DOS attack the host. 
> 
> But probably this is a more generic issue, this can be done by any guest
> and under any hypervisor, i don't know what kind of mitigations exist
> for such a scenario ?
> 
> Potentially, the guest memory donation model can handle such an attack,
> because in this model, the hypervisor will expect only one hypercall,
> any repetetive hypercalls can make the hypervisor disable the guest ?

KVM doesn't need to explicitly bound its tracking structures, it just needs to
use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that
the memory will be accounted to the task/process/VM.  Shadow MMU pages are the
only exception that comes to mind; they're still accounted properly, but KVM
also explicitly limits them for a variety of reasons.

The size of the list will naturally be bounded by the size of the guest; and
assuming KVM proactively merges adjancent regions, that upper bound is probably
reasonably low compared to other allocations, e.g. the aforementioned MMU pages.

And, using a list means a malicious guest will get automatically throttled as
the latency of walking the list (to merge/delete existing entries) will increase
with the size of the list.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07 17:26                             ` Sean Christopherson
@ 2021-01-07 18:41                               ` Ashish Kalra
  2021-01-07 19:22                                 ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2021-01-07 18:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Steve Rutherford, Dr. David Alan Gilbert, Singh, Brijesh,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, jon.grimm

On Thu, Jan 07, 2021 at 09:26:25AM -0800, Sean Christopherson wrote:
> On Thu, Jan 07, 2021, Ashish Kalra wrote:
> > Hello Steve,
> > 
> > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> > > Avoiding an rbtree for such a small (but unstable) list seems correct.
> > > 
> > > For the unencrypted region list strategy, the only questions that I
> > > have are fairly secondary.
> > > - How should the kernel upper bound the size of the list in the face
> > > of malicious guests, but still support large guests? (Something
> > > similar to the size provided in the bitmap API would work).
> > 
> > I am thinking of another scenario, where a malicious guest can make
> > infinite/repetetive hypercalls and DOS attack the host. 
> > 
> > But probably this is a more generic issue, this can be done by any guest
> > and under any hypervisor, i don't know what kind of mitigations exist
> > for such a scenario ?
> > 
> > Potentially, the guest memory donation model can handle such an attack,
> > because in this model, the hypervisor will expect only one hypercall,
> > any repetetive hypercalls can make the hypervisor disable the guest ?
> 
> KVM doesn't need to explicitly bound its tracking structures, it just needs to
> use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that
> the memory will be accounted to the task/process/VM.  Shadow MMU pages are the
> only exception that comes to mind; they're still accounted properly, but KVM
> also explicitly limits them for a variety of reasons.
> 
> The size of the list will naturally be bounded by the size of the guest; and
> assuming KVM proactively merges adjancent regions, that upper bound is probably
> reasonably low compared to other allocations, e.g. the aforementioned MMU pages.
> 
> And, using a list means a malicious guest will get automatically throttled as
> the latency of walking the list (to merge/delete existing entries) will increase
> with the size of the list.

Just to add here, potentially there won't be any proactive
merging/deletion of existing entries, as the only static entries will be
initial guest MMIO regions, which are contigious guest PA ranges but not
necessarily adjacent. 

After that, as discussed before, almost all entries will be due to 
DMA I/O with respect to dma_alloc_coherent/dma_free_coherent, and all
these entries will be temporary as these DMA buffers are marked
un-encrypted and immediately marked encrypted as soon as DMA I/O is
completed, so it makes no sense to do merging of temporary entries
that will be deleted from the list immediately after being added to it.

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07 18:41                               ` Ashish Kalra
@ 2021-01-07 19:22                                 ` Sean Christopherson
  2021-01-08  0:54                                   ` Steve Rutherford
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2021-01-07 19:22 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: Steve Rutherford, Dr. David Alan Gilbert, Singh, Brijesh,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, jon.grimm

On Thu, Jan 07, 2021, Ashish Kalra wrote:
> On Thu, Jan 07, 2021 at 09:26:25AM -0800, Sean Christopherson wrote:
> > On Thu, Jan 07, 2021, Ashish Kalra wrote:
> > > Hello Steve,
> > > 
> > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> > > > Avoiding an rbtree for such a small (but unstable) list seems correct.
> > > > 
> > > > For the unencrypted region list strategy, the only questions that I
> > > > have are fairly secondary.
> > > > - How should the kernel upper bound the size of the list in the face
> > > > of malicious guests, but still support large guests? (Something
> > > > similar to the size provided in the bitmap API would work).
> > > 
> > > I am thinking of another scenario, where a malicious guest can make
> > > infinite/repetetive hypercalls and DOS attack the host. 
> > > 
> > > But probably this is a more generic issue, this can be done by any guest
> > > and under any hypervisor, i don't know what kind of mitigations exist
> > > for such a scenario ?
> > > 
> > > Potentially, the guest memory donation model can handle such an attack,
> > > because in this model, the hypervisor will expect only one hypercall,
> > > any repetetive hypercalls can make the hypervisor disable the guest ?
> > 
> > KVM doesn't need to explicitly bound its tracking structures, it just needs to
> > use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that
> > the memory will be accounted to the task/process/VM.  Shadow MMU pages are the
> > only exception that comes to mind; they're still accounted properly, but KVM
> > also explicitly limits them for a variety of reasons.
> > 
> > The size of the list will naturally be bounded by the size of the guest; and
> > assuming KVM proactively merges adjancent regions, that upper bound is probably
> > reasonably low compared to other allocations, e.g. the aforementioned MMU pages.
> > 
> > And, using a list means a malicious guest will get automatically throttled as
> > the latency of walking the list (to merge/delete existing entries) will increase
> > with the size of the list.
> 
> Just to add here, potentially there won't be any proactive
> merging/deletion of existing entries, as the only static entries will be
> initial guest MMIO regions, which are contigious guest PA ranges but not
> necessarily adjacent. 

My point was that, if the guest is malicious, eventually there will be adjacent
entries, e.g. the worst case scenario is that the encrypted status changes on
every 4k page.  Anyways, not really all that important, I mostly thinking out
loud :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07  8:05                             ` Ashish Kalra
@ 2021-01-08  0:47                               ` Ashish Kalra
  2021-01-08  0:55                                 ` Steve Rutherford
  0 siblings, 1 reply; 38+ messages in thread
From: Ashish Kalra @ 2021-01-08  0:47 UTC (permalink / raw)
  To: Steve Rutherford
  Cc: Dr. David Alan Gilbert, Singh, Brijesh, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, jon.grimm

> On Thu, Jan 07, 2021 at 01:34:14AM +0000, Ashish Kalra wrote:
> > Hello Steve,
> > 
> > My thoughts here ...
> > 
> > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> > > Avoiding an rbtree for such a small (but unstable) list seems correct.
> > > 
> > 
> > I agree.
> > 
> > > For the unencrypted region list strategy, the only questions that I
> > > have are fairly secondary.
> > > - How should the kernel upper bound the size of the list in the face
> > > of malicious guests, but still support large guests? (Something
> > > similar to the size provided in the bitmap API would work).
> > > - What serialization format should be used for the ioctl API?
> > > (Usermode could send down a pointer to a user region and a size. The
> > > kernel could then populate that with an array of structs containing
> > > bases and limits for unencrypted regions.)
> > > - How will the kernel tag a guest as having exceeded its maximum list
> > > size, in order to indicate that the list is now incomplete? (Track a
> > > poison bit, and send it up when getting the serialized list of
> > > regions).
> > > 
> > > In my view, there are two main competitors to this strategy:
> > > - (Existing) Bitmap API
> > > - A guest memory donation based model
> > > 
> > > The existing bitmap API avoids any issues with growing too large,
> > > since it's size is predictable.
> > > 
> > > To elaborate on the memory donation based model, the guest could put
> > > an encryption status data structure into unencrypted guest memory, and
> > > then use a hypercall to inform the host where the base of that
> > > structure is located. The main advantage of this is that it side steps
> > > any issues around malicious guests causing large allocations.
> > > 
> > > The unencrypted region list seems very practical. It's biggest
> > > advantage over the bitmap is how cheap it will be to pass the
> > > structure up from the kernel. A memory donation based model could
> > > achieve similar performance, but with some additional complexity.
> > > 
> > > Does anyone view the memory donation model as worth the complexity?
> > > Does anyone think the simplicity of the bitmap is a better tradeoff
> > > compared to an unencrypted region list?
> > 
> > One advantage in sticking with the bitmap is that it maps very nicely to
> > the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh
> > designed and implemented it is very similar to dirty page bitmap tracking
> > and syncing between KVM and Qemu. The same logic is re-used for the page
> > encryption bitmap which means quite mininal changes and code resuse in
> > Qemu. 
> > 
> > Any changes to the backing data structure, will require additional
> > mapping logic to be added to Qemu.
> > 
> > This is one advantage in keeping the bitmap logic.
> > 

So if nobody is in favor of keeping the (current) bitmap logic, we will
move to the unencrypted region list approach.

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-07 19:22                                 ` Sean Christopherson
@ 2021-01-08  0:54                                   ` Steve Rutherford
  2021-01-08 16:56                                     ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Steve Rutherford @ 2021-01-08  0:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ashish Kalra, Dr. David Alan Gilbert, Singh, Brijesh,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, Grimm, Jon

Supporting merging of consecutive entries (or not) is less important
to get right since it doesn't change any of the APIs. If someone runs
into performance issues, they can loop back and fix this then. I'm
slightly concerned with the behavior for overlapping regions. I also
have slight concerns with how we handle re-encrypting small chunks of
larger unencrypted regions. I don't think we've seen these in
practice, but nothing precludes them afaik.

On Thu, Jan 7, 2021 at 11:23 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 07, 2021, Ashish Kalra wrote:
> > On Thu, Jan 07, 2021 at 09:26:25AM -0800, Sean Christopherson wrote:
> > > On Thu, Jan 07, 2021, Ashish Kalra wrote:
> > > > Hello Steve,
> > > >
> > > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> > > > > Avoiding an rbtree for such a small (but unstable) list seems correct.
> > > > >
> > > > > For the unencrypted region list strategy, the only questions that I
> > > > > have are fairly secondary.
> > > > > - How should the kernel upper bound the size of the list in the face
> > > > > of malicious guests, but still support large guests? (Something
> > > > > similar to the size provided in the bitmap API would work).
> > > >
> > > > I am thinking of another scenario, where a malicious guest can make
> > > > infinite/repetetive hypercalls and DOS attack the host.
> > > >
> > > > But probably this is a more generic issue, this can be done by any guest
> > > > and under any hypervisor, i don't know what kind of mitigations exist
> > > > for such a scenario ?
> > > >
> > > > Potentially, the guest memory donation model can handle such an attack,
> > > > because in this model, the hypervisor will expect only one hypercall,
> > > > any repetetive hypercalls can make the hypervisor disable the guest ?
> > >
> > > KVM doesn't need to explicitly bound its tracking structures, it just needs to
> > > use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that
> > > the memory will be accounted to the task/process/VM.  Shadow MMU pages are the
> > > only exception that comes to mind; they're still accounted properly, but KVM
> > > also explicitly limits them for a variety of reasons.
> > >
> > > The size of the list will naturally be bounded by the size of the guest; and
> > > assuming KVM proactively merges adjancent regions, that upper bound is probably
> > > reasonably low compared to other allocations, e.g. the aforementioned MMU pages.
> > >
> > > And, using a list means a malicious guest will get automatically throttled as
> > > the latency of walking the list (to merge/delete existing entries) will increase
> > > with the size of the list.
> >
> > Just to add here, potentially there won't be any proactive
> > merging/deletion of existing entries, as the only static entries will be
> > initial guest MMIO regions, which are contigious guest PA ranges but not
> > necessarily adjacent.
>
> My point was that, if the guest is malicious, eventually there will be adjacent
> entries, e.g. the worst case scenario is that the encrypted status changes on
> every 4k page.  Anyways, not really all that important, I mostly thinking out
> loud :-)

Agreed. Tagging this with GFP_KERNEL_ACCOUNT means we don't need to
upper bound the number of pages. I now don't think there is any
unusual DoS potential here. Perhaps, if the guest tries really hard to
make a massive list, they could get a softlockup on the host. Not sure
how important that is to fix.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-08  0:47                               ` Ashish Kalra
@ 2021-01-08  0:55                                 ` Steve Rutherford
  0 siblings, 0 replies; 38+ messages in thread
From: Steve Rutherford @ 2021-01-08  0:55 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: Dr. David Alan Gilbert, Singh, Brijesh, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, Grimm, Jon

On Thu, Jan 7, 2021 at 4:48 PM Ashish Kalra <ashish.kalra@amd.com> wrote:
>
> > On Thu, Jan 07, 2021 at 01:34:14AM +0000, Ashish Kalra wrote:
> > > Hello Steve,
> > >
> > > My thoughts here ...
> > >
> > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote:
> > > > Avoiding an rbtree for such a small (but unstable) list seems correct.
> > > >
> > >
> > > I agree.
> > >
> > > > For the unencrypted region list strategy, the only questions that I
> > > > have are fairly secondary.
> > > > - How should the kernel upper bound the size of the list in the face
> > > > of malicious guests, but still support large guests? (Something
> > > > similar to the size provided in the bitmap API would work).
> > > > - What serialization format should be used for the ioctl API?
> > > > (Usermode could send down a pointer to a user region and a size. The
> > > > kernel could then populate that with an array of structs containing
> > > > bases and limits for unencrypted regions.)
> > > > - How will the kernel tag a guest as having exceeded its maximum list
> > > > size, in order to indicate that the list is now incomplete? (Track a
> > > > poison bit, and send it up when getting the serialized list of
> > > > regions).
> > > >
> > > > In my view, there are two main competitors to this strategy:
> > > > - (Existing) Bitmap API
> > > > - A guest memory donation based model
> > > >
> > > > The existing bitmap API avoids any issues with growing too large,
> > > > since it's size is predictable.
> > > >
> > > > To elaborate on the memory donation based model, the guest could put
> > > > an encryption status data structure into unencrypted guest memory, and
> > > > then use a hypercall to inform the host where the base of that
> > > > structure is located. The main advantage of this is that it side steps
> > > > any issues around malicious guests causing large allocations.
> > > >
> > > > The unencrypted region list seems very practical. It's biggest
> > > > advantage over the bitmap is how cheap it will be to pass the
> > > > structure up from the kernel. A memory donation based model could
> > > > achieve similar performance, but with some additional complexity.
> > > >
> > > > Does anyone view the memory donation model as worth the complexity?
> > > > Does anyone think the simplicity of the bitmap is a better tradeoff
> > > > compared to an unencrypted region list?
> > >
> > > One advantage in sticking with the bitmap is that it maps very nicely to
> > > the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh
> > > designed and implemented it is very similar to dirty page bitmap tracking
> > > and syncing between KVM and Qemu. The same logic is re-used for the page
> > > encryption bitmap which means quite mininal changes and code resuse in
> > > Qemu.
> > >
> > > Any changes to the backing data structure, will require additional
> > > mapping logic to be added to Qemu.
> > >
> > > This is one advantage in keeping the bitmap logic.
> > >
>
> So if nobody is in favor of keeping the (current) bitmap logic, we will
> move to the unencrypted region list approach.
Sounds good to me.

>
> Thanks,
> Ashish

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
  2021-01-08  0:54                                   ` Steve Rutherford
@ 2021-01-08 16:56                                     ` Sean Christopherson
  0 siblings, 0 replies; 38+ messages in thread
From: Sean Christopherson @ 2021-01-08 16:56 UTC (permalink / raw)
  To: Steve Rutherford
  Cc: Ashish Kalra, Dr. David Alan Gilbert, Singh, Brijesh,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Joerg Roedel, Borislav Petkov, Lendacky, Thomas, X86 ML,
	KVM list, LKML, dovmurik, tobin, jejb, frankeh, Grimm, Jon

On Thu, Jan 07, 2021, Steve Rutherford wrote:
> Supporting merging of consecutive entries (or not) is less important
> to get right since it doesn't change any of the APIs. If someone runs
> into performance issues, they can loop back and fix this then. I'm
> slightly concerned with the behavior for overlapping regions. 

I'm assuming merging entries will be a near-trivial effort once everything else
is implemented, e.g. KVM will need to traverse the list to remove entries when
address are converted back to private.  Piling on merging functionality at that
point should not be all that hard, especially if the list is sorted and entries
are merged on insertion.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3
@ 2020-12-08  5:18 Kalra, Ashish
  0 siblings, 0 replies; 38+ messages in thread
From: Kalra, Ashish @ 2020-12-08  5:18 UTC (permalink / raw)
  To: Steve Rutherford
  Cc: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Joerg Roedel, Borislav Petkov, Lendacky, Thomas,
	X86 ML, KVM list, LKML, Singh, Brijesh, dovmurik, tobin, jejb,
	frankeh, dgilbert


> 
>> I suspect a list
>> would consume far less memory, hopefully without impacting performance.

And how much host memory are we talking about for here, say for a 4gb guest, the bitmap will be using just using something like 128k+.

Thanks,
Ashish

> On Dec 7, 2020, at 10:16 PM, Kalra, Ashish <Ashish.Kalra@amd.com> wrote:
> 
> I don’t think that the bitmap by itself is really a performance bottleneck here.
> 
> Thanks,
> Ashish
> 
>>> On Dec 7, 2020, at 9:10 PM, Steve Rutherford <srutherford@google.com> wrote:
>>> On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote:
>>>> On Sun, Dec 06, 2020, Paolo Bonzini wrote:
>>>> On 03/12/20 01:34, Sean Christopherson wrote:
>>>>> On Tue, Dec 01, 2020, Ashish Kalra wrote:
>>>>>> From: Brijesh Singh <brijesh.singh@amd.com>
>>>>>> KVM hypercall framework relies on alternative framework to patch the
>>>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
>>>>>> apply_alternative() is called then it defaults to VMCALL. The approach
>>>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
>>>>>> will be able to decode the instruction and do the right things. But
>>>>>> when SEV is active, guest memory is encrypted with guest key and
>>>>>> hypervisor will not be able to decode the instruction bytes.
>>>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
>>>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor.
>>>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL
>>>>> and opt into VMCALL?  It's a synthetic feature flag either way, and I don't
>>>>> think there are any existing KVM hypercalls that happen before alternatives are
>>>>> patched, i.e. it'll be a nop for sane kernel builds.
>>>>> I'm also skeptical that a KVM specific hypercall is the right approach for the
>>>>> encryption behavior, but I'll take that up in the patches later in the series.
>>>> Do you think that it's the guest that should "donate" memory for the bitmap
>>>> instead?
>>> No.  Two things I'd like to explore:
>>> 1. Making the hypercall to announce/request private vs. shared common across
>>>   hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX).
>>>   I'm concerned that we'll end up with multiple hypercalls that do more or
>>>   less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc...  Maybe it's a
>>>   pipe dream, but I'd like to at least explore options before shoving in KVM-
>>>   only hypercalls.
>>> 2. Tracking shared memory via a list of ranges instead of a using bitmap to
>>>   track all of guest memory.  For most use cases, the vast majority of guest
>>>   memory will be private, most ranges will be 2mb+, and conversions between
>>>   private and shared will be uncommon events, i.e. the overhead to walk and
>>>   split/merge list entries is hopefully not a big concern.  I suspect a list
>>>   would consume far less memory, hopefully without impacting performance.
>> For a fancier data structure, I'd suggest an interval tree. Linux
>> already has an rbtree-based interval tree implementation, which would
>> likely work, and would probably assuage any performance concerns.
>> Something like this would not be worth doing unless most of the shared
>> pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had
>> 60ish discontiguous shared regions. This is by no means a thorough
>> search, but it's suggestive. If this is typical, then the bitmap would
>> be far less efficient than most any interval-based data structure.
>> You'd have to allow userspace to upper bound the number of intervals
>> (similar to the maximum bitmap size), to prevent host OOMs due to
>> malicious guests. There's something nice about the guest donating
>> memory for this, since that would eliminate the OOM risk.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-01-08 16:57 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-01  0:45 [PATCH v2 0/9] Add AMD SEV page encryption bitmap support Ashish Kalra
2020-12-01  0:45 ` [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Ashish Kalra
2020-12-03  0:34   ` Sean Christopherson
2020-12-04 17:16     ` Brijesh Singh
2020-12-06 10:26     ` Paolo Bonzini
2020-12-07 20:41       ` Sean Christopherson
2020-12-08  3:09         ` Steve Rutherford
2020-12-08  4:16           ` Kalra, Ashish
2020-12-08 16:29           ` Brijesh Singh
2020-12-11 22:55             ` Ashish Kalra
     [not found]               ` <20201212045603.GA27415@ashkalra_ubuntu_server>
2020-12-18 19:39                 ` Dr. David Alan Gilbert
     [not found]                   ` <E79E09A2-F314-4B59-B7AE-07B1D422DF2B@amd.com>
2020-12-18 19:56                     ` Dr. David Alan Gilbert
2021-01-06 23:05                       ` Ashish Kalra
2021-01-07  1:01                         ` Steve Rutherford
2021-01-07  1:34                           ` Ashish Kalra
2021-01-07  8:05                             ` Ashish Kalra
2021-01-08  0:47                               ` Ashish Kalra
2021-01-08  0:55                                 ` Steve Rutherford
2021-01-07 17:07                           ` Ashish Kalra
2021-01-07 17:26                             ` Sean Christopherson
2021-01-07 18:41                               ` Ashish Kalra
2021-01-07 19:22                                 ` Sean Christopherson
2021-01-08  0:54                                   ` Steve Rutherford
2021-01-08 16:56                                     ` Sean Christopherson
2020-12-01  0:46 ` [PATCH v2 2/9] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall Ashish Kalra
2020-12-02 16:54   ` Dr. David Alan Gilbert
2020-12-02 21:22     ` Ashish Kalra
2020-12-06 10:25       ` Paolo Bonzini
2020-12-01  0:47 ` [PATCH v2 3/9] KVM: x86: Introduce KVM_GET_PAGE_ENC_BITMAP ioctl Ashish Kalra
2020-12-06 11:02   ` Dov Murik
2020-12-07 22:00     ` Ashish Kalra
2020-12-01  0:47 ` [PATCH v2 4/9] mm: x86: Invoke hypercall when page encryption status is changed Ashish Kalra
2020-12-01  0:47 ` [PATCH v2 5/9] KVM: x86: Introduce KVM_SET_PAGE_ENC_BITMAP ioctl Ashish Kalra
2020-12-01  0:47 ` [PATCH v2 6/9] KVM: SVM: Add support for static allocation of unified Page Encryption Bitmap Ashish Kalra
2020-12-01  0:48 ` [PATCH v2 7/9] KVM: x86: Mark _bss_decrypted section variables as decrypted in page encryption bitmap Ashish Kalra
2020-12-01  0:48 ` [PATCH v2 8/9] KVM: x86: Add kexec support for SEV " Ashish Kalra
2020-12-01  0:48 ` [PATCH v2 9/9] KVM: SVM: Bypass DBG_DECRYPT API calls for unecrypted guest memory Ashish Kalra
2020-12-08  5:18 [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Kalra, Ashish

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).