All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
@ 2022-03-08  4:38 Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 1/9] KVM: Introduce pinning flag to hva_to_pfn* Nikunj A Dadhania
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

This is a follow-up to the RFC implementation [1] that incorporates
review feedback and bug fixes. See the "RFC v1" section below for a 
list of changes.

SEV guest requires the guest's pages to be pinned in host physical
memory as migration of encrypted pages is not supported. The memory
encryption scheme uses the physical address of the memory being
encrypted. If guest pages are moved by the host, content decrypted in
the guest would be incorrect thereby corrupting guest's memory.

For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
encrypted and when the guest is done using those pages. Hypervisor
should treat all the guest pages as encrypted until they are 
deallocated or the guest is destroyed.

While provision a pfn, make KVM aware that guest pages need to be 
pinned for long-term and use appropriate pin_user_pages API for these
special encrypted memory regions. KVM takes the first reference and
holds it until a mapping is done. Take an extra reference before KVM
releases the pfn. 

Actual pinning management is handled by vendor code via new
kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
demand. Metadata of the pinning is stored in architecture specific
memslot area. During the memslot freeing path and deallocation path
guest pages are unpinned.

Guest boot time comparison:
+---------------+----------------+-------------------+
| Guest Memory  |   baseline     |  Demand Pinning + |
| Size (GB)     | v5.17-rc6(secs)| v5.17-rc6(secs)   |
+---------------+----------------+-------------------+
|      4        |     6.16       |      5.71         |
+---------------+----------------+-------------------+
|     16        |     7.38       |      5.91         |
+---------------+----------------+-------------------+
|     64        |    12.17       |      6.16         |
+---------------+----------------+-------------------+
|    128        |    18.20       |      6.50         |
+---------------+----------------+-------------------+
|    192        |    24.56       |      6.80         |
+---------------+----------------+-------------------+


Changelog:
RFC v1:
* Use pin_user_pages API with FOLL_LONGTERM flag for pinning the 
  encrypted guest pages. [David Hildenbrand]
* Use new api kvm_for_each_memslot_in_hva_range to walk the memslot.
  [Maciej S. Szmigiero]
* Maintain the non-mmu pinned memory and free them on destruction.
  [Peter Gonda]
* Handle non-mmu pinned memory for intra host migration. [Peter Gonda]
* Add the missing RLIMIT_MEMLOCK check. [David Hildenbrand]
* Use pin_user_pages API for long term pinning of pages.
  [David Hildenbrand]
* Flush the page before releasing it to the host system.
  [Mingwei Zhang]

[1] https://lore.kernel.org/kvm/20220118110621.62462-1-nikunj@amd.com/

Nikunj A Dadhania (7):
  KVM: Introduce pinning flag to hva_to_pfn*
  KVM: x86/mmu: Move hugepage adjust to direct_page_fault
  KVM: x86/mmu: Add hook to pin PFNs on demand in MMU
  KVM: SVM: Add pinning metadata in the arch memslot
  KVM: SVM: Implement demand page pinning
  KVM: SEV: Carve out routine for allocation of pages
  KVM: Move kvm_for_each_memslot_in_hva_range() to be used in SVM

Sean Christopherson (2):
  KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX
  KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()

 arch/x86/include/asm/kvm-x86-ops.h |   3 +
 arch/x86/include/asm/kvm_host.h    |  10 +
 arch/x86/kvm/mmu.h                 |   3 +
 arch/x86/kvm/mmu/mmu.c             |  57 +++-
 arch/x86/kvm/mmu/tdp_mmu.c         |   2 -
 arch/x86/kvm/svm/sev.c             | 531 ++++++++++++++++++++++-------
 arch/x86/kvm/svm/svm.c             |   4 +
 arch/x86/kvm/svm/svm.h             |  12 +-
 arch/x86/kvm/x86.c                 |  11 +-
 include/linux/kvm_host.h           |  12 +
 virt/kvm/kvm_main.c                |  69 ++--
 virt/kvm/kvm_mm.h                  |   2 +-
 virt/kvm/pfncache.c                |   2 +-
 13 files changed, 556 insertions(+), 162 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 1/9] KVM: Introduce pinning flag to hva_to_pfn*
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 2/9] KVM: x86/mmu: Move hugepage adjust to direct_page_fault Nikunj A Dadhania
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

KVM allocates pages with get_user_pages_* (that use FOLL_GET). For
long term pinning of guest pages pin_user_pages_* (that use FOLL_PIN)
need to be used. Add a flag to hva_to_pfn* to allocate pinned pages
when the memslot represents encrypted memory.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 include/linux/kvm_host.h |  6 ++++
 virt/kvm/kvm_main.c      | 63 ++++++++++++++++++++++++++++++----------
 virt/kvm/kvm_mm.h        |  2 +-
 virt/kvm/pfncache.c      |  2 +-
 4 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f11039944c08..c23022960d51 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -55,6 +55,7 @@
  * include/linux/kvm_h.
  */
 #define KVM_MEMSLOT_INVALID	(1UL << 16)
+#define KVM_MEMSLOT_ENCRYPTED   (1UL << 17)
 
 /*
  * Bit 63 of the memslot generation number is an "update in-progress flag",
@@ -583,6 +584,11 @@ static inline unsigned long *kvm_second_dirty_bitmap(struct kvm_memory_slot *mem
 	return memslot->dirty_bitmap + len / sizeof(*memslot->dirty_bitmap);
 }
 
+static inline bool memslot_is_encrypted(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEMSLOT_ENCRYPTED);
+}
+
 #ifndef KVM_DIRTY_LOG_MANUAL_CAPS
 #define KVM_DIRTY_LOG_MANUAL_CAPS KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0afc016cc54d..c035fe6b39ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2381,9 +2381,10 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * only part that runs if we can in atomic context.
  */
 static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-			    bool *writable, kvm_pfn_t *pfn)
+			    bool *writable, kvm_pfn_t *pfn, bool use_pin)
 {
 	struct page *page[1];
+	bool ret;
 
 	/*
 	 * Fast pin a writable pfn only if it is a write fault request
@@ -2393,7 +2394,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 	if (!(write_fault || writable))
 		return false;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (!use_pin)
+		ret = get_user_page_fast_only(addr, FOLL_WRITE, page);
+	else
+		ret = pin_user_pages_fast_only(addr, 1, FOLL_WRITE | FOLL_LONGTERM, page);
+
+	if (ret) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)
@@ -2409,9 +2415,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, kvm_pfn_t *pfn)
+			   bool *writable, kvm_pfn_t *pfn, bool use_pin)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	unsigned int flags = 0;
 	struct page *page;
 	int npages = 0;
 
@@ -2422,20 +2428,41 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 
 	if (write_fault)
 		flags |= FOLL_WRITE;
-	if (async)
-		flags |= FOLL_NOWAIT;
 
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	if (!use_pin) {
+		flags |= FOLL_HWPOISON;
+		if (async)
+			flags |= FOLL_NOWAIT;
+
+		npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	} else {
+		/*
+		 * FOLL_LONGTERM is not supported in pin_user_pages_unlocked,
+		 * use *_fast instead.
+		 */
+		flags |= FOLL_LONGTERM;
+		npages = pin_user_pages_fast(addr, 1, flags, &page);
+	}
+
 	if (npages != 1)
 		return npages;
 
 	/* map read fault as writable if possible */
 	if (unlikely(!write_fault) && writable) {
 		struct page *wpage;
+		bool ret;
+
+		if (!use_pin)
+			ret = get_user_page_fast_only(addr, FOLL_WRITE, &wpage);
+		else
+			ret = pin_user_pages_fast_only(addr, 1, FOLL_WRITE | FOLL_LONGTERM, &wpage);
 
-		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
+		if (ret) {
 			*writable = true;
-			put_page(page);
+			if (!use_pin)
+				put_page(page);
+			else
+				unpin_user_page(page);
 			page = wpage;
 		}
 	}
@@ -2541,7 +2568,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  *     whether the mapping is writable.
  */
 kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-		     bool write_fault, bool *writable)
+		     bool write_fault, bool *writable, bool use_pin)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn = 0;
@@ -2550,13 +2577,13 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	/* we can do it either atomically or asynchronously, not both */
 	BUG_ON(atomic && async);
 
-	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn, use_pin))
 		return pfn;
 
 	if (atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn, use_pin);
 	if (npages == 1)
 		return pfn;
 
@@ -2616,7 +2643,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	}
 
 	return hva_to_pfn(addr, atomic, async, write_fault,
-			  writable);
+			  writable, memslot_is_encrypted(slot));
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
@@ -2788,8 +2815,14 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
-	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
-		put_page(pfn_to_page(pfn));
+	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_maybe_dma_pinned(page))
+			unpin_user_page(page);
+		else
+			put_page(page);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
 
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 34ca40823260..b1a5e379949b 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -25,7 +25,7 @@
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
 kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-		     bool write_fault, bool *writable);
+		     bool write_fault, bool *writable, bool use_pin);
 
 #ifdef CONFIG_HAVE_KVM_PFNCACHE
 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index ce878f4be4da..44384f06c81b 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -135,7 +135,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva)
 		smp_rmb();
 
 		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL);
+		new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL, false);
 		if (is_error_noslot_pfn(new_pfn))
 			break;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 2/9] KVM: x86/mmu: Move hugepage adjust to direct_page_fault
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 1/9] KVM: Introduce pinning flag to hva_to_pfn* Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-28 21:04   ` Sean Christopherson
  2022-03-08  4:38 ` [PATCH RFC v1 3/9] KVM: x86/mmu: Add hook to pin PFNs on demand in MMU Nikunj A Dadhania
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

Both TDP MMU and legacy MMU do hugepage adjust in the mapping routine.
Adjust the pfn early in the common code. This will be used by the
following patches for pinning the pages.

No functional change intended.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kvm/mmu/mmu.c     | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8e24f73bf60b..db1feecd6fed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2940,8 +2940,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	int ret;
 	gfn_t base_gfn = fault->gfn;
 
-	kvm_mmu_hugepage_adjust(vcpu, fault);
-
 	trace_kvm_mmu_spte_requested(fault);
 	for_each_shadow_entry(vcpu, fault->addr, it) {
 		/*
@@ -4035,6 +4033,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 	r = RET_PF_RETRY;
 
+	kvm_mmu_hugepage_adjust(vcpu, fault);
+
 	if (is_tdp_mmu_fault)
 		read_lock(&vcpu->kvm->mmu_lock);
 	else
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bc9e3553fba2..e03bf59b2f81 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -959,8 +959,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 new_spte;
 	int ret;
 
-	kvm_mmu_hugepage_adjust(vcpu, fault);
-
 	trace_kvm_mmu_spte_requested(fault);
 
 	rcu_read_lock();
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 3/9] KVM: x86/mmu: Add hook to pin PFNs on demand in MMU
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 1/9] KVM: Introduce pinning flag to hva_to_pfn* Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 2/9] KVM: x86/mmu: Move hugepage adjust to direct_page_fault Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 4/9] KVM: SVM: Add pinning metadata in the arch memslot Nikunj A Dadhania
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

Use vendor code via kvm_x86_ops hooks for pinning.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 +++
 arch/x86/kvm/mmu/mmu.c             | 15 +++++++++++++++
 3 files changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index d39e0de06be2..8efb43d92eef 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -88,6 +88,7 @@ KVM_X86_OP(set_tss_addr)
 KVM_X86_OP(set_identity_map_addr)
 KVM_X86_OP(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(pin_pfn)
 KVM_X86_OP_NULL(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ec9830d2aabf..df11f1fb76de 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1418,6 +1418,9 @@ struct kvm_x86_ops {
 
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
+	bool (*pin_pfn)(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+			kvm_pfn_t pfn, hva_t hva, bool write,
+			enum pg_level level);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index db1feecd6fed..b94e5e71653e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4001,6 +4001,16 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
 }
 
+static bool kvm_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn) ||
+	    !kvm_x86_ops.pin_pfn)
+		return true;
+
+	return kvm_x86_ops.pin_pfn(vcpu, fault->slot, fault->pfn, fault->hva,
+				   fault->write, fault->goal_level);
+}
+
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
@@ -4035,6 +4045,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
 
+	if (memslot_is_encrypted(fault->slot) && !kvm_pin_pfn(vcpu, fault))
+		goto out_release;
+
 	if (is_tdp_mmu_fault)
 		read_lock(&vcpu->kvm->mmu_lock);
 	else
@@ -4057,6 +4070,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
+
+out_release:
 	kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 4/9] KVM: SVM: Add pinning metadata in the arch memslot
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
                   ` (2 preceding siblings ...)
  2022-03-08  4:38 ` [PATCH RFC v1 3/9] KVM: x86/mmu: Add hook to pin PFNs on demand in MMU Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning Nikunj A Dadhania
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

AMD SEV guest requires the guest's pages to be pinned in host
physical memory. The memory encryption scheme uses the physical
address of the memory being encrypted. If guest pages are moved,
content decrypted would be incorrect, corrupting guest's memory.

For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
encrypted and when the guest is done using those pages. Hypervisor
should treat all the guest pages as encrypted until they are
deallocated or the guest is destroyed.

The KVM MMU needs to track the pages that are pinned and the
corresponding pfns for unpinning them during the guest destroy path
and deallocation path.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 ++
 arch/x86/include/asm/kvm_host.h    |  7 +++++
 arch/x86/kvm/svm/sev.c             | 49 ++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c             |  3 ++
 arch/x86/kvm/svm/svm.h             |  6 ++++
 arch/x86/kvm/x86.c                 | 11 ++++++-
 6 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8efb43d92eef..61ff8a636db6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -89,6 +89,8 @@ KVM_X86_OP(set_identity_map_addr)
 KVM_X86_OP(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP(pin_pfn)
+KVM_X86_OP(alloc_memslot_metadata)
+KVM_X86_OP(free_memslot)
 KVM_X86_OP_NULL(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index df11f1fb76de..eeb2c799b59f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -926,6 +926,8 @@ struct kvm_arch_memory_slot {
 	struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
 	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 	unsigned short *gfn_track[KVM_PAGE_TRACK_MAX];
+	unsigned long *pinned_bitmap;
+	kvm_pfn_t *pfns;
 };
 
 /*
@@ -1421,6 +1423,11 @@ struct kvm_x86_ops {
 	bool (*pin_pfn)(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 			kvm_pfn_t pfn, hva_t hva, bool write,
 			enum pg_level level);
+	int (*alloc_memslot_metadata)(struct kvm *kvm,
+				      const struct kvm_memory_slot *old,
+				      struct kvm_memory_slot *new);
+	void (*free_memslot)(struct kvm *kvm,
+			     struct kvm_memory_slot *slot);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 17b53457d866..bd7572517c99 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2950,3 +2950,52 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 
 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
 }
+
+void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	struct kvm_arch_memory_slot *aslot = &slot->arch;
+
+	if (!sev_guest(kvm))
+		return;
+
+	if (aslot->pinned_bitmap) {
+		kvfree(aslot->pinned_bitmap);
+		aslot->pinned_bitmap = NULL;
+	}
+
+	if (aslot->pfns) {
+		kvfree(aslot->pfns);
+		aslot->pfns = NULL;
+	}
+}
+
+int sev_alloc_memslot_metadata(struct kvm *kvm,
+			       const struct kvm_memory_slot *old,
+			       struct kvm_memory_slot *new)
+{
+	struct kvm_arch_memory_slot *aslot = &new->arch;
+	unsigned long pinned_bytes = new->npages * sizeof(kvm_pfn_t);
+
+	if (!sev_guest(kvm))
+		return 0;
+
+	if (old && old->arch.pinned_bitmap && old->arch.pfns) {
+		WARN_ON(old->npages != new->npages);
+		aslot->pinned_bitmap = old->arch.pinned_bitmap;
+		aslot->pfns = old->arch.pfns;
+		return 0;
+	}
+
+	aslot->pfns = kvcalloc(new->npages, sizeof(*aslot->pfns),
+			      GFP_KERNEL_ACCOUNT);
+	if (!aslot->pfns)
+		return -ENOMEM;
+
+	aslot->pinned_bitmap = kvzalloc(pinned_bytes, GFP_KERNEL_ACCOUNT);
+	if (!aslot->pinned_bitmap) {
+		kvfree(aslot->pfns);
+		aslot->pfns = NULL;
+		return -ENOMEM;
+	}
+	return 0;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fd3a00c892c7..ec06421cb532 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4658,6 +4658,9 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.complete_emulated_msr = svm_complete_emulated_msr,
 
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+
+	.alloc_memslot_metadata = sev_alloc_memslot_metadata,
+	.free_memslot = sev_free_memslot,
 };
 
 /*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index fa98d6844728..f00364020d7e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -616,4 +616,10 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 void __svm_sev_es_vcpu_run(unsigned long vmcb_pa);
 void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs);
 
+int sev_alloc_memslot_metadata(struct kvm *kvm,
+			       const struct kvm_memory_slot *old,
+			       struct kvm_memory_slot *new);
+void sev_free_memslot(struct kvm *kvm,
+		      struct kvm_memory_slot *slot);
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 82a9dcd8c67f..95070aaa1636 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11796,6 +11796,7 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 	}
 
 	kvm_page_track_free_memslot(slot);
+	static_call_cond(kvm_x86_free_memslot)(kvm, slot);
 }
 
 int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
@@ -11821,6 +11822,7 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
 }
 
 static int kvm_alloc_memslot_metadata(struct kvm *kvm,
+				      const struct kvm_memory_slot *old,
 				      struct kvm_memory_slot *slot)
 {
 	unsigned long npages = slot->npages;
@@ -11873,8 +11875,15 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 	if (kvm_page_track_create_memslot(kvm, slot, npages))
 		goto out_free;
 
+	if (kvm_x86_ops.alloc_memslot_metadata &&
+	    static_call(kvm_x86_alloc_memslot_metadata)(kvm, old, slot))
+		goto out_free_page_track;
+
 	return 0;
 
+out_free_page_track:
+	kvm_page_track_free_memslot(slot);
+
 out_free:
 	memslot_rmap_free(slot);
 
@@ -11907,7 +11916,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				   enum kvm_mr_change change)
 {
 	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
-		return kvm_alloc_memslot_metadata(kvm, new);
+		return kvm_alloc_memslot_metadata(kvm, old, new);
 
 	if (change == KVM_MR_FLAGS_ONLY)
 		memcpy(&new->arch, &old->arch, sizeof(old->arch));
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
                   ` (3 preceding siblings ...)
  2022-03-08  4:38 ` [PATCH RFC v1 4/9] KVM: SVM: Add pinning metadata in the arch memslot Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-08 21:53   ` Mingwei Zhang
  2022-03-08  4:38 ` [PATCH RFC v1 6/9] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX Nikunj A Dadhania
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

Use the memslot metadata to store the pinned data along with the pfns.
This improves the SEV guest startup time from O(n) to a constant by
deferring guest page pinning until the pages are used to satisfy
nested page faults. The page reference will be dropped in the memslot
free path or deallocation path.

Reuse enc_region structure definition as pinned_region to maintain
pages that are pinned outside of MMU demand pinning. Remove rest of
the code which did upfront pinning, as they are no longer needed in
view of the demand pinning support.

Retain svm_register_enc_region() and svm_unregister_enc_region() with
required checks for resource limit.

Guest boot time comparison
  +---------------+----------------+-------------------+
  | Guest Memory  |   baseline     |  Demand Pinning   |
  | Size (GB)     |    (secs)      |     (secs)        |
  +---------------+----------------+-------------------+
  |      4        |     6.16       |      5.71         |
  +---------------+----------------+-------------------+
  |     16        |     7.38       |      5.91         |
  +---------------+----------------+-------------------+
  |     64        |    12.17       |      6.16         |
  +---------------+----------------+-------------------+
  |    128        |    18.20       |      6.50         |
  +---------------+----------------+-------------------+
  |    192        |    24.56       |      6.80         |
  +---------------+----------------+-------------------+

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kvm/svm/sev.c | 304 ++++++++++++++++++++++++++---------------
 arch/x86/kvm/svm/svm.c |   1 +
 arch/x86/kvm/svm/svm.h |   6 +-
 3 files changed, 200 insertions(+), 111 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index bd7572517c99..d0514975555d 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -66,7 +66,7 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
-struct enc_region {
+struct pinned_region {
 	struct list_head list;
 	unsigned long npages;
 	struct page **pages;
@@ -257,7 +257,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	if (ret)
 		goto e_free;
 
-	INIT_LIST_HEAD(&sev->regions_list);
+	INIT_LIST_HEAD(&sev->pinned_regions_list);
 
 	return 0;
 
@@ -378,16 +378,34 @@ static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return ret;
 }
 
+static bool rlimit_memlock_exceeds(unsigned long locked, unsigned long npages)
+{
+	unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	unsigned long lock_req;
+
+	lock_req = locked + npages;
+	return (lock_req > lock_limit) && !capable(CAP_IPC_LOCK);
+}
+
+static unsigned long get_npages(unsigned long uaddr, unsigned long ulen)
+{
+	unsigned long first, last;
+
+	/* Calculate number of pages. */
+	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
+	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
+	return last - first + 1;
+}
+
 static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
 				    unsigned long ulen, unsigned long *n,
 				    int write)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct pinned_region *region;
 	unsigned long npages, size;
 	int npinned;
-	unsigned long locked, lock_limit;
 	struct page **pages;
-	unsigned long first, last;
 	int ret;
 
 	lockdep_assert_held(&kvm->lock);
@@ -395,15 +413,12 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
 	if (ulen == 0 || uaddr + ulen < uaddr)
 		return ERR_PTR(-EINVAL);
 
-	/* Calculate number of pages. */
-	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
-	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
-	npages = (last - first + 1);
+	npages = get_npages(uaddr, ulen);
 
-	locked = sev->pages_locked + npages;
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit);
+	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
+		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
+			sev->pages_to_lock + npages,
+			(rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -429,7 +444,19 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
 	}
 
 	*n = npages;
-	sev->pages_locked = locked;
+	sev->pages_to_lock += npages;
+
+	/* Maintain region list that is pinned to be unpinned in vm destroy path */
+	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
+	if (!region) {
+		ret = -ENOMEM;
+		goto err;
+	}
+	region->uaddr = uaddr;
+	region->size = ulen;
+	region->pages = pages;
+	region->npages = npages;
+	list_add_tail(&region->list, &sev->pinned_regions_list);
 
 	return pages;
 
@@ -441,14 +468,43 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
 	return ERR_PTR(ret);
 }
 
-static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
-			     unsigned long npages)
+static void __sev_unpin_memory(struct kvm *kvm, struct page **pages,
+			       unsigned long npages)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
 
 	unpin_user_pages(pages, npages);
 	kvfree(pages);
-	sev->pages_locked -= npages;
+	sev->pages_to_lock -= npages;
+}
+
+static struct pinned_region *find_pinned_region(struct kvm *kvm,
+					     struct page **pages,
+					     unsigned long n)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct list_head *head = &sev->pinned_regions_list;
+	struct pinned_region *i;
+
+	list_for_each_entry(i, head, list) {
+		if (i->pages == pages && i->npages == n)
+			return i;
+	}
+
+	return NULL;
+}
+
+static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
+			     unsigned long npages)
+{
+	struct pinned_region *region;
+
+	region = find_pinned_region(kvm, pages, npages);
+	__sev_unpin_memory(kvm, pages, npages);
+	if (region) {
+		list_del(&region->list);
+		kfree(region);
+	}
 }
 
 static void sev_clflush_pages(struct page *pages[], unsigned long npages)
@@ -551,8 +607,9 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 		set_page_dirty_lock(inpages[i]);
 		mark_page_accessed(inpages[i]);
 	}
-	/* unlock the user pages */
-	sev_unpin_memory(kvm, inpages, npages);
+	/* unlock the user pages on error */
+	if (ret)
+		sev_unpin_memory(kvm, inpages, npages);
 	return ret;
 }
 
@@ -1059,7 +1116,8 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
 		set_page_dirty_lock(pages[i]);
 		mark_page_accessed(pages[i]);
 	}
-	sev_unpin_memory(kvm, pages, n);
+	if (ret)
+		sev_unpin_memory(kvm, pages, n);
 	return ret;
 }
 
@@ -1338,7 +1396,8 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 e_free_hdr:
 	kfree(hdr);
 e_unpin:
-	sev_unpin_memory(kvm, guest_page, n);
+	if (ret)
+		sev_unpin_memory(kvm, guest_page, n);
 
 	return ret;
 }
@@ -1508,7 +1567,8 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	ret = sev_issue_cmd(kvm, SEV_CMD_RECEIVE_UPDATE_DATA, &data,
 				&argp->error);
 
-	sev_unpin_memory(kvm, guest_page, n);
+	if (ret)
+		sev_unpin_memory(kvm, guest_page, n);
 
 e_free_trans:
 	kfree(trans);
@@ -1629,16 +1689,17 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
 	dst->active = true;
 	dst->asid = src->asid;
 	dst->handle = src->handle;
-	dst->pages_locked = src->pages_locked;
+	dst->pages_to_lock = src->pages_to_lock;
 	dst->enc_context_owner = src->enc_context_owner;
 
 	src->asid = 0;
 	src->active = false;
 	src->handle = 0;
-	src->pages_locked = 0;
+	src->pages_to_lock = 0;
 	src->enc_context_owner = NULL;
 
-	list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
+	list_cut_before(&dst->pinned_regions_list, &src->pinned_regions_list,
+			&src->pinned_regions_list);
 }
 
 static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
@@ -1862,8 +1923,7 @@ int svm_register_enc_region(struct kvm *kvm,
 			    struct kvm_enc_region *range)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
-	struct enc_region *region;
-	int ret = 0;
+	unsigned long npages;
 
 	if (!sev_guest(kvm))
 		return -ENOTTY;
@@ -1875,101 +1935,35 @@ int svm_register_enc_region(struct kvm *kvm,
 	if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
 		return -EINVAL;
 
-	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
-	if (!region)
+	npages = get_npages(range->addr, range->size);
+	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
+		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
+		       sev->pages_to_lock + npages,
+		       (rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
 		return -ENOMEM;
-
-	mutex_lock(&kvm->lock);
-	region->pages = sev_pin_memory(kvm, range->addr, range->size, &region->npages, 1);
-	if (IS_ERR(region->pages)) {
-		ret = PTR_ERR(region->pages);
-		mutex_unlock(&kvm->lock);
-		goto e_free;
 	}
+	sev->pages_to_lock += npages;
 
-	region->uaddr = range->addr;
-	region->size = range->size;
-
-	list_add_tail(&region->list, &sev->regions_list);
-	mutex_unlock(&kvm->lock);
-
-	/*
-	 * The guest may change the memory encryption attribute from C=0 -> C=1
-	 * or vice versa for this memory range. Lets make sure caches are
-	 * flushed to ensure that guest data gets written into memory with
-	 * correct C-bit.
-	 */
-	sev_clflush_pages(region->pages, region->npages);
-
-	return ret;
-
-e_free:
-	kfree(region);
-	return ret;
-}
-
-static struct enc_region *
-find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
-{
-	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
-	struct list_head *head = &sev->regions_list;
-	struct enc_region *i;
-
-	list_for_each_entry(i, head, list) {
-		if (i->uaddr == range->addr &&
-		    i->size == range->size)
-			return i;
-	}
-
-	return NULL;
-}
-
-static void __unregister_enc_region_locked(struct kvm *kvm,
-					   struct enc_region *region)
-{
-	sev_unpin_memory(kvm, region->pages, region->npages);
-	list_del(&region->list);
-	kfree(region);
+	return 0;
 }
 
 int svm_unregister_enc_region(struct kvm *kvm,
 			      struct kvm_enc_region *range)
 {
-	struct enc_region *region;
-	int ret;
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	unsigned long npages;
 
 	/* If kvm is mirroring encryption context it isn't responsible for it */
 	if (is_mirroring_enc_context(kvm))
 		return -EINVAL;
 
-	mutex_lock(&kvm->lock);
-
-	if (!sev_guest(kvm)) {
-		ret = -ENOTTY;
-		goto failed;
-	}
-
-	region = find_enc_region(kvm, range);
-	if (!region) {
-		ret = -EINVAL;
-		goto failed;
-	}
-
-	/*
-	 * Ensure that all guest tagged cache entries are flushed before
-	 * releasing the pages back to the system for use. CLFLUSH will
-	 * not do this, so issue a WBINVD.
-	 */
-	wbinvd_on_all_cpus();
+	if (!sev_guest(kvm))
+		return -ENOTTY;
 
-	__unregister_enc_region_locked(kvm, region);
+	npages = get_npages(range->addr, range->size);
+	sev->pages_to_lock -= npages;
 
-	mutex_unlock(&kvm->lock);
 	return 0;
-
-failed:
-	mutex_unlock(&kvm->lock);
-	return ret;
 }
 
 int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
@@ -2018,7 +2012,7 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
 	mirror_sev->fd = source_sev->fd;
 	mirror_sev->es_active = source_sev->es_active;
 	mirror_sev->handle = source_sev->handle;
-	INIT_LIST_HEAD(&mirror_sev->regions_list);
+	INIT_LIST_HEAD(&mirror_sev->pinned_regions_list);
 	ret = 0;
 
 	/*
@@ -2038,8 +2032,9 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
-	struct list_head *head = &sev->regions_list;
+	struct list_head *head = &sev->pinned_regions_list;
 	struct list_head *pos, *q;
+	struct pinned_region *region;
 
 	WARN_ON(sev->num_mirrored_vms);
 
@@ -2072,8 +2067,14 @@ void sev_vm_destroy(struct kvm *kvm)
 	 */
 	if (!list_empty(head)) {
 		list_for_each_safe(pos, q, head) {
-			__unregister_enc_region_locked(kvm,
-				list_entry(pos, struct enc_region, list));
+			/*
+			 * Unpin the memory that were pinned outside of MMU
+			 * demand pinning
+			 */
+			region = list_entry(pos, struct pinned_region, list);
+			__sev_unpin_memory(kvm, region->pages, region->npages);
+			list_del(&region->list);
+			kfree(region);
 			cond_resched();
 		}
 	}
@@ -2951,13 +2952,96 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
 }
 
+bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
+		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level)
+{
+	unsigned int npages = KVM_PAGES_PER_HPAGE(level);
+	unsigned int flags = FOLL_LONGTERM, npinned;
+	struct kvm_arch_memory_slot *aslot;
+	struct kvm *kvm = vcpu->kvm;
+	gfn_t gfn_start, rel_gfn;
+	struct page *page[1];
+	kvm_pfn_t old_pfn;
+
+	if (!sev_guest(kvm))
+		return true;
+
+	if (WARN_ON_ONCE(!memslot->arch.pfns))
+		return false;
+
+	if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
+		return false;
+
+	hva = ALIGN_DOWN(hva, npages << PAGE_SHIFT);
+	flags |= write ? FOLL_WRITE : 0;
+
+	mutex_lock(&kvm->slots_arch_lock);
+	gfn_start = hva_to_gfn_memslot(hva, memslot);
+	rel_gfn = gfn_start - memslot->base_gfn;
+	aslot = &memslot->arch;
+	if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
+		old_pfn = aslot->pfns[rel_gfn];
+		if (old_pfn == pfn)
+			goto out;
+
+		/* Flush the cache before releasing the page to the system */
+		sev_flush_guest_memory(to_svm(vcpu), __va(old_pfn),
+				       npages * PAGE_SIZE);
+		unpin_user_page(pfn_to_page(old_pfn));
+	}
+	/* Pin the page, KVM doesn't yet support page migration. */
+	npinned = pin_user_pages_fast(hva, 1, flags, page);
+	KVM_BUG(npinned != 1, kvm, "SEV: Pinning failed\n");
+	KVM_BUG(pfn != page_to_pfn(page[0]), kvm, "SEV: pfn mismatch\n");
+
+	if (!this_cpu_has(X86_FEATURE_SME_COHERENT))
+		clflush_cache_range(__va(pfn << PAGE_SHIFT), npages * PAGE_SIZE);
+
+	WARN_ON(rel_gfn >= memslot->npages);
+	aslot->pfns[rel_gfn] = pfn;
+	set_bit(rel_gfn, aslot->pinned_bitmap);
+
+out:
+	mutex_unlock(&kvm->slots_arch_lock);
+	return true;
+}
+
 void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	struct kvm_arch_memory_slot *aslot = &slot->arch;
+	kvm_pfn_t *pfns;
+	gfn_t gfn;
+	int i;
 
 	if (!sev_guest(kvm))
 		return;
 
+	if (!aslot->pinned_bitmap || !slot->arch.pfns)
+		goto out;
+
+	pfns = aslot->pfns;
+
+	/*
+	 * Ensure that all guest tagged cache entries are flushed before
+	 * releasing the pages back to the system for use. CLFLUSH will
+	 * not do this, so issue a WBINVD.
+	 */
+	wbinvd_on_all_cpus();
+
+	/*
+	 * Iterate the memslot to find the pinned pfn using the bitmap and drop
+	 * the pfn stored.
+	 */
+	for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
+		if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
+			if (WARN_ON(!pfns[i]))
+				continue;
+
+			unpin_user_page(pfn_to_page(pfns[i]));
+		}
+	}
+
+out:
 	if (aslot->pinned_bitmap) {
 		kvfree(aslot->pinned_bitmap);
 		aslot->pinned_bitmap = NULL;
@@ -2992,6 +3076,8 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
 		return -ENOMEM;
 
 	aslot->pinned_bitmap = kvzalloc(pinned_bytes, GFP_KERNEL_ACCOUNT);
+	new->flags |= KVM_MEMSLOT_ENCRYPTED;
+
 	if (!aslot->pinned_bitmap) {
 		kvfree(aslot->pfns);
 		aslot->pfns = NULL;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ec06421cb532..463a90ed6f83 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4661,6 +4661,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
 	.alloc_memslot_metadata = sev_alloc_memslot_metadata,
 	.free_memslot = sev_free_memslot,
+	.pin_pfn = sev_pin_pfn,
 };
 
 /*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f00364020d7e..2f38e793ead0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -75,8 +75,8 @@ struct kvm_sev_info {
 	unsigned int asid;	/* ASID used for this guest */
 	unsigned int handle;	/* SEV firmware handle */
 	int fd;			/* SEV device fd */
-	unsigned long pages_locked; /* Number of pages locked */
-	struct list_head regions_list;  /* List of registered regions */
+	unsigned long pages_to_lock; /* Number of page that can be locked */
+	struct list_head pinned_regions_list;  /* List of pinned regions */
 	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
 	struct kvm *enc_context_owner; /* Owner of copied encryption context */
 	unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */
@@ -621,5 +621,7 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
 			       struct kvm_memory_slot *new);
 void sev_free_memslot(struct kvm *kvm,
 		      struct kvm_memory_slot *slot);
+bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
+		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level);
 
 #endif
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 6/9] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
                   ` (4 preceding siblings ...)
  2022-03-08  4:38 ` [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 7/9] KVM: SEV: Carve out routine for allocation of pages Nikunj A Dadhania
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path. This allows
SEV/TDX to pin pages before booting the guest, provides the resulting
pfn to vendor code if should be needed in the future, and allows the
RET_PF_* enums to stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e9fbb2c8bbe2..0595891dd834 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -202,6 +202,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	return vcpu->arch.mmu->page_fault(vcpu, &fault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level);
+
 /*
  * Currently, we have two sorts of write-protection, a) the first one
  * write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b94e5e71653e..5f5da1a4e6be 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4134,6 +4134,44 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level)
+{
+	int r;
+	struct kvm_page_fault fault = (struct kvm_page_fault) {
+		.addr = gpa,
+		.error_code = error_code,
+		.exec = error_code & PFERR_FETCH_MASK,
+		.write = error_code & PFERR_WRITE_MASK,
+		.present = error_code & PFERR_PRESENT_MASK,
+		.rsvd = error_code & PFERR_RSVD_MASK,
+		.user = error_code & PFERR_USER_MASK,
+		.prefetch = false,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+	};
+
+	if (mmu_topup_memory_caches(vcpu, false))
+		return KVM_PFN_ERR_FAULT;
+
+	/*
+	 * Loop on the page fault path to handle the case where an mmu_notifier
+	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
+	 * KVM needs to resume the guest in case the invalidation changed any
+	 * of the page fault properties, i.e. the gpa or error code.  For this
+	 * path, the gpa and error code are fixed by the caller, and the caller
+	 * expects failure if and only if the page fault can't be fixed.
+	 */
+	do {
+		fault.max_level = max_level;
+		fault.req_level = PG_LEVEL_4K;
+		fault.goal_level = PG_LEVEL_4K;
+		r = direct_page_fault(vcpu, &fault);
+	} while (r == RET_PF_RETRY && !is_error_noslot_pfn(fault.pfn));
+	return fault.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 7/9] KVM: SEV: Carve out routine for allocation of pages
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
                   ` (5 preceding siblings ...)
  2022-03-08  4:38 ` [PATCH RFC v1 6/9] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 8/9] KVM: Move kvm_for_each_memslot_in_hva_range() to be used in SVM Nikunj A Dadhania
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

Create a separate routine sev_alloc_pages() for allocating sev pages.
This will be used in the following MMU based pinning.

While at it, validate the number of pages before the RLIMIT check and
use kzalloc instead of kmalloc.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kvm/svm/sev.c | 44 ++++++++++++++++++++++++++----------------
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index d0514975555d..7e39320fc65d 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -397,43 +397,53 @@ static unsigned long get_npages(unsigned long uaddr, unsigned long ulen)
 	return last - first + 1;
 }
 
-static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
-				    unsigned long ulen, unsigned long *n,
-				    int write)
+static void *sev_alloc_pages(struct kvm_sev_info *sev, unsigned long uaddr,
+			     unsigned long ulen, unsigned long *n)
 {
-	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
-	struct pinned_region *region;
 	unsigned long npages, size;
-	int npinned;
 	struct page **pages;
-	int ret;
-
-	lockdep_assert_held(&kvm->lock);
 
 	if (ulen == 0 || uaddr + ulen < uaddr)
 		return ERR_PTR(-EINVAL);
 
 	npages = get_npages(uaddr, ulen);
+	if (WARN_ON_ONCE(npages > INT_MAX))
+		return ERR_PTR(-EINVAL);
 
 	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
 		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
-			sev->pages_to_lock + npages,
-			(rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
+		       sev->pages_to_lock + npages,
+		       (rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
 		return ERR_PTR(-ENOMEM);
 	}
 
-	if (WARN_ON_ONCE(npages > INT_MAX))
-		return ERR_PTR(-EINVAL);
-
 	/* Avoid using vmalloc for smaller buffers. */
 	size = npages * sizeof(struct page *);
 	if (size > PAGE_SIZE)
 		pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		pages = kmalloc(size, GFP_KERNEL_ACCOUNT);
+		pages = kzalloc(size, GFP_KERNEL_ACCOUNT);
 
-	if (!pages)
-		return ERR_PTR(-ENOMEM);
+	*n = pages ? npages : 0;
+	return pages;
+}
+
+static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
+				    unsigned long ulen, unsigned long *n,
+				    int write)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct pinned_region *region;
+	unsigned long npages;
+	struct page **pages;
+	int npinned;
+	int ret;
+
+	lockdep_assert_held(&kvm->lock);
+
+	pages = sev_alloc_pages(sev, uaddr, ulen, &npages);
+	if (IS_ERR(pages))
+		return pages;
 
 	/* Pin the user virtual address. */
 	npinned = pin_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 8/9] KVM: Move kvm_for_each_memslot_in_hva_range() to be used in SVM
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
                   ` (6 preceding siblings ...)
  2022-03-08  4:38 ` [PATCH RFC v1 7/9] KVM: SEV: Carve out routine for allocation of pages Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-08  4:38 ` [PATCH RFC v1 9/9] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() Nikunj A Dadhania
  2022-03-28 21:00 ` [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Sean Christopherson
  9 siblings, 0 replies; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

Move the macro to kvm_host.h and make if visible for SVM to use.

No functional change intended.

Suggested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 include/linux/kvm_host.h | 6 ++++++
 virt/kvm/kvm_main.c      | 6 ------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c23022960d51..d72f692725d2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1063,6 +1063,12 @@ static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter, gfn_
 	     kvm_memslot_iter_is_valid(iter, end);			\
 	     kvm_memslot_iter_next(iter))
 
+/* Iterate over each memslot intersecting [start, last] (inclusive) range */
+#define kvm_for_each_memslot_in_hva_range(node, slots, start, last)	     \
+	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
+	     node;							     \
+	     node = interval_tree_iter_next(node, start, last))
+
 /*
  * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
  * - create a new memory slot
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c035fe6b39ec..ff890f41c7ce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -511,12 +511,6 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
-/* Iterate over each memslot intersecting [start, last] (inclusive) range */
-#define kvm_for_each_memslot_in_hva_range(node, slots, start, last)	     \
-	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
-	     node;							     \
-	     node = interval_tree_iter_next(node, start, last))	     \
-
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 						  const struct kvm_hva_range *range)
 {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC v1 9/9] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
                   ` (7 preceding siblings ...)
  2022-03-08  4:38 ` [PATCH RFC v1 8/9] KVM: Move kvm_for_each_memslot_in_hva_range() to be used in SVM Nikunj A Dadhania
@ 2022-03-08  4:38 ` Nikunj A Dadhania
  2022-03-09 16:57   ` Maciej S. Szmigiero
  2022-03-28 21:00 ` [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Sean Christopherson
  9 siblings, 1 reply; 29+ messages in thread
From: Nikunj A Dadhania @ 2022-03-08  4:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel, Nikunj A Dadhania

From: Sean Christopherson <sean.j.christopherson@intel.com>

Pin the memory for the data being passed to launch_update_data()
because it gets encrypted before the guest is first run and must
not be moved which would corrupt it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[ * Use kvm_for_each_memslot_in_hva_range() to find slot and iterate
  * Updated sev_pin_memory_in_mmu() error handling.
  * As pinning/unpining pages is handled within MMU, removed
    {get,put}_user(). ]
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kvm/svm/sev.c | 146 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 134 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 7e39320fc65d..1c371268934b 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -22,6 +22,7 @@
 #include <asm/trapnr.h>
 #include <asm/fpu/xcr.h>
 
+#include "mmu.h"
 #include "x86.h"
 #include "svm.h"
 #include "svm_ops.h"
@@ -428,9 +429,93 @@ static void *sev_alloc_pages(struct kvm_sev_info *sev, unsigned long uaddr,
 	return pages;
 }
 
+#define SEV_PFERR_RO (PFERR_USER_MASK)
+#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
+					   unsigned long size,
+					   unsigned long *npages)
+{
+	unsigned long hva_start, hva_end, uaddr, end, slot_start, slot_end;
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct interval_tree_node *node;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	int idx, ret = 0, i = 0;
+	struct kvm_vcpu *vcpu;
+	struct page **pages;
+	kvm_pfn_t pfn;
+	u32 err_code;
+	gfn_t gfn;
+
+	pages = sev_alloc_pages(sev, addr, size, npages);
+	if (IS_ERR(pages))
+		return pages;
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (mutex_lock_killable(&vcpu->mutex)) {
+		kvfree(pages);
+		return ERR_PTR(-EINTR);
+	}
+
+	vcpu_load(vcpu);
+	idx = srcu_read_lock(&kvm->srcu);
+
+	kvm_mmu_load(vcpu);
+
+	end = addr + (*npages << PAGE_SHIFT);
+	slots = kvm_memslots(kvm);
+
+	kvm_for_each_memslot_in_hva_range(node, slots, addr, end) {
+		slot = container_of(node, struct kvm_memory_slot,
+				    hva_node[slots->node_idx]);
+		slot_start = slot->userspace_addr;
+		slot_end = slot_start + (slot->npages << PAGE_SHIFT);
+		hva_start = max(addr, slot_start);
+		hva_end = min(end, slot_end);
+
+		err_code = (slot->flags & KVM_MEM_READONLY) ?
+			SEV_PFERR_RO : SEV_PFERR_RW;
+
+		for (uaddr = hva_start; uaddr < hva_end; uaddr += PAGE_SIZE) {
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+
+			if (need_resched())
+				cond_resched();
+
+			/*
+			 * Fault in the page and sev_pin_page() will handle the
+			 * pinning
+			 */
+			gfn = hva_to_gfn_memslot(uaddr, slot);
+			pfn = kvm_mmu_map_tdp_page(vcpu, gfn_to_gpa(gfn),
+						   err_code, PG_LEVEL_4K);
+			if (is_error_noslot_pfn(pfn)) {
+				ret = -EFAULT;
+				break;
+			}
+			pages[i++] = pfn_to_page(pfn);
+		}
+	}
+
+	kvm_mmu_unload(vcpu);
+	srcu_read_unlock(&kvm->srcu, idx);
+	vcpu_put(vcpu);
+	mutex_unlock(&vcpu->mutex);
+
+	if (!ret)
+		return pages;
+
+	kvfree(pages);
+	return ERR_PTR(ret);
+}
+
 static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
 				    unsigned long ulen, unsigned long *n,
-				    int write)
+				    int write, bool mmu_usable)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
 	struct pinned_region *region;
@@ -441,6 +526,10 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
 
 	lockdep_assert_held(&kvm->lock);
 
+	/* Use MMU based pinning if possible. */
+	if (mmu_usable)
+		return sev_pin_memory_in_mmu(kvm, uaddr, ulen, n);
+
 	pages = sev_alloc_pages(sev, uaddr, ulen, &npages);
 	if (IS_ERR(pages))
 		return pages;
@@ -558,6 +647,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
 	unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	bool mmu_usable = atomic_read(&kvm->online_vcpus) > 0;
 	struct kvm_sev_launch_update_data params;
 	struct sev_data_launch_update_data data;
 	struct page **inpages;
@@ -574,15 +664,18 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	vaddr_end = vaddr + size;
 
 	/* Lock the user memory. */
-	inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
+	inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1, mmu_usable);
 	if (IS_ERR(inpages))
 		return PTR_ERR(inpages);
 
 	/*
 	 * Flush (on non-coherent CPUs) before LAUNCH_UPDATE encrypts pages in
 	 * place; the cache may contain the data that was written unencrypted.
+	 * Flushing is automatically handled if the pages can be pinned in the
+	 * MMU.
 	 */
-	sev_clflush_pages(inpages, npages);
+	if (!mmu_usable)
+		sev_clflush_pages(inpages, npages);
 
 	data.reserved = 0;
 	data.handle = sev->handle;
@@ -617,9 +710,14 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 		set_page_dirty_lock(inpages[i]);
 		mark_page_accessed(inpages[i]);
 	}
-	/* unlock the user pages on error */
+	/*
+	 * unlock the user pages on error, else pages will be unpinned either
+	 * during memslot free path or vm destroy path
+	 */
 	if (ret)
 		sev_unpin_memory(kvm, inpages, npages);
+	else if (mmu_usable)
+		kvfree(inpages);
 	return ret;
 }
 
@@ -1001,11 +1099,11 @@ static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec)
 		int len, s_off, d_off;
 
 		/* lock userspace source and destination page */
-		src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0);
+		src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0, false);
 		if (IS_ERR(src_p))
 			return PTR_ERR(src_p);
 
-		dst_p = sev_pin_memory(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1);
+		dst_p = sev_pin_memory(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1, false);
 		if (IS_ERR(dst_p)) {
 			sev_unpin_memory(kvm, src_p, n);
 			return PTR_ERR(dst_p);
@@ -1057,6 +1155,7 @@ static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec)
 
 static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
+	bool mmu_usable = atomic_read(&kvm->online_vcpus) > 0;
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
 	struct sev_data_launch_secret data;
 	struct kvm_sev_launch_secret params;
@@ -1071,15 +1170,18 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
 		return -EFAULT;
 
-	pages = sev_pin_memory(kvm, params.guest_uaddr, params.guest_len, &n, 1);
+	pages = sev_pin_memory(kvm, params.guest_uaddr, params.guest_len, &n, 1, mmu_usable);
 	if (IS_ERR(pages))
 		return PTR_ERR(pages);
 
 	/*
 	 * Flush (on non-coherent CPUs) before LAUNCH_SECRET encrypts pages in
 	 * place; the cache may contain the data that was written unencrypted.
+	 * Flushing is automatically handled if the pages can be pinned in the
+	 * MMU.
 	 */
-	sev_clflush_pages(pages, n);
+	if (!mmu_usable)
+		sev_clflush_pages(pages, n);
 
 	/*
 	 * The secret must be copied into contiguous memory region, lets verify
@@ -1126,8 +1228,15 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
 		set_page_dirty_lock(pages[i]);
 		mark_page_accessed(pages[i]);
 	}
+	/*
+	 * unlock the user pages on error, else pages will be unpinned either
+	 * during memslot free path or vm destroy path
+	 */
 	if (ret)
 		sev_unpin_memory(kvm, pages, n);
+	else if (mmu_usable)
+		kvfree(pages);
+
 	return ret;
 }
 
@@ -1358,7 +1467,7 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 	/* Pin guest memory */
 	guest_page = sev_pin_memory(kvm, params.guest_uaddr & PAGE_MASK,
-				    PAGE_SIZE, &n, 0);
+				    PAGE_SIZE, &n, 0, false);
 	if (IS_ERR(guest_page))
 		return PTR_ERR(guest_page);
 
@@ -1406,6 +1515,10 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 e_free_hdr:
 	kfree(hdr);
 e_unpin:
+	/*
+	 * unlock the user pages on error, else pages will be unpinned either
+	 * during memslot free path or vm destroy path
+	 */
 	if (ret)
 		sev_unpin_memory(kvm, guest_page, n);
 
@@ -1512,6 +1625,7 @@ static int sev_receive_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
+	bool mmu_usable = atomic_read(&kvm->online_vcpus) > 0;
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
 	struct kvm_sev_receive_update_data params;
 	struct sev_data_receive_update_data data;
@@ -1555,7 +1669,7 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 	/* Pin guest memory */
 	guest_page = sev_pin_memory(kvm, params.guest_uaddr & PAGE_MASK,
-				    PAGE_SIZE, &n, 1);
+				    PAGE_SIZE, &n, 1, mmu_usable);
 	if (IS_ERR(guest_page)) {
 		ret = PTR_ERR(guest_page);
 		goto e_free_trans;
@@ -1564,9 +1678,11 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	/*
 	 * Flush (on non-coherent CPUs) before RECEIVE_UPDATE_DATA, the PSP
 	 * encrypts the written data with the guest's key, and the cache may
-	 * contain dirty, unencrypted data.
+	 * contain dirty, unencrypted data. Flushing is automatically handled if
+	 * the pages can be pinned in the MMU.
 	 */
-	sev_clflush_pages(guest_page, n);
+	if (!mmu_usable)
+		sev_clflush_pages(guest_page, n);
 
 	/* The RECEIVE_UPDATE_DATA command requires C-bit to be always set. */
 	data.guest_address = (page_to_pfn(guest_page[0]) << PAGE_SHIFT) + offset;
@@ -1577,8 +1693,14 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	ret = sev_issue_cmd(kvm, SEV_CMD_RECEIVE_UPDATE_DATA, &data,
 				&argp->error);
 
+	/*
+	 * unlock the user pages on error, else pages will be unpinned either
+	 * during memslot free path or vm destroy path
+	 */
 	if (ret)
 		sev_unpin_memory(kvm, guest_page, n);
+	else if (mmu_usable)
+		kvfree(guest_page);
 
 e_free_trans:
 	kfree(trans);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning
  2022-03-08  4:38 ` [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning Nikunj A Dadhania
@ 2022-03-08 21:53   ` Mingwei Zhang
  2022-03-09  5:10     ` Nikunj A. Dadhania
  0 siblings, 1 reply; 29+ messages in thread
From: Mingwei Zhang @ 2022-03-08 21:53 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Peter Gonda, Bharata B Rao, Maciej S . Szmigiero,
	David Hildenbrand, kvm, linux-kernel

On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
> Use the memslot metadata to store the pinned data along with the pfns.
> This improves the SEV guest startup time from O(n) to a constant by
> deferring guest page pinning until the pages are used to satisfy
> nested page faults. The page reference will be dropped in the memslot
> free path or deallocation path.
> 
> Reuse enc_region structure definition as pinned_region to maintain
> pages that are pinned outside of MMU demand pinning. Remove rest of
> the code which did upfront pinning, as they are no longer needed in
> view of the demand pinning support.

I don't quite understand why we still need the enc_region. I have
several concerns. Details below.
>
> Retain svm_register_enc_region() and svm_unregister_enc_region() with
> required checks for resource limit.
> 
> Guest boot time comparison
>   +---------------+----------------+-------------------+
>   | Guest Memory  |   baseline     |  Demand Pinning   |
>   | Size (GB)     |    (secs)      |     (secs)        |
>   +---------------+----------------+-------------------+
>   |      4        |     6.16       |      5.71         |
>   +---------------+----------------+-------------------+
>   |     16        |     7.38       |      5.91         |
>   +---------------+----------------+-------------------+
>   |     64        |    12.17       |      6.16         |
>   +---------------+----------------+-------------------+
>   |    128        |    18.20       |      6.50         |
>   +---------------+----------------+-------------------+
>   |    192        |    24.56       |      6.80         |
>   +---------------+----------------+-------------------+
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>  arch/x86/kvm/svm/sev.c | 304 ++++++++++++++++++++++++++---------------
>  arch/x86/kvm/svm/svm.c |   1 +
>  arch/x86/kvm/svm/svm.h |   6 +-
>  3 files changed, 200 insertions(+), 111 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index bd7572517c99..d0514975555d 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -66,7 +66,7 @@ static unsigned int nr_asids;
>  static unsigned long *sev_asid_bitmap;
>  static unsigned long *sev_reclaim_asid_bitmap;
>  
> -struct enc_region {
> +struct pinned_region {
>  	struct list_head list;
>  	unsigned long npages;
>  	struct page **pages;
> @@ -257,7 +257,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	if (ret)
>  		goto e_free;
>  
> -	INIT_LIST_HEAD(&sev->regions_list);
> +	INIT_LIST_HEAD(&sev->pinned_regions_list);
>  
>  	return 0;
>  
> @@ -378,16 +378,34 @@ static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	return ret;
>  }
>  
> +static bool rlimit_memlock_exceeds(unsigned long locked, unsigned long npages)
> +{
> +	unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	unsigned long lock_req;
> +
> +	lock_req = locked + npages;
> +	return (lock_req > lock_limit) && !capable(CAP_IPC_LOCK);
> +}
> +
> +static unsigned long get_npages(unsigned long uaddr, unsigned long ulen)
> +{
> +	unsigned long first, last;
> +
> +	/* Calculate number of pages. */
> +	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
> +	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
> +	return last - first + 1;
> +}
> +
>  static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>  				    unsigned long ulen, unsigned long *n,
>  				    int write)
>  {
>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	struct pinned_region *region;
>  	unsigned long npages, size;
>  	int npinned;
> -	unsigned long locked, lock_limit;
>  	struct page **pages;
> -	unsigned long first, last;
>  	int ret;
>  
>  	lockdep_assert_held(&kvm->lock);
> @@ -395,15 +413,12 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>  	if (ulen == 0 || uaddr + ulen < uaddr)
>  		return ERR_PTR(-EINVAL);
>  
> -	/* Calculate number of pages. */
> -	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
> -	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
> -	npages = (last - first + 1);
> +	npages = get_npages(uaddr, ulen);
>  
> -	locked = sev->pages_locked + npages;
> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> -		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit);
> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
> +			sev->pages_to_lock + npages,
> +			(rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
>  		return ERR_PTR(-ENOMEM);
>  	}
>  
> @@ -429,7 +444,19 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>  	}
>  
>  	*n = npages;
> -	sev->pages_locked = locked;
> +	sev->pages_to_lock += npages;
> +
> +	/* Maintain region list that is pinned to be unpinned in vm destroy path */
> +	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
> +	if (!region) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +	region->uaddr = uaddr;
> +	region->size = ulen;
> +	region->pages = pages;
> +	region->npages = npages;
> +	list_add_tail(&region->list, &sev->pinned_regions_list);

Hmm. I see a duplication of the metadata. We already store the pfns in
memslot. But now we also do it in regions. Is this one used for
migration purpose?

I might miss some of the context here. But we may still have to support
the user-level memory pinning API as legacy case. Instead of duplicating
the storage, can we change the region list to the list of memslot->pfns
instead (Or using the struct **pages in memslot instead of pfns)? This
way APIs could still work but we don't need extra storage burden.

Anyway, I think it might be essentially needed to unify them together,
Otherwise, not only the metadata size is quite large, but also it is
confusing to have parallel data structures doing the same thing.
>
>  	return pages;
>  
> @@ -441,14 +468,43 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>  	return ERR_PTR(ret);
>  }
>  
> -static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
> -			     unsigned long npages)
> +static void __sev_unpin_memory(struct kvm *kvm, struct page **pages,
> +			       unsigned long npages)
>  {
>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>  
>  	unpin_user_pages(pages, npages);
>  	kvfree(pages);
> -	sev->pages_locked -= npages;
> +	sev->pages_to_lock -= npages;
> +}
> +
> +static struct pinned_region *find_pinned_region(struct kvm *kvm,
> +					     struct page **pages,
> +					     unsigned long n)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	struct list_head *head = &sev->pinned_regions_list;
> +	struct pinned_region *i;
> +
> +	list_for_each_entry(i, head, list) {
> +		if (i->pages == pages && i->npages == n)
> +			return i;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
> +			     unsigned long npages)
> +{
> +	struct pinned_region *region;
> +
> +	region = find_pinned_region(kvm, pages, npages);
> +	__sev_unpin_memory(kvm, pages, npages);
> +	if (region) {
> +		list_del(&region->list);
> +		kfree(region);
> +	}
>  }
>  
>  static void sev_clflush_pages(struct page *pages[], unsigned long npages)
> @@ -551,8 +607,9 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  		set_page_dirty_lock(inpages[i]);
>  		mark_page_accessed(inpages[i]);
>  	}
> -	/* unlock the user pages */
> -	sev_unpin_memory(kvm, inpages, npages);
> +	/* unlock the user pages on error */
> +	if (ret)
> +		sev_unpin_memory(kvm, inpages, npages);
>  	return ret;
>  }
>  
> @@ -1059,7 +1116,8 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  		set_page_dirty_lock(pages[i]);
>  		mark_page_accessed(pages[i]);
>  	}
> -	sev_unpin_memory(kvm, pages, n);
> +	if (ret)
> +		sev_unpin_memory(kvm, pages, n);
>  	return ret;
>  }
>  
> @@ -1338,7 +1396,8 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  e_free_hdr:
>  	kfree(hdr);
>  e_unpin:
> -	sev_unpin_memory(kvm, guest_page, n);
> +	if (ret)
> +		sev_unpin_memory(kvm, guest_page, n);
>  
>  	return ret;
>  }
> @@ -1508,7 +1567,8 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	ret = sev_issue_cmd(kvm, SEV_CMD_RECEIVE_UPDATE_DATA, &data,
>  				&argp->error);
>  
> -	sev_unpin_memory(kvm, guest_page, n);
> +	if (ret)
> +		sev_unpin_memory(kvm, guest_page, n);
>  
>  e_free_trans:
>  	kfree(trans);
> @@ -1629,16 +1689,17 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>  	dst->active = true;
>  	dst->asid = src->asid;
>  	dst->handle = src->handle;
> -	dst->pages_locked = src->pages_locked;
> +	dst->pages_to_lock = src->pages_to_lock;
>  	dst->enc_context_owner = src->enc_context_owner;
>  
>  	src->asid = 0;
>  	src->active = false;
>  	src->handle = 0;
> -	src->pages_locked = 0;
> +	src->pages_to_lock = 0;
>  	src->enc_context_owner = NULL;
>  
> -	list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
> +	list_cut_before(&dst->pinned_regions_list, &src->pinned_regions_list,
> +			&src->pinned_regions_list);
>  }
>  
>  static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
> @@ -1862,8 +1923,7 @@ int svm_register_enc_region(struct kvm *kvm,
>  			    struct kvm_enc_region *range)
>  {
>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> -	struct enc_region *region;
> -	int ret = 0;
> +	unsigned long npages;
>  
>  	if (!sev_guest(kvm))
>  		return -ENOTTY;
> @@ -1875,101 +1935,35 @@ int svm_register_enc_region(struct kvm *kvm,
>  	if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>  		return -EINVAL;
>  
> -	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
> -	if (!region)
> +	npages = get_npages(range->addr, range->size);
> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
> +		       sev->pages_to_lock + npages,
> +		       (rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
>  		return -ENOMEM;
> -
> -	mutex_lock(&kvm->lock);
> -	region->pages = sev_pin_memory(kvm, range->addr, range->size, &region->npages, 1);
> -	if (IS_ERR(region->pages)) {
> -		ret = PTR_ERR(region->pages);
> -		mutex_unlock(&kvm->lock);
> -		goto e_free;
>  	}
> +	sev->pages_to_lock += npages;
>  
> -	region->uaddr = range->addr;
> -	region->size = range->size;
> -
> -	list_add_tail(&region->list, &sev->regions_list);
> -	mutex_unlock(&kvm->lock);
> -
> -	/*
> -	 * The guest may change the memory encryption attribute from C=0 -> C=1
> -	 * or vice versa for this memory range. Lets make sure caches are
> -	 * flushed to ensure that guest data gets written into memory with
> -	 * correct C-bit.
> -	 */
> -	sev_clflush_pages(region->pages, region->npages);
> -
> -	return ret;
> -
> -e_free:
> -	kfree(region);
> -	return ret;
> -}
> -
> -static struct enc_region *
> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
> -{
> -	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> -	struct list_head *head = &sev->regions_list;
> -	struct enc_region *i;
> -
> -	list_for_each_entry(i, head, list) {
> -		if (i->uaddr == range->addr &&
> -		    i->size == range->size)
> -			return i;
> -	}
> -
> -	return NULL;
> -}
> -
> -static void __unregister_enc_region_locked(struct kvm *kvm,
> -					   struct enc_region *region)
> -{
> -	sev_unpin_memory(kvm, region->pages, region->npages);
> -	list_del(&region->list);
> -	kfree(region);
> +	return 0;
>  }
>  
>  int svm_unregister_enc_region(struct kvm *kvm,
>  			      struct kvm_enc_region *range)
>  {
> -	struct enc_region *region;
> -	int ret;
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	unsigned long npages;
>  
>  	/* If kvm is mirroring encryption context it isn't responsible for it */
>  	if (is_mirroring_enc_context(kvm))
>  		return -EINVAL;
>  
> -	mutex_lock(&kvm->lock);
> -
> -	if (!sev_guest(kvm)) {
> -		ret = -ENOTTY;
> -		goto failed;
> -	}
> -
> -	region = find_enc_region(kvm, range);
> -	if (!region) {
> -		ret = -EINVAL;
> -		goto failed;
> -	}
> -
> -	/*
> -	 * Ensure that all guest tagged cache entries are flushed before
> -	 * releasing the pages back to the system for use. CLFLUSH will
> -	 * not do this, so issue a WBINVD.
> -	 */
> -	wbinvd_on_all_cpus();
> +	if (!sev_guest(kvm))
> +		return -ENOTTY;
>  
> -	__unregister_enc_region_locked(kvm, region);
> +	npages = get_npages(range->addr, range->size);
> +	sev->pages_to_lock -= npages;
>  
> -	mutex_unlock(&kvm->lock);
>  	return 0;
> -
> -failed:
> -	mutex_unlock(&kvm->lock);
> -	return ret;
>  }
>  
>  int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> @@ -2018,7 +2012,7 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>  	mirror_sev->fd = source_sev->fd;
>  	mirror_sev->es_active = source_sev->es_active;
>  	mirror_sev->handle = source_sev->handle;
> -	INIT_LIST_HEAD(&mirror_sev->regions_list);
> +	INIT_LIST_HEAD(&mirror_sev->pinned_regions_list);
>  	ret = 0;
>  
>  	/*
> @@ -2038,8 +2032,9 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>  void sev_vm_destroy(struct kvm *kvm)
>  {
>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> -	struct list_head *head = &sev->regions_list;
> +	struct list_head *head = &sev->pinned_regions_list;
>  	struct list_head *pos, *q;
> +	struct pinned_region *region;
>  
>  	WARN_ON(sev->num_mirrored_vms);
>  
> @@ -2072,8 +2067,14 @@ void sev_vm_destroy(struct kvm *kvm)
>  	 */
>  	if (!list_empty(head)) {
>  		list_for_each_safe(pos, q, head) {
> -			__unregister_enc_region_locked(kvm,
> -				list_entry(pos, struct enc_region, list));
> +			/*
> +			 * Unpin the memory that were pinned outside of MMU
> +			 * demand pinning
> +			 */
> +			region = list_entry(pos, struct pinned_region, list);
> +			__sev_unpin_memory(kvm, region->pages, region->npages);
> +			list_del(&region->list);
> +			kfree(region);
>  			cond_resched();
>  		}
>  	}
 So the guest memory has already been unpinned in sev_free_memslot().
 Why doing it again at here?

> @@ -2951,13 +2952,96 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>  	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>  }
>  
> +bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
> +		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level)
> +{
> +	unsigned int npages = KVM_PAGES_PER_HPAGE(level);
> +	unsigned int flags = FOLL_LONGTERM, npinned;
> +	struct kvm_arch_memory_slot *aslot;
> +	struct kvm *kvm = vcpu->kvm;
> +	gfn_t gfn_start, rel_gfn;
> +	struct page *page[1];
> +	kvm_pfn_t old_pfn;
> +
> +	if (!sev_guest(kvm))
> +		return true;
> +
> +	if (WARN_ON_ONCE(!memslot->arch.pfns))
> +		return false;
> +
> +	if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
> +		return false;
> +
> +	hva = ALIGN_DOWN(hva, npages << PAGE_SHIFT);
> +	flags |= write ? FOLL_WRITE : 0;
> +
> +	mutex_lock(&kvm->slots_arch_lock);
> +	gfn_start = hva_to_gfn_memslot(hva, memslot);
> +	rel_gfn = gfn_start - memslot->base_gfn;
> +	aslot = &memslot->arch;
> +	if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
> +		old_pfn = aslot->pfns[rel_gfn];
> +		if (old_pfn == pfn)
> +			goto out;
> +
> +		/* Flush the cache before releasing the page to the system */
> +		sev_flush_guest_memory(to_svm(vcpu), __va(old_pfn),
> +				       npages * PAGE_SIZE);
> +		unpin_user_page(pfn_to_page(old_pfn));
> +	}
> +	/* Pin the page, KVM doesn't yet support page migration. */
> +	npinned = pin_user_pages_fast(hva, 1, flags, page);
> +	KVM_BUG(npinned != 1, kvm, "SEV: Pinning failed\n");
> +	KVM_BUG(pfn != page_to_pfn(page[0]), kvm, "SEV: pfn mismatch\n");
> +
> +	if (!this_cpu_has(X86_FEATURE_SME_COHERENT))
> +		clflush_cache_range(__va(pfn << PAGE_SHIFT), npages * PAGE_SIZE);
> +
> +	WARN_ON(rel_gfn >= memslot->npages);
> +	aslot->pfns[rel_gfn] = pfn;
> +	set_bit(rel_gfn, aslot->pinned_bitmap);
> +
> +out:
> +	mutex_unlock(&kvm->slots_arch_lock);
> +	return true;
> +}
> +
>  void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  	struct kvm_arch_memory_slot *aslot = &slot->arch;
> +	kvm_pfn_t *pfns;
> +	gfn_t gfn;
> +	int i;
>  
>  	if (!sev_guest(kvm))
>  		return;
>  
> +	if (!aslot->pinned_bitmap || !slot->arch.pfns)
> +		goto out;
> +
> +	pfns = aslot->pfns;
> +
> +	/*
> +	 * Ensure that all guest tagged cache entries are flushed before
> +	 * releasing the pages back to the system for use. CLFLUSH will
> +	 * not do this, so issue a WBINVD.
> +	 */
> +	wbinvd_on_all_cpus();

This is a heavy-weight operation and is essentially only needed at most
once per VM shutdown. So, better using smaller hammer in the following
code. Or, alternatively, maybe passing a boolean from caller to avoid
flushing if true.
> +
> +	/*
> +	 * Iterate the memslot to find the pinned pfn using the bitmap and drop
> +	 * the pfn stored.
> +	 */
> +	for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
> +		if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
> +			if (WARN_ON(!pfns[i]))
> +				continue;
> +
> +			unpin_user_page(pfn_to_page(pfns[i]));
> +		}
> +	}
> +
> +out:
>  	if (aslot->pinned_bitmap) {
>  		kvfree(aslot->pinned_bitmap);
>  		aslot->pinned_bitmap = NULL;
> @@ -2992,6 +3076,8 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
>  		return -ENOMEM;
>  
>  	aslot->pinned_bitmap = kvzalloc(pinned_bytes, GFP_KERNEL_ACCOUNT);
> +	new->flags |= KVM_MEMSLOT_ENCRYPTED;
> +
>  	if (!aslot->pinned_bitmap) {
>  		kvfree(aslot->pfns);
>  		aslot->pfns = NULL;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ec06421cb532..463a90ed6f83 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4661,6 +4661,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>  
>  	.alloc_memslot_metadata = sev_alloc_memslot_metadata,
>  	.free_memslot = sev_free_memslot,
> +	.pin_pfn = sev_pin_pfn,
>  };
>  
>  /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f00364020d7e..2f38e793ead0 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -75,8 +75,8 @@ struct kvm_sev_info {
>  	unsigned int asid;	/* ASID used for this guest */
>  	unsigned int handle;	/* SEV firmware handle */
>  	int fd;			/* SEV device fd */
> -	unsigned long pages_locked; /* Number of pages locked */
> -	struct list_head regions_list;  /* List of registered regions */
> +	unsigned long pages_to_lock; /* Number of page that can be locked */
> +	struct list_head pinned_regions_list;  /* List of pinned regions */
>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
>  	struct kvm *enc_context_owner; /* Owner of copied encryption context */
>  	unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */
> @@ -621,5 +621,7 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
>  			       struct kvm_memory_slot *new);
>  void sev_free_memslot(struct kvm *kvm,
>  		      struct kvm_memory_slot *slot);
> +bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
> +		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level);
>  
>  #endif
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning
  2022-03-08 21:53   ` Mingwei Zhang
@ 2022-03-09  5:10     ` Nikunj A. Dadhania
  2022-03-21  6:11       ` Mingwei Zhang
  0 siblings, 1 reply; 29+ messages in thread
From: Nikunj A. Dadhania @ 2022-03-09  5:10 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Peter Gonda, Bharata B Rao, Maciej S . Szmigiero,
	David Hildenbrand, kvm, linux-kernel

On 3/9/2022 3:23 AM, Mingwei Zhang wrote:
> On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
>> Use the memslot metadata to store the pinned data along with the pfns.
>> This improves the SEV guest startup time from O(n) to a constant by
>> deferring guest page pinning until the pages are used to satisfy
>> nested page faults. The page reference will be dropped in the memslot
>> free path or deallocation path.
>>
>> Reuse enc_region structure definition as pinned_region to maintain
>> pages that are pinned outside of MMU demand pinning. Remove rest of
>> the code which did upfront pinning, as they are no longer needed in
>> view of the demand pinning support.
> 
> I don't quite understand why we still need the enc_region. I have
> several concerns. Details below.

With patch 9 the enc_region is used only for memory that was pinned before 
the vcpu is online (i.e. mmu is not yet usable)

>>
>> Retain svm_register_enc_region() and svm_unregister_enc_region() with
>> required checks for resource limit.
>>
>> Guest boot time comparison
>>   +---------------+----------------+-------------------+
>>   | Guest Memory  |   baseline     |  Demand Pinning   |
>>   | Size (GB)     |    (secs)      |     (secs)        |
>>   +---------------+----------------+-------------------+
>>   |      4        |     6.16       |      5.71         |
>>   +---------------+----------------+-------------------+
>>   |     16        |     7.38       |      5.91         |
>>   +---------------+----------------+-------------------+
>>   |     64        |    12.17       |      6.16         |
>>   +---------------+----------------+-------------------+
>>   |    128        |    18.20       |      6.50         |
>>   +---------------+----------------+-------------------+
>>   |    192        |    24.56       |      6.80         |
>>   +---------------+----------------+-------------------+
>>
>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>> ---
>>  arch/x86/kvm/svm/sev.c | 304 ++++++++++++++++++++++++++---------------
>>  arch/x86/kvm/svm/svm.c |   1 +
>>  arch/x86/kvm/svm/svm.h |   6 +-
>>  3 files changed, 200 insertions(+), 111 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index bd7572517c99..d0514975555d 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -66,7 +66,7 @@ static unsigned int nr_asids;
>>  static unsigned long *sev_asid_bitmap;
>>  static unsigned long *sev_reclaim_asid_bitmap;
>>  
>> -struct enc_region {
>> +struct pinned_region {
>>  	struct list_head list;
>>  	unsigned long npages;
>>  	struct page **pages;
>> @@ -257,7 +257,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  	if (ret)
>>  		goto e_free;
>>  
>> -	INIT_LIST_HEAD(&sev->regions_list);
>> +	INIT_LIST_HEAD(&sev->pinned_regions_list);
>>  
>>  	return 0;
>>  
>> @@ -378,16 +378,34 @@ static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  	return ret;
>>  }
>>  
>> +static bool rlimit_memlock_exceeds(unsigned long locked, unsigned long npages)
>> +{
>> +	unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> +	unsigned long lock_req;
>> +
>> +	lock_req = locked + npages;
>> +	return (lock_req > lock_limit) && !capable(CAP_IPC_LOCK);
>> +}
>> +
>> +static unsigned long get_npages(unsigned long uaddr, unsigned long ulen)
>> +{
>> +	unsigned long first, last;
>> +
>> +	/* Calculate number of pages. */
>> +	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
>> +	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
>> +	return last - first + 1;
>> +}
>> +
>>  static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>  				    unsigned long ulen, unsigned long *n,
>>  				    int write)
>>  {
>>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +	struct pinned_region *region;
>>  	unsigned long npages, size;
>>  	int npinned;
>> -	unsigned long locked, lock_limit;
>>  	struct page **pages;
>> -	unsigned long first, last;
>>  	int ret;
>>  
>>  	lockdep_assert_held(&kvm->lock);
>> @@ -395,15 +413,12 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>  	if (ulen == 0 || uaddr + ulen < uaddr)
>>  		return ERR_PTR(-EINVAL);
>>  
>> -	/* Calculate number of pages. */
>> -	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
>> -	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
>> -	npages = (last - first + 1);
>> +	npages = get_npages(uaddr, ulen);
>>  
>> -	locked = sev->pages_locked + npages;
>> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> -		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit);
>> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
>> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
>> +			sev->pages_to_lock + npages,
>> +			(rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
>>  		return ERR_PTR(-ENOMEM);
>>  	}
>>  
>> @@ -429,7 +444,19 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>  	}
>>  
>>  	*n = npages;
>> -	sev->pages_locked = locked;
>> +	sev->pages_to_lock += npages;
>> +
>> +	/* Maintain region list that is pinned to be unpinned in vm destroy path */
>> +	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>> +	if (!region) {
>> +		ret = -ENOMEM;
>> +		goto err;
>> +	}
>> +	region->uaddr = uaddr;
>> +	region->size = ulen;
>> +	region->pages = pages;
>> +	region->npages = npages;
>> +	list_add_tail(&region->list, &sev->pinned_regions_list);
> 
> Hmm. I see a duplication of the metadata. We already store the pfns in
> memslot. But now we also do it in regions. Is this one used for
> migration purpose?

We are not duplicating, the enc_region holds regions that are pinned other 
than svm_register_enc_region(). Later patches add infrastructure to directly 
fault-in those pages which will use memslot->pfns. 

> 
> I might miss some of the context here. 

More context here:
https://lore.kernel.org/kvm/CAMkAt6p1-82LTRNB3pkPRwYh=wGpreUN=jcUeBj_dZt8ss9w0Q@mail.gmail.com/

> But we may still have to support
> the user-level memory pinning API as legacy case. Instead of duplicating
> the storage, can we change the region list to the list of memslot->pfns
> instead (Or using the struct **pages in memslot instead of pfns)? This
> way APIs could still work but we don't need extra storage burden.

Right, patch 6 and 9 will achieve this using the memslot->pfns when the MMU 
is active. We still need to maintain this enc_region if the user tries to pin
memory before MMU is active(i.e. vcpu is online). In my testing, I havent 
seen enc_region being used, but added to make sure we do not break any userspace.

> Anyway, I think it might be essentially needed to unify them together,
> Otherwise, not only the metadata size is quite large, but also it is
> confusing to have parallel data structures doing the same thing.
>>
>>  	return pages;
>>  
>> @@ -441,14 +468,43 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>  	return ERR_PTR(ret);
>>  }
>>  
>> -static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
>> -			     unsigned long npages)
>> +static void __sev_unpin_memory(struct kvm *kvm, struct page **pages,
>> +			       unsigned long npages)
>>  {
>>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>  
>>  	unpin_user_pages(pages, npages);
>>  	kvfree(pages);
>> -	sev->pages_locked -= npages;
>> +	sev->pages_to_lock -= npages;
>> +}
>> +
>> +static struct pinned_region *find_pinned_region(struct kvm *kvm,
>> +					     struct page **pages,
>> +					     unsigned long n)
>> +{
>> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +	struct list_head *head = &sev->pinned_regions_list;
>> +	struct pinned_region *i;
>> +
>> +	list_for_each_entry(i, head, list) {
>> +		if (i->pages == pages && i->npages == n)
>> +			return i;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
>> +			     unsigned long npages)
>> +{
>> +	struct pinned_region *region;
>> +
>> +	region = find_pinned_region(kvm, pages, npages);
>> +	__sev_unpin_memory(kvm, pages, npages);
>> +	if (region) {
>> +		list_del(&region->list);
>> +		kfree(region);
>> +	}
>>  }
>>  
>>  static void sev_clflush_pages(struct page *pages[], unsigned long npages)
>> @@ -551,8 +607,9 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  		set_page_dirty_lock(inpages[i]);
>>  		mark_page_accessed(inpages[i]);
>>  	}
>> -	/* unlock the user pages */
>> -	sev_unpin_memory(kvm, inpages, npages);
>> +	/* unlock the user pages on error */
>> +	if (ret)
>> +		sev_unpin_memory(kvm, inpages, npages);
>>  	return ret;
>>  }
>>  
>> @@ -1059,7 +1116,8 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  		set_page_dirty_lock(pages[i]);
>>  		mark_page_accessed(pages[i]);
>>  	}
>> -	sev_unpin_memory(kvm, pages, n);
>> +	if (ret)
>> +		sev_unpin_memory(kvm, pages, n);
>>  	return ret;
>>  }
>>  
>> @@ -1338,7 +1396,8 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  e_free_hdr:
>>  	kfree(hdr);
>>  e_unpin:
>> -	sev_unpin_memory(kvm, guest_page, n);
>> +	if (ret)
>> +		sev_unpin_memory(kvm, guest_page, n);
>>  
>>  	return ret;
>>  }
>> @@ -1508,7 +1567,8 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  	ret = sev_issue_cmd(kvm, SEV_CMD_RECEIVE_UPDATE_DATA, &data,
>>  				&argp->error);
>>  
>> -	sev_unpin_memory(kvm, guest_page, n);
>> +	if (ret)
>> +		sev_unpin_memory(kvm, guest_page, n);
>>  
>>  e_free_trans:
>>  	kfree(trans);
>> @@ -1629,16 +1689,17 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
>>  	dst->active = true;
>>  	dst->asid = src->asid;
>>  	dst->handle = src->handle;
>> -	dst->pages_locked = src->pages_locked;
>> +	dst->pages_to_lock = src->pages_to_lock;
>>  	dst->enc_context_owner = src->enc_context_owner;
>>  
>>  	src->asid = 0;
>>  	src->active = false;
>>  	src->handle = 0;
>> -	src->pages_locked = 0;
>> +	src->pages_to_lock = 0;
>>  	src->enc_context_owner = NULL;
>>  
>> -	list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
>> +	list_cut_before(&dst->pinned_regions_list, &src->pinned_regions_list,
>> +			&src->pinned_regions_list);
>>  }
>>  
>>  static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
>> @@ -1862,8 +1923,7 @@ int svm_register_enc_region(struct kvm *kvm,
>>  			    struct kvm_enc_region *range)
>>  {
>>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> -	struct enc_region *region;
>> -	int ret = 0;
>> +	unsigned long npages;
>>  
>>  	if (!sev_guest(kvm))
>>  		return -ENOTTY;
>> @@ -1875,101 +1935,35 @@ int svm_register_enc_region(struct kvm *kvm,
>>  	if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>  		return -EINVAL;
>>  
>> -	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>> -	if (!region)
>> +	npages = get_npages(range->addr, range->size);
>> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
>> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
>> +		       sev->pages_to_lock + npages,
>> +		       (rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
>>  		return -ENOMEM;
>> -
>> -	mutex_lock(&kvm->lock);
>> -	region->pages = sev_pin_memory(kvm, range->addr, range->size, &region->npages, 1);
>> -	if (IS_ERR(region->pages)) {
>> -		ret = PTR_ERR(region->pages);
>> -		mutex_unlock(&kvm->lock);
>> -		goto e_free;
>>  	}
>> +	sev->pages_to_lock += npages;
>>  
>> -	region->uaddr = range->addr;
>> -	region->size = range->size;
>> -
>> -	list_add_tail(&region->list, &sev->regions_list);
>> -	mutex_unlock(&kvm->lock);
>> -
>> -	/*
>> -	 * The guest may change the memory encryption attribute from C=0 -> C=1
>> -	 * or vice versa for this memory range. Lets make sure caches are
>> -	 * flushed to ensure that guest data gets written into memory with
>> -	 * correct C-bit.
>> -	 */
>> -	sev_clflush_pages(region->pages, region->npages);
>> -
>> -	return ret;
>> -
>> -e_free:
>> -	kfree(region);
>> -	return ret;
>> -}
>> -
>> -static struct enc_region *
>> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>> -{
>> -	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> -	struct list_head *head = &sev->regions_list;
>> -	struct enc_region *i;
>> -
>> -	list_for_each_entry(i, head, list) {
>> -		if (i->uaddr == range->addr &&
>> -		    i->size == range->size)
>> -			return i;
>> -	}
>> -
>> -	return NULL;
>> -}
>> -
>> -static void __unregister_enc_region_locked(struct kvm *kvm,
>> -					   struct enc_region *region)
>> -{
>> -	sev_unpin_memory(kvm, region->pages, region->npages);
>> -	list_del(&region->list);
>> -	kfree(region);
>> +	return 0;
>>  }
>>  
>>  int svm_unregister_enc_region(struct kvm *kvm,
>>  			      struct kvm_enc_region *range)
>>  {
>> -	struct enc_region *region;
>> -	int ret;
>> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +	unsigned long npages;
>>  
>>  	/* If kvm is mirroring encryption context it isn't responsible for it */
>>  	if (is_mirroring_enc_context(kvm))
>>  		return -EINVAL;
>>  
>> -	mutex_lock(&kvm->lock);
>> -
>> -	if (!sev_guest(kvm)) {
>> -		ret = -ENOTTY;
>> -		goto failed;
>> -	}
>> -
>> -	region = find_enc_region(kvm, range);
>> -	if (!region) {
>> -		ret = -EINVAL;
>> -		goto failed;
>> -	}
>> -
>> -	/*
>> -	 * Ensure that all guest tagged cache entries are flushed before
>> -	 * releasing the pages back to the system for use. CLFLUSH will
>> -	 * not do this, so issue a WBINVD.
>> -	 */
>> -	wbinvd_on_all_cpus();
>> +	if (!sev_guest(kvm))
>> +		return -ENOTTY;
>>  
>> -	__unregister_enc_region_locked(kvm, region);
>> +	npages = get_npages(range->addr, range->size);
>> +	sev->pages_to_lock -= npages;
>>  
>> -	mutex_unlock(&kvm->lock);
>>  	return 0;
>> -
>> -failed:
>> -	mutex_unlock(&kvm->lock);
>> -	return ret;
>>  }
>>  
>>  int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>> @@ -2018,7 +2012,7 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>  	mirror_sev->fd = source_sev->fd;
>>  	mirror_sev->es_active = source_sev->es_active;
>>  	mirror_sev->handle = source_sev->handle;
>> -	INIT_LIST_HEAD(&mirror_sev->regions_list);
>> +	INIT_LIST_HEAD(&mirror_sev->pinned_regions_list);
>>  	ret = 0;
>>  
>>  	/*
>> @@ -2038,8 +2032,9 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
>>  void sev_vm_destroy(struct kvm *kvm)
>>  {
>>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> -	struct list_head *head = &sev->regions_list;
>> +	struct list_head *head = &sev->pinned_regions_list;
>>  	struct list_head *pos, *q;
>> +	struct pinned_region *region;
>>  
>>  	WARN_ON(sev->num_mirrored_vms);
>>  
>> @@ -2072,8 +2067,14 @@ void sev_vm_destroy(struct kvm *kvm)
>>  	 */
>>  	if (!list_empty(head)) {
>>  		list_for_each_safe(pos, q, head) {
>> -			__unregister_enc_region_locked(kvm,
>> -				list_entry(pos, struct enc_region, list));
>> +			/*
>> +			 * Unpin the memory that were pinned outside of MMU
>> +			 * demand pinning
>> +			 */
>> +			region = list_entry(pos, struct pinned_region, list);
>> +			__sev_unpin_memory(kvm, region->pages, region->npages);
>> +			list_del(&region->list);
>> +			kfree(region);
>>  			cond_resched();
>>  		}
>>  	}
>  So the guest memory has already been unpinned in sev_free_memslot().
>  Why doing it again at here?

Guest memory that was demand pinned got freed using sev_free_memslot(). Regions that were 
statically pinned for e.g. SEV_LAUNCH_UPDATE_DATA need to be freed here. This too will be 
demand pinned in patch 9.

> 
>> @@ -2951,13 +2952,96 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>  	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
>>  }
>>  
>> +bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
>> +		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level)
>> +{
>> +	unsigned int npages = KVM_PAGES_PER_HPAGE(level);
>> +	unsigned int flags = FOLL_LONGTERM, npinned;
>> +	struct kvm_arch_memory_slot *aslot;
>> +	struct kvm *kvm = vcpu->kvm;
>> +	gfn_t gfn_start, rel_gfn;
>> +	struct page *page[1];
>> +	kvm_pfn_t old_pfn;
>> +
>> +	if (!sev_guest(kvm))
>> +		return true;
>> +
>> +	if (WARN_ON_ONCE(!memslot->arch.pfns))
>> +		return false;
>> +
>> +	if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
>> +		return false;
>> +
>> +	hva = ALIGN_DOWN(hva, npages << PAGE_SHIFT);
>> +	flags |= write ? FOLL_WRITE : 0;
>> +
>> +	mutex_lock(&kvm->slots_arch_lock);
>> +	gfn_start = hva_to_gfn_memslot(hva, memslot);
>> +	rel_gfn = gfn_start - memslot->base_gfn;
>> +	aslot = &memslot->arch;
>> +	if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
>> +		old_pfn = aslot->pfns[rel_gfn];
>> +		if (old_pfn == pfn)
>> +			goto out;
>> +
>> +		/* Flush the cache before releasing the page to the system */
>> +		sev_flush_guest_memory(to_svm(vcpu), __va(old_pfn),
>> +				       npages * PAGE_SIZE);
>> +		unpin_user_page(pfn_to_page(old_pfn));
>> +	}
>> +	/* Pin the page, KVM doesn't yet support page migration. */
>> +	npinned = pin_user_pages_fast(hva, 1, flags, page);
>> +	KVM_BUG(npinned != 1, kvm, "SEV: Pinning failed\n");
>> +	KVM_BUG(pfn != page_to_pfn(page[0]), kvm, "SEV: pfn mismatch\n");
>> +
>> +	if (!this_cpu_has(X86_FEATURE_SME_COHERENT))
>> +		clflush_cache_range(__va(pfn << PAGE_SHIFT), npages * PAGE_SIZE);
>> +
>> +	WARN_ON(rel_gfn >= memslot->npages);
>> +	aslot->pfns[rel_gfn] = pfn;
>> +	set_bit(rel_gfn, aslot->pinned_bitmap);
>> +
>> +out:
>> +	mutex_unlock(&kvm->slots_arch_lock);
>> +	return true;
>> +}
>> +
>>  void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>>  {
>>  	struct kvm_arch_memory_slot *aslot = &slot->arch;
>> +	kvm_pfn_t *pfns;
>> +	gfn_t gfn;
>> +	int i;
>>  
>>  	if (!sev_guest(kvm))
>>  		return;
>>  
>> +	if (!aslot->pinned_bitmap || !slot->arch.pfns)
>> +		goto out;
>> +
>> +	pfns = aslot->pfns;
>> +
>> +	/*
>> +	 * Ensure that all guest tagged cache entries are flushed before
>> +	 * releasing the pages back to the system for use. CLFLUSH will
>> +	 * not do this, so issue a WBINVD.
>> +	 */
>> +	wbinvd_on_all_cpus();
> 
> This is a heavy-weight operation and is essentially only needed at most
> once per VM shutdown. So, better using smaller hammer in the following
> code. Or, alternatively, maybe passing a boolean from caller to avoid
> flushing if true.

Or maybe I can do this in sev_vm_destroy() patch?

>> +
>> +	/*
>> +	 * Iterate the memslot to find the pinned pfn using the bitmap and drop
>> +	 * the pfn stored.
>> +	 */
>> +	for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
>> +		if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
>> +			if (WARN_ON(!pfns[i]))
>> +				continue;
>> +
>> +			unpin_user_page(pfn_to_page(pfns[i]));
>> +		}
>> +	}
>> +
>> +out:
>>  	if (aslot->pinned_bitmap) {
>>  		kvfree(aslot->pinned_bitmap);
>>  		aslot->pinned_bitmap = NULL;
>> @@ -2992,6 +3076,8 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
>>  		return -ENOMEM;
>>  
>>  	aslot->pinned_bitmap = kvzalloc(pinned_bytes, GFP_KERNEL_ACCOUNT);
>> +	new->flags |= KVM_MEMSLOT_ENCRYPTED;
>> +
>>  	if (!aslot->pinned_bitmap) {
>>  		kvfree(aslot->pfns);
>>  		aslot->pfns = NULL;
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index ec06421cb532..463a90ed6f83 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -4661,6 +4661,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>>  
>>  	.alloc_memslot_metadata = sev_alloc_memslot_metadata,
>>  	.free_memslot = sev_free_memslot,
>> +	.pin_pfn = sev_pin_pfn,
>>  };
>>  
>>  /*
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index f00364020d7e..2f38e793ead0 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -75,8 +75,8 @@ struct kvm_sev_info {
>>  	unsigned int asid;	/* ASID used for this guest */
>>  	unsigned int handle;	/* SEV firmware handle */
>>  	int fd;			/* SEV device fd */
>> -	unsigned long pages_locked; /* Number of pages locked */
>> -	struct list_head regions_list;  /* List of registered regions */
>> +	unsigned long pages_to_lock; /* Number of page that can be locked */
>> +	struct list_head pinned_regions_list;  /* List of pinned regions */
>>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
>>  	struct kvm *enc_context_owner; /* Owner of copied encryption context */
>>  	unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */
>> @@ -621,5 +621,7 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
>>  			       struct kvm_memory_slot *new);
>>  void sev_free_memslot(struct kvm *kvm,
>>  		      struct kvm_memory_slot *slot);
>> +bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
>> +		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level);
>>  
>>  #endif
>> -- 
>> 2.32.0
>>

Thanks for the review.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 9/9] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
  2022-03-08  4:38 ` [PATCH RFC v1 9/9] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() Nikunj A Dadhania
@ 2022-03-09 16:57   ` Maciej S. Szmigiero
  2022-03-09 17:47     ` Nikunj A. Dadhania
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-09 16:57 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Mingwei Zhang, David Hildenbrand, kvm,
	linux-kernel, Paolo Bonzini

On 8.03.2022 05:38, Nikunj A Dadhania wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Pin the memory for the data being passed to launch_update_data()
> because it gets encrypted before the guest is first run and must
> not be moved which would corrupt it.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> [ * Use kvm_for_each_memslot_in_hva_range() to find slot and iterate
>    * Updated sev_pin_memory_in_mmu() error handling.
>    * As pinning/unpining pages is handled within MMU, removed
>      {get,put}_user(). ]
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>   arch/x86/kvm/svm/sev.c | 146 +++++++++++++++++++++++++++++++++++++----
>   1 file changed, 134 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 7e39320fc65d..1c371268934b 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -22,6 +22,7 @@
>   #include <asm/trapnr.h>
>   #include <asm/fpu/xcr.h>
>   
> +#include "mmu.h"
>   #include "x86.h"
>   #include "svm.h"
>   #include "svm_ops.h"
> @@ -428,9 +429,93 @@ static void *sev_alloc_pages(struct kvm_sev_info *sev, unsigned long uaddr,
>   	return pages;
>   }
>   
> +#define SEV_PFERR_RO (PFERR_USER_MASK)
> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
> +
> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
> +					   unsigned long size,
> +					   unsigned long *npages)
> +{
> +	unsigned long hva_start, hva_end, uaddr, end, slot_start, slot_end;
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	struct interval_tree_node *node;
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	int idx, ret = 0, i = 0;
> +	struct kvm_vcpu *vcpu;
> +	struct page **pages;
> +	kvm_pfn_t pfn;
> +	u32 err_code;
> +	gfn_t gfn;
> +
> +	pages = sev_alloc_pages(sev, addr, size, npages);
> +	if (IS_ERR(pages))
> +		return pages;
> +
> +	vcpu = kvm_get_vcpu(kvm, 0);
> +	if (mutex_lock_killable(&vcpu->mutex)) {
> +		kvfree(pages);
> +		return ERR_PTR(-EINTR);
> +	}
> +
> +	vcpu_load(vcpu);
> +	idx = srcu_read_lock(&kvm->srcu);
> +
> +	kvm_mmu_load(vcpu);
> +
> +	end = addr + (*npages << PAGE_SHIFT);
> +	slots = kvm_memslots(kvm);
> +
> +	kvm_for_each_memslot_in_hva_range(node, slots, addr, end) {
> +		slot = container_of(node, struct kvm_memory_slot,
> +				    hva_node[slots->node_idx]);
> +		slot_start = slot->userspace_addr;
> +		slot_end = slot_start + (slot->npages << PAGE_SHIFT);
> +		hva_start = max(addr, slot_start);
> +		hva_end = min(end, slot_end);
> +
> +		err_code = (slot->flags & KVM_MEM_READONLY) ?
> +			SEV_PFERR_RO : SEV_PFERR_RW;
> +
> +		for (uaddr = hva_start; uaddr < hva_end; uaddr += PAGE_SIZE) {
> +			if (signal_pending(current)) {
> +				ret = -ERESTARTSYS;
> +				break;
> +			}
> +
> +			if (need_resched())
> +				cond_resched();
> +
> +			/*
> +			 * Fault in the page and sev_pin_page() will handle the
> +			 * pinning
> +			 */
> +			gfn = hva_to_gfn_memslot(uaddr, slot);
> +			pfn = kvm_mmu_map_tdp_page(vcpu, gfn_to_gpa(gfn),
> +						   err_code, PG_LEVEL_4K);
> +			if (is_error_noslot_pfn(pfn)) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +			pages[i++] = pfn_to_page(pfn);
> +		}
> +	}

This algorithm looks much better than the previews one - thanks!

By the way, as far as I know, there could be duplicates in the "page" array
above since the same hva can be mapped to multiple gfns (in different memslots).
Is the code prepared to deal with this possibility?

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 9/9] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
  2022-03-09 16:57   ` Maciej S. Szmigiero
@ 2022-03-09 17:47     ` Nikunj A. Dadhania
  0 siblings, 0 replies; 29+ messages in thread
From: Nikunj A. Dadhania @ 2022-03-09 17:47 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Mingwei Zhang, David Hildenbrand, kvm,
	linux-kernel, Paolo Bonzini

On 3/9/2022 10:27 PM, Maciej S. Szmigiero wrote:
> On 8.03.2022 05:38, Nikunj A Dadhania wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> Pin the memory for the data being passed to launch_update_data()
>> because it gets encrypted before the guest is first run and must
>> not be moved which would corrupt it.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> [ * Use kvm_for_each_memslot_in_hva_range() to find slot and iterate
>>    * Updated sev_pin_memory_in_mmu() error handling.
>>    * As pinning/unpining pages is handled within MMU, removed
>>      {get,put}_user(). ]
>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>> ---
>>   arch/x86/kvm/svm/sev.c | 146 +++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 134 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 7e39320fc65d..1c371268934b 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -22,6 +22,7 @@
>>   #include <asm/trapnr.h>
>>   #include <asm/fpu/xcr.h>
>>   +#include "mmu.h"
>>   #include "x86.h"
>>   #include "svm.h"
>>   #include "svm_ops.h"
>> @@ -428,9 +429,93 @@ static void *sev_alloc_pages(struct kvm_sev_info *sev, unsigned long uaddr,
>>       return pages;
>>   }
>>   +#define SEV_PFERR_RO (PFERR_USER_MASK)
>> +#define SEV_PFERR_RW (PFERR_WRITE_MASK | PFERR_USER_MASK)
>> +
>> +static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
>> +                       unsigned long size,
>> +                       unsigned long *npages)
>> +{
>> +    unsigned long hva_start, hva_end, uaddr, end, slot_start, slot_end;
>> +    struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +    struct interval_tree_node *node;
>> +    struct kvm_memory_slot *slot;
>> +    struct kvm_memslots *slots;
>> +    int idx, ret = 0, i = 0;
>> +    struct kvm_vcpu *vcpu;
>> +    struct page **pages;
>> +    kvm_pfn_t pfn;
>> +    u32 err_code;
>> +    gfn_t gfn;
>> +
>> +    pages = sev_alloc_pages(sev, addr, size, npages);
>> +    if (IS_ERR(pages))
>> +        return pages;
>> +
>> +    vcpu = kvm_get_vcpu(kvm, 0);
>> +    if (mutex_lock_killable(&vcpu->mutex)) {
>> +        kvfree(pages);
>> +        return ERR_PTR(-EINTR);
>> +    }
>> +
>> +    vcpu_load(vcpu);
>> +    idx = srcu_read_lock(&kvm->srcu);
>> +
>> +    kvm_mmu_load(vcpu);
>> +
>> +    end = addr + (*npages << PAGE_SHIFT);
>> +    slots = kvm_memslots(kvm);
>> +
>> +    kvm_for_each_memslot_in_hva_range(node, slots, addr, end) {
>> +        slot = container_of(node, struct kvm_memory_slot,
>> +                    hva_node[slots->node_idx]);
>> +        slot_start = slot->userspace_addr;
>> +        slot_end = slot_start + (slot->npages << PAGE_SHIFT);
>> +        hva_start = max(addr, slot_start);
>> +        hva_end = min(end, slot_end);
>> +
>> +        err_code = (slot->flags & KVM_MEM_READONLY) ?
>> +            SEV_PFERR_RO : SEV_PFERR_RW;
>> +
>> +        for (uaddr = hva_start; uaddr < hva_end; uaddr += PAGE_SIZE) {
>> +            if (signal_pending(current)) {
>> +                ret = -ERESTARTSYS;
>> +                break;
>> +            }
>> +
>> +            if (need_resched())
>> +                cond_resched();
>> +
>> +            /*
>> +             * Fault in the page and sev_pin_page() will handle the
>> +             * pinning
>> +             */
>> +            gfn = hva_to_gfn_memslot(uaddr, slot);
>> +            pfn = kvm_mmu_map_tdp_page(vcpu, gfn_to_gpa(gfn),
>> +                           err_code, PG_LEVEL_4K);
>> +            if (is_error_noslot_pfn(pfn)) {
>> +                ret = -EFAULT;
>> +                break;
>> +            }
>> +            pages[i++] = pfn_to_page(pfn);
>> +        }
>> +    }
> 
> This algorithm looks much better than the previews one - thanks!

Thanks for your feedback earlier. 

> By the way, as far as I know, there could be duplicates in the "page" array
> above since the same hva can be mapped to multiple gfns (in different memslots).
> Is the code prepared to deal with this possibility?

Yes, as the pinning is done with pfn as index, it can get pinned multiple times. During 
memslot destroy path they would get unpinned.

Regards
Nikunj 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning
  2022-03-09  5:10     ` Nikunj A. Dadhania
@ 2022-03-21  6:11       ` Mingwei Zhang
  2022-03-21  9:19         ` Nikunj A. Dadhania
  0 siblings, 1 reply; 29+ messages in thread
From: Mingwei Zhang @ 2022-03-21  6:11 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Peter Gonda, Bharata B Rao, Maciej S . Szmigiero,
	David Hildenbrand, kvm, linux-kernel

On Wed, Mar 09, 2022, Nikunj A. Dadhania wrote:
> On 3/9/2022 3:23 AM, Mingwei Zhang wrote:
> > On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
> >> Use the memslot metadata to store the pinned data along with the pfns.
> >> This improves the SEV guest startup time from O(n) to a constant by
> >> deferring guest page pinning until the pages are used to satisfy
> >> nested page faults. The page reference will be dropped in the memslot
> >> free path or deallocation path.
> >>
> >> Reuse enc_region structure definition as pinned_region to maintain
> >> pages that are pinned outside of MMU demand pinning. Remove rest of
> >> the code which did upfront pinning, as they are no longer needed in
> >> view of the demand pinning support.
> > 
> > I don't quite understand why we still need the enc_region. I have
> > several concerns. Details below.
> 
> With patch 9 the enc_region is used only for memory that was pinned before 
> the vcpu is online (i.e. mmu is not yet usable)
> 
> >>
> >> Retain svm_register_enc_region() and svm_unregister_enc_region() with
> >> required checks for resource limit.
> >>
> >> Guest boot time comparison
> >>   +---------------+----------------+-------------------+
> >>   | Guest Memory  |   baseline     |  Demand Pinning   |
> >>   | Size (GB)     |    (secs)      |     (secs)        |
> >>   +---------------+----------------+-------------------+
> >>   |      4        |     6.16       |      5.71         |
> >>   +---------------+----------------+-------------------+
> >>   |     16        |     7.38       |      5.91         |
> >>   +---------------+----------------+-------------------+
> >>   |     64        |    12.17       |      6.16         |
> >>   +---------------+----------------+-------------------+
> >>   |    128        |    18.20       |      6.50         |
> >>   +---------------+----------------+-------------------+
> >>   |    192        |    24.56       |      6.80         |
> >>   +---------------+----------------+-------------------+
> >>
> >> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> >> ---
> >>  arch/x86/kvm/svm/sev.c | 304 ++++++++++++++++++++++++++---------------
> >>  arch/x86/kvm/svm/svm.c |   1 +
> >>  arch/x86/kvm/svm/svm.h |   6 +-
> >>  3 files changed, 200 insertions(+), 111 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> >> index bd7572517c99..d0514975555d 100644
> >> --- a/arch/x86/kvm/svm/sev.c
> >> +++ b/arch/x86/kvm/svm/sev.c
> >> @@ -66,7 +66,7 @@ static unsigned int nr_asids;
> >>  static unsigned long *sev_asid_bitmap;
> >>  static unsigned long *sev_reclaim_asid_bitmap;
> >>  
> >> -struct enc_region {
> >> +struct pinned_region {
> >>  	struct list_head list;
> >>  	unsigned long npages;
> >>  	struct page **pages;
> >> @@ -257,7 +257,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >>  	if (ret)
> >>  		goto e_free;
> >>  
> >> -	INIT_LIST_HEAD(&sev->regions_list);
> >> +	INIT_LIST_HEAD(&sev->pinned_regions_list);
> >>  
> >>  	return 0;
> >>  
> >> @@ -378,16 +378,34 @@ static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >>  	return ret;
> >>  }
> >>  
> >> +static bool rlimit_memlock_exceeds(unsigned long locked, unsigned long npages)
> >> +{
> >> +	unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> +	unsigned long lock_req;
> >> +
> >> +	lock_req = locked + npages;
> >> +	return (lock_req > lock_limit) && !capable(CAP_IPC_LOCK);
> >> +}
> >> +
> >> +static unsigned long get_npages(unsigned long uaddr, unsigned long ulen)
> >> +{
> >> +	unsigned long first, last;
> >> +
> >> +	/* Calculate number of pages. */
> >> +	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
> >> +	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
> >> +	return last - first + 1;
> >> +}
> >> +
> >>  static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
> >>  				    unsigned long ulen, unsigned long *n,
> >>  				    int write)
> >>  {
> >>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> +	struct pinned_region *region;
> >>  	unsigned long npages, size;
> >>  	int npinned;
> >> -	unsigned long locked, lock_limit;
> >>  	struct page **pages;
> >> -	unsigned long first, last;
> >>  	int ret;
> >>  
> >>  	lockdep_assert_held(&kvm->lock);
> >> @@ -395,15 +413,12 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
> >>  	if (ulen == 0 || uaddr + ulen < uaddr)
> >>  		return ERR_PTR(-EINVAL);
> >>  
> >> -	/* Calculate number of pages. */
> >> -	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
> >> -	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
> >> -	npages = (last - first + 1);
> >> +	npages = get_npages(uaddr, ulen);
> >>  
> >> -	locked = sev->pages_locked + npages;
> >> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> >> -		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit);
> >> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
> >> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
> >> +			sev->pages_to_lock + npages,
> >> +			(rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
> >>  		return ERR_PTR(-ENOMEM);
> >>  	}
> >>  
> >> @@ -429,7 +444,19 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
> >>  	}
> >>  
> >>  	*n = npages;
> >> -	sev->pages_locked = locked;
> >> +	sev->pages_to_lock += npages;
> >> +
> >> +	/* Maintain region list that is pinned to be unpinned in vm destroy path */
> >> +	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
> >> +	if (!region) {
> >> +		ret = -ENOMEM;
> >> +		goto err;
> >> +	}
> >> +	region->uaddr = uaddr;
> >> +	region->size = ulen;
> >> +	region->pages = pages;
> >> +	region->npages = npages;
> >> +	list_add_tail(&region->list, &sev->pinned_regions_list);
> > 
> > Hmm. I see a duplication of the metadata. We already store the pfns in
> > memslot. But now we also do it in regions. Is this one used for
> > migration purpose?
> 
> We are not duplicating, the enc_region holds regions that are pinned other 
> than svm_register_enc_region(). Later patches add infrastructure to directly 
> fault-in those pages which will use memslot->pfns. 
> 
> > 
> > I might miss some of the context here. 
> 
> More context here:
> https://lore.kernel.org/kvm/CAMkAt6p1-82LTRNB3pkPRwYh=wGpreUN=jcUeBj_dZt8ss9w0Q@mail.gmail.com/

hmm. I think I might got the point. However, logically, I still think we
might not need double data structures for pinning. When vcpu is not
online, we could use the the array in memslot to contain the pinned
pages, right?

Since user-level code is not allowed to pin arbitrary regions of HVA, we
could check that and bail out early if the region goes out of a memslot.

From that point, the only requirement is that we need a valid memslot
before doing memory encryption and pinning. So enc_region is still not
needed from this point.

This should save some time to avoid double pinning and make the pinning
information clear.

> 
> > But we may still have to support
> > the user-level memory pinning API as legacy case. Instead of duplicating
> > the storage, can we change the region list to the list of memslot->pfns
> > instead (Or using the struct **pages in memslot instead of pfns)? This
> > way APIs could still work but we don't need extra storage burden.
> 
> Right, patch 6 and 9 will achieve this using the memslot->pfns when the MMU 
> is active. We still need to maintain this enc_region if the user tries to pin
> memory before MMU is active(i.e. vcpu is online). In my testing, I havent 
> seen enc_region being used, but added to make sure we do not break any userspace.
> 
> > Anyway, I think it might be essentially needed to unify them together,
> > Otherwise, not only the metadata size is quite large, but also it is
> > confusing to have parallel data structures doing the same thing.
> >>
> >>  	return pages;
> >>  
> >> @@ -441,14 +468,43 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
> >>  	return ERR_PTR(ret);
> >>  }
> >>  
> >> -static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
> >> -			     unsigned long npages)
> >> +static void __sev_unpin_memory(struct kvm *kvm, struct page **pages,
> >> +			       unsigned long npages)
> >>  {
> >>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >>  
> >>  	unpin_user_pages(pages, npages);
> >>  	kvfree(pages);
> >> -	sev->pages_locked -= npages;
> >> +	sev->pages_to_lock -= npages;
> >> +}
> >> +
> >> +static struct pinned_region *find_pinned_region(struct kvm *kvm,
> >> +					     struct page **pages,
> >> +					     unsigned long n)
> >> +{
> >> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> +	struct list_head *head = &sev->pinned_regions_list;
> >> +	struct pinned_region *i;
> >> +
> >> +	list_for_each_entry(i, head, list) {
> >> +		if (i->pages == pages && i->npages == n)
> >> +			return i;
> >> +	}
> >> +
> >> +	return NULL;
> >> +}
> >> +
> >> +static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
> >> +			     unsigned long npages)
> >> +{
> >> +	struct pinned_region *region;
> >> +
> >> +	region = find_pinned_region(kvm, pages, npages);
> >> +	__sev_unpin_memory(kvm, pages, npages);
> >> +	if (region) {
> >> +		list_del(&region->list);
> >> +		kfree(region);
> >> +	}
> >>  }
> >>  
> >>  static void sev_clflush_pages(struct page *pages[], unsigned long npages)
> >> @@ -551,8 +607,9 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >>  		set_page_dirty_lock(inpages[i]);
> >>  		mark_page_accessed(inpages[i]);
> >>  	}
> >> -	/* unlock the user pages */
> >> -	sev_unpin_memory(kvm, inpages, npages);
> >> +	/* unlock the user pages on error */
> >> +	if (ret)
> >> +		sev_unpin_memory(kvm, inpages, npages);
> >>  	return ret;
> >>  }
> >>  
> >> @@ -1059,7 +1116,8 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >>  		set_page_dirty_lock(pages[i]);
> >>  		mark_page_accessed(pages[i]);
> >>  	}
> >> -	sev_unpin_memory(kvm, pages, n);
> >> +	if (ret)
> >> +		sev_unpin_memory(kvm, pages, n);
> >>  	return ret;
> >>  }
> >>  
> >> @@ -1338,7 +1396,8 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >>  e_free_hdr:
> >>  	kfree(hdr);
> >>  e_unpin:
> >> -	sev_unpin_memory(kvm, guest_page, n);
> >> +	if (ret)
> >> +		sev_unpin_memory(kvm, guest_page, n);
> >>  
> >>  	return ret;
> >>  }
> >> @@ -1508,7 +1567,8 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
> >>  	ret = sev_issue_cmd(kvm, SEV_CMD_RECEIVE_UPDATE_DATA, &data,
> >>  				&argp->error);
> >>  
> >> -	sev_unpin_memory(kvm, guest_page, n);
> >> +	if (ret)
> >> +		sev_unpin_memory(kvm, guest_page, n);
> >>  
> >>  e_free_trans:
> >>  	kfree(trans);
> >> @@ -1629,16 +1689,17 @@ static void sev_migrate_from(struct kvm_sev_info *dst,
> >>  	dst->active = true;
> >>  	dst->asid = src->asid;
> >>  	dst->handle = src->handle;
> >> -	dst->pages_locked = src->pages_locked;
> >> +	dst->pages_to_lock = src->pages_to_lock;
> >>  	dst->enc_context_owner = src->enc_context_owner;
> >>  
> >>  	src->asid = 0;
> >>  	src->active = false;
> >>  	src->handle = 0;
> >> -	src->pages_locked = 0;
> >> +	src->pages_to_lock = 0;
> >>  	src->enc_context_owner = NULL;
> >>  
> >> -	list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);
> >> +	list_cut_before(&dst->pinned_regions_list, &src->pinned_regions_list,
> >> +			&src->pinned_regions_list);
> >>  }
> >>  
> >>  static int sev_es_migrate_from(struct kvm *dst, struct kvm *src)
> >> @@ -1862,8 +1923,7 @@ int svm_register_enc_region(struct kvm *kvm,
> >>  			    struct kvm_enc_region *range)
> >>  {
> >>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> -	struct enc_region *region;
> >> -	int ret = 0;
> >> +	unsigned long npages;
> >>  
> >>  	if (!sev_guest(kvm))
> >>  		return -ENOTTY;
> >> @@ -1875,101 +1935,35 @@ int svm_register_enc_region(struct kvm *kvm,
> >>  	if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
> >>  		return -EINVAL;
> >>  
> >> -	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
> >> -	if (!region)
> >> +	npages = get_npages(range->addr, range->size);
> >> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
> >> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
> >> +		       sev->pages_to_lock + npages,
> >> +		       (rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
> >>  		return -ENOMEM;
> >> -
> >> -	mutex_lock(&kvm->lock);
> >> -	region->pages = sev_pin_memory(kvm, range->addr, range->size, &region->npages, 1);
> >> -	if (IS_ERR(region->pages)) {
> >> -		ret = PTR_ERR(region->pages);
> >> -		mutex_unlock(&kvm->lock);
> >> -		goto e_free;
> >>  	}
> >> +	sev->pages_to_lock += npages;
> >>  
> >> -	region->uaddr = range->addr;
> >> -	region->size = range->size;
> >> -
> >> -	list_add_tail(&region->list, &sev->regions_list);
> >> -	mutex_unlock(&kvm->lock);
> >> -
> >> -	/*
> >> -	 * The guest may change the memory encryption attribute from C=0 -> C=1
> >> -	 * or vice versa for this memory range. Lets make sure caches are
> >> -	 * flushed to ensure that guest data gets written into memory with
> >> -	 * correct C-bit.
> >> -	 */
> >> -	sev_clflush_pages(region->pages, region->npages);
> >> -
> >> -	return ret;
> >> -
> >> -e_free:
> >> -	kfree(region);
> >> -	return ret;
> >> -}
> >> -
> >> -static struct enc_region *
> >> -find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
> >> -{
> >> -	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> -	struct list_head *head = &sev->regions_list;
> >> -	struct enc_region *i;
> >> -
> >> -	list_for_each_entry(i, head, list) {
> >> -		if (i->uaddr == range->addr &&
> >> -		    i->size == range->size)
> >> -			return i;
> >> -	}
> >> -
> >> -	return NULL;
> >> -}
> >> -
> >> -static void __unregister_enc_region_locked(struct kvm *kvm,
> >> -					   struct enc_region *region)
> >> -{
> >> -	sev_unpin_memory(kvm, region->pages, region->npages);
> >> -	list_del(&region->list);
> >> -	kfree(region);
> >> +	return 0;
> >>  }
> >>  
> >>  int svm_unregister_enc_region(struct kvm *kvm,
> >>  			      struct kvm_enc_region *range)
> >>  {
> >> -	struct enc_region *region;
> >> -	int ret;
> >> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> +	unsigned long npages;
> >>  
> >>  	/* If kvm is mirroring encryption context it isn't responsible for it */
> >>  	if (is_mirroring_enc_context(kvm))
> >>  		return -EINVAL;
> >>  
> >> -	mutex_lock(&kvm->lock);
> >> -
> >> -	if (!sev_guest(kvm)) {
> >> -		ret = -ENOTTY;
> >> -		goto failed;
> >> -	}
> >> -
> >> -	region = find_enc_region(kvm, range);
> >> -	if (!region) {
> >> -		ret = -EINVAL;
> >> -		goto failed;
> >> -	}
> >> -
> >> -	/*
> >> -	 * Ensure that all guest tagged cache entries are flushed before
> >> -	 * releasing the pages back to the system for use. CLFLUSH will
> >> -	 * not do this, so issue a WBINVD.
> >> -	 */
> >> -	wbinvd_on_all_cpus();
> >> +	if (!sev_guest(kvm))
> >> +		return -ENOTTY;
> >>  
> >> -	__unregister_enc_region_locked(kvm, region);
> >> +	npages = get_npages(range->addr, range->size);
> >> +	sev->pages_to_lock -= npages;
> >>  
> >> -	mutex_unlock(&kvm->lock);
> >>  	return 0;
> >> -
> >> -failed:
> >> -	mutex_unlock(&kvm->lock);
> >> -	return ret;
> >>  }
> >>  
> >>  int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> >> @@ -2018,7 +2012,7 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> >>  	mirror_sev->fd = source_sev->fd;
> >>  	mirror_sev->es_active = source_sev->es_active;
> >>  	mirror_sev->handle = source_sev->handle;
> >> -	INIT_LIST_HEAD(&mirror_sev->regions_list);
> >> +	INIT_LIST_HEAD(&mirror_sev->pinned_regions_list);
> >>  	ret = 0;
> >>  
> >>  	/*
> >> @@ -2038,8 +2032,9 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
> >>  void sev_vm_destroy(struct kvm *kvm)
> >>  {
> >>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> -	struct list_head *head = &sev->regions_list;
> >> +	struct list_head *head = &sev->pinned_regions_list;
> >>  	struct list_head *pos, *q;
> >> +	struct pinned_region *region;
> >>  
> >>  	WARN_ON(sev->num_mirrored_vms);
> >>  
> >> @@ -2072,8 +2067,14 @@ void sev_vm_destroy(struct kvm *kvm)
> >>  	 */
> >>  	if (!list_empty(head)) {
> >>  		list_for_each_safe(pos, q, head) {
> >> -			__unregister_enc_region_locked(kvm,
> >> -				list_entry(pos, struct enc_region, list));
> >> +			/*
> >> +			 * Unpin the memory that were pinned outside of MMU
> >> +			 * demand pinning
> >> +			 */
> >> +			region = list_entry(pos, struct pinned_region, list);
> >> +			__sev_unpin_memory(kvm, region->pages, region->npages);
> >> +			list_del(&region->list);
> >> +			kfree(region);
> >>  			cond_resched();
> >>  		}
> >>  	}
> >  So the guest memory has already been unpinned in sev_free_memslot().
> >  Why doing it again at here?
> 
> Guest memory that was demand pinned got freed using sev_free_memslot(). Regions that were 
> statically pinned for e.g. SEV_LAUNCH_UPDATE_DATA need to be freed here. This too will be 
> demand pinned in patch 9.
> 
> > 
> >> @@ -2951,13 +2952,96 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
> >>  	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
> >>  }
> >>  
> >> +bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
> >> +		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level)
> >> +{
> >> +	unsigned int npages = KVM_PAGES_PER_HPAGE(level);
> >> +	unsigned int flags = FOLL_LONGTERM, npinned;
> >> +	struct kvm_arch_memory_slot *aslot;
> >> +	struct kvm *kvm = vcpu->kvm;
> >> +	gfn_t gfn_start, rel_gfn;
> >> +	struct page *page[1];
> >> +	kvm_pfn_t old_pfn;
> >> +
> >> +	if (!sev_guest(kvm))
> >> +		return true;
> >> +
> >> +	if (WARN_ON_ONCE(!memslot->arch.pfns))
> >> +		return false;
> >> +
> >> +	if (KVM_BUG_ON(level > PG_LEVEL_1G, kvm))
> >> +		return false;
> >> +
> >> +	hva = ALIGN_DOWN(hva, npages << PAGE_SHIFT);
> >> +	flags |= write ? FOLL_WRITE : 0;
> >> +
> >> +	mutex_lock(&kvm->slots_arch_lock);
> >> +	gfn_start = hva_to_gfn_memslot(hva, memslot);
> >> +	rel_gfn = gfn_start - memslot->base_gfn;
> >> +	aslot = &memslot->arch;
> >> +	if (test_bit(rel_gfn, aslot->pinned_bitmap)) {
> >> +		old_pfn = aslot->pfns[rel_gfn];
> >> +		if (old_pfn == pfn)
> >> +			goto out;
> >> +
> >> +		/* Flush the cache before releasing the page to the system */
> >> +		sev_flush_guest_memory(to_svm(vcpu), __va(old_pfn),
> >> +				       npages * PAGE_SIZE);
> >> +		unpin_user_page(pfn_to_page(old_pfn));
> >> +	}
> >> +	/* Pin the page, KVM doesn't yet support page migration. */
> >> +	npinned = pin_user_pages_fast(hva, 1, flags, page);
> >> +	KVM_BUG(npinned != 1, kvm, "SEV: Pinning failed\n");
> >> +	KVM_BUG(pfn != page_to_pfn(page[0]), kvm, "SEV: pfn mismatch\n");
> >> +
> >> +	if (!this_cpu_has(X86_FEATURE_SME_COHERENT))
> >> +		clflush_cache_range(__va(pfn << PAGE_SHIFT), npages * PAGE_SIZE);
> >> +
> >> +	WARN_ON(rel_gfn >= memslot->npages);
> >> +	aslot->pfns[rel_gfn] = pfn;
> >> +	set_bit(rel_gfn, aslot->pinned_bitmap);
> >> +
> >> +out:
> >> +	mutex_unlock(&kvm->slots_arch_lock);
> >> +	return true;
> >> +}
> >> +
> >>  void sev_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> >>  {
> >>  	struct kvm_arch_memory_slot *aslot = &slot->arch;
> >> +	kvm_pfn_t *pfns;
> >> +	gfn_t gfn;
> >> +	int i;
> >>  
> >>  	if (!sev_guest(kvm))
> >>  		return;
> >>  
> >> +	if (!aslot->pinned_bitmap || !slot->arch.pfns)
> >> +		goto out;
> >> +
> >> +	pfns = aslot->pfns;
> >> +
> >> +	/*
> >> +	 * Ensure that all guest tagged cache entries are flushed before
> >> +	 * releasing the pages back to the system for use. CLFLUSH will
> >> +	 * not do this, so issue a WBINVD.
> >> +	 */
> >> +	wbinvd_on_all_cpus();
> > 
> > This is a heavy-weight operation and is essentially only needed at most
> > once per VM shutdown. So, better using smaller hammer in the following
> > code. Or, alternatively, maybe passing a boolean from caller to avoid
> > flushing if true.
> 
> Or maybe I can do this in sev_vm_destroy() patch?
> 
> >> +
> >> +	/*
> >> +	 * Iterate the memslot to find the pinned pfn using the bitmap and drop
> >> +	 * the pfn stored.
> >> +	 */
> >> +	for (i = 0, gfn = slot->base_gfn; i < slot->npages; i++, gfn++) {
> >> +		if (test_and_clear_bit(i, aslot->pinned_bitmap)) {
> >> +			if (WARN_ON(!pfns[i]))
> >> +				continue;
> >> +
> >> +			unpin_user_page(pfn_to_page(pfns[i]));
> >> +		}
> >> +	}
> >> +
> >> +out:
> >>  	if (aslot->pinned_bitmap) {
> >>  		kvfree(aslot->pinned_bitmap);
> >>  		aslot->pinned_bitmap = NULL;
> >> @@ -2992,6 +3076,8 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
> >>  		return -ENOMEM;
> >>  
> >>  	aslot->pinned_bitmap = kvzalloc(pinned_bytes, GFP_KERNEL_ACCOUNT);
> >> +	new->flags |= KVM_MEMSLOT_ENCRYPTED;
> >> +
> >>  	if (!aslot->pinned_bitmap) {
> >>  		kvfree(aslot->pfns);
> >>  		aslot->pfns = NULL;
> >> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> >> index ec06421cb532..463a90ed6f83 100644
> >> --- a/arch/x86/kvm/svm/svm.c
> >> +++ b/arch/x86/kvm/svm/svm.c
> >> @@ -4661,6 +4661,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> >>  
> >>  	.alloc_memslot_metadata = sev_alloc_memslot_metadata,
> >>  	.free_memslot = sev_free_memslot,
> >> +	.pin_pfn = sev_pin_pfn,
> >>  };
> >>  
> >>  /*
> >> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> >> index f00364020d7e..2f38e793ead0 100644
> >> --- a/arch/x86/kvm/svm/svm.h
> >> +++ b/arch/x86/kvm/svm/svm.h
> >> @@ -75,8 +75,8 @@ struct kvm_sev_info {
> >>  	unsigned int asid;	/* ASID used for this guest */
> >>  	unsigned int handle;	/* SEV firmware handle */
> >>  	int fd;			/* SEV device fd */
> >> -	unsigned long pages_locked; /* Number of pages locked */
> >> -	struct list_head regions_list;  /* List of registered regions */
> >> +	unsigned long pages_to_lock; /* Number of page that can be locked */
> >> +	struct list_head pinned_regions_list;  /* List of pinned regions */
> >>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
> >>  	struct kvm *enc_context_owner; /* Owner of copied encryption context */
> >>  	unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */
> >> @@ -621,5 +621,7 @@ int sev_alloc_memslot_metadata(struct kvm *kvm,
> >>  			       struct kvm_memory_slot *new);
> >>  void sev_free_memslot(struct kvm *kvm,
> >>  		      struct kvm_memory_slot *slot);
> >> +bool sev_pin_pfn(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
> >> +		 kvm_pfn_t pfn, hva_t hva, bool write, enum pg_level level);
> >>  
> >>  #endif
> >> -- 
> >> 2.32.0
> >>
> 
> Thanks for the review.
> 
> Regards
> Nikunj

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning
  2022-03-21  6:11       ` Mingwei Zhang
@ 2022-03-21  9:19         ` Nikunj A. Dadhania
  0 siblings, 0 replies; 29+ messages in thread
From: Nikunj A. Dadhania @ 2022-03-21  9:19 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Peter Gonda, Bharata B Rao, Maciej S . Szmigiero,
	David Hildenbrand, kvm, linux-kernel

On 3/21/2022 11:41 AM, Mingwei Zhang wrote:
> On Wed, Mar 09, 2022, Nikunj A. Dadhania wrote:
>> On 3/9/2022 3:23 AM, Mingwei Zhang wrote:
>>> On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
>>>> Use the memslot metadata to store the pinned data along with the pfns.
>>>> This improves the SEV guest startup time from O(n) to a constant by
>>>> deferring guest page pinning until the pages are used to satisfy
>>>> nested page faults. The page reference will be dropped in the memslot
>>>> free path or deallocation path.
>>>>
>>>> Reuse enc_region structure definition as pinned_region to maintain
>>>> pages that are pinned outside of MMU demand pinning. Remove rest of
>>>> the code which did upfront pinning, as they are no longer needed in
>>>> view of the demand pinning support.
>>>
>>> I don't quite understand why we still need the enc_region. I have
>>> several concerns. Details below.
>>
>> With patch 9 the enc_region is used only for memory that was pinned before 
>> the vcpu is online (i.e. mmu is not yet usable)
>>
>>>>
>>>> Retain svm_register_enc_region() and svm_unregister_enc_region() with
>>>> required checks for resource limit.
>>>>
>>>> Guest boot time comparison
>>>>   +---------------+----------------+-------------------+
>>>>   | Guest Memory  |   baseline     |  Demand Pinning   |
>>>>   | Size (GB)     |    (secs)      |     (secs)        |
>>>>   +---------------+----------------+-------------------+
>>>>   |      4        |     6.16       |      5.71         |
>>>>   +---------------+----------------+-------------------+
>>>>   |     16        |     7.38       |      5.91         |
>>>>   +---------------+----------------+-------------------+
>>>>   |     64        |    12.17       |      6.16         |
>>>>   +---------------+----------------+-------------------+
>>>>   |    128        |    18.20       |      6.50         |
>>>>   +---------------+----------------+-------------------+
>>>>   |    192        |    24.56       |      6.80         |
>>>>   +---------------+----------------+-------------------+
>>>>
>>>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>>>> ---
>>>>  arch/x86/kvm/svm/sev.c | 304 ++++++++++++++++++++++++++---------------
>>>>  arch/x86/kvm/svm/svm.c |   1 +
>>>>  arch/x86/kvm/svm/svm.h |   6 +-
>>>>  3 files changed, 200 insertions(+), 111 deletions(-)
>>>>

<SNIP>

>>>>  static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>>>  				    unsigned long ulen, unsigned long *n,
>>>>  				    int write)
>>>>  {
>>>>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> +	struct pinned_region *region;
>>>>  	unsigned long npages, size;
>>>>  	int npinned;
>>>> -	unsigned long locked, lock_limit;
>>>>  	struct page **pages;
>>>> -	unsigned long first, last;
>>>>  	int ret;
>>>>  
>>>>  	lockdep_assert_held(&kvm->lock);
>>>> @@ -395,15 +413,12 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>>>  	if (ulen == 0 || uaddr + ulen < uaddr)
>>>>  		return ERR_PTR(-EINVAL);
>>>>  
>>>> -	/* Calculate number of pages. */
>>>> -	first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
>>>> -	last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
>>>> -	npages = (last - first + 1);
>>>> +	npages = get_npages(uaddr, ulen);
>>>>  
>>>> -	locked = sev->pages_locked + npages;
>>>> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>>> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>>>> -		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit);
>>>> +	if (rlimit_memlock_exceeds(sev->pages_to_lock, npages)) {
>>>> +		pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n",
>>>> +			sev->pages_to_lock + npages,
>>>> +			(rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT));
>>>>  		return ERR_PTR(-ENOMEM);
>>>>  	}
>>>>  
>>>> @@ -429,7 +444,19 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
>>>>  	}
>>>>  
>>>>  	*n = npages;
>>>> -	sev->pages_locked = locked;
>>>> +	sev->pages_to_lock += npages;
>>>> +
>>>> +	/* Maintain region list that is pinned to be unpinned in vm destroy path */
>>>> +	region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
>>>> +	if (!region) {
>>>> +		ret = -ENOMEM;
>>>> +		goto err;
>>>> +	}
>>>> +	region->uaddr = uaddr;
>>>> +	region->size = ulen;
>>>> +	region->pages = pages;
>>>> +	region->npages = npages;
>>>> +	list_add_tail(&region->list, &sev->pinned_regions_list);
>>>
>>> Hmm. I see a duplication of the metadata. We already store the pfns in
>>> memslot. But now we also do it in regions. Is this one used for
>>> migration purpose?
>>
>> We are not duplicating, the enc_region holds regions that are pinned other 
>> than svm_register_enc_region(). Later patches add infrastructure to directly 
>> fault-in those pages which will use memslot->pfns. 
>>
>>>
>>> I might miss some of the context here. 
>>
>> More context here:
>> https://lore.kernel.org/kvm/CAMkAt6p1-82LTRNB3pkPRwYh=wGpreUN=jcUeBj_dZt8ss9w0Q@mail.gmail.com/
> 
> hmm. I think I might got the point. However, logically, I still think we
> might not need double data structures for pinning. When vcpu is not
> online, we could use the the array in memslot to contain the pinned
> pages, right?

Yes.

> Since user-level code is not allowed to pin arbitrary regions of HVA, we
> could check that and bail out early if the region goes out of a memslot.
> 
> From that point, the only requirement is that we need a valid memslot
> before doing memory encryption and pinning. So enc_region is still not
> needed from this point.
> 
> This should save some time to avoid double pinning and make the pinning
> information clear.

Agreed, I think that should be possible:

* Check for addr/end being part of a memslot
* Error out in case it is not part of any memslot
* Add __sev_pin_pfn() which is not dependent on vcpu arg.
* Iterate over the pages and use __sev_pin_pfn() routine to pin.
	slots = kvm_memslots(kvm);
	kvm_for_each_memslot_in_hva_range(node, slots, addr, end) {
		slot = container_of(node, struct kvm_memory_slot,
			    hva_node[slots->node_idx]);
		slot_start = slot->userspace_addr;
		slot_end = slot_start + (slot->npages << PAGE_SHIFT);
		hva_start = max(addr, slot_start);
		hva_end = min(end, slot_end)
		for (uaddr = hva_start; uaddr < hva_end; uaddr += PAGE_SIZE) {
			__sev_pin_pfn(slot, uaddr, PG_LEVEL_4K)
		}
	}

This will make sure memslot based data structure is used and enc_region can be removed.

Regards
Nikunj
  



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
                   ` (8 preceding siblings ...)
  2022-03-08  4:38 ` [PATCH RFC v1 9/9] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() Nikunj A Dadhania
@ 2022-03-28 21:00 ` Sean Christopherson
  2022-03-30  4:42   ` Nikunj A. Dadhania
  9 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2022-03-28 21:00 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel

On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
> This is a follow-up to the RFC implementation [1] that incorporates
> review feedback and bug fixes. See the "RFC v1" section below for a 
> list of changes.

Heh, for future reference, the initial posting of a series/patch/RFC is implicitly
v1, i.e. this should be RFC v2.

> SEV guest requires the guest's pages to be pinned in host physical
> memory as migration of encrypted pages is not supported. The memory
> encryption scheme uses the physical address of the memory being
> encrypted. If guest pages are moved by the host, content decrypted in
> the guest would be incorrect thereby corrupting guest's memory.
> 
> For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
> encrypted and when the guest is done using those pages. Hypervisor
> should treat all the guest pages as encrypted until they are 
> deallocated or the guest is destroyed.
> 
> While provision a pfn, make KVM aware that guest pages need to be 
> pinned for long-term and use appropriate pin_user_pages API for these
> special encrypted memory regions. KVM takes the first reference and
> holds it until a mapping is done. Take an extra reference before KVM
> releases the pfn. 
> 
> Actual pinning management is handled by vendor code via new
> kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
> demand. Metadata of the pinning is stored in architecture specific
> memslot area. During the memslot freeing path and deallocation path
> guest pages are unpinned.
> 
> Guest boot time comparison:
> +---------------+----------------+-------------------+
> | Guest Memory  |   baseline     |  Demand Pinning + |
> | Size (GB)     | v5.17-rc6(secs)| v5.17-rc6(secs)   |
> +---------------+----------------+-------------------+
> |      4        |     6.16       |      5.71         |
> +---------------+----------------+-------------------+
> |     16        |     7.38       |      5.91         |
> +---------------+----------------+-------------------+
> |     64        |    12.17       |      6.16         |
> +---------------+----------------+-------------------+
> |    128        |    18.20       |      6.50         |
> +---------------+----------------+-------------------+
> |    192        |    24.56       |      6.80         |
> +---------------+----------------+-------------------+

Let me preface this by saying I generally like the idea and especially the
performance, but...

I think we should abandon this approach in favor of committing all our resources
to fd-based private memory[*], which (if done right) will provide on-demand pinning
for "free".  I would much rather get that support merged sooner than later, and use
it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
memory.  That would require guest kernel support to communicate private vs. shared,
but SEV guests already "need" to do that to play nice with live migration, so it's
not a big ask, just another carrot to entice guests/customers to update their kernel
(and possibly users to update their guest firmware).

This series isn't awful by any means, but it requires poking into core flows and
further complicates paths that are already anything but simple.  And things like
conditionally grabbing vCPU0 to pin pages in its MMU make me flinch.  And I think
the situation would only get worse by the time all the bugs and corner cases are
ironed out.  E.g. this code is wrong:

  void kvm_release_pfn_clean(kvm_pfn_t pfn)
  {
	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) {
		struct page *page = pfn_to_page(pfn);

		if (page_maybe_dma_pinned(page))
			unpin_user_page(page);
		else
			put_page(page);
	}
  }
  EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

Because (a) page_maybe_dma_pinned() is susceptible to false positives (clearly
documented), and (b) even if it didn't get false positives, there's no guarantee
that _KVM_ owns a pin of the page.

It's not an impossible problem to solve, but I suspect any solution will require
either touching a lot of code or will be fragile and difficult to maintain, e.g.
by auditing all users to understand which need to pin and which don't.  Even if
we _always_ pin memory for SEV guests, we'd still need to plumb the "is SEV guest"
info around.

And FWIW, my years-old idea of using a software-available SPTE bit to track pinned
pages is plagued by the same underlying issue: KVM's current management (or lack
thereof) of SEV guest memory just isn't viable long term.  In all honesty, it
probably should never have been merged.  We can't change the past, but we can, 
and IMO should, avoid piling on more code to an approach that is fundamentally flawed.
	
[*] https://lore.kernel.org/all/20220310140911.50924-1-chao.p.peng@linux.intel.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 2/9] KVM: x86/mmu: Move hugepage adjust to direct_page_fault
  2022-03-08  4:38 ` [PATCH RFC v1 2/9] KVM: x86/mmu: Move hugepage adjust to direct_page_fault Nikunj A Dadhania
@ 2022-03-28 21:04   ` Sean Christopherson
  0 siblings, 0 replies; 29+ messages in thread
From: Sean Christopherson @ 2022-03-28 21:04 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel

On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
> Both TDP MMU and legacy MMU do hugepage adjust in the mapping routine.
> Adjust the pfn early in the common code. This will be used by the
> following patches for pinning the pages.
> 
> No functional change intended.

There is a functional change here, as kvm_mmu_hugepage_adjust() is now called
without mmu_lock being held.  That really shouldn't be problematic, but sadly KVM
very, very subtly relies on calling lookup_address_in_mm() while holding mmu_lock
_and_ after checking mmu_notifier_retry_hva().

https://lore.kernel.org/all/CAL715WL7ejOBjzXy9vbS_M2LmvXcC-CxmNr+oQtCZW0kciozHA@mail.gmail.com

> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 2 --
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8e24f73bf60b..db1feecd6fed 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2940,8 +2940,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	int ret;
>  	gfn_t base_gfn = fault->gfn;
>  
> -	kvm_mmu_hugepage_adjust(vcpu, fault);
> -
>  	trace_kvm_mmu_spte_requested(fault);
>  	for_each_shadow_entry(vcpu, fault->addr, it) {
>  		/*
> @@ -4035,6 +4033,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  	r = RET_PF_RETRY;
>  
> +	kvm_mmu_hugepage_adjust(vcpu, fault);
> +
>  	if (is_tdp_mmu_fault)
>  		read_lock(&vcpu->kvm->mmu_lock);
>  	else
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bc9e3553fba2..e03bf59b2f81 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -959,8 +959,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	u64 new_spte;
>  	int ret;
>  
> -	kvm_mmu_hugepage_adjust(vcpu, fault);
> -
>  	trace_kvm_mmu_spte_requested(fault);
>  
>  	rcu_read_lock();
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-28 21:00 ` [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Sean Christopherson
@ 2022-03-30  4:42   ` Nikunj A. Dadhania
  2022-03-30 19:47     ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: Nikunj A. Dadhania @ 2022-03-30  4:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel

On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
>> This is a follow-up to the RFC implementation [1] that incorporates
>> review feedback and bug fixes. See the "RFC v1" section below for a 
>> list of changes.
> 
> Heh, for future reference, the initial posting of a series/patch/RFC is implicitly
> v1, i.e. this should be RFC v2.

Sure.

> 
>> SEV guest requires the guest's pages to be pinned in host physical
>> memory as migration of encrypted pages is not supported. The memory
>> encryption scheme uses the physical address of the memory being
>> encrypted. If guest pages are moved by the host, content decrypted in
>> the guest would be incorrect thereby corrupting guest's memory.
>>
>> For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
>> encrypted and when the guest is done using those pages. Hypervisor
>> should treat all the guest pages as encrypted until they are 
>> deallocated or the guest is destroyed.
>>
>> While provision a pfn, make KVM aware that guest pages need to be 
>> pinned for long-term and use appropriate pin_user_pages API for these
>> special encrypted memory regions. KVM takes the first reference and
>> holds it until a mapping is done. Take an extra reference before KVM
>> releases the pfn. 
>>
>> Actual pinning management is handled by vendor code via new
>> kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
>> demand. Metadata of the pinning is stored in architecture specific
>> memslot area. During the memslot freeing path and deallocation path
>> guest pages are unpinned.
>>
>> Guest boot time comparison:
>> +---------------+----------------+-------------------+
>> | Guest Memory  |   baseline     |  Demand Pinning + |
>> | Size (GB)     | v5.17-rc6(secs)| v5.17-rc6(secs)   |
>> +---------------+----------------+-------------------+
>> |      4        |     6.16       |      5.71         |
>> +---------------+----------------+-------------------+
>> |     16        |     7.38       |      5.91         |
>> +---------------+----------------+-------------------+
>> |     64        |    12.17       |      6.16         |
>> +---------------+----------------+-------------------+
>> |    128        |    18.20       |      6.50         |
>> +---------------+----------------+-------------------+
>> |    192        |    24.56       |      6.80         |
>> +---------------+----------------+-------------------+
> 
> Let me preface this by saying I generally like the idea and especially the
> performance, but...
> 
> I think we should abandon this approach in favor of committing all our resources
> to fd-based private memory[*], which (if done right) will provide on-demand pinning
> for "free".  

I will give this a try for SEV, was on my todo list.

> I would much rather get that support merged sooner than later, and use
> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> memory.  

> That would require guest kernel support to communicate private vs. shared,

Could you explain this in more detail? This is required for punching hole for shared pages?

> but SEV guests already "need" to do that to play nice with live migration, so it's
> not a big ask, just another carrot to entice guests/customers to update their kernel
> (and possibly users to update their guest firmware).
> 
> This series isn't awful by any means, but it requires poking into core flows and
> further complicates paths that are already anything but simple.  And things like
> conditionally grabbing vCPU0 to pin pages in its MMU make me flinch.  And I think
> the situation would only get worse by the time all the bugs and corner cases are
> ironed out.  E.g. this code is wrong:
> 
>   void kvm_release_pfn_clean(kvm_pfn_t pfn)
>   {
> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) {
> 		struct page *page = pfn_to_page(pfn);
> 
> 		if (page_maybe_dma_pinned(page))
> 			unpin_user_page(page);
> 		else
> 			put_page(page);
> 	}
>   }
>   EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
> 
> Because (a) page_maybe_dma_pinned() is susceptible to false positives (clearly
> documented), and (b) even if it didn't get false positives, there's no guarantee
> that _KVM_ owns a pin of the page.

Right, the pinning could have been done by some other subsystem.

> 
> It's not an impossible problem to solve, but I suspect any solution will require
> either touching a lot of code or will be fragile and difficult to maintain, e.g.
> by auditing all users to understand which need to pin and which don't.  Even if
> we _always_ pin memory for SEV guests, we'd still need to plumb the "is SEV guest"
> info around.
> 
> And FWIW, my years-old idea of using a software-available SPTE bit to track pinned
> pages is plagued by the same underlying issue: KVM's current management (or lack
> thereof) of SEV guest memory just isn't viable long term.  In all honesty, it
> probably should never have been merged.  We can't change the past, but we can, 
> and IMO should, avoid piling on more code to an approach that is fundamentally flawed.
> 	
> [*] https://lore.kernel.org/all/20220310140911.50924-1-chao.p.peng@linux.intel.com
>

Thanks for the valuable feedback. 

Regards
Nikunj


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-30  4:42   ` Nikunj A. Dadhania
@ 2022-03-30 19:47     ` Sean Christopherson
  2022-03-31  4:48       ` Nikunj A. Dadhania
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2022-03-30 19:47 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel

On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote:
> On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> > Let me preface this by saying I generally like the idea and especially the
> > performance, but...
> > 
> > I think we should abandon this approach in favor of committing all our resources
> > to fd-based private memory[*], which (if done right) will provide on-demand pinning
> > for "free".  
> 
> I will give this a try for SEV, was on my todo list.
> 
> > I would much rather get that support merged sooner than later, and use
> > it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> > term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> > memory.  
> 
> > That would require guest kernel support to communicate private vs. shared,
> 
> Could you explain this in more detail? This is required for punching hole for shared pages?

Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES
don't provide private vs. shared information to the host (KVM) on page fault.  And
it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest
accesses the "wrong" GPA variant, they'll silent consume/corrupt data.

That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit
hypercall is mandatory.  SEV doesn't even have a vendor-agnostic guest/host paravirt
ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so
running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be
properly enlightened beyond what is required architecturally.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-30 19:47     ` Sean Christopherson
@ 2022-03-31  4:48       ` Nikunj A. Dadhania
  2022-03-31 18:32         ` Peter Gonda
  0 siblings, 1 reply; 29+ messages in thread
From: Nikunj A. Dadhania @ 2022-03-31  4:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Peter Gonda,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm, linux-kernel



On 3/31/2022 1:17 AM, Sean Christopherson wrote:
> On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote:
>> On 3/29/2022 2:30 AM, Sean Christopherson wrote:
>>> Let me preface this by saying I generally like the idea and especially the
>>> performance, but...
>>>
>>> I think we should abandon this approach in favor of committing all our resources
>>> to fd-based private memory[*], which (if done right) will provide on-demand pinning
>>> for "free".  
>>
>> I will give this a try for SEV, was on my todo list.
>>
>>> I would much rather get that support merged sooner than later, and use
>>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
>>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
>>> memory.  
>>
>>> That would require guest kernel support to communicate private vs. shared,
>>
>> Could you explain this in more detail? This is required for punching hole for shared pages?
> 
> Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES
> don't provide private vs. shared information to the host (KVM) on page fault.  And
> it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest
> accesses the "wrong" GPA variant, they'll silent consume/corrupt data.
> 
> That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit
> hypercall is mandatory.  SEV doesn't even have a vendor-agnostic guest/host paravirt
> ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so
> running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be
> properly enlightened beyond what is required architecturally.
> 

So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting 
KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared 
pages to the hypervisor, this information can be used to mark page shared/private.

Regards,
Nikunj

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-31  4:48       ` Nikunj A. Dadhania
@ 2022-03-31 18:32         ` Peter Gonda
  2022-03-31 19:00           ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Gonda @ 2022-03-31 18:32 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm list, LKML

On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
>
>
>
> On 3/31/2022 1:17 AM, Sean Christopherson wrote:
> > On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote:
> >> On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> >>> Let me preface this by saying I generally like the idea and especially the
> >>> performance, but...
> >>>
> >>> I think we should abandon this approach in favor of committing all our resources
> >>> to fd-based private memory[*], which (if done right) will provide on-demand pinning
> >>> for "free".
> >>
> >> I will give this a try for SEV, was on my todo list.
> >>
> >>> I would much rather get that support merged sooner than later, and use
> >>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> >>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> >>> memory.
> >>
> >>> That would require guest kernel support to communicate private vs. shared,
> >>
> >> Could you explain this in more detail? This is required for punching hole for shared pages?
> >
> > Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES
> > don't provide private vs. shared information to the host (KVM) on page fault.  And
> > it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest
> > accesses the "wrong" GPA variant, they'll silent consume/corrupt data.
> >
> > That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit
> > hypercall is mandatory.  SEV doesn't even have a vendor-agnostic guest/host paravirt
> > ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so
> > running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be
> > properly enlightened beyond what is required architecturally.
> >
>
> So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting
> KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared
> pages to the hypervisor, this information can be used to mark page shared/private.

One concern here may be that the VMM doesn't know which guests have
KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the
guest boots does the guest tell KVM that it supports
KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all
the memory before we run the guest to be safe to be safe.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-31 18:32         ` Peter Gonda
@ 2022-03-31 19:00           ` Sean Christopherson
  2022-04-01  3:22             ` Nikunj A. Dadhania
  2022-04-01 17:28             ` Marc Orr
  0 siblings, 2 replies; 29+ messages in thread
From: Sean Christopherson @ 2022-03-31 19:00 UTC (permalink / raw)
  To: Peter Gonda
  Cc: Nikunj A. Dadhania, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm list, LKML

On Thu, Mar 31, 2022, Peter Gonda wrote:
> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
> > On 3/31/2022 1:17 AM, Sean Christopherson wrote:
> > > On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote:
> > >> On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> > >>> Let me preface this by saying I generally like the idea and especially the
> > >>> performance, but...
> > >>>
> > >>> I think we should abandon this approach in favor of committing all our resources
> > >>> to fd-based private memory[*], which (if done right) will provide on-demand pinning
> > >>> for "free".
> > >>
> > >> I will give this a try for SEV, was on my todo list.
> > >>
> > >>> I would much rather get that support merged sooner than later, and use
> > >>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> > >>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> > >>> memory.
> > >>
> > >>> That would require guest kernel support to communicate private vs. shared,
> > >>
> > >> Could you explain this in more detail? This is required for punching hole for shared pages?
> > >
> > > Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES
> > > don't provide private vs. shared information to the host (KVM) on page fault.  And
> > > it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest
> > > accesses the "wrong" GPA variant, they'll silent consume/corrupt data.
> > >
> > > That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit
> > > hypercall is mandatory.  SEV doesn't even have a vendor-agnostic guest/host paravirt
> > > ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so
> > > running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be
> > > properly enlightened beyond what is required architecturally.
> > >
> >
> > So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting
> > KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared
> > pages to the hypervisor, this information can be used to mark page shared/private.
> 
> One concern here may be that the VMM doesn't know which guests have
> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the
> guest boots does the guest tell KVM that it supports
> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all
> the memory before we run the guest to be safe to be safe.

Yep, that's a big reason why I view purging the existing SEV memory management as
a long term goal.  The other being that userspace obviously needs to be updated to
support UPM[*].   I suspect the only feasible way to enable this for SEV/SEV-ES
would be to restrict it to new VM types that have a disclaimer regarding additional
requirements.

[*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory".  We've
    been using it iternally for discussion and it rolls off the tongue a lot easier than
    the full phrase, and is much more precise/descriptive than just "private fd".

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-31 19:00           ` Sean Christopherson
@ 2022-04-01  3:22             ` Nikunj A. Dadhania
  2022-04-01 14:54               ` Sean Christopherson
  2022-04-01 17:28             ` Marc Orr
  1 sibling, 1 reply; 29+ messages in thread
From: Nikunj A. Dadhania @ 2022-04-01  3:22 UTC (permalink / raw)
  To: Sean Christopherson, Peter Gonda
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Brijesh Singh, Tom Lendacky, Bharata B Rao,
	Maciej S . Szmigiero, Mingwei Zhang, David Hildenbrand, kvm list,
	LKML



On 4/1/2022 12:30 AM, Sean Christopherson wrote:
> On Thu, Mar 31, 2022, Peter Gonda wrote:
>> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
>>> On 3/31/2022 1:17 AM, Sean Christopherson wrote:
>>>> On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote:
>>>>> On 3/29/2022 2:30 AM, Sean Christopherson wrote:
>>>>>> Let me preface this by saying I generally like the idea and especially the
>>>>>> performance, but...
>>>>>>
>>>>>> I think we should abandon this approach in favor of committing all our resources
>>>>>> to fd-based private memory[*], which (if done right) will provide on-demand pinning
>>>>>> for "free".
>>>>>
>>>>> I will give this a try for SEV, was on my todo list.
>>>>>
>>>>>> I would much rather get that support merged sooner than later, and use
>>>>>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
>>>>>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
>>>>>> memory.
>>>>>
>>>>>> That would require guest kernel support to communicate private vs. shared,
>>>>>
>>>>> Could you explain this in more detail? This is required for punching hole for shared pages?
>>>>
>>>> Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES
>>>> don't provide private vs. shared information to the host (KVM) on page fault.  And
>>>> it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest
>>>> accesses the "wrong" GPA variant, they'll silent consume/corrupt data.
>>>>
>>>> That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit
>>>> hypercall is mandatory.  SEV doesn't even have a vendor-agnostic guest/host paravirt
>>>> ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so
>>>> running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be
>>>> properly enlightened beyond what is required architecturally.
>>>>
>>>
>>> So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting
>>> KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared
>>> pages to the hypervisor, this information can be used to mark page shared/private.
>>
>> One concern here may be that the VMM doesn't know which guests have
>> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the
>> guest boots does the guest tell KVM that it supports
>> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all
>> the memory before we run the guest to be safe to be safe.
> 
> Yep, that's a big reason why I view purging the existing SEV memory management as
> a long term goal.  The other being that userspace obviously needs to be updated to
> support UPM[*].   I suspect the only feasible way to enable this for SEV/SEV-ES
> would be to restrict it to new VM types that have a disclaimer regarding additional
> requirements.

For SEV/SEV-ES could we base demand pinning on my first RFC[*]. Those patches does not touch 
the core KVM flow. Moreover, it does not expect any guest/firmware changes.

[*] https://lore.kernel.org/kvm/20220118110621.62462-1-nikunj@amd.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-04-01  3:22             ` Nikunj A. Dadhania
@ 2022-04-01 14:54               ` Sean Christopherson
  2022-04-01 15:39                 ` Nikunj A. Dadhania
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2022-04-01 14:54 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Peter Gonda, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm list, LKML

On Fri, Apr 01, 2022, Nikunj A. Dadhania wrote:
> 
> On 4/1/2022 12:30 AM, Sean Christopherson wrote:
> > On Thu, Mar 31, 2022, Peter Gonda wrote:
> >> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
> >>> So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting
> >>> KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared
> >>> pages to the hypervisor, this information can be used to mark page shared/private.
> >>
> >> One concern here may be that the VMM doesn't know which guests have
> >> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the
> >> guest boots does the guest tell KVM that it supports
> >> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all
> >> the memory before we run the guest to be safe to be safe.
> > 
> > Yep, that's a big reason why I view purging the existing SEV memory management as
> > a long term goal.  The other being that userspace obviously needs to be updated to
> > support UPM[*].   I suspect the only feasible way to enable this for SEV/SEV-ES
> > would be to restrict it to new VM types that have a disclaimer regarding additional
> > requirements.
> 
> For SEV/SEV-ES could we base demand pinning on my first RFC[*].

No, because as David pointed out, elevating the refcount is not the same as actually
pinning the page.  Things like NUMA balancing will still try to migrate the page,
and even go so far as to zap the PTE, before bailing due to the outstanding reference.
In other words, not actually pinning makes the mm subsystem less efficient.  Would it
functionally work?  Yes.  Is it acceptable KVM behavior?  No.

> Those patches does not touch the core KVM flow.

I don't mind touching core KVM code.  If this goes forward, I actually strongly
prefer having the x86 MMU code handle the pinning as opposed to burying it in SEV
via kvm_x86_ops.  The reason I don't think it's worth pursuing this approach is
because (a) we know that the current SEV/SEV-ES memory management scheme is flawed
and is a deadend, and (b) this is not so trivial as we (or at least I) originally
thought/hoped it would be.  In other words, it's not that I think demand pinning
is a bad idea, nor do I think the issues are unsolvable, it's that I think the
cost of getting a workable solution, e.g. code churn, ongoing maintenance, reviewer
time, etc..., far outweighs the benefits.

> Moreover, it does not expect any guest/firmware changes.
> 
> [*] https://lore.kernel.org/kvm/20220118110621.62462-1-nikunj@amd.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-04-01 14:54               ` Sean Christopherson
@ 2022-04-01 15:39                 ` Nikunj A. Dadhania
  0 siblings, 0 replies; 29+ messages in thread
From: Nikunj A. Dadhania @ 2022-04-01 15:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Gonda, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Brijesh Singh, Tom Lendacky,
	Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm list, LKML



On 4/1/2022 8:24 PM, Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Nikunj A. Dadhania wrote:
>>
>> On 4/1/2022 12:30 AM, Sean Christopherson wrote:
>>> On Thu, Mar 31, 2022, Peter Gonda wrote:
>>>> On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
>>>>> So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting
>>>>> KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared
>>>>> pages to the hypervisor, this information can be used to mark page shared/private.
>>>>
>>>> One concern here may be that the VMM doesn't know which guests have
>>>> KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the
>>>> guest boots does the guest tell KVM that it supports
>>>> KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all
>>>> the memory before we run the guest to be safe to be safe.
>>>
>>> Yep, that's a big reason why I view purging the existing SEV memory management as
>>> a long term goal.  The other being that userspace obviously needs to be updated to
>>> support UPM[*].   I suspect the only feasible way to enable this for SEV/SEV-ES
>>> would be to restrict it to new VM types that have a disclaimer regarding additional
>>> requirements.
>>
>> For SEV/SEV-ES could we base demand pinning on my first RFC[*].
> 
> No, because as David pointed out, elevating the refcount is not the same as actually
> pinning the page.  Things like NUMA balancing will still try to migrate the page,
> and even go so far as to zap the PTE, before bailing due to the outstanding reference.
> In other words, not actually pinning makes the mm subsystem less efficient.  Would it
> functionally work?  Yes.  Is it acceptable KVM behavior?  No.
> 
>> Those patches does not touch the core KVM flow.
> 
> I don't mind touching core KVM code.  If this goes forward, I actually strongly
> prefer having the x86 MMU code handle the pinning as opposed to burying it in SEV
> via kvm_x86_ops.  The reason I don't think it's worth pursuing this approach is
> because (a) we know that the current SEV/SEV-ES memory management scheme is flawed
> and is a deadend, and (b) this is not so trivial as we (or at least I) originally
> thought/hoped it would be.  In other words, it's not that I think demand pinning
> is a bad idea, nor do I think the issues are unsolvable, it's that I think the
> cost of getting a workable solution, e.g. code churn, ongoing maintenance, reviewer
> time, etc..., far outweighs the benefits.

Point noted Sean, will focus on the UPM effort.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-03-31 19:00           ` Sean Christopherson
  2022-04-01  3:22             ` Nikunj A. Dadhania
@ 2022-04-01 17:28             ` Marc Orr
  2022-04-01 18:02               ` Sean Christopherson
  1 sibling, 1 reply; 29+ messages in thread
From: Marc Orr @ 2022-04-01 17:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Gonda, Nikunj A. Dadhania, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Brijesh Singh,
	Tom Lendacky, Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm list, LKML

On Thu, Mar 31, 2022 at 12:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 31, 2022, Peter Gonda wrote:
> > On Wed, Mar 30, 2022 at 10:48 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
> > > On 3/31/2022 1:17 AM, Sean Christopherson wrote:
> > > > On Wed, Mar 30, 2022, Nikunj A. Dadhania wrote:
> > > >> On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> > > >>> Let me preface this by saying I generally like the idea and especially the
> > > >>> performance, but...
> > > >>>
> > > >>> I think we should abandon this approach in favor of committing all our resources
> > > >>> to fd-based private memory[*], which (if done right) will provide on-demand pinning
> > > >>> for "free".
> > > >>
> > > >> I will give this a try for SEV, was on my todo list.
> > > >>
> > > >>> I would much rather get that support merged sooner than later, and use
> > > >>> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> > > >>> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> > > >>> memory.
> > > >>
> > > >>> That would require guest kernel support to communicate private vs. shared,
> > > >>
> > > >> Could you explain this in more detail? This is required for punching hole for shared pages?
> > > >
> > > > Unlike SEV-SNP, which enumerates private vs. shared in the error code, SEV and SEV-ES
> > > > don't provide private vs. shared information to the host (KVM) on page fault.  And
> > > > it's even more fundamental then that, as SEV/SEV-ES won't even fault if the guest
> > > > accesses the "wrong" GPA variant, they'll silent consume/corrupt data.
> > > >
> > > > That means KVM can't support implicit conversions for SEV/SEV-ES, and so an explicit
> > > > hypercall is mandatory.  SEV doesn't even have a vendor-agnostic guest/host paravirt
> > > > ABI, and IIRC SEV-ES doesn't provide a conversion/map hypercall in the GHCB spec, so
> > > > running a SEV/SEV-ES guest under UPM would require the guest firmware+kernel to be
> > > > properly enlightened beyond what is required architecturally.
> > > >
> > >
> > > So with guest supporting KVM_FEATURE_HC_MAP_GPA_RANGE and host (KVM) supporting
> > > KVM_HC_MAP_GPA_RANGE hypercall, SEV/SEV-ES guest should communicate private/shared
> > > pages to the hypervisor, this information can be used to mark page shared/private.
> >
> > One concern here may be that the VMM doesn't know which guests have
> > KVM_FEATURE_HC_MAP_GPA_RANGE support and which don't. Only once the
> > guest boots does the guest tell KVM that it supports
> > KVM_FEATURE_HC_MAP_GPA_RANGE. If the guest doesn't we need to pin all
> > the memory before we run the guest to be safe to be safe.
>
> Yep, that's a big reason why I view purging the existing SEV memory management as
> a long term goal.  The other being that userspace obviously needs to be updated to
> support UPM[*].   I suspect the only feasible way to enable this for SEV/SEV-ES
> would be to restrict it to new VM types that have a disclaimer regarding additional
> requirements.
>
> [*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory".  We've
>     been using it iternally for discussion and it rolls off the tongue a lot easier than
>     the full phrase, and is much more precise/descriptive than just "private fd".

Can we really "purge the existing SEV memory management"? This seems
like a non-starter because it violates userspace API (i.e., the
ability for the userspace VMM to run a guest without
KVM_FEATURE_HC_MAP_GPA_RANGE). Or maybe I'm not quite following what
you mean by purge.

Assuming that UPM-based lazy pinning comes together via a new VM type
that only supports new images based on a minimum kernel version with
KVM_FEATURE_HC_MAP_GPA_RANGE, then I think this would like as follows:

1. Userspace VMM: Check SEV VM type. If type is legacy SEV type then
do upfront pinning. Else, skip up front pinning.
2. KVM: I'm not sure anything special needs to happen here. For the
legacy VM types, it can be configured to use legacy memslots,
presumably the same as non-CVMs will be configured. For the new VM
type, it should be configured to use UPM.
3. Control plane (thing creating VMs): Responsible for not allowing
legacy SEV images (i.e., images without KVM_FEATURE_HC_MAP_GPA_RANGE)
with the new SEV VM types that use UPM and have support for demand
pinning.

Sean: Did I get this right?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-04-01 17:28             ` Marc Orr
@ 2022-04-01 18:02               ` Sean Christopherson
  2022-04-01 18:19                 ` Marc Orr
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2022-04-01 18:02 UTC (permalink / raw)
  To: Marc Orr
  Cc: Peter Gonda, Nikunj A. Dadhania, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Brijesh Singh,
	Tom Lendacky, Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm list, LKML

On Fri, Apr 01, 2022, Marc Orr wrote:
> On Thu, Mar 31, 2022 at 12:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > Yep, that's a big reason why I view purging the existing SEV memory management as
> > a long term goal.  The other being that userspace obviously needs to be updated to
> > support UPM[*].   I suspect the only feasible way to enable this for SEV/SEV-ES
> > would be to restrict it to new VM types that have a disclaimer regarding additional
> > requirements.
> >
> > [*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory".  We've
> >     been using it iternally for discussion and it rolls off the tongue a lot easier than
> >     the full phrase, and is much more precise/descriptive than just "private fd".
> 
> Can we really "purge the existing SEV memory management"? This seems
> like a non-starter because it violates userspace API (i.e., the
> ability for the userspace VMM to run a guest without
> KVM_FEATURE_HC_MAP_GPA_RANGE). Or maybe I'm not quite following what
> you mean by purge.

I really do mean purge, but I also really do mean "long term", as in 5+ years
(probably 10+ if I'm being realistic).

Removing support is completely ok, as is changing the uABI, the rule is that we
can't break userspace.  If all users are migrated to private-fd, e.g. by carrots
and/or sticks such as putting the code into maintenance-only mode, then at some
point in the future there will be no users left to break and we can drop the
current code and make use of private-fd mandatory for SEV/SEV-ES guests.

> Assuming that UPM-based lazy pinning comes together via a new VM type
> that only supports new images based on a minimum kernel version with
> KVM_FEATURE_HC_MAP_GPA_RANGE, then I think this would like as follows:
> 
> 1. Userspace VMM: Check SEV VM type. If type is legacy SEV type then
> do upfront pinning. Else, skip up front pinning.

Yep, if by legacy "SEV type" you mean "SEV/SEV-ES guest that isn't required to
use MAP_GPA_RANGE", which I'm pretty sure you do based on #3.

> 2. KVM: I'm not sure anything special needs to happen here. For the
> legacy VM types, it can be configured to use legacy memslots,
> presumably the same as non-CVMs will be configured. For the new VM
> type, it should be configured to use UPM.

Correct, for now, KVM does nothing different for SEV/SEV-ES guests.

> 3. Control plane (thing creating VMs): Responsible for not allowing
> legacy SEV images (i.e., images without KVM_FEATURE_HC_MAP_GPA_RANGE)
> with the new SEV VM types that use UPM and have support for demand
> pinning.
> 
> Sean: Did I get this right?

Yep.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests
  2022-04-01 18:02               ` Sean Christopherson
@ 2022-04-01 18:19                 ` Marc Orr
  0 siblings, 0 replies; 29+ messages in thread
From: Marc Orr @ 2022-04-01 18:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Gonda, Nikunj A. Dadhania, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Brijesh Singh,
	Tom Lendacky, Bharata B Rao, Maciej S . Szmigiero, Mingwei Zhang,
	David Hildenbrand, kvm list, LKML

On Fri, Apr 1, 2022 at 11:02 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, Marc Orr wrote:
> > On Thu, Mar 31, 2022 at 12:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > > Yep, that's a big reason why I view purging the existing SEV memory management as
> > > a long term goal.  The other being that userspace obviously needs to be updated to
> > > support UPM[*].   I suspect the only feasible way to enable this for SEV/SEV-ES
> > > would be to restrict it to new VM types that have a disclaimer regarding additional
> > > requirements.
> > >
> > > [*] I believe Peter coined the UPM acronym for "Unmapping guest Private Memory".  We've
> > >     been using it iternally for discussion and it rolls off the tongue a lot easier than
> > >     the full phrase, and is much more precise/descriptive than just "private fd".
> >
> > Can we really "purge the existing SEV memory management"? This seems
> > like a non-starter because it violates userspace API (i.e., the
> > ability for the userspace VMM to run a guest without
> > KVM_FEATURE_HC_MAP_GPA_RANGE). Or maybe I'm not quite following what
> > you mean by purge.
>
> I really do mean purge, but I also really do mean "long term", as in 5+ years
> (probably 10+ if I'm being realistic).
>
> Removing support is completely ok, as is changing the uABI, the rule is that we
> can't break userspace.  If all users are migrated to private-fd, e.g. by carrots
> and/or sticks such as putting the code into maintenance-only mode, then at some
> point in the future there will be no users left to break and we can drop the
> current code and make use of private-fd mandatory for SEV/SEV-ES guests.

Ah, it makes sense now. Thanks!

> > Assuming that UPM-based lazy pinning comes together via a new VM type
> > that only supports new images based on a minimum kernel version with
> > KVM_FEATURE_HC_MAP_GPA_RANGE, then I think this would like as follows:
> >
> > 1. Userspace VMM: Check SEV VM type. If type is legacy SEV type then
> > do upfront pinning. Else, skip up front pinning.
>
> Yep, if by legacy "SEV type" you mean "SEV/SEV-ES guest that isn't required to
> use MAP_GPA_RANGE", which I'm pretty sure you do based on #3.

Yeah, that's exactly what I meant.

> > 2. KVM: I'm not sure anything special needs to happen here. For the
> > legacy VM types, it can be configured to use legacy memslots,
> > presumably the same as non-CVMs will be configured. For the new VM
> > type, it should be configured to use UPM.
>
> Correct, for now, KVM does nothing different for SEV/SEV-ES guests.
>
> > 3. Control plane (thing creating VMs): Responsible for not allowing
> > legacy SEV images (i.e., images without KVM_FEATURE_HC_MAP_GPA_RANGE)
> > with the new SEV VM types that use UPM and have support for demand
> > pinning.
> >
> > Sean: Did I get this right?
>
> Yep.

Thank you for verifying.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-04-01 18:20 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08  4:38 [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Nikunj A Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 1/9] KVM: Introduce pinning flag to hva_to_pfn* Nikunj A Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 2/9] KVM: x86/mmu: Move hugepage adjust to direct_page_fault Nikunj A Dadhania
2022-03-28 21:04   ` Sean Christopherson
2022-03-08  4:38 ` [PATCH RFC v1 3/9] KVM: x86/mmu: Add hook to pin PFNs on demand in MMU Nikunj A Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 4/9] KVM: SVM: Add pinning metadata in the arch memslot Nikunj A Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 5/9] KVM: SVM: Implement demand page pinning Nikunj A Dadhania
2022-03-08 21:53   ` Mingwei Zhang
2022-03-09  5:10     ` Nikunj A. Dadhania
2022-03-21  6:11       ` Mingwei Zhang
2022-03-21  9:19         ` Nikunj A. Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 6/9] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV/TDX Nikunj A Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 7/9] KVM: SEV: Carve out routine for allocation of pages Nikunj A Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 8/9] KVM: Move kvm_for_each_memslot_in_hva_range() to be used in SVM Nikunj A Dadhania
2022-03-08  4:38 ` [PATCH RFC v1 9/9] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() Nikunj A Dadhania
2022-03-09 16:57   ` Maciej S. Szmigiero
2022-03-09 17:47     ` Nikunj A. Dadhania
2022-03-28 21:00 ` [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests Sean Christopherson
2022-03-30  4:42   ` Nikunj A. Dadhania
2022-03-30 19:47     ` Sean Christopherson
2022-03-31  4:48       ` Nikunj A. Dadhania
2022-03-31 18:32         ` Peter Gonda
2022-03-31 19:00           ` Sean Christopherson
2022-04-01  3:22             ` Nikunj A. Dadhania
2022-04-01 14:54               ` Sean Christopherson
2022-04-01 15:39                 ` Nikunj A. Dadhania
2022-04-01 17:28             ` Marc Orr
2022-04-01 18:02               ` Sean Christopherson
2022-04-01 18:19                 ` Marc Orr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.