KVM Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework
@ 2020-07-31 21:23 Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 1/8] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() Sean Christopherson
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

SEV currently needs to pin guest memory as it doesn't support migrating
encrypted pages.  Introduce a framework in KVM's MMU to support pinning
pages on demand without requiring additional memory allocations, and with
(somewhat hazy) line of sight toward supporting more advanced features for
encrypted guest memory, e.g. host page migration.

The idea is to use a software available bit in the SPTE to track that a
page has been pinned.  The decision to pin a page and the actual pinning
managment is handled by vendor code via kvm_x86_ops hooks.  There are
intentionally two hooks (zap and unzap) introduced that are not needed for
SEV.  I included them to again show how the flag (probably renamed?) could
be used for more than just pin/unpin.

Bugs in the core implementation are pretty much guaranteed.  The basic
concept has been tested, but in a fairly different incarnation.  Most
notably, tagging PRESENT SPTEs as PINNED has not been tested, although
using the PINNED flag to track zapped (and known to be pinned) SPTEs has
been tested.  I cobbled this variation together fairly quickly to get the
code out there for discussion.

The last patch to pin SEV pages during sev_launch_update_data() is
incomplete; it's there to show how we might leverage MMU-based pinning to
support pinning pages before the guest is live.

Sean Christopherson (8):
  KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
  KVM: x86/mmu: Use bits 2:0 to check for present SPTEs
  KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte()
  KVM: x86/mmu: Add infrastructure for pinning PFNs on demand
  KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages on demand
  KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
  KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
  KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()

 arch/x86/include/asm/kvm_host.h |   7 ++
 arch/x86/kvm/mmu.h              |   3 +
 arch/x86/kvm/mmu/mmu.c          | 186 +++++++++++++++++++++++++-------
 arch/x86/kvm/mmu/paging_tmpl.h  |   3 +-
 arch/x86/kvm/svm/sev.c          | 141 +++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.c          |   3 +
 arch/x86/kvm/svm/svm.h          |   3 +
 7 files changed, 302 insertions(+), 44 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/8] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 2/8] KVM: x86/mmu: Use bits 2:0 to check for present SPTEs Sean Christopherson
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Return the old SPTE when clearing a SPTE and push the "old SPTE present"
check to the caller.  Tracking pinned SPTEs will use the old SPTE
in rmap_remove() to determine whether or not the SPTE is pinned.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 289dddff2615f..d737042fea55e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -930,9 +930,9 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
  * Rules for using mmu_spte_clear_track_bits:
  * It sets the sptep from present to nonpresent, and track the
  * state bits, it is used to clear the last level sptep.
- * Returns non-zero if the PTE was previously valid.
+ * Returns the old PTE.
  */
-static int mmu_spte_clear_track_bits(u64 *sptep)
+static u64 mmu_spte_clear_track_bits(u64 *sptep)
 {
 	kvm_pfn_t pfn;
 	u64 old_spte = *sptep;
@@ -943,7 +943,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 		old_spte = __update_clear_spte_slow(sptep, 0ull);
 
 	if (!is_shadow_present_pte(old_spte))
-		return 0;
+		return old_spte;
 
 	pfn = spte_to_pfn(old_spte);
 
@@ -960,7 +960,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	if (is_dirty_spte(old_spte))
 		kvm_set_pfn_dirty(pfn);
 
-	return 1;
+	return old_spte;
 }
 
 /*
@@ -1484,7 +1484,9 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
 
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
-	if (mmu_spte_clear_track_bits(sptep))
+	u64 old_spte = mmu_spte_clear_track_bits(sptep);
+
+	if (is_shadow_present_pte(old_spte))
 		rmap_remove(kvm, sptep);
 }
 
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/8] KVM: x86/mmu: Use bits 2:0 to check for present SPTEs
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 1/8] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 3/8] KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte() Sean Christopherson
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Use what are effectively EPT's RWX bits to detect present SPTEs instead
of simply looking for a non-zero value.  This will allow using a
non-zero initial value for SPTEs as well as using not-present SPTEs to
track metadata for zapped private SPTEs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 9 +++++++--
 arch/x86/kvm/mmu/paging_tmpl.h | 3 ++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d737042fea55e..82f69a7456004 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -625,9 +625,14 @@ static int is_nx(struct kvm_vcpu *vcpu)
 	return vcpu->arch.efer & EFER_NX;
 }
 
-static int is_shadow_present_pte(u64 pte)
+static inline bool __is_shadow_present_pte(u64 pte)
 {
-	return (pte != 0) && !is_mmio_spte(pte);
+	return !!(pte & 0x7);
+}
+
+static bool is_shadow_present_pte(u64 pte)
+{
+	return __is_shadow_present_pte(pte) && !is_mmio_spte(pte);
 }
 
 static int is_large_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0172a949f6a75..57813e92ea8e0 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1024,7 +1024,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		gpa_t pte_gpa;
 		gfn_t gfn;
 
-		if (!sp->spt[i])
+		if (!__is_shadow_present_pte(sp->spt[i]) &&
+		    !is_mmio_spte(sp->spt[i]))
 			continue;
 
 		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/8] KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte()
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 1/8] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 2/8] KVM: x86/mmu: Use bits 2:0 to check for present SPTEs Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 4/8] KVM: x86/mmu: Add infrastructure for pinning PFNs on demand Sean Christopherson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Return early from mmu_set_spte() if the new SPTE is not-present so as to
reduce the indentation of the code that performs metadata updates, e.g.
rmap manipulation.  Additional metadata updates will soon follow...

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 82f69a7456004..182f398036248 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3126,12 +3126,14 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (!was_rmapped && is_large_pte(*sptep))
 		++vcpu->kvm->stat.lpages;
 
-	if (is_shadow_present_pte(*sptep)) {
-		if (!was_rmapped) {
-			rmap_count = rmap_add(vcpu, sptep, gfn);
-			if (rmap_count > RMAP_RECYCLE_THRESHOLD)
-				rmap_recycle(vcpu, sptep, gfn);
-		}
+	/* No additional tracking necessary for not-present SPTEs. */
+	if (!is_shadow_present_pte(*sptep))
+		return ret;
+
+	if (!was_rmapped) {
+		rmap_count = rmap_add(vcpu, sptep, gfn);
+		if (rmap_count > RMAP_RECYCLE_THRESHOLD)
+			rmap_recycle(vcpu, sptep, gfn);
 	}
 
 	return ret;
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 4/8] KVM: x86/mmu: Add infrastructure for pinning PFNs on demand
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
                   ` (2 preceding siblings ...)
  2020-07-31 21:23 ` [RFC PATCH 3/8] KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte() Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 5/8] KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages " Sean Christopherson
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/asm/kvm_host.h |   7 ++
 arch/x86/kvm/mmu/mmu.c          | 111 ++++++++++++++++++++++++++------
 2 files changed, 99 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1bab87a444d78..b14864f3e8e74 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1138,6 +1138,13 @@ struct kvm_x86_ops {
 
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, unsigned long cr3);
 
+	bool (*pin_spte)(struct kvm_vcpu *vcpu, gfn_t gfn, int level,
+			 kvm_pfn_t pfn);
+	void (*drop_pinned_spte)(struct kvm *kvm, gfn_t gfn, int level,
+				 kvm_pfn_t pfn);
+	void (*zap_pinned_spte)(struct kvm *kvm, gfn_t gfn, int level);
+	void (*unzap_pinned_spte)(struct kvm *kvm, gfn_t gfn, int level);
+
 	bool (*has_wbinvd_exit)(void);
 
 	/* Returns actual tsc_offset set in active VMCS */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 182f398036248..cab3b2f2f49c3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -133,6 +133,9 @@ module_param(dbg, bool, 0644);
 #define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
 #define SPTE_MMIO_MASK (3ULL << 52)
 
+/* Special SPTEs flags that can only be used for non-MMIO SPTEs. */
+#define SPTE_PINNED_MASK	BIT_ULL(62)
+
 #define PT64_LEVEL_BITS 9
 
 #define PT64_LEVEL_SHIFT(level) \
@@ -211,6 +214,7 @@ enum {
 	RET_PF_EMULATE = 1,
 	RET_PF_INVALID = 2,
 	RET_PF_FIXED = 3,
+	RET_PF_UNZAPPED = 4,
 };
 
 struct pte_list_desc {
@@ -635,6 +639,11 @@ static bool is_shadow_present_pte(u64 pte)
 	return __is_shadow_present_pte(pte) && !is_mmio_spte(pte);
 }
 
+static bool is_pinned_pte(u64 pte)
+{
+	return !!(pte & SPTE_PINNED_MASK);
+}
+
 static int is_large_pte(u64 pte)
 {
 	return pte & PT_PAGE_SIZE_MASK;
@@ -937,15 +946,15 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
  * state bits, it is used to clear the last level sptep.
  * Returns the old PTE.
  */
-static u64 mmu_spte_clear_track_bits(u64 *sptep)
+static u64 __mmu_spte_clear_track_bits(u64 *sptep, u64 clear_value)
 {
 	kvm_pfn_t pfn;
 	u64 old_spte = *sptep;
 
 	if (!spte_has_volatile_bits(old_spte))
-		__update_clear_spte_fast(sptep, 0ull);
+		__update_clear_spte_fast(sptep, clear_value);
 	else
-		old_spte = __update_clear_spte_slow(sptep, 0ull);
+		old_spte = __update_clear_spte_slow(sptep, clear_value);
 
 	if (!is_shadow_present_pte(old_spte))
 		return old_spte;
@@ -968,6 +977,11 @@ static u64 mmu_spte_clear_track_bits(u64 *sptep)
 	return old_spte;
 }
 
+static inline u64 mmu_spte_clear_track_bits(u64 *sptep)
+{
+	return __mmu_spte_clear_track_bits(sptep, 0ull);
+}
+
 /*
  * Rules for using mmu_spte_clear_no_track:
  * Directly clear spte without caring the state bits of sptep,
@@ -1399,7 +1413,7 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 	return pte_list_add(vcpu, spte, rmap_head);
 }
 
-static void rmap_remove(struct kvm *kvm, u64 *spte)
+static void rmap_remove(struct kvm *kvm, u64 *spte, u64 old_spte)
 {
 	struct kvm_mmu_page *sp;
 	gfn_t gfn;
@@ -1409,6 +1423,10 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
 	rmap_head = gfn_to_rmap(kvm, gfn, sp);
 	__pte_list_remove(spte, rmap_head);
+
+	if (is_pinned_pte(old_spte))
+		kvm_x86_ops.drop_pinned_spte(kvm, gfn, sp->role.level - 1,
+					     spte_to_pfn(old_spte));
 }
 
 /*
@@ -1446,7 +1464,7 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 	iter->pos = 0;
 	sptep = iter->desc->sptes[iter->pos];
 out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
+	BUG_ON(!is_shadow_present_pte(*sptep) && !is_pinned_pte(*sptep));
 	return sptep;
 }
 
@@ -1491,8 +1509,8 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
 	u64 old_spte = mmu_spte_clear_track_bits(sptep);
 
-	if (is_shadow_present_pte(old_spte))
-		rmap_remove(kvm, sptep);
+	if (is_shadow_present_pte(old_spte) || is_pinned_pte(old_spte))
+		rmap_remove(kvm, sptep, old_spte);
 }
 
 
@@ -1730,17 +1748,49 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 	return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
 }
 
+static bool kvm_mmu_zap_pinned_spte(struct kvm *kvm, u64 *sptep)
+{
+	struct kvm_mmu_page *sp;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	if (!(*sptep & SPTE_PINNED_MASK))
+		return false;
+
+	sp = sptep_to_sp(sptep);
+	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+	pfn = spte_to_pfn(*sptep);
+
+	if (kvm_x86_ops.zap_pinned_spte)
+		kvm_x86_ops.zap_pinned_spte(kvm, gfn, sp->role.level - 1);
+
+	__mmu_spte_clear_track_bits(sptep, SPTE_PINNED_MASK | pfn << PAGE_SHIFT);
+	return true;
+}
+
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
 	bool flush = false;
 
-	while ((sptep = rmap_get_first(rmap_head, &iter))) {
+restart:
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
 		rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);
 
+		if (!is_shadow_present_pte(*sptep)) {
+			WARN_ON_ONCE(!is_pinned_pte(*sptep));
+			continue;
+		}
+
+		flush = true;
+
+		/* Keep the rmap if the SPTE is pinned. */
+		if (kvm_mmu_zap_pinned_spte(kvm, sptep))
+			continue;
+
 		pte_list_remove(rmap_head, sptep);
-		flush = true;
+		goto restart;
 	}
 
 	return flush;
@@ -1774,6 +1824,10 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 		need_flush = 1;
 
+		/* Pinned pages should not be relocated (obviously). */
+		if (WARN_ON_ONCE(is_pinned_pte(*sptep)))
+			continue;
+
 		if (pte_write(*ptep)) {
 			pte_list_remove(rmap_head, sptep);
 			goto restart;
@@ -2630,7 +2684,7 @@ static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	struct kvm_mmu_page *child;
 
 	pte = *spte;
-	if (is_shadow_present_pte(pte)) {
+	if (is_shadow_present_pte(pte) || is_pinned_pte(pte)) {
 		if (is_last_spte(pte, sp->role.level)) {
 			drop_spte(kvm, spte);
 			if (is_large_pte(pte))
@@ -2639,7 +2693,7 @@ static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, spte);
 		}
-		return true;
+		return is_shadow_present_pte(pte);
 	}
 
 	if (is_mmio_spte(pte))
@@ -2987,10 +3041,13 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	u64 spte = 0;
 	int ret = 0;
 	struct kvm_mmu_page *sp;
+	bool is_mmio_pfn;
 
 	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
 		return 0;
 
+	is_mmio_pfn = kvm_is_mmio_pfn(pfn);
+
 	sp = sptep_to_sp(sptep);
 	if (sp_ad_disabled(sp))
 		spte |= SPTE_AD_DISABLED_MASK;
@@ -3023,15 +3080,14 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;
 	if (tdp_enabled)
-		spte |= kvm_x86_ops.get_mt_mask(vcpu, gfn,
-			kvm_is_mmio_pfn(pfn));
+		spte |= kvm_x86_ops.get_mt_mask(vcpu, gfn, is_mmio_pfn);
 
 	if (host_writable)
 		spte |= SPTE_HOST_WRITEABLE;
 	else
 		pte_access &= ~ACC_WRITE_MASK;
 
-	if (!kvm_is_mmio_pfn(pfn))
+	if (!is_mmio_pfn)
 		spte |= shadow_me_mask;
 
 	spte |= (u64)pfn << PAGE_SHIFT;
@@ -3065,6 +3121,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (speculative)
 		spte = mark_spte_for_access_track(spte);
 
+	if (is_pinned_pte(*sptep) ||
+	    (vcpu->arch.mmu->direct_map && !is_mmio_pfn &&
+	     kvm_x86_ops.pin_spte &&
+	     kvm_x86_ops.pin_spte(vcpu, gfn, level, pfn)))
+		spte |= SPTE_PINNED_MASK;
+
 set_pte:
 	if (mmu_spte_update(sptep, spte))
 		ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
@@ -3081,29 +3143,33 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	int set_spte_ret;
 	int ret = RET_PF_FIXED;
 	bool flush = false;
+	u64 pte = *sptep;
 
 	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 		 *sptep, write_fault, gfn);
 
-	if (is_shadow_present_pte(*sptep)) {
+	if (is_shadow_present_pte(pte)) {
 		/*
 		 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
 		 * the parent of the now unreachable PTE.
 		 */
-		if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) {
+		if (level > PG_LEVEL_4K && !is_large_pte(pte)) {
 			struct kvm_mmu_page *child;
-			u64 pte = *sptep;
 
 			child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, sptep);
 			flush = true;
-		} else if (pfn != spte_to_pfn(*sptep)) {
+		} else if (pfn != spte_to_pfn(pte)) {
 			pgprintk("hfn old %llx new %llx\n",
-				 spte_to_pfn(*sptep), pfn);
+				 spte_to_pfn(pte), pfn);
 			drop_spte(vcpu->kvm, sptep);
 			flush = true;
 		} else
 			was_rmapped = 1;
+	} else if (is_pinned_pte(pte)) {
+		WARN_ON_ONCE(pfn != spte_to_pfn(pte));
+		ret = RET_PF_UNZAPPED;
+		was_rmapped = 1;
 	}
 
 	set_spte_ret = set_spte(vcpu, sptep, pte_access, level, gfn, pfn,
@@ -3136,6 +3202,9 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			rmap_recycle(vcpu, sptep, gfn);
 	}
 
+	if (ret == RET_PF_UNZAPPED && kvm_x86_ops.unzap_pinned_spte)
+		kvm_x86_ops.unzap_pinned_spte(vcpu->kvm, gfn, level - 1);
+
 	return ret;
 }
 
@@ -5921,6 +5990,10 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		sp = sptep_to_sp(sptep);
 		pfn = spte_to_pfn(*sptep);
 
+		/* Pinned page dirty logging is not supported. */
+		if (WARN_ON_ONCE(is_pinned_pte(*sptep)))
+			continue;
+
 		/*
 		 * We cannot do huge page mapping for indirect shadow pages,
 		 * which are found on the last rmap (level = 1) when not using
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 5/8] KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages on demand
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
                   ` (3 preceding siblings ...)
  2020-07-31 21:23 ` [RFC PATCH 4/8] KVM: x86/mmu: Add infrastructure for pinning PFNs on demand Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 6/8] KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault() Sean Christopherson
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Cc: eric van tassell <Eric.VanTassell@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/svm/sev.c | 24 ++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c |  3 +++
 arch/x86/kvm/svm/svm.h |  3 +++
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f7f1f4ecf08e3..f640b8beb443e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1193,3 +1193,27 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu)
 	svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
 	vmcb_mark_dirty(svm->vmcb, VMCB_ASID);
 }
+
+bool sev_pin_spte(struct kvm_vcpu *vcpu, gfn_t gfn, int level, kvm_pfn_t pfn)
+{
+	if (!sev_guest(vcpu->kvm))
+		return false;
+
+	get_page(pfn_to_page(pfn));
+
+	/*
+	 * Flush any cached lines of the page being added since "ownership" of
+	 * it will be transferred from the host to an encrypted guest.
+	 */
+	clflush_cache_range(__va(pfn << PAGE_SHIFT), page_level_size(level));
+
+	return true;
+}
+
+void sev_drop_pinned_spte(struct kvm *kvm, gfn_t gfn, int level, kvm_pfn_t pfn)
+{
+	if (WARN_ON_ONCE(!sev_guest(kvm)))
+		return;
+
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0ae20af3a1677..a9f7515b4eff3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4150,6 +4150,9 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
 
 	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
+
+	.pin_spte = sev_pin_spte,
+	.drop_pinned_spte = sev_drop_pinned_spte,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index a798e17317094..3060e3e529cbc 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -489,4 +489,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu);
 int __init sev_hardware_setup(void);
 void sev_hardware_teardown(void);
 
+bool sev_pin_spte(struct kvm_vcpu *vcpu, gfn_t gfn, int level, kvm_pfn_t pfn);
+void sev_drop_pinned_spte(struct kvm *kvm, gfn_t gfn, int level, kvm_pfn_t pfn);
+
 #endif
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 6/8] KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
                   ` (4 preceding siblings ...)
  2020-07-31 21:23 ` [RFC PATCH 5/8] KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages " Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 7/8] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV Sean Christopherson
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

When adding pages prior to boot, SEV needs to pin the resulting host pfn
so that the pages that are consumed by sev_launch_update_data() are not
moved after the memory is encrypted, which would corrupt the guest data.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cab3b2f2f49c3..92b133d7b1713 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4156,7 +4156,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-			     bool prefault, int max_level, bool is_tdp)
+			     bool prefault, int max_level, bool is_tdp,
+			     kvm_pfn_t *pfn)
 {
 	bool write = error_code & PFERR_WRITE_MASK;
 	bool exec = error_code & PFERR_FETCH_MASK;
@@ -4165,7 +4166,6 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
-	kvm_pfn_t pfn;
 	int r;
 
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
@@ -4184,10 +4184,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, gpa, pfn, write, &map_writable))
 		return RET_PF_RETRY;
 
-	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
+	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, *pfn, ACC_ALL, &r))
 		return r;
 
 	r = RET_PF_RETRY;
@@ -4197,23 +4197,25 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	r = make_mmu_pages_available(vcpu);
 	if (r)
 		goto out_unlock;
-	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
+	r = __direct_map(vcpu, gpa, write, map_writable, max_level, *pfn,
 			 prefault, is_tdp && lpage_disallowed);
 
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(pfn);
+	kvm_release_pfn_clean(*pfn);
 	return r;
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 				u32 error_code, bool prefault)
 {
+	kvm_pfn_t pfn;
+
 	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
 
 	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
 	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
-				 PG_LEVEL_2M, false);
+				 PG_LEVEL_2M, false, &pfn);
 }
 
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
@@ -4252,6 +4254,7 @@ EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		       bool prefault)
 {
+	kvm_pfn_t pfn;
 	int max_level;
 
 	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
@@ -4265,7 +4268,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	}
 
 	return direct_page_fault(vcpu, gpa, error_code, prefault,
-				 max_level, true);
+				 max_level, true, &pfn);
 }
 
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 7/8] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
                   ` (5 preceding siblings ...)
  2020-07-31 21:23 ` [RFC PATCH 6/8] KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault() Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-07-31 21:23 ` [RFC PATCH 8/8] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() Sean Christopherson
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path.  This allows
SEV to pin pages before booting the guest, provides the resulting pfn to
vendor code if should be needed in the future, and allows the RET_PF_*
enums to stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9f6554613babc..06f4475b8aad8 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -108,6 +108,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	return vcpu->arch.mmu->page_fault(vcpu, cr2_or_gpa, err, prefault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level);
+
 /*
  * Currently, we have two sorts of write-protection, a) the first one
  * write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 92b133d7b1713..06dbc1bb79a6a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4271,6 +4271,31 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 max_level, true, &pfn);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level)
+{
+	kvm_pfn_t pfn;
+	int r;
+
+	if (mmu_topup_memory_caches(vcpu, false))
+		return KVM_PFN_ERR_FAULT;
+
+	/*
+	 * Loop on the page fault path to handle the case where an mmu_notifier
+	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
+	 * KVM needs to resume the guest in case the invalidation changed any
+	 * of the page fault properties, i.e. the gpa or error code.  For this
+	 * path, the gpa and error code are fixed by the caller, and the caller
+	 * expects failure if and only if the page fault can't be fixed.
+	 */
+	do {
+		r = direct_page_fault(vcpu, gpa, error_code, false, max_level,
+				      true, &pfn);
+	} while (r == RET_PF_RETRY && !is_error_noslot_pfn(pfn));
+	return pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 				   struct kvm_mmu *context)
 {
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 8/8] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
                   ` (6 preceding siblings ...)
  2020-07-31 21:23 ` [RFC PATCH 7/8] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV Sean Christopherson
@ 2020-07-31 21:23 ` Sean Christopherson
  2020-08-03  3:00 ` [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Eric van Tassell
  2020-08-03 15:52 ` Brijesh Singh
  9 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-07-31 21:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/svm/sev.c | 117 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 112 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f640b8beb443e..eb95914578497 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -15,6 +15,7 @@
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 
+#include "mmu.h"
 #include "x86.h"
 #include "svm.h"
 
@@ -415,6 +416,107 @@ static unsigned long get_num_contig_pages(unsigned long idx,
 	return pages;
 }
 
+#define SEV_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static void *sev_alloc_pages(unsigned long size, unsigned long *npages)
+{
+	/* TODO */
+	*npages = 0;
+	return NULL;
+}
+
+static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm,
+					      unsigned long hva)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot;
+
+	kvm_for_each_memslot(memslot, slots) {
+		if (hva >= memslot->userspace_addr &&
+			hva < memslot->userspace_addr +
+				(memslot->npages << PAGE_SHIFT))
+			return memslot;
+	}
+
+	return NULL;
+}
+
+static bool hva_to_gpa(struct kvm *kvm, unsigned long hva)
+{
+	struct kvm_memory_slot *memslot;
+	gpa_t gpa_offset;
+
+	memslot = hva_to_memslot(kvm, hva);
+	if (!memslot)
+		return UNMAPPED_GVA;
+
+	gpa_offset = hva - memslot->userspace_addr;
+	return ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
+}
+
+static struct page **sev_pin_memory_in_mmu(struct kvm *kvm, unsigned long addr,
+					   unsigned long size,
+					   unsigned long *npages)
+{
+	struct kvm_vcpu *vcpu;
+	struct page **pages;
+	unsigned long i;
+	kvm_pfn_t pfn;
+	int idx, ret;
+	gpa_t gpa;
+
+	pages = sev_alloc_pages(size, npages);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (mutex_lock_killable(&vcpu->mutex)) {
+		kvfree(pages);
+		return ERR_PTR(-EINTR);
+	}
+
+	vcpu_load(vcpu);
+	idx = srcu_read_lock(&kvm->srcu);
+
+	kvm_mmu_load(vcpu);
+
+	for (i = 0; i < *npages; i++, addr += PAGE_SIZE) {
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			goto err;
+		}
+
+		if (need_resched())
+			cond_resched();
+
+		gpa = hva_to_gpa(kvm, addr);
+		if (gpa == UNMAPPED_GVA) {
+			ret = -EFAULT;
+			goto err;
+		}
+		pfn = kvm_mmu_map_tdp_page(vcpu, gpa, SEV_PFERR, PG_LEVEL_4K);
+		if (is_error_noslot_pfn(pfn)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		pages[i] = pfn_to_page(pfn);
+		get_page(pages[i]);
+	}
+
+	srcu_read_unlock(&kvm->srcu, idx);
+	vcpu_put(vcpu);
+
+	mutex_unlock(&vcpu->mutex);
+	return pages;
+
+err:
+	for ( ; i; --i)
+		put_page(pages[i-1]);
+
+	kvfree(pages);
+	return ERR_PTR(ret);
+}
+
 static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
 	unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
@@ -439,9 +541,12 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	vaddr_end = vaddr + size;
 
 	/* Lock the user memory. */
-	inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
-	if (!inpages) {
-		ret = -ENOMEM;
+	if (atomic_read(&kvm->online_vcpus))
+		inpages = sev_pin_memory_in_mmu(kvm, vaddr, size, &npages);
+	else
+		inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
+	if (IS_ERR(inpages)) {
+		ret = PTR_ERR(inpages);
 		goto e_free;
 	}
 
@@ -449,9 +554,11 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	 * The LAUNCH_UPDATE command will perform in-place encryption of the
 	 * memory content (i.e it will write the same memory region with C=1).
 	 * It's possible that the cache may contain the data with C=0, i.e.,
-	 * unencrypted so invalidate it first.
+	 * unencrypted so invalidate it first.  Flushing is automatically
+	 * handled if the pages can be pinned in the MMU.
 	 */
-	sev_clflush_pages(inpages, npages);
+	if (!atomic_read(&kvm->online_vcpus))
+		sev_clflush_pages(inpages, npages);
 
 	for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i += pages) {
 		int offset, len;
-- 
2.28.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
                   ` (7 preceding siblings ...)
  2020-07-31 21:23 ` [RFC PATCH 8/8] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() Sean Christopherson
@ 2020-08-03  3:00 ` Eric van Tassell
  2020-08-03 15:00   ` Sean Christopherson
  2020-08-03 15:52 ` Brijesh Singh
  9 siblings, 1 reply; 14+ messages in thread
From: Eric van Tassell @ 2020-08-03  3:00 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, eric van tassell, Tom Lendacky

Sean,
	What commit did you base your series  on?
	Thanks.

-evt(Eric van Tassell)

On 7/31/20 4:23 PM, Sean Christopherson wrote:
> SEV currently needs to pin guest memory as it doesn't support migrating
> encrypted pages.  Introduce a framework in KVM's MMU to support pinning
> pages on demand without requiring additional memory allocations, and with
> (somewhat hazy) line of sight toward supporting more advanced features for
> encrypted guest memory, e.g. host page migration.
> 
> The idea is to use a software available bit in the SPTE to track that a
> page has been pinned.  The decision to pin a page and the actual pinning
> managment is handled by vendor code via kvm_x86_ops hooks.  There are
> intentionally two hooks (zap and unzap) introduced that are not needed for
> SEV.  I included them to again show how the flag (probably renamed?) could
> be used for more than just pin/unpin.
> 
> Bugs in the core implementation are pretty much guaranteed.  The basic
> concept has been tested, but in a fairly different incarnation.  Most
> notably, tagging PRESENT SPTEs as PINNED has not been tested, although
> using the PINNED flag to track zapped (and known to be pinned) SPTEs has
> been tested.  I cobbled this variation together fairly quickly to get the
> code out there for discussion.
> 
> The last patch to pin SEV pages during sev_launch_update_data() is
> incomplete; it's there to show how we might leverage MMU-based pinning to
> support pinning pages before the guest is live.
> 
> Sean Christopherson (8):
>    KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
>    KVM: x86/mmu: Use bits 2:0 to check for present SPTEs
>    KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte()
>    KVM: x86/mmu: Add infrastructure for pinning PFNs on demand
>    KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages on demand
>    KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
>    KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
>    KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
> 
>   arch/x86/include/asm/kvm_host.h |   7 ++
>   arch/x86/kvm/mmu.h              |   3 +
>   arch/x86/kvm/mmu/mmu.c          | 186 +++++++++++++++++++++++++-------
>   arch/x86/kvm/mmu/paging_tmpl.h  |   3 +-
>   arch/x86/kvm/svm/sev.c          | 141 +++++++++++++++++++++++-
>   arch/x86/kvm/svm/svm.c          |   3 +
>   arch/x86/kvm/svm/svm.h          |   3 +
>   7 files changed, 302 insertions(+), 44 deletions(-)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework
  2020-08-03  3:00 ` [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Eric van Tassell
@ 2020-08-03 15:00   ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2020-08-03 15:00 UTC (permalink / raw)
  To: Eric van Tassell
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

On Sun, Aug 02, 2020 at 10:00:13PM -0500, Eric van Tassell wrote:
> Sean,
> 	What commit did you base your series  on?

Gah, sorry.

kvm/queue, commit 9c2475f3e46a1 ("KVM: Using macros instead of magic values").

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework
  2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
                   ` (8 preceding siblings ...)
  2020-08-03  3:00 ` [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Eric van Tassell
@ 2020-08-03 15:52 ` Brijesh Singh
  2020-08-03 17:16   ` Sean Christopherson
  9 siblings, 1 reply; 14+ messages in thread
From: Brijesh Singh @ 2020-08-03 15:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: brijesh.singh, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

Thanks for series Sean. Some thoughts


On 7/31/20 4:23 PM, Sean Christopherson wrote:
> SEV currently needs to pin guest memory as it doesn't support migrating
> encrypted pages.  Introduce a framework in KVM's MMU to support pinning
> pages on demand without requiring additional memory allocations, and with
> (somewhat hazy) line of sight toward supporting more advanced features for
> encrypted guest memory, e.g. host page migration.


Eric's attempt to do a lazy pinning suffers with the memory allocation
problem and your series seems to address it. As you have noticed,
currently the SEV enablement  in the KVM does not support migrating the
encrypted pages. But the recent SEV firmware provides a support to
migrate the encrypted pages (e.g host page migration). The support is
available in SEV FW >= 0.17.

> The idea is to use a software available bit in the SPTE to track that a
> page has been pinned.  The decision to pin a page and the actual pinning
> managment is handled by vendor code via kvm_x86_ops hooks.  There are
> intentionally two hooks (zap and unzap) introduced that are not needed for
> SEV.  I included them to again show how the flag (probably renamed?) could
> be used for more than just pin/unpin.

If using the available software bits for the tracking the pinning is
acceptable then it can be used for the non-SEV guests (if needed). I
will look through your patch more carefully but one immediate question,
when do we unpin the pages? In the case of the SEV, once a page is
pinned then it should not be unpinned until the guest terminates. If we
unpin the page before the VM terminates then there is a  chance the host
page migration will kick-in and move the pages. The KVM MMU code may
call to drop the spte's during the zap/unzap and this happens a lot
during a guest execution and it will lead us to the path where a vendor
specific code will unpin the pages during the guest execution and cause
a data corruption for the SEV guest.

> Bugs in the core implementation are pretty much guaranteed.  The basic
> concept has been tested, but in a fairly different incarnation.  Most
> notably, tagging PRESENT SPTEs as PINNED has not been tested, although
> using the PINNED flag to track zapped (and known to be pinned) SPTEs has
> been tested.  I cobbled this variation together fairly quickly to get the
> code out there for discussion.
>
> The last patch to pin SEV pages during sev_launch_update_data() is
> incomplete; it's there to show how we might leverage MMU-based pinning to
> support pinning pages before the guest is live.


I will add the SEV specific bits and  give this a try.

>
> Sean Christopherson (8):
>   KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
>   KVM: x86/mmu: Use bits 2:0 to check for present SPTEs
>   KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte()
>   KVM: x86/mmu: Add infrastructure for pinning PFNs on demand
>   KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages on demand
>   KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
>   KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
>   KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
>
>  arch/x86/include/asm/kvm_host.h |   7 ++
>  arch/x86/kvm/mmu.h              |   3 +
>  arch/x86/kvm/mmu/mmu.c          | 186 +++++++++++++++++++++++++-------
>  arch/x86/kvm/mmu/paging_tmpl.h  |   3 +-
>  arch/x86/kvm/svm/sev.c          | 141 +++++++++++++++++++++++-
>  arch/x86/kvm/svm/svm.c          |   3 +
>  arch/x86/kvm/svm/svm.h          |   3 +
>  7 files changed, 302 insertions(+), 44 deletions(-)
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework
  2020-08-03 15:52 ` Brijesh Singh
@ 2020-08-03 17:16   ` Sean Christopherson
  2020-08-04 19:40     ` Brijesh Singh
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2020-08-03 17:16 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, eric van tassell, Tom Lendacky

On Mon, Aug 03, 2020 at 10:52:05AM -0500, Brijesh Singh wrote:
> Thanks for series Sean. Some thoughts
> 
> 
> On 7/31/20 4:23 PM, Sean Christopherson wrote:
> > SEV currently needs to pin guest memory as it doesn't support migrating
> > encrypted pages.  Introduce a framework in KVM's MMU to support pinning
> > pages on demand without requiring additional memory allocations, and with
> > (somewhat hazy) line of sight toward supporting more advanced features for
> > encrypted guest memory, e.g. host page migration.
> 
> 
> Eric's attempt to do a lazy pinning suffers with the memory allocation
> problem and your series seems to address it. As you have noticed,
> currently the SEV enablement  in the KVM does not support migrating the
> encrypted pages. But the recent SEV firmware provides a support to
> migrate the encrypted pages (e.g host page migration). The support is
> available in SEV FW >= 0.17.

I assume SEV also doesn't support ballooning?  Ballooning would be a good
first step toward page migration as I think it'd be easier for KVM to
support, e.g. only needs to deal with the "zap" and not the "move".

> > The idea is to use a software available bit in the SPTE to track that a
> > page has been pinned.  The decision to pin a page and the actual pinning
> > managment is handled by vendor code via kvm_x86_ops hooks.  There are
> > intentionally two hooks (zap and unzap) introduced that are not needed for
> > SEV.  I included them to again show how the flag (probably renamed?) could
> > be used for more than just pin/unpin.
> 
> If using the available software bits for the tracking the pinning is
> acceptable then it can be used for the non-SEV guests (if needed). I
> will look through your patch more carefully but one immediate question,
> when do we unpin the pages? In the case of the SEV, once a page is
> pinned then it should not be unpinned until the guest terminates. If we
> unpin the page before the VM terminates then there is a  chance the host
> page migration will kick-in and move the pages. The KVM MMU code may
> call to drop the spte's during the zap/unzap and this happens a lot
> during a guest execution and it will lead us to the path where a vendor
> specific code will unpin the pages during the guest execution and cause
> a data corruption for the SEV guest.

The pages are unpinned by:

  drop_spte()
  |
  -> rmap_remove()
     |
     -> sev_drop_pinned_spte()


The intent is to allow unpinning pages when the mm_struct dies, i.e. when
the memory is no longer reachable (as opposed to when the last reference to
KVM is put), but typing that out, I realize there are dependencies and
assumptions that don't hold true for SEV as implemented.

  - Parent shadow pages won't be zapped.  Recycling MMU pages and zapping
    all SPs due to memslot updates are the two concerns.

    The easy way out for recycling is to not recycle SPs with pinned
    children, though that may or may not fly with VMM admins.

    I'm trying to resolve the memslot issue[*], but confirming that there's
    no longer an issue with not zapping everything is proving difficult as
    we haven't yet reproduced the original bug.

  - drop_large_spte() won't be invoked.  I believe the only semi-legitimate
    scenario is if the NX huge page workaround is toggled on while a VM is
    running.  Disallowing that if there is an SEV guest seems reasonable?

    There might be an issue with the host page size changing, but I don't
    think that can happen if the page is pinned.  That needs more
    investigation.


[*] https://lkml.kernel.org/r/20200703025047.13987-1-sean.j.christopherson@intel.com

> > Bugs in the core implementation are pretty much guaranteed.  The basic
> > concept has been tested, but in a fairly different incarnation.  Most
> > notably, tagging PRESENT SPTEs as PINNED has not been tested, although
> > using the PINNED flag to track zapped (and known to be pinned) SPTEs has
> > been tested.  I cobbled this variation together fairly quickly to get the
> > code out there for discussion.
> >
> > The last patch to pin SEV pages during sev_launch_update_data() is
> > incomplete; it's there to show how we might leverage MMU-based pinning to
> > support pinning pages before the guest is live.
> 
> 
> I will add the SEV specific bits and  give this a try.
> 
> >
> > Sean Christopherson (8):
> >   KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
> >   KVM: x86/mmu: Use bits 2:0 to check for present SPTEs
> >   KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte()
> >   KVM: x86/mmu: Add infrastructure for pinning PFNs on demand
> >   KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages on demand
> >   KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
> >   KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
> >   KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
> >
> >  arch/x86/include/asm/kvm_host.h |   7 ++
> >  arch/x86/kvm/mmu.h              |   3 +
> >  arch/x86/kvm/mmu/mmu.c          | 186 +++++++++++++++++++++++++-------
> >  arch/x86/kvm/mmu/paging_tmpl.h  |   3 +-
> >  arch/x86/kvm/svm/sev.c          | 141 +++++++++++++++++++++++-
> >  arch/x86/kvm/svm/svm.c          |   3 +
> >  arch/x86/kvm/svm/svm.h          |   3 +
> >  7 files changed, 302 insertions(+), 44 deletions(-)
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework
  2020-08-03 17:16   ` Sean Christopherson
@ 2020-08-04 19:40     ` Brijesh Singh
  0 siblings, 0 replies; 14+ messages in thread
From: Brijesh Singh @ 2020-08-04 19:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, eric van tassell,
	Tom Lendacky


On 8/3/20 12:16 PM, Sean Christopherson wrote:
> On Mon, Aug 03, 2020 at 10:52:05AM -0500, Brijesh Singh wrote:
>> Thanks for series Sean. Some thoughts
>>
>>
>> On 7/31/20 4:23 PM, Sean Christopherson wrote:
>>> SEV currently needs to pin guest memory as it doesn't support migrating
>>> encrypted pages.  Introduce a framework in KVM's MMU to support pinning
>>> pages on demand without requiring additional memory allocations, and with
>>> (somewhat hazy) line of sight toward supporting more advanced features for
>>> encrypted guest memory, e.g. host page migration.
>>
>> Eric's attempt to do a lazy pinning suffers with the memory allocation
>> problem and your series seems to address it. As you have noticed,
>> currently the SEV enablement  in the KVM does not support migrating the
>> encrypted pages. But the recent SEV firmware provides a support to
>> migrate the encrypted pages (e.g host page migration). The support is
>> available in SEV FW >= 0.17.
> I assume SEV also doesn't support ballooning?  Ballooning would be a good
> first step toward page migration as I think it'd be easier for KVM to
> support, e.g. only needs to deal with the "zap" and not the "move".


Yes, the ballooning does not work with the SEV.


>
>>> The idea is to use a software available bit in the SPTE to track that a
>>> page has been pinned.  The decision to pin a page and the actual pinning
>>> managment is handled by vendor code via kvm_x86_ops hooks.  There are
>>> intentionally two hooks (zap and unzap) introduced that are not needed for
>>> SEV.  I included them to again show how the flag (probably renamed?) could
>>> be used for more than just pin/unpin.
>> If using the available software bits for the tracking the pinning is
>> acceptable then it can be used for the non-SEV guests (if needed). I
>> will look through your patch more carefully but one immediate question,
>> when do we unpin the pages? In the case of the SEV, once a page is
>> pinned then it should not be unpinned until the guest terminates. If we
>> unpin the page before the VM terminates then there is a  chance the host
>> page migration will kick-in and move the pages. The KVM MMU code may
>> call to drop the spte's during the zap/unzap and this happens a lot
>> during a guest execution and it will lead us to the path where a vendor
>> specific code will unpin the pages during the guest execution and cause
>> a data corruption for the SEV guest.
> The pages are unpinned by:
>
>   drop_spte()
>   |
>   -> rmap_remove()
>      |
>      -> sev_drop_pinned_spte()
>
>
> The intent is to allow unpinning pages when the mm_struct dies, i.e. when
> the memory is no longer reachable (as opposed to when the last reference to
> KVM is put), but typing that out, I realize there are dependencies and
> assumptions that don't hold true for SEV as implemented.


So, I tried this RFC with the SEV guest (of course after adding some of
the stuff you highlighted below), the guest fails randomly. I have seen
a two to three type of failures 1) boot 2) kernbench execution and 3)
device addition/removal, the failure signature is not consistent. I
believe after addressing some of the dependencies we may able to make
some progress but it will add new restriction which did not existed before.

>
>   - Parent shadow pages won't be zapped.  Recycling MMU pages and zapping
>     all SPs due to memslot updates are the two concerns.
>
>     The easy way out for recycling is to not recycle SPs with pinned
>     children, though that may or may not fly with VMM admins.
>
>     I'm trying to resolve the memslot issue[*], but confirming that there's
>     no longer an issue with not zapping everything is proving difficult as
>     we haven't yet reproduced the original bug.
>
>   - drop_large_spte() won't be invoked.  I believe the only semi-legitimate
>     scenario is if the NX huge page workaround is toggled on while a VM is
>     running.  Disallowing that if there is an SEV guest seems reasonable?
>
>     There might be an issue with the host page size changing, but I don't
>     think that can happen if the page is pinned.  That needs more
>     investigation.
>
>
> [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20200703025047.13987-1-sean.j.christopherson%40intel.com&amp;data=02%7C01%7Cbrijesh.singh%40amd.com%7C8d0dd94297ff4d24e54108d837d0f1dc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637320717832773086&amp;sdata=yAHvMptxstoczXBZkFCpNC4AbADOJOgluwAtIYCuNVo%3D&amp;reserved=0
>
>>> Bugs in the core implementation are pretty much guaranteed.  The basic
>>> concept has been tested, but in a fairly different incarnation.  Most
>>> notably, tagging PRESENT SPTEs as PINNED has not been tested, although
>>> using the PINNED flag to track zapped (and known to be pinned) SPTEs has
>>> been tested.  I cobbled this variation together fairly quickly to get the
>>> code out there for discussion.
>>>
>>> The last patch to pin SEV pages during sev_launch_update_data() is
>>> incomplete; it's there to show how we might leverage MMU-based pinning to
>>> support pinning pages before the guest is live.
>>
>> I will add the SEV specific bits and  give this a try.
>>
>>> Sean Christopherson (8):
>>>   KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
>>>   KVM: x86/mmu: Use bits 2:0 to check for present SPTEs
>>>   KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte()
>>>   KVM: x86/mmu: Add infrastructure for pinning PFNs on demand
>>>   KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages on demand
>>>   KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
>>>   KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
>>>   KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data()
>>>
>>>  arch/x86/include/asm/kvm_host.h |   7 ++
>>>  arch/x86/kvm/mmu.h              |   3 +
>>>  arch/x86/kvm/mmu/mmu.c          | 186 +++++++++++++++++++++++++-------
>>>  arch/x86/kvm/mmu/paging_tmpl.h  |   3 +-
>>>  arch/x86/kvm/svm/sev.c          | 141 +++++++++++++++++++++++-
>>>  arch/x86/kvm/svm/svm.c          |   3 +
>>>  arch/x86/kvm/svm/svm.h          |   3 +
>>>  7 files changed, 302 insertions(+), 44 deletions(-)
>>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31 21:23 [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 1/8] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 2/8] KVM: x86/mmu: Use bits 2:0 to check for present SPTEs Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 3/8] KVM: x86/mmu: Refactor handling of not-present SPTEs in mmu_set_spte() Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 4/8] KVM: x86/mmu: Add infrastructure for pinning PFNs on demand Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 5/8] KVM: SVM: Use the KVM MMU SPTE pinning hooks to pin pages " Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 6/8] KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault() Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 7/8] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV Sean Christopherson
2020-07-31 21:23 ` [RFC PATCH 8/8] KVM: SVM: Pin SEV pages in MMU during sev_launch_update_data() Sean Christopherson
2020-08-03  3:00 ` [RFC PATCH 0/8] KVM: x86/mmu: Introduce pinned SPTEs framework Eric van Tassell
2020-08-03 15:00   ` Sean Christopherson
2020-08-03 15:52 ` Brijesh Singh
2020-08-03 17:16   ` Sean Christopherson
2020-08-04 19:40     ` Brijesh Singh

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git