linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] KVM: allow mapping non-refcounted pages
@ 2021-11-29  3:43 David Stevens
  2021-11-29  3:43 ` [PATCH v5 1/4] KVM: mmu: introduce new gfn_to_pfn_page functions David Stevens
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: David Stevens @ 2021-11-29  3:43 UTC (permalink / raw)
  To: Marc Zyngier, Paolo Bonzini
  Cc: James Morse, Alexandru Elisei, Suzuki K Poulose, Will Deacon,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-kernel, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping non-refcount VM_IO and
VM_PFNMAP memory into the guest.

Currently, the gfn_to_pfn functions require being able to pin the target
pfn, so they will fail if the pfn returned by follow_pte isn't a
ref-counted page.  However, the KVM secondary MMUs do not require that
the pfn be pinned, since they are integrated with the mmu notifier API.
This series adds a new set of gfn_to_pfn_page functions which parallel
the gfn_to_pfn functions but do not pin the pfn. The new functions
return the page from gup if it was present, so callers can use it and
call put_page when done.

The gfn_to_pfn functions should be depreciated, since as they are unsafe
due to relying on trying to obtain a struct page from a pfn returned by
follow_pte. I added new functions instead of simply adding another
optional parameter to the existing functions to make it easier to track
down users of the deprecated functions.

This series updates x86 and arm64 secondary MMUs to the new API.

v4 -> v5:
 - rebase on kvm next branch again
v3 -> v4:
 - rebase on kvm next branch again
 - Add some more context to a comment in ensure_pfn_ref
v2 -> v3:
 - rebase on kvm next branch
v1 -> v2:
 - Introduce new gfn_to_pfn_page functions instead of modifying the
   behavior of existing gfn_to_pfn functions, to make the change less
   invasive.
 - Drop changes to mmu_audit.c
 - Include Nicholas Piggin's patch to avoid corrupting refcount in the
   follow_pte case, and use it in depreciated gfn_to_pfn functions.
 - Rebase on kvm/next

David Stevens (4):
  KVM: mmu: introduce new gfn_to_pfn_page functions
  KVM: x86/mmu: use gfn_to_pfn_page
  KVM: arm64/mmu: use gfn_to_pfn_page
  KVM: mmu: remove over-aggressive warnings

 arch/arm64/kvm/mmu.c           |  27 +++--
 arch/x86/kvm/mmu.h             |   1 +
 arch/x86/kvm/mmu/mmu.c         |  25 ++---
 arch/x86/kvm/mmu/paging_tmpl.h |   9 +-
 arch/x86/kvm/x86.c             |   6 +-
 include/linux/kvm_host.h       |  17 +++
 virt/kvm/kvm_main.c            | 198 ++++++++++++++++++++++++---------
 7 files changed, 202 insertions(+), 81 deletions(-)

-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/4] KVM: mmu: introduce new gfn_to_pfn_page functions
  2021-11-29  3:43 [PATCH v5 0/4] KVM: allow mapping non-refcounted pages David Stevens
@ 2021-11-29  3:43 ` David Stevens
  2021-12-30 19:26   ` Sean Christopherson
  2021-11-29  3:43 ` [PATCH v5 2/4] KVM: x86/mmu: use gfn_to_pfn_page David Stevens
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: David Stevens @ 2021-11-29  3:43 UTC (permalink / raw)
  To: Marc Zyngier, Paolo Bonzini
  Cc: James Morse, Alexandru Elisei, Suzuki K Poulose, Will Deacon,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-kernel, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Introduce new gfn_to_pfn_page functions that parallel existing
gfn_to_pfn functions. The new functions are identical except they take
an additional out parameter that is used to return the struct page if
the hva was resolved by gup. This allows callers to differentiate the
gup and follow_pte cases, which in turn allows callers to only touch the
page refcount when necessitated by gup.

The old gfn_to_pfn functions are depreciated, and all callers should be
migrated to the new gfn_to_pfn_page functions. In the interim, the
gfn_to_pfn functions are reimplemented as wrappers of the corresponding
gfn_to_pfn_page functions. The wrappers take a reference to the pfn's
page that had previously been taken in hva_to_pfn_remapped.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h |  17 ++++
 virt/kvm/kvm_main.c      | 196 +++++++++++++++++++++++++++++----------
 2 files changed, 162 insertions(+), 51 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 60a35d9fe259..24b49c7aacf9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -861,6 +861,19 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
 			       bool *writable, hva_t *hva);
 
+kvm_pfn_t gfn_to_pfn_page(struct kvm *kvm, gfn_t gfn, struct page **page);
+kvm_pfn_t gfn_to_pfn_page_prot(struct kvm *kvm, gfn_t gfn,
+			       bool write_fault, bool *writable,
+			       struct page **page);
+kvm_pfn_t gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot,
+				  gfn_t gfn, struct page **page);
+kvm_pfn_t gfn_to_pfn_page_memslot_atomic(struct kvm_memory_slot *slot,
+					 gfn_t gfn, struct page **page);
+kvm_pfn_t __gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot,
+				    gfn_t gfn, bool atomic, bool *async,
+				    bool write_fault, bool *writable,
+				    hva_t *hva, struct page **page);
+
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_dirty(kvm_pfn_t pfn);
@@ -941,6 +954,10 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page_atomic(struct kvm_vcpu *vcpu, gfn_t gfn,
+					  struct page **page);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+				   struct page **page);
 int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
 int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
 		struct gfn_to_pfn_cache *cache, bool atomic);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3f6d450355f0..16a8a71f20bf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2230,9 +2230,9 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * only part that runs if we can in atomic context.
  */
 static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-			    bool *writable, kvm_pfn_t *pfn)
+			    bool *writable, kvm_pfn_t *pfn,
+			    struct page **page)
 {
-	struct page *page[1];
 
 	/*
 	 * Fast pin a writable pfn only if it is a write fault request
@@ -2243,7 +2243,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 		return false;
 
 	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
-		*pfn = page_to_pfn(page[0]);
+		*pfn = page_to_pfn(*page);
 
 		if (writable)
 			*writable = true;
@@ -2258,10 +2258,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, kvm_pfn_t *pfn)
+			   bool *writable, kvm_pfn_t *pfn, struct page **page)
 {
 	unsigned int flags = FOLL_HWPOISON;
-	struct page *page;
 	int npages = 0;
 
 	might_sleep();
@@ -2274,7 +2273,7 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 	if (async)
 		flags |= FOLL_NOWAIT;
 
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	npages = get_user_pages_unlocked(addr, 1, page, flags);
 	if (npages != 1)
 		return npages;
 
@@ -2284,11 +2283,11 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 
 		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
 			*writable = true;
-			put_page(page);
-			page = wpage;
+			put_page(*page);
+			*page = wpage;
 		}
 	}
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn(*page);
 	return npages;
 }
 
@@ -2303,13 +2302,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
-	if (kvm_is_reserved_pfn(pfn))
-		return 1;
-	return get_page_unless_zero(pfn_to_page(pfn));
-}
-
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool *async,
 			       bool write_fault, bool *writable,
@@ -2349,26 +2341,6 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		*writable = pte_write(*ptep);
 	pfn = pte_pfn(*ptep);
 
-	/*
-	 * Get a reference here because callers of *hva_to_pfn* and
-	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
-	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
-	 * simply do nothing for reserved pfns.
-	 *
-	 * Whoever called remap_pfn_range is also going to call e.g.
-	 * unmap_mapping_range before the underlying pages are freed,
-	 * causing a call to our MMU notifier.
-	 *
-	 * Certain IO or PFNMAP mappings can be backed with valid
-	 * struct pages, but be allocated without refcounting e.g.,
-	 * tail pages of non-compound higher order allocations, which
-	 * would then underflow the refcount when the caller does the
-	 * required put_page. Don't allow those pages here.
-	 */ 
-	if (!kvm_try_get_pfn(pfn))
-		r = -EFAULT;
-
 out:
 	pte_unmap_unlock(ptep, ptl);
 	*p_pfn = pfn;
@@ -2390,8 +2362,9 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-			bool write_fault, bool *writable)
+static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic,
+			    bool *async, bool write_fault, bool *writable,
+			    struct page **page)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn = 0;
@@ -2400,13 +2373,14 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	/* we can do it either atomically or asynchronously, not both */
 	BUG_ON(atomic && async);
 
-	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn, page))
 		return pfn;
 
 	if (atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+	npages = hva_to_pfn_slow(addr, async, write_fault, writable,
+				 &pfn, page);
 	if (npages == 1)
 		return pfn;
 
@@ -2438,12 +2412,14 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool *async, bool write_fault,
-			       bool *writable, hva_t *hva)
+kvm_pfn_t __gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot,
+				    gfn_t gfn, bool atomic, bool *async,
+				    bool write_fault, bool *writable,
+				    hva_t *hva, struct page **page)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
+	*page = NULL;
 	if (hva)
 		*hva = addr;
 
@@ -2466,45 +2442,163 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 	}
 
 	return hva_to_pfn(addr, atomic, async, write_fault,
-			  writable);
+			  writable, page);
+}
+EXPORT_SYMBOL_GPL(__gfn_to_pfn_page_memslot);
+
+kvm_pfn_t gfn_to_pfn_page_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+			       bool *writable, struct page **page)
+{
+	return __gfn_to_pfn_page_memslot(gfn_to_memslot(kvm, gfn), gfn, false,
+					 NULL, write_fault, writable, NULL,
+					 page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page_prot);
+
+kvm_pfn_t gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+				  struct page **page)
+{
+	return __gfn_to_pfn_page_memslot(slot, gfn, false, NULL, true,
+					 NULL, NULL, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page_memslot);
+
+kvm_pfn_t gfn_to_pfn_page_memslot_atomic(struct kvm_memory_slot *slot,
+					 gfn_t gfn, struct page **page)
+{
+	return __gfn_to_pfn_page_memslot(slot, gfn, true, NULL, true, NULL,
+					 NULL, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page_memslot_atomic);
+
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page_atomic(struct kvm_vcpu *vcpu, gfn_t gfn,
+					  struct page **page)
+{
+	return gfn_to_pfn_page_memslot_atomic(
+			kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, page);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_page_atomic);
+
+kvm_pfn_t gfn_to_pfn_page(struct kvm *kvm, gfn_t gfn, struct page **page)
+{
+	return gfn_to_pfn_page_memslot(gfn_to_memslot(kvm, gfn), gfn, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page);
+
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+				   struct page **page)
+{
+	return gfn_to_pfn_page_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+				       gfn, page);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_page);
+
+static kvm_pfn_t ensure_pfn_ref(struct page *page, kvm_pfn_t pfn)
+{
+	if (page || is_error_pfn(pfn))
+		return pfn;
+
+	/*
+	 * If we're here, a pfn resolved by hva_to_pfn_remapped is
+	 * going to be returned to something that ultimately calls
+	 * kvm_release_pfn_clean, so the refcount needs to be bumped if
+	 * the pfn isn't a reserved pfn.
+	 *
+	 * Whoever called remap_pfn_range is also going to call e.g.
+	 * unmap_mapping_range before the underlying pages are freed,
+	 * causing a call to our MMU notifier.
+	 *
+	 * Certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages, but be allocated without refcounting e.g.,
+	 * tail pages of non-compound higher order allocations, which
+	 * would then underflow the refcount when the caller does the
+	 * required put_page. Don't allow those pages here.
+	 */
+	if (kvm_is_reserved_pfn(pfn) ||
+	    get_page_unless_zero(pfn_to_page(pfn)))
+		return pfn;
+
+	return KVM_PFN_ERR_FAULT;
+}
+
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable, hva_t *hva)
+{
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = __gfn_to_pfn_page_memslot(slot, gfn, atomic, async,
+					write_fault, writable, hva, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
-				    write_fault, writable, NULL);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page_prot(kvm, gfn, write_fault, writable, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page_memslot(slot, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page_memslot_atomic(slot, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = kvm_vcpu_gfn_to_pfn_page_atomic(vcpu, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
 
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page(kvm, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = kvm_vcpu_gfn_to_pfn_page(vcpu, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn);
 
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/4] KVM: x86/mmu: use gfn_to_pfn_page
  2021-11-29  3:43 [PATCH v5 0/4] KVM: allow mapping non-refcounted pages David Stevens
  2021-11-29  3:43 ` [PATCH v5 1/4] KVM: mmu: introduce new gfn_to_pfn_page functions David Stevens
@ 2021-11-29  3:43 ` David Stevens
  2021-12-30 19:30   ` Sean Christopherson
  2021-11-29  3:43 ` [PATCH v5 3/4] KVM: arm64/mmu: " David Stevens
  2021-11-29  3:43 ` [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings David Stevens
  3 siblings, 1 reply; 18+ messages in thread
From: David Stevens @ 2021-11-29  3:43 UTC (permalink / raw)
  To: Marc Zyngier, Paolo Bonzini
  Cc: James Morse, Alexandru Elisei, Suzuki K Poulose, Will Deacon,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-kernel, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Covert usages of the deprecated gfn_to_pfn functions to the new
gfn_to_pfn_page functions.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu.h             |  1 +
 arch/x86/kvm/mmu/mmu.c         | 18 +++++++++++-------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++++++---
 arch/x86/kvm/x86.c             |  6 ++++--
 4 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9ae6168d381e..97d94a9612b6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -164,6 +164,7 @@ struct kvm_page_fault {
 	/* Outputs of kvm_faultin_pfn.  */
 	kvm_pfn_t pfn;
 	hva_t hva;
+	struct page *page;
 	bool map_writable;
 };
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 04c00c34517e..0626395ff1d9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2891,6 +2891,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (unlikely(fault->max_level == PG_LEVEL_4K))
 		return;
 
+	if (!fault->page)
+		return;
+
 	if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn))
 		return;
 
@@ -3950,9 +3953,9 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	}
 
 	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+	fault->pfn = __gfn_to_pfn_page_memslot(slot, fault->gfn, false, &async,
+					       fault->write, &fault->map_writable,
+					       &fault->hva, &fault->page);
 	if (!async)
 		return false; /* *pfn has correct page already */
 
@@ -3966,9 +3969,9 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			goto out_retry;
 	}
 
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, NULL,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+	fault->pfn = __gfn_to_pfn_page_memslot(slot, fault->gfn, false, NULL,
+					       fault->write, &fault->map_writable,
+					       &fault->hva, &fault->page);
 	return false;
 
 out_retry:
@@ -4029,7 +4032,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->page)
+		put_page(fault->page);
 	return r;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f87d36898c44..370d52f252a8 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -565,6 +565,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	unsigned pte_access;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
+	struct page *page;
 
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
@@ -580,12 +581,13 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (!slot)
 		return false;
 
-	pfn = gfn_to_pfn_memslot_atomic(slot, gfn);
+	pfn = gfn_to_pfn_page_memslot_atomic(slot, gfn, &page);
 	if (is_error_pfn(pfn))
 		return false;
 
 	mmu_set_spte(vcpu, slot, spte, pte_access, gfn, pfn, NULL);
-	kvm_release_pfn_clean(pfn);
+	if (page)
+		put_page(page);
 	return true;
 }
 
@@ -923,7 +925,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->page)
+		put_page(fault->page);
 	return r;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5c479ae57693..95f56ec43e0b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7820,6 +7820,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 {
 	gpa_t gpa = cr2_or_gpa;
 	kvm_pfn_t pfn;
+	struct page *page;
 
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
@@ -7849,7 +7850,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * retry instruction -> write #PF -> emulation fail -> retry
 	 * instruction -> ...
 	 */
-	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+	pfn = gfn_to_pfn_page(vcpu->kvm, gpa_to_gfn(gpa), &page);
 
 	/*
 	 * If the instruction failed on the error pfn, it can not be fixed,
@@ -7858,7 +7859,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (is_error_noslot_pfn(pfn))
 		return false;
 
-	kvm_release_pfn_clean(pfn);
+	if (page)
+		put_page(page);
 
 	/* The instructions are well-emulated on direct mmu. */
 	if (vcpu->arch.mmu->direct_map) {
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 3/4] KVM: arm64/mmu: use gfn_to_pfn_page
  2021-11-29  3:43 [PATCH v5 0/4] KVM: allow mapping non-refcounted pages David Stevens
  2021-11-29  3:43 ` [PATCH v5 1/4] KVM: mmu: introduce new gfn_to_pfn_page functions David Stevens
  2021-11-29  3:43 ` [PATCH v5 2/4] KVM: x86/mmu: use gfn_to_pfn_page David Stevens
@ 2021-11-29  3:43 ` David Stevens
  2021-12-30 19:45   ` Sean Christopherson
  2021-11-29  3:43 ` [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings David Stevens
  3 siblings, 1 reply; 18+ messages in thread
From: David Stevens @ 2021-11-29  3:43 UTC (permalink / raw)
  To: Marc Zyngier, Paolo Bonzini
  Cc: James Morse, Alexandru Elisei, Suzuki K Poulose, Will Deacon,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-kernel, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Covert usages of the deprecated gfn_to_pfn functions to the new
gfn_to_pfn_page functions.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/arm64/kvm/mmu.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 326cdfec74a1..197fb8afbb94 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -829,7 +829,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
 static unsigned long
 transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			    unsigned long hva, kvm_pfn_t *pfnp,
-			    phys_addr_t *ipap)
+			    struct page **page, phys_addr_t *ipap)
 {
 	kvm_pfn_t pfn = *pfnp;
 
@@ -838,7 +838,8 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	 * sure that the HVA and IPA are sufficiently aligned and that the
 	 * block map is contained within the memslot.
 	 */
-	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
+	if (*page &&
+	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
 	    get_user_mapping_size(kvm, hva) >= PMD_SIZE) {
 		/*
 		 * The address we faulted on is backed by a transparent huge
@@ -859,10 +860,11 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		 * page accordingly.
 		 */
 		*ipap &= PMD_MASK;
-		kvm_release_pfn_clean(pfn);
+		put_page(*page);
 		pfn &= ~(PTRS_PER_PMD - 1);
-		get_page(pfn_to_page(pfn));
 		*pfnp = pfn;
+		*page = pfn_to_page(pfn);
+		get_page(*page);
 
 		return PMD_SIZE;
 	}
@@ -955,6 +957,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	short vma_shift;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
+	struct page *page;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	unsigned long vma_pagesize, fault_granule;
@@ -1056,8 +1059,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	smp_rmb();
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __gfn_to_pfn_page_memslot(memslot, gfn, false, NULL,
+					write_fault, &writable, NULL, &page);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
@@ -1102,7 +1105,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			vma_pagesize = fault_granule;
 		else
 			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
-								   hva, &pfn,
+								   hva,
+								   &pfn, &page,
 								   &fault_ipa);
 	}
 
@@ -1142,14 +1146,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	/* Mark the page dirty only if the fault is handled successfully */
 	if (writable && !ret) {
-		kvm_set_pfn_dirty(pfn);
+		if (page)
+			kvm_set_pfn_dirty(pfn);
 		mark_page_dirty_in_slot(kvm, memslot, gfn);
 	}
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
-	kvm_set_pfn_accessed(pfn);
-	kvm_release_pfn_clean(pfn);
+	if (page) {
+		kvm_set_pfn_accessed(pfn);
+		put_page(page);
+	}
 	return ret != -EAGAIN ? ret : 0;
 }
 
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2021-11-29  3:43 [PATCH v5 0/4] KVM: allow mapping non-refcounted pages David Stevens
                   ` (2 preceding siblings ...)
  2021-11-29  3:43 ` [PATCH v5 3/4] KVM: arm64/mmu: " David Stevens
@ 2021-11-29  3:43 ` David Stevens
  2021-12-30 19:22   ` Sean Christopherson
  3 siblings, 1 reply; 18+ messages in thread
From: David Stevens @ 2021-11-29  3:43 UTC (permalink / raw)
  To: Marc Zyngier, Paolo Bonzini
  Cc: James Morse, Alexandru Elisei, Suzuki K Poulose, Will Deacon,
	Sean Christopherson, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-kernel, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Remove two warnings that require ref counts for pages to be non-zero, as
mapped pfns from follow_pfn may not have an initialized ref count.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c | 7 -------
 virt/kvm/kvm_main.c    | 2 +-
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0626395ff1d9..7c4c7fededf0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -621,13 +621,6 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 
 	pfn = spte_to_pfn(old_spte);
 
-	/*
-	 * KVM does not hold the refcount of the page used by
-	 * kvm mmu, before reclaiming the page, we should
-	 * unmap it from mmu first.
-	 */
-	WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
-
 	if (is_accessed_spte(old_spte))
 		kvm_set_pfn_accessed(pfn);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 16a8a71f20bf..d81edcb3e107 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -170,7 +170,7 @@ bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
 	 * the device has been pinned, e.g. by get_user_pages().  WARN if the
 	 * page_count() is zero to help detect bad usage of this helper.
 	 */
-	if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
+	if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))
 		return false;
 
 	return is_zone_device_page(pfn_to_page(pfn));
-- 
2.34.0.rc2.393.gf8c9666880-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2021-11-29  3:43 ` [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings David Stevens
@ 2021-12-30 19:22   ` Sean Christopherson
  2022-01-05  7:14     ` David Stevens
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-12-30 19:22 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm

On Mon, Nov 29, 2021, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Remove two warnings that require ref counts for pages to be non-zero, as
> mapped pfns from follow_pfn may not have an initialized ref count.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 7 -------
>  virt/kvm/kvm_main.c    | 2 +-
>  2 files changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0626395ff1d9..7c4c7fededf0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -621,13 +621,6 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>  
>  	pfn = spte_to_pfn(old_spte);
>  
> -	/*
> -	 * KVM does not hold the refcount of the page used by
> -	 * kvm mmu, before reclaiming the page, we should
> -	 * unmap it from mmu first.
> -	 */
> -	WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
> -
>  	if (is_accessed_spte(old_spte))
>  		kvm_set_pfn_accessed(pfn);
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 16a8a71f20bf..d81edcb3e107 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -170,7 +170,7 @@ bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
>  	 * the device has been pinned, e.g. by get_user_pages().  WARN if the
>  	 * page_count() is zero to help detect bad usage of this helper.

Stale comment.

>  	 */
> -	if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
> +	if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))

Hrm, I know the whole point of this series is to support pages without an elevated
refcount, but this WARN was extremely helpful in catching several use-after-free
bugs in the TDP MMU.  We talked about burying a slow check behind MMU_WARN_ON, but
that isn't very helpful because no one runs with MMU_WARN_ON, and this is also a
type of check that's most useful if it runs in production.

IIUC, this series explicitly disallows using pfns that have a struct page without
refcounting, and the issue with the WARN here is that kvm_is_zone_device_pfn() is
called by kvm_is_reserved_pfn() before ensure_pfn_ref() rejects problematic pages,
i.e. triggers false positive.

So, can't we preserve the use-after-free benefits of the check by moving it to
where KVM releases the PFN?  I.e.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fbca2e232e94..675b835525fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2904,15 +2904,19 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);

 void kvm_set_pfn_dirty(kvm_pfn_t pfn)
 {
-       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
+       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
+               WARN_ON_ONCE(!page_count(pfn_to_page(pfn)));
                SetPageDirty(pfn_to_page(pfn));
+       }
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);

 void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 {
-       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
+       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
+               WARN_ON_ONCE(!page_count(pfn_to_page(pfn)));
                mark_page_accessed(pfn_to_page(pfn));
+       }
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);

In a way, that's even better than the current check as it makes it more obvious
that the WARN is due to a use-after-free.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/4] KVM: mmu: introduce new gfn_to_pfn_page functions
  2021-11-29  3:43 ` [PATCH v5 1/4] KVM: mmu: introduce new gfn_to_pfn_page functions David Stevens
@ 2021-12-30 19:26   ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-12-30 19:26 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm

On Mon, Nov 29, 2021, David Stevens wrote:
> +static kvm_pfn_t ensure_pfn_ref(struct page *page, kvm_pfn_t pfn)

"ensure" is rather misleading as that implies this is _just_ an assertion, but
that's not true since it elevates the refcount.  Maybe kvm_try_get_page_ref()?

> +{
> +	if (page || is_error_pfn(pfn))

A comment above here would be very helpful.  It's easy to overlook the "page"
check and think that KVM is double-counting pages.  E.g.

	/* If @page is valid, KVM already has a reference to the pfn/page. */

That would tie in nicely with the kvm_try_get_page_ref() name too.

> +		return pfn;
> +
> +	/*
> +	 * If we're here, a pfn resolved by hva_to_pfn_remapped is
> +	 * going to be returned to something that ultimately calls
> +	 * kvm_release_pfn_clean, so the refcount needs to be bumped if
> +	 * the pfn isn't a reserved pfn.
> +	 *
> +	 * Whoever called remap_pfn_range is also going to call e.g.
> +	 * unmap_mapping_range before the underlying pages are freed,
> +	 * causing a call to our MMU notifier.
> +	 *
> +	 * Certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages, but be allocated without refcounting e.g.,
> +	 * tail pages of non-compound higher order allocations, which
> +	 * would then underflow the refcount when the caller does the
> +	 * required put_page. Don't allow those pages here.
> +	 */
> +	if (kvm_is_reserved_pfn(pfn) ||
> +	    get_page_unless_zero(pfn_to_page(pfn)))
> +		return pfn;
> +
> +	return KVM_PFN_ERR_FAULT;
> +}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/4] KVM: x86/mmu: use gfn_to_pfn_page
  2021-11-29  3:43 ` [PATCH v5 2/4] KVM: x86/mmu: use gfn_to_pfn_page David Stevens
@ 2021-12-30 19:30   ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-12-30 19:30 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm

On Mon, Nov 29, 2021, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Covert usages of the deprecated gfn_to_pfn functions to the new
> gfn_to_pfn_page functions.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu.h             |  1 +
>  arch/x86/kvm/mmu/mmu.c         | 18 +++++++++++-------
>  arch/x86/kvm/mmu/paging_tmpl.h |  9 ++++++---
>  arch/x86/kvm/x86.c             |  6 ++++--
>  4 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 9ae6168d381e..97d94a9612b6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -164,6 +164,7 @@ struct kvm_page_fault {
>  	/* Outputs of kvm_faultin_pfn.  */
>  	kvm_pfn_t pfn;
>  	hva_t hva;
> +	struct page *page;
>  	bool map_writable;
>  };
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 04c00c34517e..0626395ff1d9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2891,6 +2891,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	if (unlikely(fault->max_level == PG_LEVEL_4K))
>  		return;
>  
> +	if (!fault->page)
> +		return;
> +
>  	if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn))

These two checks can go away as they're made obsolete by the new !fault->page check.

>  		return;
>  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] KVM: arm64/mmu: use gfn_to_pfn_page
  2021-11-29  3:43 ` [PATCH v5 3/4] KVM: arm64/mmu: " David Stevens
@ 2021-12-30 19:45   ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-12-30 19:45 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm

On Mon, Nov 29, 2021, David Stevens wrote:
> @@ -1142,14 +1146,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	/* Mark the page dirty only if the fault is handled successfully */
>  	if (writable && !ret) {
> -		kvm_set_pfn_dirty(pfn);
> +		if (page)
> +			kvm_set_pfn_dirty(pfn);

If kvm_set_page_dirty() is changed to be less dumb:

		if (page)
			kvm_set_page_dirty(page);

>  		mark_page_dirty_in_slot(kvm, memslot, gfn);
>  	}
>  
>  out_unlock:
>  	spin_unlock(&kvm->mmu_lock);
> -	kvm_set_pfn_accessed(pfn);
> -	kvm_release_pfn_clean(pfn);
> +	if (page) {
> +		kvm_set_pfn_accessed(pfn);
> +		put_page(page);

Oof, KVM's helpers are stupid.  Take a page, convert it to a pfn, then convert it
back to a page, just to mark it dirty or put a ref.  Can you fold the below 
(completely untested) patch in before the x86/arm64 patches?  That way this code
can be:

	if (page)
		kvm_release_page_accessed(page);

and x86 can do:

	if (fault->page)
		kvm_release_page_clean(page);

instead of open-coding put_page().


From a8af0c60d7f6e77bbc7310d898211c43ae075cf8 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Thu, 30 Dec 2021 11:40:58 -0800
Subject: [PATCH] KVM: Clean up and enhance helpers for releasing pages/pfns

Tweak kvm_release_page_clean() and kvm_release_page_dirty() to avoid
pointlessly converting to a pfn and back to a page, and add an "accessed"
variant that will be used in a future arm64 patch.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8eb0f762a82c..f75129f641e9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2876,29 +2876,37 @@ void kvm_release_page_clean(struct page *page)
 {
 	WARN_ON(is_error_page(page));

-	kvm_release_pfn_clean(page_to_pfn(page));
+	put_page(page);
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_clean);

 void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
-		put_page(pfn_to_page(pfn));
+		kvm_release_page_clean(page);
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

+void kvm_release_page_accessed(struct page *page)
+{
+	mark_page_accessed(page);
+
+	kvm_release_page_clean(page);
+}
+EXPORT_SYMBOL_GPL(kvm_release_page_accessed);
+
 void kvm_release_page_dirty(struct page *page)
 {
-	WARN_ON(is_error_page(page));
+	SetPageDirty(page);

-	kvm_release_pfn_dirty(page_to_pfn(page));
+	kvm_release_page_clean(page);
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_dirty);

 void kvm_release_pfn_dirty(kvm_pfn_t pfn)
 {
-	kvm_set_pfn_dirty(pfn);
-	kvm_release_pfn_clean(pfn);
+	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
+		kvm_release_page_dirty(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);

--
2.34.1.448.ga2b2bfdf31-goog

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2021-12-30 19:22   ` Sean Christopherson
@ 2022-01-05  7:14     ` David Stevens
  2022-01-05 19:02       ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: David Stevens @ 2022-01-05  7:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm

On Fri, Dec 31, 2021 at 4:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Nov 29, 2021, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Remove two warnings that require ref counts for pages to be non-zero, as
> > mapped pfns from follow_pfn may not have an initialized ref count.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 7 -------
> >  virt/kvm/kvm_main.c    | 2 +-
> >  2 files changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0626395ff1d9..7c4c7fededf0 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -621,13 +621,6 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
> >
> >       pfn = spte_to_pfn(old_spte);
> >
> > -     /*
> > -      * KVM does not hold the refcount of the page used by
> > -      * kvm mmu, before reclaiming the page, we should
> > -      * unmap it from mmu first.
> > -      */
> > -     WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
> > -
> >       if (is_accessed_spte(old_spte))
> >               kvm_set_pfn_accessed(pfn);
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 16a8a71f20bf..d81edcb3e107 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -170,7 +170,7 @@ bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
> >        * the device has been pinned, e.g. by get_user_pages().  WARN if the
> >        * page_count() is zero to help detect bad usage of this helper.
>
> Stale comment.
>
> >        */
> > -     if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
> > +     if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))
>
> Hrm, I know the whole point of this series is to support pages without an elevated
> refcount, but this WARN was extremely helpful in catching several use-after-free
> bugs in the TDP MMU.  We talked about burying a slow check behind MMU_WARN_ON, but
> that isn't very helpful because no one runs with MMU_WARN_ON, and this is also a
> type of check that's most useful if it runs in production.
>
> IIUC, this series explicitly disallows using pfns that have a struct page without
> refcounting, and the issue with the WARN here is that kvm_is_zone_device_pfn() is
> called by kvm_is_reserved_pfn() before ensure_pfn_ref() rejects problematic pages,
> i.e. triggers false positive.
>
> So, can't we preserve the use-after-free benefits of the check by moving it to
> where KVM releases the PFN?  I.e.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fbca2e232e94..675b835525fa 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2904,15 +2904,19 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
>
>  void kvm_set_pfn_dirty(kvm_pfn_t pfn)
>  {
> -       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
> +       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
> +               WARN_ON_ONCE(!page_count(pfn_to_page(pfn)));
>                 SetPageDirty(pfn_to_page(pfn));
> +       }
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);

I'm still seeing this warning show up via __handle_changed_spte
calling kvm_set_pfn_dirty:

[  113.350473]  kvm_set_pfn_dirty+0x26/0x3e
[  113.354861]  __handle_changed_spte+0x452/0x4f6
[  113.359841]  __handle_changed_spte+0x452/0x4f6
[  113.364819]  __handle_changed_spte+0x452/0x4f6
[  113.369790]  zap_gfn_range+0x1de/0x27a
[  113.373992]  kvm_tdp_mmu_zap_invalidated_roots+0x64/0xb8
[  113.379945]  kvm_mmu_zap_all_fast+0x18c/0x1c1
[  113.384827]  kvm_page_track_flush_slot+0x55/0x87
[  113.390000]  kvm_set_memslot+0x137/0x455
[  113.394394]  kvm_delete_memslot+0x5c/0x91
[  113.398888]  __kvm_set_memory_region+0x3c0/0x5e6
[  113.404061]  kvm_set_memory_region+0x45/0x74
[  113.408844]  kvm_vm_ioctl+0x563/0x60c

I wasn't seeing it for my particular test case, but the gfn aging code
might trigger the warning as well.

I don't know if setting the dirty/accessed bits in non-refcounted
struct pages is problematic. The only way I can see to avoid it would
be to try to map from the spte to the vma and then check its flags. If
setting the flags is benign, then we'd need to do that lookup to
differentiate the safe case from the use-after-free case. Do you have
any advice on how to handle this?

-David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-05  7:14     ` David Stevens
@ 2022-01-05 19:02       ` Sean Christopherson
  2022-01-05 19:19         ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2022-01-05 19:02 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm

On Wed, Jan 05, 2022, David Stevens wrote:
> On Fri, Dec 31, 2021 at 4:22 AM Sean Christopherson <seanjc@google.com> wrote:
> > >        */
> > > -     if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
> > > +     if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))
> >
> > Hrm, I know the whole point of this series is to support pages without an elevated
> > refcount, but this WARN was extremely helpful in catching several use-after-free
> > bugs in the TDP MMU.  We talked about burying a slow check behind MMU_WARN_ON, but
> > that isn't very helpful because no one runs with MMU_WARN_ON, and this is also a
> > type of check that's most useful if it runs in production.
> >
> > IIUC, this series explicitly disallows using pfns that have a struct page without
> > refcounting, and the issue with the WARN here is that kvm_is_zone_device_pfn() is
> > called by kvm_is_reserved_pfn() before ensure_pfn_ref() rejects problematic pages,
> > i.e. triggers false positive.
> >
> > So, can't we preserve the use-after-free benefits of the check by moving it to
> > where KVM releases the PFN?  I.e.
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fbca2e232e94..675b835525fa 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2904,15 +2904,19 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
> >
> >  void kvm_set_pfn_dirty(kvm_pfn_t pfn)
> >  {
> > -       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
> > +       if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
> > +               WARN_ON_ONCE(!page_count(pfn_to_page(pfn)));
> >                 SetPageDirty(pfn_to_page(pfn));
> > +       }
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
> 
> I'm still seeing this warning show up via __handle_changed_spte
> calling kvm_set_pfn_dirty:
> 
> [  113.350473]  kvm_set_pfn_dirty+0x26/0x3e
> [  113.354861]  __handle_changed_spte+0x452/0x4f6
> [  113.359841]  __handle_changed_spte+0x452/0x4f6
> [  113.364819]  __handle_changed_spte+0x452/0x4f6
> [  113.369790]  zap_gfn_range+0x1de/0x27a
> [  113.373992]  kvm_tdp_mmu_zap_invalidated_roots+0x64/0xb8
> [  113.379945]  kvm_mmu_zap_all_fast+0x18c/0x1c1
> [  113.384827]  kvm_page_track_flush_slot+0x55/0x87
> [  113.390000]  kvm_set_memslot+0x137/0x455
> [  113.394394]  kvm_delete_memslot+0x5c/0x91
> [  113.398888]  __kvm_set_memory_region+0x3c0/0x5e6
> [  113.404061]  kvm_set_memory_region+0x45/0x74
> [  113.408844]  kvm_vm_ioctl+0x563/0x60c
> 
> I wasn't seeing it for my particular test case, but the gfn aging code
> might trigger the warning as well.

Ah, I got royally confused by ensure_pfn_ref()'s comment

  * Certain IO or PFNMAP mappings can be backed with valid
  * struct pages, but be allocated without refcounting e.g.,
  * tail pages of non-compound higher order allocations, which
  * would then underflow the refcount when the caller does the
  * required put_page. Don't allow those pages here.
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
that doesn't apply here because kvm_faultin_pfn() uses the low level
__gfn_to_pfn_page_memslot().

and my understanding is that @page will be non-NULL in ensure_pfn_ref() iff the
page has an elevated refcount.

Can you update the changelogs for the x86+arm64 "use gfn_to_pfn_page" patches to
explicitly call out the various ramifications of moving to gfn_to_pfn_page()?

Side topic, s/covert/convert in both changelogs :-)

> I don't know if setting the dirty/accessed bits in non-refcounted
> struct pages is problematic.

Without knowing exactly what lies behind such pages, KVM needs to set dirty bits,
otherwise there's a potential for data lost.

> The only way I can see to avoid it would be to try to map from the spte to
> the vma and then check its flags. If setting the flags is benign, then we'd
> need to do that lookup to differentiate the safe case from the use-after-free
> case. Do you have any advice on how to handle this?

Hrm.  I can't think of a clever generic solution.  But for x86-64, we can use a
software available bit to mark SPTEs as being refcounted use that flag to assert
the refcount is elevated when marking the backing pfn dirty/accessed.  It'd be
64-bit only because we're out of software available bits for PAE paging, but (a)
practically no one cares about 32-bit and (b) odds are slim that a use-after-free
would be unique to 32-bit KVM.

But that can all go in after your series is merged, e.g. I'd prefer to cleanup
make_spte()'s prototype to use @fault adding yet another parameter, and that'll
take a few patches to make happen since FNAME(sync_page) also uses make_spte().

TL;DR: continue as you were, I'll stop whining about this :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-05 19:02       ` Sean Christopherson
@ 2022-01-05 19:19         ` Sean Christopherson
  2022-01-06  2:42           ` David Stevens
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2022-01-05 19:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm

On Wed, Jan 05, 2022, Sean Christopherson wrote:
> Ah, I got royally confused by ensure_pfn_ref()'s comment
> 
>   * Certain IO or PFNMAP mappings can be backed with valid
>   * struct pages, but be allocated without refcounting e.g.,
>   * tail pages of non-compound higher order allocations, which
>   * would then underflow the refcount when the caller does the
>   * required put_page. Don't allow those pages here.
>                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> that doesn't apply here because kvm_faultin_pfn() uses the low level
> __gfn_to_pfn_page_memslot().

On fifth thought, I think this is wrong and doomed to fail.  By mapping these pages
into the guest, KVM is effectively saying it supports these pages.  But if the guest
uses the corresponding gfns for an action that requires KVM to access the page,
e.g. via kvm_vcpu_map(), ensure_pfn_ref() will reject the access and all sorts of
bad things will happen to the guest.

So, why not fully reject these types of pages?  If someone is relying on KVM to
support these types of pages, then we'll fail fast and get a bug report letting us
know we need to properly support these types of pages.  And if not, then we reduce
KVM's complexity and I get to keep my precious WARN :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-05 19:19         ` Sean Christopherson
@ 2022-01-06  2:42           ` David Stevens
  2022-01-06 17:38             ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: David Stevens @ 2022-01-06  2:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm,
	Chia-I Wu

On Thu, Jan 6, 2022 at 4:19 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jan 05, 2022, Sean Christopherson wrote:
> > Ah, I got royally confused by ensure_pfn_ref()'s comment
> >
> >   * Certain IO or PFNMAP mappings can be backed with valid
> >   * struct pages, but be allocated without refcounting e.g.,
> >   * tail pages of non-compound higher order allocations, which
> >   * would then underflow the refcount when the caller does the
> >   * required put_page. Don't allow those pages here.
> >                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > that doesn't apply here because kvm_faultin_pfn() uses the low level
> > __gfn_to_pfn_page_memslot().
>
> On fifth thought, I think this is wrong and doomed to fail.  By mapping these pages
> into the guest, KVM is effectively saying it supports these pages.  But if the guest
> uses the corresponding gfns for an action that requires KVM to access the page,
> e.g. via kvm_vcpu_map(), ensure_pfn_ref() will reject the access and all sorts of
> bad things will happen to the guest.
>
> So, why not fully reject these types of pages?  If someone is relying on KVM to
> support these types of pages, then we'll fail fast and get a bug report letting us
> know we need to properly support these types of pages.  And if not, then we reduce
> KVM's complexity and I get to keep my precious WARN :-)

Our current use case here is virtio-gpu blob resources [1]. Blob
resources are useful because they avoid a guest shadow buffer and the
associated memcpys, and as I understand it they are also required for
virtualized vulkan.

One type of blob resources requires mapping dma-bufs allocated by the
host directly into the guest. This works on Intel platforms and the
ARM platforms I've tested. However, the amdgpu driver sometimes
allocates higher order, non-compound pages via ttm_pool_alloc_page.
These are the type of pages which KVM is currently rejecting. Is this
something that KVM can support?

+olv, who has done some of the blob resource work.

[1] https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/

-David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-06  2:42           ` David Stevens
@ 2022-01-06 17:38             ` Sean Christopherson
  2022-01-07  2:21               ` David Stevens
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2022-01-06 17:38 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm,
	Chia-I Wu

On Thu, Jan 06, 2022, David Stevens wrote:
> On Thu, Jan 6, 2022 at 4:19 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Jan 05, 2022, Sean Christopherson wrote:
> > > Ah, I got royally confused by ensure_pfn_ref()'s comment
> > >
> > >   * Certain IO or PFNMAP mappings can be backed with valid
> > >   * struct pages, but be allocated without refcounting e.g.,
> > >   * tail pages of non-compound higher order allocations, which
> > >   * would then underflow the refcount when the caller does the
> > >   * required put_page. Don't allow those pages here.
> > >                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > that doesn't apply here because kvm_faultin_pfn() uses the low level
> > > __gfn_to_pfn_page_memslot().
> >
> > On fifth thought, I think this is wrong and doomed to fail.  By mapping these pages
> > into the guest, KVM is effectively saying it supports these pages.  But if the guest
> > uses the corresponding gfns for an action that requires KVM to access the page,
> > e.g. via kvm_vcpu_map(), ensure_pfn_ref() will reject the access and all sorts of
> > bad things will happen to the guest.
> >
> > So, why not fully reject these types of pages?  If someone is relying on KVM to
> > support these types of pages, then we'll fail fast and get a bug report letting us
> > know we need to properly support these types of pages.  And if not, then we reduce
> > KVM's complexity and I get to keep my precious WARN :-)
> 
> Our current use case here is virtio-gpu blob resources [1]. Blob
> resources are useful because they avoid a guest shadow buffer and the
> associated memcpys, and as I understand it they are also required for
> virtualized vulkan.
> 
> One type of blob resources requires mapping dma-bufs allocated by the
> host directly into the guest. This works on Intel platforms and the
> ARM platforms I've tested. However, the amdgpu driver sometimes
> allocates higher order, non-compound pages via ttm_pool_alloc_page.

Ah.  In the future, please provide this type of information in the cover letter,
and in this case, a paragraph in patch 01 is also warranted.  The context of _why_
is critical information, e.g. having something in the changelog explaining the use
case is very helpful for future developers wondering why on earth KVM supports
this type of odd behavior.

> These are the type of pages which KVM is currently rejecting. Is this
> something that KVM can support?

I'm not opposed to it.  My complaint is that this series is incomplete in that it
allows mapping the memory into the guest, but doesn't support accessing the memory
from KVM itself.  That means for things to work properly, KVM is relying on the
guest to use the memory in a limited capacity, e.g. isn't using the memory as
general purpose RAM.  That's not problematic for your use case, because presumably
the memory is used only by the vGPU, but as is KVM can't enforce that behavior in
any way.

The really gross part is that failures are not strictly punted to userspace;
the resulting error varies significantly depending on how the guest "illegally"
uses the memory.

My first choice would be to get the amdgpu driver "fixed", but that's likely an
unreasonable request since it sounds like the non-KVM behavior is working as intended.

One thought would be to require userspace to opt-in to mapping this type of memory
by introducing a new memslot flag that explicitly states that the memslot cannot
be accessed directly by KVM, i.e. can only be mapped into the guest.  That way,
KVM has an explicit ABI with respect to how it handles this type of memory, even
though the semantics of exactly what will happen if userspace/guest violates the
ABI are not well-defined.  And internally, KVM would also have a clear touchpoint
where it deliberately allows mapping such memslots, as opposed to the more implicit
behavior of bypassing ensure_pfn_ref().

If we're clever, we might even be able to share the flag with the "guest private
memory"[*] concept being pursued for confidential VMs.

[*] https://lore.kernel.org/all/20211223123011.41044-1-chao.p.peng@linux.intel.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-06 17:38             ` Sean Christopherson
@ 2022-01-07  2:21               ` David Stevens
  2022-01-07 16:31                 ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: David Stevens @ 2022-01-07  2:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm,
	Chia-I Wu

> > These are the type of pages which KVM is currently rejecting. Is this
> > something that KVM can support?
>
> I'm not opposed to it.  My complaint is that this series is incomplete in that it
> allows mapping the memory into the guest, but doesn't support accessing the memory
> from KVM itself.  That means for things to work properly, KVM is relying on the
> guest to use the memory in a limited capacity, e.g. isn't using the memory as
> general purpose RAM.  That's not problematic for your use case, because presumably
> the memory is used only by the vGPU, but as is KVM can't enforce that behavior in
> any way.
>
> The really gross part is that failures are not strictly punted to userspace;
> the resulting error varies significantly depending on how the guest "illegally"
> uses the memory.
>
> My first choice would be to get the amdgpu driver "fixed", but that's likely an
> unreasonable request since it sounds like the non-KVM behavior is working as intended.
>
> One thought would be to require userspace to opt-in to mapping this type of memory
> by introducing a new memslot flag that explicitly states that the memslot cannot
> be accessed directly by KVM, i.e. can only be mapped into the guest.  That way,
> KVM has an explicit ABI with respect to how it handles this type of memory, even
> though the semantics of exactly what will happen if userspace/guest violates the
> ABI are not well-defined.  And internally, KVM would also have a clear touchpoint
> where it deliberately allows mapping such memslots, as opposed to the more implicit
> behavior of bypassing ensure_pfn_ref().

Is it well defined when KVM needs to directly access a memslot? At
least for x86, it looks like most of the use cases are related to
nested virtualization, except for the call in
emulator_cmpxchg_emulated. Without being able to specifically state
what should be avoided, a flag like that would be difficult for
userspace to use.

> If we're clever, we might even be able to share the flag with the "guest private
> memory"[*] concept being pursued for confidential VMs.
>
> [*] https://lore.kernel.org/all/20211223123011.41044-1-chao.p.peng@linux.intel.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-07  2:21               ` David Stevens
@ 2022-01-07 16:31                 ` Sean Christopherson
  2022-01-07 16:46                   ` Sean Christopherson
  2022-01-10 23:47                   ` David Stevens
  0 siblings, 2 replies; 18+ messages in thread
From: Sean Christopherson @ 2022-01-07 16:31 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm,
	Chia-I Wu

On Fri, Jan 07, 2022, David Stevens wrote:
> > > These are the type of pages which KVM is currently rejecting. Is this
> > > something that KVM can support?
> >
> > I'm not opposed to it.  My complaint is that this series is incomplete in that it
> > allows mapping the memory into the guest, but doesn't support accessing the memory
> > from KVM itself.  That means for things to work properly, KVM is relying on the
> > guest to use the memory in a limited capacity, e.g. isn't using the memory as
> > general purpose RAM.  That's not problematic for your use case, because presumably
> > the memory is used only by the vGPU, but as is KVM can't enforce that behavior in
> > any way.
> >
> > The really gross part is that failures are not strictly punted to userspace;
> > the resulting error varies significantly depending on how the guest "illegally"
> > uses the memory.
> >
> > My first choice would be to get the amdgpu driver "fixed", but that's likely an
> > unreasonable request since it sounds like the non-KVM behavior is working as intended.
> >
> > One thought would be to require userspace to opt-in to mapping this type of memory
> > by introducing a new memslot flag that explicitly states that the memslot cannot
> > be accessed directly by KVM, i.e. can only be mapped into the guest.  That way,
> > KVM has an explicit ABI with respect to how it handles this type of memory, even
> > though the semantics of exactly what will happen if userspace/guest violates the
> > ABI are not well-defined.  And internally, KVM would also have a clear touchpoint
> > where it deliberately allows mapping such memslots, as opposed to the more implicit
> > behavior of bypassing ensure_pfn_ref().
> 
> Is it well defined when KVM needs to directly access a memslot?

Not really, there's certainly no established rule.

> At least for x86, it looks like most of the use cases are related to nested
> virtualization, except for the call in emulator_cmpxchg_emulated.

The emulator_cmpxchg_emulated() will hopefully go away in the nearish future[*].
Paravirt features that communicate between guest and host via memory is the other
case that often maps a pfn into KVM.

> Without being able to specifically state what should be avoided, a flag like
> that would be difficult for userspace to use.

Yeah :-(  I was thinking KVM could state the flag would be safe to use if and only
if userspace could guarantee that the guest would use the memory for some "special"
use case, but hadn't actually thought about how to word things.

The best thing to do is probably to wait for for kvm_vcpu_map() to be eliminated,
as described in the changelogs for commits:

  357a18ad230f ("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache")
  7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")

Once that is done, everything in KVM will either access guest memory through the
userspace hva, or via a mechanism that is tied into the mmu_notifier, at which
point accessing non-refcounted struct pages is safe and just needs to worry about
not corrupting _refcount.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-07 16:31                 ` Sean Christopherson
@ 2022-01-07 16:46                   ` Sean Christopherson
  2022-01-10 23:47                   ` David Stevens
  1 sibling, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2022-01-07 16:46 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm,
	Chia-I Wu

On Fri, Jan 07, 2022, Sean Christopherson wrote:
> On Fri, Jan 07, 2022, David Stevens wrote:
> > > > These are the type of pages which KVM is currently rejecting. Is this
> > > > something that KVM can support?
> > >
> > > I'm not opposed to it.  My complaint is that this series is incomplete in that it
> > > allows mapping the memory into the guest, but doesn't support accessing the memory
> > > from KVM itself.  That means for things to work properly, KVM is relying on the
> > > guest to use the memory in a limited capacity, e.g. isn't using the memory as
> > > general purpose RAM.  That's not problematic for your use case, because presumably
> > > the memory is used only by the vGPU, but as is KVM can't enforce that behavior in
> > > any way.
> > >
> > > The really gross part is that failures are not strictly punted to userspace;
> > > the resulting error varies significantly depending on how the guest "illegally"
> > > uses the memory.
> > >
> > > My first choice would be to get the amdgpu driver "fixed", but that's likely an
> > > unreasonable request since it sounds like the non-KVM behavior is working as intended.
> > >
> > > One thought would be to require userspace to opt-in to mapping this type of memory
> > > by introducing a new memslot flag that explicitly states that the memslot cannot
> > > be accessed directly by KVM, i.e. can only be mapped into the guest.  That way,
> > > KVM has an explicit ABI with respect to how it handles this type of memory, even
> > > though the semantics of exactly what will happen if userspace/guest violates the
> > > ABI are not well-defined.  And internally, KVM would also have a clear touchpoint
> > > where it deliberately allows mapping such memslots, as opposed to the more implicit
> > > behavior of bypassing ensure_pfn_ref().
> > 
> > Is it well defined when KVM needs to directly access a memslot?
> 
> Not really, there's certainly no established rule.
> 
> > At least for x86, it looks like most of the use cases are related to nested
> > virtualization, except for the call in emulator_cmpxchg_emulated.
> 
> The emulator_cmpxchg_emulated() will hopefully go away in the nearish future[*].

Forgot the link...

https://lore.kernel.org/all/YcG32Ytj0zUAW%2FB2@hirez.programming.kicks-ass.net/

> Paravirt features that communicate between guest and host via memory is the other
> case that often maps a pfn into KVM.
> 
> > Without being able to specifically state what should be avoided, a flag like
> > that would be difficult for userspace to use.
> 
> Yeah :-(  I was thinking KVM could state the flag would be safe to use if and only
> if userspace could guarantee that the guest would use the memory for some "special"
> use case, but hadn't actually thought about how to word things.
> 
> The best thing to do is probably to wait for for kvm_vcpu_map() to be eliminated,
> as described in the changelogs for commits:
> 
>   357a18ad230f ("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache")
>   7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
> 
> Once that is done, everything in KVM will either access guest memory through the
> userspace hva, or via a mechanism that is tied into the mmu_notifier, at which
> point accessing non-refcounted struct pages is safe and just needs to worry about
> not corrupting _refcount.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings
  2022-01-07 16:31                 ` Sean Christopherson
  2022-01-07 16:46                   ` Sean Christopherson
@ 2022-01-10 23:47                   ` David Stevens
  1 sibling, 0 replies; 18+ messages in thread
From: David Stevens @ 2022-01-10 23:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, James Morse, Alexandru Elisei,
	Suzuki K Poulose, Will Deacon, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-kernel, kvm,
	Chia-I Wu

> The best thing to do is probably to wait for for kvm_vcpu_map() to be eliminated,
> as described in the changelogs for commits:
>
>   357a18ad230f ("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache")
>   7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
>
> Once that is done, everything in KVM will either access guest memory through the
> userspace hva, or via a mechanism that is tied into the mmu_notifier, at which
> point accessing non-refcounted struct pages is safe and just needs to worry about
> not corrupting _refcount.

That does sound like the best approach. I'll put this patch series on
hold until that work is done.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-01-10 23:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-29  3:43 [PATCH v5 0/4] KVM: allow mapping non-refcounted pages David Stevens
2021-11-29  3:43 ` [PATCH v5 1/4] KVM: mmu: introduce new gfn_to_pfn_page functions David Stevens
2021-12-30 19:26   ` Sean Christopherson
2021-11-29  3:43 ` [PATCH v5 2/4] KVM: x86/mmu: use gfn_to_pfn_page David Stevens
2021-12-30 19:30   ` Sean Christopherson
2021-11-29  3:43 ` [PATCH v5 3/4] KVM: arm64/mmu: " David Stevens
2021-12-30 19:45   ` Sean Christopherson
2021-11-29  3:43 ` [PATCH v5 4/4] KVM: mmu: remove over-aggressive warnings David Stevens
2021-12-30 19:22   ` Sean Christopherson
2022-01-05  7:14     ` David Stevens
2022-01-05 19:02       ` Sean Christopherson
2022-01-05 19:19         ` Sean Christopherson
2022-01-06  2:42           ` David Stevens
2022-01-06 17:38             ` Sean Christopherson
2022-01-07  2:21               ` David Stevens
2022-01-07 16:31                 ` Sean Christopherson
2022-01-07 16:46                   ` Sean Christopherson
2022-01-10 23:47                   ` David Stevens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).