linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
@ 2023-09-11  2:16 David Stevens
  2023-09-11  2:16 ` [PATCH v9 1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty David Stevens
                   ` (7 more replies)
  0 siblings, 8 replies; 42+ messages in thread
From: David Stevens @ 2023-09-11  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm,
	David Stevens

From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.

First, this series replaces the __gfn_to_pfn_memslot API with a more
extensible __kvm_faultin_pfn API. The updated API rearranges
__gfn_to_pfn_memslot's args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior.

From there, this series extends the __kvm_faultin_pfn API so that
non-refconuted pages can be safely handled. This invloves adding an
input parameter to indicate whether the caller can safely use
non-refcounted pfns and an output parameter to tell the caller whether
or not the returned page is refcounted. This change includes a breaking
change, by disallowing non-refcounted pfn mappings by default, as such
mappings are unsafe. To allow such systems to continue to function, an
opt-in module parameter is added to allow the unsafe behavior.

This series only adds support for non-refcounted pages to x86. Other
MMUs can likely be updated without too much difficulty, but it is not
needed at this point. Updating other parts of KVM (e.g. pfncache) is not
straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/
[2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

v8 -> v9:
 - Make paying attention to is_refcounted_page mandatory. This means
   that FOLL_GET is no longer necessary. For compatibility with
   un-migrated callers, add a temporary parameter to sidestep
   ref-counting issues.
 - Add allow_unsafe_mappings, which is a breaking change.
 - Migrate kvm_vcpu_map and other callsites used by x86 to the new API.
 - Drop arm and ppc changes.
v7 -> v8:
 - Set access bits before releasing mmu_lock.
 - Pass FOLL_GET on 32-bit x86 or !tdp_enabled.
 - Refactor FOLL_GET handling, add kvm_follow_refcounted_pfn helper.
 - Set refcounted bit on >4k pages.
 - Add comments and apply formatting suggestions.
 - rebase on kvm next branch.
v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages (complete
   rewrite).

David Stevens (5):
  KVM: mmu: Introduce __kvm_follow_pfn function
  KVM: mmu: Improve handling of non-refcounted pfns
  KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn
  KVM: x86: Migrate to __kvm_follow_pfn
  KVM: x86/mmu: Handle non-refcounted pages

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
    accessed/dirty

 arch/x86/kvm/mmu/mmu.c          |  93 +++++++---
 arch/x86/kvm/mmu/mmu_internal.h |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |   8 +-
 arch/x86/kvm/mmu/spte.c         |   4 +-
 arch/x86/kvm/mmu/spte.h         |  12 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  22 ++-
 arch/x86/kvm/x86.c              |  12 +-
 include/linux/kvm_host.h        |  42 ++++-
 virt/kvm/kvm_main.c             | 294 +++++++++++++++++++-------------
 virt/kvm/kvm_mm.h               |   3 +-
 virt/kvm/pfncache.c             |  11 +-
 11 files changed, 339 insertions(+), 163 deletions(-)

-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v9 1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
@ 2023-09-11  2:16 ` David Stevens
  2023-09-11  2:16 ` [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function David Stevens
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: David Stevens @ 2023-09-11  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

From: Sean Christopherson <seanjc@google.com>

Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.

Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free.  Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occurred, but yelling
does help detect, triage, and debug use-after-free bugs.

Note, the assertion must use page_count(), NOT page_ref_count()!  For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d63cf1c4f5a7..ee6090ecb1fe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2914,6 +2914,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
 
 static bool kvm_is_ad_tracked_page(struct page *page)
 {
+	/*
+	 * Assert that KVM isn't attempting to mark a freed page as Accessed or
+	 * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug.  KVM
+	 * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+	 * instead relies on mmu_notifiers to know when a mapping needs to be
+	 * zapped/invalidated.  Unmapping from KVM's MMU must happen _before_
+	 * KVM returns from its mmu_notifier, i.e. the page should have an
+	 * elevated refcount at this point even though KVM doesn't hold a
+	 * reference of its own.
+	 */
+	if (WARN_ON_ONCE(!page_count(page)))
+		return false;
+
 	/*
 	 * Per page-flags.h, pages tagged PG_reserved "should in general not be
 	 * touched (e.g. set dirty) except by its owner".
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
  2023-09-11  2:16 ` [PATCH v9 1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty David Stevens
@ 2023-09-11  2:16 ` David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
                     ` (3 more replies)
  2023-09-11  2:16 ` [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns David Stevens
                   ` (5 subsequent siblings)
  7 siblings, 4 replies; 42+ messages in thread
From: David Stevens @ 2023-09-11  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
__kvm_follow_pfn refactors the old API's arguments into a struct and,
where possible, combines the boolean arguments into a single flags
argument.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h |  16 ++++
 virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
 virt/kvm/kvm_mm.h        |   3 +-
 virt/kvm/pfncache.c      |  10 ++-
 4 files changed, 123 insertions(+), 77 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..c2e0ddf14dba 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
 #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
 #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
 
 /*
  * error pfns indicate that the gfn is in slot but faild to
@@ -1177,6 +1178,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+struct kvm_follow_pfn {
+	const struct kvm_memory_slot *slot;
+	gfn_t gfn;
+	unsigned int flags;
+	bool atomic;
+	/* Try to create a writable mapping even for a read fault */
+	bool try_map_writable;
+
+	/* Outputs of __kvm_follow_pfn */
+	hva_t hva;
+	bool writable;
+};
+
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
+
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee6090ecb1fe..9b33a59c6d65 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2512,8 +2512,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
-static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-			    bool *writable, kvm_pfn_t *pfn)
+static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
 
@@ -2522,14 +2521,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
+	if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
 		return false;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
 		*pfn = page_to_pfn(page[0]);
-
-		if (writable)
-			*writable = true;
+		foll->writable = true;
 		return true;
 	}
 
@@ -2540,35 +2537,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * The slow path to get the pfn of the specified host virtual address,
  * 1 indicates success, -errno is returned if error is detected.
  */
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
+static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	unsigned int flags = FOLL_HWPOISON | foll->flags;
 	struct page *page;
 	int npages;
 
 	might_sleep();
 
-	if (writable)
-		*writable = write_fault;
-
-	if (write_fault)
-		flags |= FOLL_WRITE;
-	if (async)
-		flags |= FOLL_NOWAIT;
-	if (interruptible)
-		flags |= FOLL_INTERRUPTIBLE;
-
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
 	if (npages != 1)
 		return npages;
 
-	/* map read fault as writable if possible */
-	if (unlikely(!write_fault) && writable) {
+	if (foll->flags & FOLL_WRITE) {
+		foll->writable = true;
+	} else if (foll->try_map_writable) {
 		struct page *wpage;
 
-		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
-			*writable = true;
+		/* map read fault as writable if possible */
+		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
+			foll->writable = true;
 			put_page(page);
 			page = wpage;
 		}
@@ -2599,23 +2587,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
 }
 
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
-			       unsigned long addr, bool write_fault,
-			       bool *writable, kvm_pfn_t *p_pfn)
+			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)
 {
 	kvm_pfn_t pfn;
 	pte_t *ptep;
 	pte_t pte;
 	spinlock_t *ptl;
+	bool write_fault = foll->flags & FOLL_WRITE;
 	int r;
 
-	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
 		 * not call the fault handler, so do it here.
 		 */
 		bool unlocked = false;
-		r = fixup_user_fault(current->mm, addr,
+		r = fixup_user_fault(current->mm, foll->hva,
 				     (write_fault ? FAULT_FLAG_WRITE : 0),
 				     &unlocked);
 		if (unlocked)
@@ -2623,7 +2611,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 		if (r)
 			return r;
 	}
@@ -2635,8 +2623,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (writable)
-		*writable = pte_write(pte);
+	foll->writable = pte_write(pte);
 	pfn = pte_pfn(pte);
 
 	/*
@@ -2681,24 +2668,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn;
 	int npages, r;
 
 	/* we can do it either atomically or asynchronously, not both */
-	BUG_ON(atomic && async);
+	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
 
-	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+	if (hva_to_pfn_fast(foll, &pfn))
 		return pfn;
 
-	if (atomic)
+	if (foll->atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
-				 writable, &pfn);
+	npages = hva_to_pfn_slow(foll, &pfn);
 	if (npages == 1)
 		return pfn;
 	if (npages == -EINTR)
@@ -2706,83 +2691,123 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 
 	mmap_read_lock(current->mm);
 	if (npages == -EHWPOISON ||
-	      (!async && check_user_page_hwpoison(addr))) {
+	    (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
 		pfn = KVM_PFN_ERR_HWPOISON;
 		goto exit;
 	}
 
 retry:
-	vma = vma_lookup(current->mm, addr);
+	vma = vma_lookup(current->mm, foll->hva);
 
 	if (vma == NULL)
 		pfn = KVM_PFN_ERR_FAULT;
 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
-		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
+		r = hva_to_pfn_remapped(vma, foll, &pfn);
 		if (r == -EAGAIN)
 			goto retry;
 		if (r < 0)
 			pfn = KVM_PFN_ERR_FAULT;
 	} else {
-		if (async && vma_is_valid(vma, write_fault))
-			*async = true;
-		pfn = KVM_PFN_ERR_FAULT;
+		if ((foll->flags & FOLL_NOWAIT) &&
+		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
+			pfn = KVM_PFN_ERR_NEEDS_IO;
+		else
+			pfn = KVM_PFN_ERR_FAULT;
 	}
 exit:
 	mmap_read_unlock(current->mm);
 	return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
-	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
+	foll->writable = false;
+	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
+				      foll->flags & FOLL_WRITE);
 
-	if (hva)
-		*hva = addr;
-
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
+	if (foll->hva == KVM_HVA_ERR_RO_BAD)
 		return KVM_PFN_ERR_RO_FAULT;
-	}
 
-	if (kvm_is_error_hva(addr)) {
-		if (writable)
-			*writable = false;
+	if (kvm_is_error_hva(foll->hva))
 		return KVM_PFN_NOSLOT;
-	}
 
-	/* Do not map writable pfn in the readonly memslot. */
-	if (writable && memslot_is_readonly(slot)) {
-		*writable = false;
-		writable = NULL;
-	}
+	if (memslot_is_readonly(foll->slot))
+		foll->try_map_writable = false;
 
-	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
-			  writable);
+	return hva_to_pfn(foll);
+}
+EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
+
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool interruptible, bool *async,
+			       bool write_fault, bool *writable, hva_t *hva)
+{
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = 0,
+		.atomic = atomic,
+		.try_map_writable = !!writable,
+	};
+
+	if (write_fault)
+		foll.flags |= FOLL_WRITE;
+	if (async)
+		foll.flags |= FOLL_NOWAIT;
+	if (interruptible)
+		foll.flags |= FOLL_INTERRUPTIBLE;
+
+	pfn = __kvm_follow_pfn(&foll);
+	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
+		*async = true;
+		pfn = KVM_PFN_ERR_FAULT;
+	}
+	if (hva)
+		*hva = foll.hva;
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = gfn_to_memslot(kvm, gfn),
+		.gfn = gfn,
+		.flags = write_fault ? FOLL_WRITE : 0,
+		.try_map_writable = !!writable,
+	};
+	pfn = __kvm_follow_pfn(&foll);
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+		.atomic = true,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..ed896aee5396 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -20,8 +20,7 @@
 #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable);
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
 
 #ifdef CONFIG_HAVE_KVM_PFNCACHE
 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..86cd40acad11 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
 	void *new_khva = NULL;
 	unsigned long mmu_seq;
+	struct kvm_follow_pfn foll = {
+		.slot = gpc->memslot,
+		.gfn = gpa_to_gfn(gpc->gpa),
+		.flags = FOLL_WRITE,
+		.hva = gpc->uhva,
+	};
 
 	lockdep_assert_held(&gpc->refresh_lock);
 
@@ -182,8 +188,8 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			cond_resched();
 		}
 
-		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+		/* We always request a writable mapping */
+		new_pfn = hva_to_pfn(&foll);
 		if (is_error_noslot_pfn(new_pfn))
 			goto out_error;
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
  2023-09-11  2:16 ` [PATCH v9 1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty David Stevens
  2023-09-11  2:16 ` [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function David Stevens
@ 2023-09-11  2:16 ` David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
  2024-02-13  3:44   ` Sean Christopherson
  2023-09-11  2:16 ` [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn David Stevens
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 42+ messages in thread
From: David Stevens @ 2023-09-11  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm,
	David Stevens

From: David Stevens <stevensd@chromium.org>

KVM's handling of non-refcounted pfns has two problems:

 - struct pages without refcounting (e.g. tail pages of non-compound
   higher order pages) cannot be used at all, as gfn_to_pfn does not
   provide enough information for callers to handle the refcount.
 - pfns without struct pages can be accessed without the protection of a
   mmu notifier. This is unsafe because KVM cannot monitor or control
   the lifespan of such pfns, so it may continue to access the pfns
   after they are freed.

This patch extends the __kvm_follow_pfn API to properly handle these
cases. First, it adds a is_refcounted_page output parameter so that
callers can tell whether or not a pfn has a struct page that needs to be
passed to put_page. Second, it adds a guarded_by_mmu_notifier parameter
that is used to avoid returning non-refcounted pages when the caller
cannot safely use them.

Since callers need to be updated on a case-by-case basis to pay
attention to is_refcounted_page, the new behavior of returning
non-refcounted pages is opt-in via the allow_non_refcounted_struct_page
parameter. Once all callers have been updated, this parameter should be
removed.

The fact that non-refcounted pfns can no longer be accessed without mmu
notifier protection is a breaking change. Since there is no timeline for
updating everything in KVM to use mmu notifiers, this change adds an
opt-in module parameter called allow_unsafe_mappings to allow such
mappings. Systems which trust userspace not to tear down such unsafe
mappings while KVM is using them can set this parameter to re-enable the
legacy behavior.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h | 21 ++++++++++
 virt/kvm/kvm_main.c      | 84 ++++++++++++++++++++++++----------------
 virt/kvm/pfncache.c      |  1 +
 3 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c2e0ddf14dba..2ed08ae1a9be 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1185,10 +1185,31 @@ struct kvm_follow_pfn {
 	bool atomic;
 	/* Try to create a writable mapping even for a read fault */
 	bool try_map_writable;
+	/* Usage of the returned pfn will be guared by a mmu notifier. */
+	bool guarded_by_mmu_notifier;
+	/*
+	 * When false, do not return pfns for non-refcounted struct pages.
+	 *
+	 * TODO: This allows callers to use kvm_release_pfn on the pfns
+	 * returned by gfn_to_pfn without worrying about corrupting the
+	 * refcounted of non-refcounted pages. Once all callers respect
+	 * is_refcounted_page, this flag should be removed.
+	 */
+	bool allow_non_refcounted_struct_page;
 
 	/* Outputs of __kvm_follow_pfn */
 	hva_t hva;
 	bool writable;
+	/*
+	 * True if the returned pfn is for a page with a valid refcount. False
+	 * if the returned pfn has no struct page or if the struct page is not
+	 * being refcounted (e.g. tail pages of non-compound higher order
+	 * allocations from IO/PFNMAP mappings).
+	 *
+	 * When this output flag is false, callers should not try to convert
+	 * the pfn to a struct page.
+	 */
+	bool is_refcounted_page;
 };
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9b33a59c6d65..235c5cb3fdac 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink;
 module_param(halt_poll_ns_shrink, uint, 0644);
 EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
 
+/* Allow non-struct page memory to be mapped without MMU notifier protection. */
+static bool allow_unsafe_mappings;
+module_param(allow_unsafe_mappings, bool, 0444);
+
 /*
  * Ordering of locks:
  *
@@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 	return rc == -EHWPOISON;
 }
 
+static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
+					   struct page *page)
+{
+	kvm_pfn_t pfn = page_to_pfn(page);
+
+	foll->is_refcounted_page = true;
+	return pfn;
+}
+
 /*
  * The fast path to get the writable pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
@@ -2525,7 +2538,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 		return false;
 
 	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
-		*pfn = page_to_pfn(page[0]);
+		*pfn = kvm_follow_refcounted_pfn(foll, page[0]);
 		foll->writable = true;
 		return true;
 	}
@@ -2561,7 +2574,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 			page = wpage;
 		}
 	}
-	*pfn = page_to_pfn(page);
+	*pfn = kvm_follow_refcounted_pfn(foll, page);
 	return npages;
 }
 
@@ -2576,16 +2589,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
-	struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
-	if (!page)
-		return 1;
-
-	return get_page_unless_zero(page);
-}
-
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)
 {
@@ -2594,6 +2597,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	pte_t pte;
 	spinlock_t *ptl;
 	bool write_fault = foll->flags & FOLL_WRITE;
+	struct page *page;
 	int r;
 
 	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
@@ -2618,37 +2622,39 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 
 	pte = ptep_get(ptep);
 
+	foll->writable = pte_write(pte);
+	pfn = pte_pfn(pte);
+
+	page = kvm_pfn_to_refcounted_page(pfn);
+
 	if (write_fault && !pte_write(pte)) {
 		pfn = KVM_PFN_ERR_RO_FAULT;
 		goto out;
 	}
 
-	foll->writable = pte_write(pte);
-	pfn = pte_pfn(pte);
+	if (!page)
+		goto out;
 
-	/*
-	 * Get a reference here because callers of *hva_to_pfn* and
-	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
-	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
-	 * simply do nothing for reserved pfns.
-	 *
-	 * Whoever called remap_pfn_range is also going to call e.g.
-	 * unmap_mapping_range before the underlying pages are freed,
-	 * causing a call to our MMU notifier.
-	 *
-	 * Certain IO or PFNMAP mappings can be backed with valid
-	 * struct pages, but be allocated without refcounting e.g.,
-	 * tail pages of non-compound higher order allocations, which
-	 * would then underflow the refcount when the caller does the
-	 * required put_page. Don't allow those pages here.
-	 */
-	if (!kvm_try_get_pfn(pfn))
-		r = -EFAULT;
+	if (get_page_unless_zero(page))
+		WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
 
 out:
 	pte_unmap_unlock(ptep, ptl);
-	*p_pfn = pfn;
+
+	/*
+	 * TODO: Remove the first branch once all callers have been
+	 * taught to play nice with non-refcounted struct pages.
+	 */
+	if (page && !foll->is_refcounted_page &&
+	    !foll->allow_non_refcounted_struct_page) {
+		r = -EFAULT;
+	} else if (!foll->is_refcounted_page &&
+		   !foll->guarded_by_mmu_notifier &&
+		   !allow_unsafe_mappings) {
+		r = -EFAULT;
+	} else {
+		*p_pfn = pfn;
+	}
 
 	return r;
 }
@@ -2722,6 +2728,8 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
 	foll->writable = false;
+	foll->is_refcounted_page = false;
+
 	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
 				      foll->flags & FOLL_WRITE);
 
@@ -2749,6 +2757,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 		.flags = 0,
 		.atomic = atomic,
 		.try_map_writable = !!writable,
+		.allow_non_refcounted_struct_page = false,
 	};
 
 	if (write_fault)
@@ -2780,6 +2789,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		.gfn = gfn,
 		.flags = write_fault ? FOLL_WRITE : 0,
 		.try_map_writable = !!writable,
+		.allow_non_refcounted_struct_page = false,
 	};
 	pfn = __kvm_follow_pfn(&foll);
 	if (writable)
@@ -2794,6 +2804,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 		.slot = slot,
 		.gfn = gfn,
 		.flags = FOLL_WRITE,
+		.allow_non_refcounted_struct_page = false,
 	};
 	return __kvm_follow_pfn(&foll);
 }
@@ -2806,6 +2817,11 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
 		.gfn = gfn,
 		.flags = FOLL_WRITE,
 		.atomic = true,
+		/*
+		 * Setting atomic means __kvm_follow_pfn will never make it
+		 * to hva_to_pfn_remapped, so this is vacuously true.
+		 */
+		.allow_non_refcounted_struct_page = true,
 	};
 	return __kvm_follow_pfn(&foll);
 }
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 86cd40acad11..6bbf972c11f8 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -149,6 +149,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		.gfn = gpa_to_gfn(gpc->gpa),
 		.flags = FOLL_WRITE,
 		.hva = gpc->uhva,
+		.allow_non_refcounted_struct_page = false,
 	};
 
 	lockdep_assert_held(&gpc->refresh_lock);
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
                   ` (2 preceding siblings ...)
  2023-09-11  2:16 ` [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns David Stevens
@ 2023-09-11  2:16 ` David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
  2023-09-11  2:16 ` [PATCH v9 5/6] KVM: x86: Migrate " David Stevens
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: David Stevens @ 2023-09-11  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate kvm_vcpu_map to __kvm_follow_pfn. Track is_refcounted_page so
that kvm_vcpu_unmap know whether or not it needs to release the page.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c      | 24 ++++++++++++++----------
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2ed08ae1a9be..b95c79b7833b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -294,6 +294,7 @@ struct kvm_host_map {
 	void *hva;
 	kvm_pfn_t pfn;
 	kvm_pfn_t gfn;
+	bool is_refcounted_page;
 };
 
 /*
@@ -1228,7 +1229,6 @@ void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_accessed(kvm_pfn_t pfn);
 
-void kvm_release_pfn(kvm_pfn_t pfn, bool dirty);
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 235c5cb3fdac..913de4e86d9d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2886,24 +2886,22 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_page);
 
-void kvm_release_pfn(kvm_pfn_t pfn, bool dirty)
-{
-	if (dirty)
-		kvm_release_pfn_dirty(pfn);
-	else
-		kvm_release_pfn_clean(pfn);
-}
-
 int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 {
 	kvm_pfn_t pfn;
 	void *hva = NULL;
 	struct page *page = KVM_UNMAPPED_PAGE;
+	struct kvm_follow_pfn foll = {
+		.slot = gfn_to_memslot(vcpu->kvm, gfn),
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+		.allow_non_refcounted_struct_page = true,
+	};
 
 	if (!map)
 		return -EINVAL;
 
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	pfn = __kvm_follow_pfn(&foll);
 	if (is_error_noslot_pfn(pfn))
 		return -EINVAL;
 
@@ -2923,6 +2921,7 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 	map->hva = hva;
 	map->pfn = pfn;
 	map->gfn = gfn;
+	map->is_refcounted_page = foll.is_refcounted_page;
 
 	return 0;
 }
@@ -2946,7 +2945,12 @@ void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty)
 	if (dirty)
 		kvm_vcpu_mark_page_dirty(vcpu, map->gfn);
 
-	kvm_release_pfn(map->pfn, dirty);
+	if (map->is_refcounted_page) {
+		if (dirty)
+			kvm_release_page_dirty(map->page);
+		else
+			kvm_release_page_clean(map->page);
+	}
 
 	map->hva = NULL;
 	map->page = NULL;
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 5/6] KVM: x86: Migrate to __kvm_follow_pfn
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
                   ` (3 preceding siblings ...)
  2023-09-11  2:16 ` [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn David Stevens
@ 2023-09-11  2:16 ` David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
  2023-09-11  2:16 ` [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages David Stevens
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: David Stevens @ 2023-09-11  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate functions which need access to is_refcounted_page to
__kvm_follow_pfn. The functions which need this are __kvm_faultin_pfn
and reexecute_instruction. The former requires replacing the async
in/out parameter with FOLL_NOWAIT parameter and the KVM_PFN_ERR_NEEDS_IO
return value. Handling non-refcounted pages is complicated, so it will
be done in a followup. The latter is a straightforward refactor.

APIC related callers do not need to migrate because KVM controls the
memslot, so it will always be regular memory. Prefetch related callers
do not need to be migrated because atomic gfn_to_pfn calls can never
make it to hva_to_pfn_remapped.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c | 43 ++++++++++++++++++++++++++++++++----------
 arch/x86/kvm/x86.c     | 12 ++++++++++--
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1d011c67cc6..e1eca26215e2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4254,7 +4254,14 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
-	bool async;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = fault->gfn,
+		.flags = fault->write ? FOLL_WRITE : 0,
+		.try_map_writable = true,
+		.guarded_by_mmu_notifier = true,
+		.allow_non_refcounted_struct_page = false,
+	};
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4283,12 +4290,20 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
-	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
-	if (!async)
-		return RET_PF_CONTINUE; /* *pfn has correct page already */
+	foll.flags |= FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	/*
+	 * If __kvm_follow_pfn() failed because I/O is needed to fault in the
+	 * page, then either set up an asynchronous #PF to do the I/O, or if
+	 * doing an async #PF isn't possible, retry __kvm_follow_pfn() with
+	 * I/O allowed. All other failures are fatal, i.e. retrying won't help.
+	 */
+	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
+		return RET_PF_CONTINUE;
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4306,9 +4321,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * to wait for IO.  Note, gup always bails if it is unable to quickly
 	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
 	 */
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+	foll.flags |= FOLL_INTERRUPTIBLE;
+	foll.flags &= ~FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	return RET_PF_CONTINUE;
+success:
+	fault->hva = foll.hva;
+	fault->map_writable = foll.writable;
 	return RET_PF_CONTINUE;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..2011a7e47296 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8556,6 +8556,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 {
 	gpa_t gpa = cr2_or_gpa;
 	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll;
 
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
@@ -8585,7 +8586,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * retry instruction -> write #PF -> emulation fail -> retry
 	 * instruction -> ...
 	 */
-	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+	foll = (struct kvm_follow_pfn) {
+		.slot = gfn_to_memslot(vcpu->kvm, gpa_to_gfn(gpa)),
+		.gfn = gpa_to_gfn(gpa),
+		.flags = FOLL_WRITE,
+		.allow_non_refcounted_struct_page = true,
+	};
+	pfn = __kvm_follow_pfn(&foll);
 
 	/*
 	 * If the instruction failed on the error pfn, it can not be fixed,
@@ -8594,7 +8601,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (is_error_noslot_pfn(pfn))
 		return false;
 
-	kvm_release_pfn_clean(pfn);
+	if (foll.is_refcounted_page)
+		kvm_release_page_clean(pfn_to_page(pfn));
 
 	/* The instructions are well-emulated on direct mmu. */
 	if (vcpu->arch.mmu->root_role.direct) {
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
                   ` (4 preceding siblings ...)
  2023-09-11  2:16 ` [PATCH v9 5/6] KVM: x86: Migrate " David Stevens
@ 2023-09-11  2:16 ` David Stevens
  2023-09-18  9:53   ` Dmitry Osipenko
                     ` (3 more replies)
  2023-09-29  5:19 ` [PATCH v9 0/6] KVM: allow mapping " Christoph Hellwig
  2023-10-31  4:30 ` David Stevens
  7 siblings, 4 replies; 42+ messages in thread
From: David Stevens @ 2023-09-11  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
to map memory into the guest that is backed by non-refcounted struct
pages - for example, the tail pages of higher order non-compound pages
allocated by the amdgpu driver via ttm_pool_alloc_page.

The bulk of this change is tracking the is_refcounted_page flag so that
non-refcounted pages don't trigger page_count() == 0 warnings. This is
done by storing the flag in an unused bit in the sptes. There are no
bits available in PAE SPTEs, so non-refcounted pages can only be handled
on TDP and x86-64.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
 arch/x86/kvm/mmu/spte.c         |  4 ++-
 arch/x86/kvm/mmu/spte.h         | 12 +++++++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
 include/linux/kvm_host.h        |  3 ++
 virt/kvm/kvm_main.c             |  6 ++--
 8 files changed, 76 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1eca26215e2..b8168cc4cc96 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -545,12 +545,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 
 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	return flush;
@@ -588,14 +590,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	 * before they are reclaimed.  Sanity check that, if the pfn is backed
 	 * by a refcounted page, the refcount is elevated.
 	 */
-	page = kvm_pfn_to_refcounted_page(pfn);
-	WARN_ON_ONCE(page && !page_count(page));
+	if (is_refcounted_page_pte(old_spte)) {
+		page = kvm_pfn_to_refcounted_page(pfn);
+		WARN_ON_ONCE(!page || !page_count(page));
+	}
 
-	if (is_accessed_spte(old_spte))
-		kvm_set_pfn_accessed(pfn);
+	if (is_refcounted_page_pte(old_spte)) {
+		if (is_accessed_spte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(pfn));
 
-	if (is_dirty_spte(old_spte))
-		kvm_set_pfn_dirty(pfn);
+		if (is_dirty_spte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(pfn));
+	}
 
 	return old_spte;
 }
@@ -631,8 +637,8 @@ static bool mmu_spte_age(u64 *sptep)
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(spte))
-			kvm_set_pfn_dirty(spte_to_pfn(spte));
+		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
 
 		spte = mark_spte_for_access_track(spte);
 		mmu_spte_update_no_track(sptep, spte);
@@ -1261,8 +1267,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
 {
 	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
 					       (unsigned long *)sptep);
-	if (was_writable && !spte_ad_enabled(*sptep))
-		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
 
 	return was_writable;
 }
@@ -2913,6 +2919,11 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	bool host_writable = !fault || fault->map_writable;
 	bool prefetch = !fault || fault->prefetch;
 	bool write_fault = fault && fault->write;
+	/*
+	 * Prefetching uses gfn_to_page_many_atomic, which never gets
+	 * non-refcounted pages.
+	 */
+	bool is_refcounted = !fault || fault->is_refcounted_page;
 
 	if (unlikely(is_noslot_pfn(pfn))) {
 		vcpu->stat.pf_mmio_spte_created++;
@@ -2940,7 +2951,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	}
 
 	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
-			   true, host_writable, &spte);
+			   true, host_writable, is_refcounted, &spte);
 
 	if (*sptep == spte) {
 		ret = RET_PF_SPURIOUS;
@@ -4254,13 +4265,18 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
+	/*
+	 * There are no extra bits for tracking non-refcounted pages in
+	 * PAE SPTEs, so reject non-refcounted struct pages in that case.
+	 */
+	bool has_spte_refcount_bit = tdp_enabled && IS_ENABLED(CONFIG_X86_64);
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = fault->gfn,
 		.flags = fault->write ? FOLL_WRITE : 0,
 		.try_map_writable = true,
 		.guarded_by_mmu_notifier = true,
-		.allow_non_refcounted_struct_page = false,
+		.allow_non_refcounted_struct_page = has_spte_refcount_bit,
 	};
 
 	/*
@@ -4277,6 +4293,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
+			fault->is_refcounted_page = false;
 			return RET_PF_CONTINUE;
 		}
 		/*
@@ -4332,6 +4349,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 success:
 	fault->hva = foll.hva;
 	fault->map_writable = foll.writable;
+	fault->is_refcounted_page = foll.is_refcounted_page;
 	return RET_PF_CONTINUE;
 }
 
@@ -4420,8 +4438,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	r = direct_map(vcpu, fault);
 
 out_unlock:
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
@@ -4496,8 +4515,9 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 	r = kvm_tdp_mmu_map(vcpu, fault);
 
 out_unlock:
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 #endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b102014e2c60..7f73bc2a552e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -239,6 +239,7 @@ struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	bool is_refcounted_page;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f67..0ac4a4e5870c 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -902,7 +903,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
  */
 static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
 {
-	bool host_writable;
+	bool host_writable, is_refcounted;
 	gpa_t first_pte_gpa;
 	u64 *sptep, spte;
 	struct kvm_memory_slot *slot;
@@ -959,10 +960,11 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
 	sptep = &sp->spt[i];
 	spte = *sptep;
 	host_writable = spte & shadow_host_writable_mask;
+	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	make_spte(vcpu, sp, slot, pte_access, gfn,
 		  spte_to_pfn(spte), spte, true, false,
-		  host_writable, &spte);
+		  host_writable, is_refcounted, &spte);
 
 	return mmu_spte_update(sptep, spte);
 }
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..ce495819061f 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte)
+	       bool host_writable, bool is_refcounted, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;
+	if (is_refcounted)
+		spte |= SPTE_MMU_PAGE_REFCOUNTED;
 
 	if (shadow_memtype_mask)
 		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a88..4bf4a535c23d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -96,6 +96,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 /* Defined only to keep the above static asserts readable. */
 #undef SHADOW_ACC_TRACK_SAVED_MASK
 
+/*
+ * Indicates that the SPTE refers to a page with a valid refcount.
+ */
+#define SPTE_MMU_PAGE_REFCOUNTED        BIT_ULL(59)
+
 /*
  * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
  * the memslots generation and is derived as follows:
@@ -345,6 +350,11 @@ static inline bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_refcounted_page_pte(u64 spte)
+{
+	return spte & SPTE_MMU_PAGE_REFCOUNTED;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
 				int level)
 {
@@ -475,7 +485,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte);
+	       bool host_writable, bool is_refcounted, u64 *new_spte);
 u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 		      	      union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6c63f2d1675f..185f3c666c2b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	bool is_refcounted = is_refcounted_page_pte(old_spte);
 
 	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (is_leaf != was_leaf)
 		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
 
-	if (was_leaf && is_dirty_spte(old_spte) &&
+	if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 
-	if (was_leaf && is_accessed_spte(old_spte) &&
+	if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 }
 
 /*
@@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+				   fault->pfn, iter->old_spte, fault->prefetch, true,
+				   fault->map_writable, fault->is_refcounted_page,
+				   &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(iter->old_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
+		if (is_writable_pte(iter->old_spte) &&
+		    is_refcounted_page_pte(iter->old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));
 
 		new_spte = mark_spte_for_access_track(iter->old_spte);
 		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
@@ -1628,7 +1631,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
 					       iter.old_spte,
 					       iter.old_spte & ~dbit);
-		kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+		if (is_refcounted_page_pte(iter.old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
 	}
 
 	rcu_read_unlock();
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b95c79b7833b..6696925f01f1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1179,6 +1179,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
 struct kvm_follow_pfn {
 	const struct kvm_memory_slot *slot;
 	gfn_t gfn;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 913de4e86d9d..4d8538cdb690 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2979,17 +2979,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
 	return !PageReserved(page);
 }
 
-static void kvm_set_page_dirty(struct page *page)
+void kvm_set_page_dirty(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		SetPageDirty(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
-static void kvm_set_page_accessed(struct page *page)
+void kvm_set_page_accessed(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		mark_page_accessed(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
 
 void kvm_release_page_clean(struct page *page)
 {
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-11  2:16 ` [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages David Stevens
@ 2023-09-18  9:53   ` Dmitry Osipenko
  2023-09-19  2:25     ` David Stevens
  2023-09-18  9:58   ` Dmitry Osipenko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Dmitry Osipenko @ 2023-09-18  9:53 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On 9/11/23 05:16, David Stevens wrote:
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));

The other similar occurrences in the code that replaced
kvm_release_pfn_clean() with kvm_set_page_accessed() did it under the
held mmu_lock.

Does kvm_set_page_accessed() needs to be invoked under the lock?

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-11  2:16 ` [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages David Stevens
  2023-09-18  9:53   ` Dmitry Osipenko
@ 2023-09-18  9:58   ` Dmitry Osipenko
  2023-09-18 11:19     ` Dmitry Osipenko
  2023-09-19  2:31     ` David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
  2024-02-06  3:23   ` Sean Christopherson
  3 siblings, 2 replies; 42+ messages in thread
From: Dmitry Osipenko @ 2023-09-18  9:58 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On 9/11/23 05:16, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
> to map memory into the guest that is backed by non-refcounted struct
> pages - for example, the tail pages of higher order non-compound pages
> allocated by the amdgpu driver via ttm_pool_alloc_page.
> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes. There are no
> bits available in PAE SPTEs, so non-refcounted pages can only be handled
> on TDP and x86-64.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
>  include/linux/kvm_host.h        |  3 ++
>  virt/kvm/kvm_main.c             |  6 ++--
>  8 files changed, 76 insertions(+), 32 deletions(-)

Could you please tell which kernel tree you used for the base of this
series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm

error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
error: could not build fake ancestor

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-18  9:58   ` Dmitry Osipenko
@ 2023-09-18 11:19     ` Dmitry Osipenko
  2023-09-19  2:59       ` David Stevens
  2023-09-19  2:31     ` David Stevens
  1 sibling, 1 reply; 42+ messages in thread
From: Dmitry Osipenko @ 2023-09-18 11:19 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On 9/18/23 12:58, Dmitry Osipenko wrote:
> On 9/11/23 05:16, David Stevens wrote:
>> From: David Stevens <stevensd@chromium.org>
>>
>> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
>> to map memory into the guest that is backed by non-refcounted struct
>> pages - for example, the tail pages of higher order non-compound pages
>> allocated by the amdgpu driver via ttm_pool_alloc_page.
>>
>> The bulk of this change is tracking the is_refcounted_page flag so that
>> non-refcounted pages don't trigger page_count() == 0 warnings. This is
>> done by storing the flag in an unused bit in the sptes. There are no
>> bits available in PAE SPTEs, so non-refcounted pages can only be handled
>> on TDP and x86-64.
>>
>> Signed-off-by: David Stevens <stevensd@chromium.org>
>> ---
>>  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
>>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>>  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
>>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>>  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
>>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
>>  include/linux/kvm_host.h        |  3 ++
>>  virt/kvm/kvm_main.c             |  6 ++--
>>  8 files changed, 76 insertions(+), 32 deletions(-)
> 
> Could you please tell which kernel tree you used for the base of this
> series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm
> 
> error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
> error: could not build fake ancestor

I applied the patch manually to v6.5.2 and tested Venus using Intel TGL iGPU, the intel driver is crashing:

   BUG: kernel NULL pointer dereference, address: 0000000000000058
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x0000) - not-present page
   PGD 0 P4D 0 
   Oops: 0000 [#1] PREEMPT SMP
   CPU: 1 PID: 5926 Comm: qemu-system-x86 Not tainted 6.5.2+ #114
   Hardware name: LENOVO 20VE/LNVNB161216, BIOS F8CN43WW(V2.06) 08/12/2021
   RIP: 0010:gen8_ppgtt_insert+0x50b/0x8f0
   Code: 00 00 f7 c2 00 00 20 00 74 15 f7 c3 ff ff 1f 00 75 0d 41 81 fc ff ff 1f 00 0f 87 0e 02 00 00 48 8b 74 24 08 44 89 c0 45 85 ed <48> 8b 4e 58 48 8b 04 c1 0f 85 0b 02 00 00 81 e2 00 00 01 00 0f 84
   RSP: 0018:ffffafc085afb820 EFLAGS: 00010246
   RAX: 0000000000000000 RBX: 00000000e9604000 RCX: 000000000000001b
   RDX: 0000000000211000 RSI: 0000000000000000 RDI: ffff9513d44c1000
   RBP: ffff951106f8dfc0 R08: 0000000000000000 R09: 0000000000000003
   R10: 0000000000000fff R11: 00000000e9800000 R12: 00000000001fc000
   R13: 0000000000000000 R14: 0000000000001000 R15: 0000ffff00000000
   FS:  00007f2a5bcced80(0000) GS:ffff951a87a40000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000058 CR3: 0000000116f16006 CR4: 0000000000772ee0
   PKRU: 55555554
   Call Trace:
    <TASK>
    ? __die+0x1f/0x60
    ? page_fault_oops+0x14d/0x420
    ? exc_page_fault+0x3d7/0x880
    ? lock_acquire+0xc9/0x290
    ? asm_exc_page_fault+0x22/0x30
    ? gen8_ppgtt_insert+0x50b/0x8f0
    ppgtt_bind_vma+0x4f/0x60
    fence_work+0x1b/0x70
    fence_notify+0x8f/0x130
    __i915_sw_fence_complete+0x58/0x230
    i915_vma_pin_ww+0x513/0xa80
    eb_validate_vmas+0x17e/0x9e0
    ? eb_pin_engine+0x2bb/0x340
    i915_gem_do_execbuffer+0xc85/0x2bf0
    ? __lock_acquire+0x3b6/0x21c0
    i915_gem_execbuffer2_ioctl+0xee/0x240
    ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
    drm_ioctl_kernel+0x9d/0x140
    drm_ioctl+0x1dd/0x410
    ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
    ? __fget_files+0xc5/0x170
    __x64_sys_ioctl+0x8c/0xc0
    do_syscall_64+0x34/0x80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0
   RIP: 0033:0x7f2a60b0c9df


$ ./scripts/faddr2line ./vmlinux gen8_ppgtt_insert+0x50b/0x8f0
gen8_ppgtt_insert+0x50b/0x8f0:
i915_pt_entry at drivers/gpu/drm/i915/gt/intel_gtt.h:557
(inlined by) gen8_ppgtt_insert_huge at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:641
(inlined by) gen8_ppgtt_insert at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:743

It's likely should be the i915 driver issue that is crashes with the NULL deref, but the origin of the bug should be the kvm page fault handling. 

David, could you please tell what tests you've run and post a link to yours kernel tree? Maybe I made obscure mistake while applied the patch manually.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-18  9:53   ` Dmitry Osipenko
@ 2023-09-19  2:25     ` David Stevens
  2023-09-30 13:34       ` Dmitry Osipenko
  0 siblings, 1 reply; 42+ messages in thread
From: David Stevens @ 2023-09-19  2:25 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm, David Stevens

On Mon, Sep 18, 2023 at 6:53 PM Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>
> On 9/11/23 05:16, David Stevens wrote:
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  
> >  out_unlock:
> >       write_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
> > +     if (fault->is_refcounted_page)
> > +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
> 
> The other similar occurrences in the code that replaced
> kvm_release_pfn_clean() with kvm_set_page_accessed() did it under the
> held mmu_lock.
> 
> Does kvm_set_page_accessed() needs to be invoked under the lock?

It looks like I made a mistake when folding the v8->v9 delta into the stack of
patches to get a clean v9 series. v8 of the series returned pfns without
reference counts from __kvm_follow_pfn, so the x86 MMU needed to mark the pages
as accessed under the lock. v9 instead returns pfns with a refcount (i.e. does
the same thing as __gfn_to_pfn_memslot), so the x86 MMU should instead call
kvm_release_page_clean outside of the lock. I've included the corrected version
of this patch in this email.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1eca26215e2..5e7124f63fbc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -545,12 +545,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 
 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	return flush;
@@ -588,14 +590,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	 * before they are reclaimed.  Sanity check that, if the pfn is backed
 	 * by a refcounted page, the refcount is elevated.
 	 */
-	page = kvm_pfn_to_refcounted_page(pfn);
-	WARN_ON_ONCE(page && !page_count(page));
+	if (is_refcounted_page_pte(old_spte)) {
+		page = kvm_pfn_to_refcounted_page(pfn);
+		WARN_ON_ONCE(!page || !page_count(page));
+	}
 
-	if (is_accessed_spte(old_spte))
-		kvm_set_pfn_accessed(pfn);
+	if (is_refcounted_page_pte(old_spte)) {
+		if (is_accessed_spte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(pfn));
 
-	if (is_dirty_spte(old_spte))
-		kvm_set_pfn_dirty(pfn);
+		if (is_dirty_spte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(pfn));
+	}
 
 	return old_spte;
 }
@@ -631,8 +637,8 @@ static bool mmu_spte_age(u64 *sptep)
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(spte))
-			kvm_set_pfn_dirty(spte_to_pfn(spte));
+		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
 
 		spte = mark_spte_for_access_track(spte);
 		mmu_spte_update_no_track(sptep, spte);
@@ -1261,8 +1267,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
 {
 	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
 					       (unsigned long *)sptep);
-	if (was_writable && !spte_ad_enabled(*sptep))
-		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
 
 	return was_writable;
 }
@@ -2913,6 +2919,11 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	bool host_writable = !fault || fault->map_writable;
 	bool prefetch = !fault || fault->prefetch;
 	bool write_fault = fault && fault->write;
+	/*
+	 * Prefetching uses gfn_to_page_many_atomic, which never gets
+	 * non-refcounted pages.
+	 */
+	bool is_refcounted = !fault || fault->is_refcounted_page;
 
 	if (unlikely(is_noslot_pfn(pfn))) {
 		vcpu->stat.pf_mmio_spte_created++;
@@ -2940,7 +2951,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	}
 
 	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
-			   true, host_writable, &spte);
+			   true, host_writable, is_refcounted, &spte);
 
 	if (*sptep == spte) {
 		ret = RET_PF_SPURIOUS;
@@ -4254,13 +4265,18 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
+	/*
+	 * There are no extra bits for tracking non-refcounted pages in
+	 * PAE SPTEs, so reject non-refcounted struct pages in that case.
+	 */
+	bool has_spte_refcount_bit = tdp_enabled && IS_ENABLED(CONFIG_X86_64);
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = fault->gfn,
 		.flags = fault->write ? FOLL_WRITE : 0,
 		.try_map_writable = true,
 		.guarded_by_mmu_notifier = true,
-		.allow_non_refcounted_struct_page = false,
+		.allow_non_refcounted_struct_page = has_spte_refcount_bit,
 	};
 
 	/*
@@ -4277,6 +4293,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
+			fault->is_refcounted_page = false;
 			return RET_PF_CONTINUE;
 		}
 		/*
@@ -4332,6 +4349,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 success:
 	fault->hva = foll.hva;
 	fault->map_writable = foll.writable;
+	fault->is_refcounted_page = foll.is_refcounted_page;
 	return RET_PF_CONTINUE;
 }
 
@@ -4421,7 +4439,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_release_page_clean(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -4497,7 +4516,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 
 out_unlock:
 	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_release_page_clean(pfn_to_page(fault->pfn));
 	return r;
 }
 #endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b102014e2c60..7f73bc2a552e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -239,6 +239,7 @@ struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	bool is_refcounted_page;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f67..b2d62fd9634c 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_release_page_clean(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -902,7 +903,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
  */
 static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
 {
-	bool host_writable;
+	bool host_writable, is_refcounted;
 	gpa_t first_pte_gpa;
 	u64 *sptep, spte;
 	struct kvm_memory_slot *slot;
@@ -959,10 +960,11 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
 	sptep = &sp->spt[i];
 	spte = *sptep;
 	host_writable = spte & shadow_host_writable_mask;
+	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	make_spte(vcpu, sp, slot, pte_access, gfn,
 		  spte_to_pfn(spte), spte, true, false,
-		  host_writable, &spte);
+		  host_writable, is_refcounted, &spte);
 
 	return mmu_spte_update(sptep, spte);
 }
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..ce495819061f 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte)
+	       bool host_writable, bool is_refcounted, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;
+	if (is_refcounted)
+		spte |= SPTE_MMU_PAGE_REFCOUNTED;
 
 	if (shadow_memtype_mask)
 		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a88..4bf4a535c23d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -96,6 +96,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 /* Defined only to keep the above static asserts readable. */
 #undef SHADOW_ACC_TRACK_SAVED_MASK
 
+/*
+ * Indicates that the SPTE refers to a page with a valid refcount.
+ */
+#define SPTE_MMU_PAGE_REFCOUNTED        BIT_ULL(59)
+
 /*
  * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
  * the memslots generation and is derived as follows:
@@ -345,6 +350,11 @@ static inline bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_refcounted_page_pte(u64 spte)
+{
+	return spte & SPTE_MMU_PAGE_REFCOUNTED;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
 				int level)
 {
@@ -475,7 +485,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte);
+	       bool host_writable, bool is_refcounted, u64 *new_spte);
 u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 		      	      union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6c63f2d1675f..185f3c666c2b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	bool is_refcounted = is_refcounted_page_pte(old_spte);
 
 	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (is_leaf != was_leaf)
 		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
 
-	if (was_leaf && is_dirty_spte(old_spte) &&
+	if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 
-	if (was_leaf && is_accessed_spte(old_spte) &&
+	if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 }
 
 /*
@@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+				   fault->pfn, iter->old_spte, fault->prefetch, true,
+				   fault->map_writable, fault->is_refcounted_page,
+				   &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(iter->old_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
+		if (is_writable_pte(iter->old_spte) &&
+		    is_refcounted_page_pte(iter->old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));
 
 		new_spte = mark_spte_for_access_track(iter->old_spte);
 		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
@@ -1628,7 +1631,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
 					       iter.old_spte,
 					       iter.old_spte & ~dbit);
-		kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+		if (is_refcounted_page_pte(iter.old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
 	}
 
 	rcu_read_unlock();
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b95c79b7833b..6696925f01f1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1179,6 +1179,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
 struct kvm_follow_pfn {
 	const struct kvm_memory_slot *slot;
 	gfn_t gfn;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 913de4e86d9d..4d8538cdb690 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2979,17 +2979,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
 	return !PageReserved(page);
 }
 
-static void kvm_set_page_dirty(struct page *page)
+void kvm_set_page_dirty(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		SetPageDirty(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
-static void kvm_set_page_accessed(struct page *page)
+void kvm_set_page_accessed(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		mark_page_accessed(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
 
 void kvm_release_page_clean(struct page *page)
 {
-- 
2.42.0.459.ge4e396fd5e-goog


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-18  9:58   ` Dmitry Osipenko
  2023-09-18 11:19     ` Dmitry Osipenko
@ 2023-09-19  2:31     ` David Stevens
  2023-09-21 20:04       ` Dmitry Osipenko
  2024-02-06  3:02       ` Sean Christopherson
  1 sibling, 2 replies; 42+ messages in thread
From: David Stevens @ 2023-09-19  2:31 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Mon, Sep 18, 2023 at 6:58 PM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> On 9/11/23 05:16, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
> > to map memory into the guest that is backed by non-refcounted struct
> > pages - for example, the tail pages of higher order non-compound pages
> > allocated by the amdgpu driver via ttm_pool_alloc_page.
> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes. There are no
> > bits available in PAE SPTEs, so non-refcounted pages can only be handled
> > on TDP and x86-64.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
> >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
> >  include/linux/kvm_host.h        |  3 ++
> >  virt/kvm/kvm_main.c             |  6 ++--
> >  8 files changed, 76 insertions(+), 32 deletions(-)
>
> Could you please tell which kernel tree you used for the base of this
> series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm
>
> error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
> error: could not build fake ancestor

This series is based on the kvm next branch (i.e.
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next). The
specific hash is d011151616e73de20c139580b73fa4c7042bd861.

-David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-18 11:19     ` Dmitry Osipenko
@ 2023-09-19  2:59       ` David Stevens
  2023-09-21 20:06         ` Dmitry Osipenko
  0 siblings, 1 reply; 42+ messages in thread
From: David Stevens @ 2023-09-19  2:59 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Mon, Sep 18, 2023 at 8:19 PM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> On 9/18/23 12:58, Dmitry Osipenko wrote:
> > On 9/11/23 05:16, David Stevens wrote:
> >> From: David Stevens <stevensd@chromium.org>
> >>
> >> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
> >> to map memory into the guest that is backed by non-refcounted struct
> >> pages - for example, the tail pages of higher order non-compound pages
> >> allocated by the amdgpu driver via ttm_pool_alloc_page.
> >>
> >> The bulk of this change is tracking the is_refcounted_page flag so that
> >> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> >> done by storing the flag in an unused bit in the sptes. There are no
> >> bits available in PAE SPTEs, so non-refcounted pages can only be handled
> >> on TDP and x86-64.
> >>
> >> Signed-off-by: David Stevens <stevensd@chromium.org>
> >> ---
> >>  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
> >>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >>  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
> >>  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >>  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
> >>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
> >>  include/linux/kvm_host.h        |  3 ++
> >>  virt/kvm/kvm_main.c             |  6 ++--
> >>  8 files changed, 76 insertions(+), 32 deletions(-)
> >
> > Could you please tell which kernel tree you used for the base of this
> > series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm
> >
> > error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
> > error: could not build fake ancestor
>
> I applied the patch manually to v6.5.2 and tested Venus using Intel TGL iGPU, the intel driver is crashing:
>
>    BUG: kernel NULL pointer dereference, address: 0000000000000058
>    #PF: supervisor read access in kernel mode
>    #PF: error_code(0x0000) - not-present page
>    PGD 0 P4D 0
>    Oops: 0000 [#1] PREEMPT SMP
>    CPU: 1 PID: 5926 Comm: qemu-system-x86 Not tainted 6.5.2+ #114
>    Hardware name: LENOVO 20VE/LNVNB161216, BIOS F8CN43WW(V2.06) 08/12/2021
>    RIP: 0010:gen8_ppgtt_insert+0x50b/0x8f0
>    Code: 00 00 f7 c2 00 00 20 00 74 15 f7 c3 ff ff 1f 00 75 0d 41 81 fc ff ff 1f 00 0f 87 0e 02 00 00 48 8b 74 24 08 44 89 c0 45 85 ed <48> 8b 4e 58 48 8b 04 c1 0f 85 0b 02 00 00 81 e2 00 00 01 00 0f 84
>    RSP: 0018:ffffafc085afb820 EFLAGS: 00010246
>    RAX: 0000000000000000 RBX: 00000000e9604000 RCX: 000000000000001b
>    RDX: 0000000000211000 RSI: 0000000000000000 RDI: ffff9513d44c1000
>    RBP: ffff951106f8dfc0 R08: 0000000000000000 R09: 0000000000000003
>    R10: 0000000000000fff R11: 00000000e9800000 R12: 00000000001fc000
>    R13: 0000000000000000 R14: 0000000000001000 R15: 0000ffff00000000
>    FS:  00007f2a5bcced80(0000) GS:ffff951a87a40000(0000) knlGS:0000000000000000
>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    CR2: 0000000000000058 CR3: 0000000116f16006 CR4: 0000000000772ee0
>    PKRU: 55555554
>    Call Trace:
>     <TASK>
>     ? __die+0x1f/0x60
>     ? page_fault_oops+0x14d/0x420
>     ? exc_page_fault+0x3d7/0x880
>     ? lock_acquire+0xc9/0x290
>     ? asm_exc_page_fault+0x22/0x30
>     ? gen8_ppgtt_insert+0x50b/0x8f0
>     ppgtt_bind_vma+0x4f/0x60
>     fence_work+0x1b/0x70
>     fence_notify+0x8f/0x130
>     __i915_sw_fence_complete+0x58/0x230
>     i915_vma_pin_ww+0x513/0xa80
>     eb_validate_vmas+0x17e/0x9e0
>     ? eb_pin_engine+0x2bb/0x340
>     i915_gem_do_execbuffer+0xc85/0x2bf0
>     ? __lock_acquire+0x3b6/0x21c0
>     i915_gem_execbuffer2_ioctl+0xee/0x240
>     ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
>     drm_ioctl_kernel+0x9d/0x140
>     drm_ioctl+0x1dd/0x410
>     ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
>     ? __fget_files+0xc5/0x170
>     __x64_sys_ioctl+0x8c/0xc0
>     do_syscall_64+0x34/0x80
>     entry_SYSCALL_64_after_hwframe+0x46/0xb0
>    RIP: 0033:0x7f2a60b0c9df
>
>
> $ ./scripts/faddr2line ./vmlinux gen8_ppgtt_insert+0x50b/0x8f0
> gen8_ppgtt_insert+0x50b/0x8f0:
> i915_pt_entry at drivers/gpu/drm/i915/gt/intel_gtt.h:557
> (inlined by) gen8_ppgtt_insert_huge at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:641
> (inlined by) gen8_ppgtt_insert at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:743
>
> It's likely should be the i915 driver issue that is crashes with the NULL deref, but the origin of the bug should be the kvm page fault handling.
>
> David, could you please tell what tests you've run and post a link to yours kernel tree? Maybe I made obscure mistake while applied the patch manually.

For tests, I ran the kvm selftests and then various ChromeOS
virtualization tests. Two things to note about the ChromeOS
virtualization tests are that they use crosvm instead of qemu, and
they use virtio-gpu+virgl for graphics in the guest. I tested on an
AMD device (since the goal of this series is to fix a compatibility
issue with the amdgpu driver), and on a TGL device.

I don't have an easy way to share my kernel tree, but it's based on
v6.5-r3. The series I sent out is rebased onto the kvm next branch,
but there were only minor conflicts.

-David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-19  2:31     ` David Stevens
@ 2023-09-21 20:04       ` Dmitry Osipenko
  2024-02-06  3:02       ` Sean Christopherson
  1 sibling, 0 replies; 42+ messages in thread
From: Dmitry Osipenko @ 2023-09-21 20:04 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On 9/19/23 05:31, David Stevens wrote:
> On Mon, Sep 18, 2023 at 6:58 PM Dmitry Osipenko
> <dmitry.osipenko@collabora.com> wrote:
>>
>> On 9/11/23 05:16, David Stevens wrote:
>>> From: David Stevens <stevensd@chromium.org>
>>>
>>> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
>>> to map memory into the guest that is backed by non-refcounted struct
>>> pages - for example, the tail pages of higher order non-compound pages
>>> allocated by the amdgpu driver via ttm_pool_alloc_page.
>>>
>>> The bulk of this change is tracking the is_refcounted_page flag so that
>>> non-refcounted pages don't trigger page_count() == 0 warnings. This is
>>> done by storing the flag in an unused bit in the sptes. There are no
>>> bits available in PAE SPTEs, so non-refcounted pages can only be handled
>>> on TDP and x86-64.
>>>
>>> Signed-off-by: David Stevens <stevensd@chromium.org>
>>> ---
>>>  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
>>>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>>>  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
>>>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>>>  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
>>>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
>>>  include/linux/kvm_host.h        |  3 ++
>>>  virt/kvm/kvm_main.c             |  6 ++--
>>>  8 files changed, 76 insertions(+), 32 deletions(-)
>>
>> Could you please tell which kernel tree you used for the base of this
>> series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm
>>
>> error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
>> error: could not build fake ancestor
> 
> This series is based on the kvm next branch (i.e.
> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next). The
> specific hash is d011151616e73de20c139580b73fa4c7042bd861.

Thanks, this tag works

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-19  2:59       ` David Stevens
@ 2023-09-21 20:06         ` Dmitry Osipenko
  2023-09-30 13:34           ` Dmitry Osipenko
  0 siblings, 1 reply; 42+ messages in thread
From: Dmitry Osipenko @ 2023-09-21 20:06 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On 9/19/23 05:59, David Stevens wrote:
> On Mon, Sep 18, 2023 at 8:19 PM Dmitry Osipenko
> <dmitry.osipenko@collabora.com> wrote:
>>
>> On 9/18/23 12:58, Dmitry Osipenko wrote:
>>> On 9/11/23 05:16, David Stevens wrote:
>>>> From: David Stevens <stevensd@chromium.org>
>>>>
>>>> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
>>>> to map memory into the guest that is backed by non-refcounted struct
>>>> pages - for example, the tail pages of higher order non-compound pages
>>>> allocated by the amdgpu driver via ttm_pool_alloc_page.
>>>>
>>>> The bulk of this change is tracking the is_refcounted_page flag so that
>>>> non-refcounted pages don't trigger page_count() == 0 warnings. This is
>>>> done by storing the flag in an unused bit in the sptes. There are no
>>>> bits available in PAE SPTEs, so non-refcounted pages can only be handled
>>>> on TDP and x86-64.
>>>>
>>>> Signed-off-by: David Stevens <stevensd@chromium.org>
>>>> ---
>>>>  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
>>>>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>>>>  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
>>>>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>>>>  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
>>>>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
>>>>  include/linux/kvm_host.h        |  3 ++
>>>>  virt/kvm/kvm_main.c             |  6 ++--
>>>>  8 files changed, 76 insertions(+), 32 deletions(-)
>>>
>>> Could you please tell which kernel tree you used for the base of this
>>> series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm
>>>
>>> error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
>>> error: could not build fake ancestor
>>
>> I applied the patch manually to v6.5.2 and tested Venus using Intel TGL iGPU, the intel driver is crashing:
>>
>>    BUG: kernel NULL pointer dereference, address: 0000000000000058
>>    #PF: supervisor read access in kernel mode
>>    #PF: error_code(0x0000) - not-present page
>>    PGD 0 P4D 0
>>    Oops: 0000 [#1] PREEMPT SMP
>>    CPU: 1 PID: 5926 Comm: qemu-system-x86 Not tainted 6.5.2+ #114
>>    Hardware name: LENOVO 20VE/LNVNB161216, BIOS F8CN43WW(V2.06) 08/12/2021
>>    RIP: 0010:gen8_ppgtt_insert+0x50b/0x8f0
>>    Code: 00 00 f7 c2 00 00 20 00 74 15 f7 c3 ff ff 1f 00 75 0d 41 81 fc ff ff 1f 00 0f 87 0e 02 00 00 48 8b 74 24 08 44 89 c0 45 85 ed <48> 8b 4e 58 48 8b 04 c1 0f 85 0b 02 00 00 81 e2 00 00 01 00 0f 84
>>    RSP: 0018:ffffafc085afb820 EFLAGS: 00010246
>>    RAX: 0000000000000000 RBX: 00000000e9604000 RCX: 000000000000001b
>>    RDX: 0000000000211000 RSI: 0000000000000000 RDI: ffff9513d44c1000
>>    RBP: ffff951106f8dfc0 R08: 0000000000000000 R09: 0000000000000003
>>    R10: 0000000000000fff R11: 00000000e9800000 R12: 00000000001fc000
>>    R13: 0000000000000000 R14: 0000000000001000 R15: 0000ffff00000000
>>    FS:  00007f2a5bcced80(0000) GS:ffff951a87a40000(0000) knlGS:0000000000000000
>>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>    CR2: 0000000000000058 CR3: 0000000116f16006 CR4: 0000000000772ee0
>>    PKRU: 55555554
>>    Call Trace:
>>     <TASK>
>>     ? __die+0x1f/0x60
>>     ? page_fault_oops+0x14d/0x420
>>     ? exc_page_fault+0x3d7/0x880
>>     ? lock_acquire+0xc9/0x290
>>     ? asm_exc_page_fault+0x22/0x30
>>     ? gen8_ppgtt_insert+0x50b/0x8f0
>>     ppgtt_bind_vma+0x4f/0x60
>>     fence_work+0x1b/0x70
>>     fence_notify+0x8f/0x130
>>     __i915_sw_fence_complete+0x58/0x230
>>     i915_vma_pin_ww+0x513/0xa80
>>     eb_validate_vmas+0x17e/0x9e0
>>     ? eb_pin_engine+0x2bb/0x340
>>     i915_gem_do_execbuffer+0xc85/0x2bf0
>>     ? __lock_acquire+0x3b6/0x21c0
>>     i915_gem_execbuffer2_ioctl+0xee/0x240
>>     ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
>>     drm_ioctl_kernel+0x9d/0x140
>>     drm_ioctl+0x1dd/0x410
>>     ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
>>     ? __fget_files+0xc5/0x170
>>     __x64_sys_ioctl+0x8c/0xc0
>>     do_syscall_64+0x34/0x80
>>     entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>    RIP: 0033:0x7f2a60b0c9df
>>
>>
>> $ ./scripts/faddr2line ./vmlinux gen8_ppgtt_insert+0x50b/0x8f0
>> gen8_ppgtt_insert+0x50b/0x8f0:
>> i915_pt_entry at drivers/gpu/drm/i915/gt/intel_gtt.h:557
>> (inlined by) gen8_ppgtt_insert_huge at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:641
>> (inlined by) gen8_ppgtt_insert at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:743
>>
>> It's likely should be the i915 driver issue that is crashes with the NULL deref, but the origin of the bug should be the kvm page fault handling.
>>
>> David, could you please tell what tests you've run and post a link to yours kernel tree? Maybe I made obscure mistake while applied the patch manually.
> 
> For tests, I ran the kvm selftests and then various ChromeOS
> virtualization tests. Two things to note about the ChromeOS
> virtualization tests are that they use crosvm instead of qemu, and
> they use virtio-gpu+virgl for graphics in the guest. I tested on an
> AMD device (since the goal of this series is to fix a compatibility
> issue with the amdgpu driver), and on a TGL device.
> 
> I don't have an easy way to share my kernel tree, but it's based on
> v6.5-r3. The series I sent out is rebased onto the kvm next branch,
> but there were only minor conflicts.

It's good that you mentioned crosvm because I only tested qemu.
Interestingly, the problem doesn't present using crosvm.

I made few more tests using these options:

  1. yours latest version of the patch based on v6.5.4
  2. yours latest version of the patch based on the kvm tree
  3. by forward-porting yours older version of the kvm patchset to v6.5
that works well based on v6.4 kernel

Neither of the options work with qemu and v6.5 kernel, but crosvm works.

At first I felt confident that it must be i915 bug of the v6.5 kernel.
But given that crosvm works, now I'm not sure.

Will try to bisect 6.5 kernel, at minimum the i915 driver changes.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
                   ` (5 preceding siblings ...)
  2023-09-11  2:16 ` [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages David Stevens
@ 2023-09-29  5:19 ` Christoph Hellwig
  2023-09-29 16:06   ` Sean Christopherson
  2023-10-31  4:30 ` David Stevens
  7 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2023-09-29  5:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Mon, Sep 11, 2023 at 11:16:30AM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.
> 
> Our use case is virtio-gpu blob resources [1], which directly map host
> graphics buffers into the guest as "vram" for the virtio-gpu device.
> This feature currently does not work on systems using the amdgpu driver,
> as that driver allocates non-compound higher order pages via
> ttm_pool_alloc_page.
> 
> First, this series replaces the __gfn_to_pfn_memslot API with a more
> extensible __kvm_faultin_pfn API. The updated API rearranges
> __gfn_to_pfn_memslot's args into a struct and where possible packs the
> bool arguments into a FOLL_ flags argument. The refactoring changes do
> not change any behavior.

Instead of adding hacks to kvm you really should fix the driver / TTM
to not do weird memory allocations.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-09-29  5:19 ` [PATCH v9 0/6] KVM: allow mapping " Christoph Hellwig
@ 2023-09-29 16:06   ` Sean Christopherson
  2023-10-02  6:25     ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2023-09-29 16:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Stevens, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Thu, Sep 28, 2023, Christoph Hellwig wrote:
> On Mon, Sep 11, 2023 at 11:16:30AM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> > that is backed by struct pages that aren't currently being refcounted
> > (e.g. tail pages of non-compound higher order allocations) into the
> > guest.
> > 
> > Our use case is virtio-gpu blob resources [1], which directly map host
> > graphics buffers into the guest as "vram" for the virtio-gpu device.
> > This feature currently does not work on systems using the amdgpu driver,
> > as that driver allocates non-compound higher order pages via
> > ttm_pool_alloc_page.
> > 
> > First, this series replaces the __gfn_to_pfn_memslot API with a more
> > extensible __kvm_faultin_pfn API. The updated API rearranges
> > __gfn_to_pfn_memslot's args into a struct and where possible packs the
> > bool arguments into a FOLL_ flags argument. The refactoring changes do
> > not change any behavior.
> 
> Instead of adding hacks to kvm you really should fix the driver / TTM
> to not do weird memory allocations.

I agree that the driver/TTM behavior is nasty, but from a KVM perspective the vast
majority of this series is long-overdue cleanups (at least, IMO).  All of those
cleanups were my requirement for adding support for the behavior David and friends
actually care about.

KVM needs to be aware of non-refcounted struct page memory no matter what; see
CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
non-reference-counted pages").  I don't think it makes any sense whatsoever to
remove that code and assume every driver in existence will do the right thing.

With the cleanups done, playing nice with non-refcounted paged instead of outright
rejecting them is a wash in terms of lines of code, complexity, and ongoing
maintenance cost.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-21 20:06         ` Dmitry Osipenko
@ 2023-09-30 13:34           ` Dmitry Osipenko
  0 siblings, 0 replies; 42+ messages in thread
From: Dmitry Osipenko @ 2023-09-30 13:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On 9/21/23 23:06, Dmitry Osipenko wrote:
> On 9/19/23 05:59, David Stevens wrote:
>> On Mon, Sep 18, 2023 at 8:19 PM Dmitry Osipenko
>> <dmitry.osipenko@collabora.com> wrote:
>>>
>>> On 9/18/23 12:58, Dmitry Osipenko wrote:
>>>> On 9/11/23 05:16, David Stevens wrote:
>>>>> From: David Stevens <stevensd@chromium.org>
>>>>>
>>>>> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
>>>>> to map memory into the guest that is backed by non-refcounted struct
>>>>> pages - for example, the tail pages of higher order non-compound pages
>>>>> allocated by the amdgpu driver via ttm_pool_alloc_page.
>>>>>
>>>>> The bulk of this change is tracking the is_refcounted_page flag so that
>>>>> non-refcounted pages don't trigger page_count() == 0 warnings. This is
>>>>> done by storing the flag in an unused bit in the sptes. There are no
>>>>> bits available in PAE SPTEs, so non-refcounted pages can only be handled
>>>>> on TDP and x86-64.
>>>>>
>>>>> Signed-off-by: David Stevens <stevensd@chromium.org>
>>>>> ---
>>>>>  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
>>>>>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>>>>>  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
>>>>>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>>>>>  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
>>>>>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
>>>>>  include/linux/kvm_host.h        |  3 ++
>>>>>  virt/kvm/kvm_main.c             |  6 ++--
>>>>>  8 files changed, 76 insertions(+), 32 deletions(-)
>>>>
>>>> Could you please tell which kernel tree you used for the base of this
>>>> series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm
>>>>
>>>> error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
>>>> error: could not build fake ancestor
>>>
>>> I applied the patch manually to v6.5.2 and tested Venus using Intel TGL iGPU, the intel driver is crashing:
>>>
>>>    BUG: kernel NULL pointer dereference, address: 0000000000000058
>>>    #PF: supervisor read access in kernel mode
>>>    #PF: error_code(0x0000) - not-present page
>>>    PGD 0 P4D 0
>>>    Oops: 0000 [#1] PREEMPT SMP
>>>    CPU: 1 PID: 5926 Comm: qemu-system-x86 Not tainted 6.5.2+ #114
>>>    Hardware name: LENOVO 20VE/LNVNB161216, BIOS F8CN43WW(V2.06) 08/12/2021
>>>    RIP: 0010:gen8_ppgtt_insert+0x50b/0x8f0
>>>    Code: 00 00 f7 c2 00 00 20 00 74 15 f7 c3 ff ff 1f 00 75 0d 41 81 fc ff ff 1f 00 0f 87 0e 02 00 00 48 8b 74 24 08 44 89 c0 45 85 ed <48> 8b 4e 58 48 8b 04 c1 0f 85 0b 02 00 00 81 e2 00 00 01 00 0f 84
>>>    RSP: 0018:ffffafc085afb820 EFLAGS: 00010246
>>>    RAX: 0000000000000000 RBX: 00000000e9604000 RCX: 000000000000001b
>>>    RDX: 0000000000211000 RSI: 0000000000000000 RDI: ffff9513d44c1000
>>>    RBP: ffff951106f8dfc0 R08: 0000000000000000 R09: 0000000000000003
>>>    R10: 0000000000000fff R11: 00000000e9800000 R12: 00000000001fc000
>>>    R13: 0000000000000000 R14: 0000000000001000 R15: 0000ffff00000000
>>>    FS:  00007f2a5bcced80(0000) GS:ffff951a87a40000(0000) knlGS:0000000000000000
>>>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>    CR2: 0000000000000058 CR3: 0000000116f16006 CR4: 0000000000772ee0
>>>    PKRU: 55555554
>>>    Call Trace:
>>>     <TASK>
>>>     ? __die+0x1f/0x60
>>>     ? page_fault_oops+0x14d/0x420
>>>     ? exc_page_fault+0x3d7/0x880
>>>     ? lock_acquire+0xc9/0x290
>>>     ? asm_exc_page_fault+0x22/0x30
>>>     ? gen8_ppgtt_insert+0x50b/0x8f0
>>>     ppgtt_bind_vma+0x4f/0x60
>>>     fence_work+0x1b/0x70
>>>     fence_notify+0x8f/0x130
>>>     __i915_sw_fence_complete+0x58/0x230
>>>     i915_vma_pin_ww+0x513/0xa80
>>>     eb_validate_vmas+0x17e/0x9e0
>>>     ? eb_pin_engine+0x2bb/0x340
>>>     i915_gem_do_execbuffer+0xc85/0x2bf0
>>>     ? __lock_acquire+0x3b6/0x21c0
>>>     i915_gem_execbuffer2_ioctl+0xee/0x240
>>>     ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
>>>     drm_ioctl_kernel+0x9d/0x140
>>>     drm_ioctl+0x1dd/0x410
>>>     ? i915_gem_do_execbuffer+0x2bf0/0x2bf0
>>>     ? __fget_files+0xc5/0x170
>>>     __x64_sys_ioctl+0x8c/0xc0
>>>     do_syscall_64+0x34/0x80
>>>     entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>    RIP: 0033:0x7f2a60b0c9df
>>>
>>>
>>> $ ./scripts/faddr2line ./vmlinux gen8_ppgtt_insert+0x50b/0x8f0
>>> gen8_ppgtt_insert+0x50b/0x8f0:
>>> i915_pt_entry at drivers/gpu/drm/i915/gt/intel_gtt.h:557
>>> (inlined by) gen8_ppgtt_insert_huge at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:641
>>> (inlined by) gen8_ppgtt_insert at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:743
>>>
>>> It's likely should be the i915 driver issue that is crashes with the NULL deref, but the origin of the bug should be the kvm page fault handling.
>>>
>>> David, could you please tell what tests you've run and post a link to yours kernel tree? Maybe I made obscure mistake while applied the patch manually.
>>
>> For tests, I ran the kvm selftests and then various ChromeOS
>> virtualization tests. Two things to note about the ChromeOS
>> virtualization tests are that they use crosvm instead of qemu, and
>> they use virtio-gpu+virgl for graphics in the guest. I tested on an
>> AMD device (since the goal of this series is to fix a compatibility
>> issue with the amdgpu driver), and on a TGL device.
>>
>> I don't have an easy way to share my kernel tree, but it's based on
>> v6.5-r3. The series I sent out is rebased onto the kvm next branch,
>> but there were only minor conflicts.
> 
> It's good that you mentioned crosvm because I only tested qemu.
> Interestingly, the problem doesn't present using crosvm.
> 
> I made few more tests using these options:
> 
>   1. yours latest version of the patch based on v6.5.4
>   2. yours latest version of the patch based on the kvm tree
>   3. by forward-porting yours older version of the kvm patchset to v6.5
> that works well based on v6.4 kernel
> 
> Neither of the options work with qemu and v6.5 kernel, but crosvm works.
> 
> At first I felt confident that it must be i915 bug of the v6.5 kernel.
> But given that crosvm works, now I'm not sure.
> 
> Will try to bisect 6.5 kernel, at minimum the i915 driver changes.

The bisection said that it's not i915 bug.

Good news, the bug was fixed in a recent linux-next. Don't know what is
the exact fix, suppose it's something in mm/

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-19  2:25     ` David Stevens
@ 2023-09-30 13:34       ` Dmitry Osipenko
  0 siblings, 0 replies; 42+ messages in thread
From: Dmitry Osipenko @ 2023-09-30 13:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On 9/19/23 05:25, David Stevens wrote:
> On Mon, Sep 18, 2023 at 6:53 PM Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>>
>> On 9/11/23 05:16, David Stevens wrote:
>>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>>> @@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>>>  
>>>  out_unlock:
>>>       write_unlock(&vcpu->kvm->mmu_lock);
>>> -     kvm_release_pfn_clean(fault->pfn);
>>> +     if (fault->is_refcounted_page)
>>> +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
>>
>> The other similar occurrences in the code that replaced
>> kvm_release_pfn_clean() with kvm_set_page_accessed() did it under the
>> held mmu_lock.
>>
>> Does kvm_set_page_accessed() needs to be invoked under the lock?
> 
> It looks like I made a mistake when folding the v8->v9 delta into the stack of
> patches to get a clean v9 series. v8 of the series returned pfns without
> reference counts from __kvm_follow_pfn, so the x86 MMU needed to mark the pages
> as accessed under the lock. v9 instead returns pfns with a refcount (i.e. does
> the same thing as __gfn_to_pfn_memslot), so the x86 MMU should instead call
> kvm_release_page_clean outside of the lock. I've included the corrected version
> of this patch in this email.
[snip]

I tested this series + the corrected version of the patch on Intel TGL using virgl/venus/virtio-intel on both qemu and crosvm on top of the recent linux-next. All is working good. Feel free to add my t-b to the v10:

Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> # virgl+venus+virtio-intel+i915

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-09-29 16:06   ` Sean Christopherson
@ 2023-10-02  6:25     ` Christoph Hellwig
  2024-02-06  3:29       ` Sean Christopherson
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2023-10-02  6:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Christoph Hellwig, David Stevens, Yu Zhang, Isaku Yamahata,
	Zhi Wang, kvmarm, linux-kernel, kvm

On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> KVM needs to be aware of non-refcounted struct page memory no matter what; see
> CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> non-reference-counted pages").  I don't think it makes any sense whatsoever to
> remove that code and assume every driver in existence will do the right thing.

Agreed.

> 
> With the cleanups done, playing nice with non-refcounted paged instead of outright
> rejecting them is a wash in terms of lines of code, complexity, and ongoing
> maintenance cost.

I tend to strongly disagree with that, though.  We can't just let these
non-refcounted pages spread everywhere and instead need to fix their
usage.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function
  2023-09-11  2:16 ` [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function David Stevens
@ 2023-10-03 16:54   ` Maxim Levitsky
  2024-02-06  1:25     ` Sean Christopherson
  2024-02-06  3:16   ` Sean Christopherson
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Maxim Levitsky @ 2023-10-03 16:54 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |  10 ++-
>  4 files changed, 123 insertions(+), 77 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fb6c6109fdca..c2e0ddf14dba 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1177,6 +1178,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> +	const struct kvm_memory_slot *slot;
> +	gfn_t gfn;
> +	unsigned int flags;
It might be useful for the future readers to have a note about which values the flags can take.
(e.g one of the 'FOLL_* flags).
> +	bool atomic;

I wish we had FOLL_ATOMIC, because there is a slight usability regression in regard to the fact,
that now some of the flags are here and in the 'atomic' variable.


> +	/* Try to create a writable mapping even for a read fault */
> +	bool try_map_writable;
> +
> +	/* Outputs of __kvm_follow_pfn */
> +	hva_t hva;
> +	bool writable;
> +};


Another small usability note. I feel like the name 'follow_pfn' is not the best name for this.

I think ultimately it comes from 'follow_pte()' and even that name IMHO is incorrect.
the 'follow_pte' should be called 'lookup_kernel_pte', because that is what it does - it finds a pointer to pte
of hva in its process's kernel page tables.

IMHO, the kvm_follow_pfn struct should be called something like gfn_to_pfn_params, because it specifies on how
to convert gfn to pfn (or create the pfn if the page was swapped out).

Same note applies to '__kvm_follow_pfn()'

If you really want to keep the original name though, I won't argue over this.

> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ee6090ecb1fe..9b33a59c6d65 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2512,8 +2512,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> -			    bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
>  
> @@ -2522,14 +2521,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>  	 * or the caller allows to map a writable pfn for a read fault
>  	 * request.
>  	 */
> -	if (!(write_fault || writable))
> +	if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
>  		return false;

A small note: the 'foll' variable and the FOLL_* flags have different meaning:
foll is the pointer to a new 'struct kvm_follow_pfn' while FOLL_ is from the folio API,
I think.

Maybe we should rename the 'foll' to something, like 'args' or something like that?

>  
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
> -
> -		if (writable)
> -			*writable = true;
> +		foll->writable = true;
>  		return true;
>  	}
>  
> @@ -2540,35 +2537,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | foll->flags;
>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> -	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (foll->flags & FOLL_WRITE) {
> +		foll->writable = true;
> +	} else if (foll->try_map_writable) {
>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		/* map read fault as writable if possible */
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;

Regardless of this patch, I am wondering, what was the reason to first map the
page in the same way as requested and then try to map it as writable.

Since the vast majority of the guest pages are likely to be writable, isn't it better
to first opportunistically map them as writable and if that fails, then try to map
readonly?

>  		}
> @@ -2599,23 +2587,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  }
>  
>  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)
>  {
>  	kvm_pfn_t pfn;
>  	pte_t *ptep;
>  	pte_t pte;
>  	spinlock_t *ptl;
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  	int r;
>  
> -	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  	if (r) {
>  		/*
>  		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
>  		 * not call the fault handler, so do it here.
>  		 */
>  		bool unlocked = false;
> -		r = fixup_user_fault(current->mm, addr,
> +		r = fixup_user_fault(current->mm, foll->hva,
>  				     (write_fault ? FAULT_FLAG_WRITE : 0),
>  				     &unlocked);
>  		if (unlocked)
> @@ -2623,7 +2611,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		if (r)
>  			return r;
>  
> -		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  		if (r)
>  			return r;
>  	}
> @@ -2635,8 +2623,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(pte);
> +	foll->writable = pte_write(pte);
>  	pfn = pte_pfn(pte);
>  
>  	/*
> @@ -2681,24 +2668,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2706,83 +2691,123 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	    (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
>  		pfn = KVM_PFN_ERR_HWPOISON;
>  		goto exit;
>  	}
>  
>  retry:
> -	vma = vma_lookup(current->mm, addr);
> +	vma = vma_lookup(current->mm, foll->hva);
>  
>  	if (vma == NULL)
>  		pfn = KVM_PFN_ERR_FAULT;
>  	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> -		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> +		r = hva_to_pfn_remapped(vma, foll, &pfn);
>  		if (r == -EAGAIN)
>  			goto retry;
>  		if (r < 0)
>  			pfn = KVM_PFN_ERR_FAULT;
>  	} else {
> -		if (async && vma_is_valid(vma, write_fault))
> -			*async = true;
> -		pfn = KVM_PFN_ERR_FAULT;
> +		if ((foll->flags & FOLL_NOWAIT) &&
> +		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
> +			pfn = KVM_PFN_ERR_NEEDS_IO;
> +		else
> +			pfn = KVM_PFN_ERR_FAULT;
>  	}
>  exit:
>  	mmap_read_unlock(current->mm);
>  	return pfn;
>  }
>  
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> -			       bool atomic, bool interruptible, bool *async,
> -			       bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> -	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> +	foll->writable = false;
> +	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> +				      foll->flags & FOLL_WRITE);
>  
> -	if (hva)
> -		*hva = addr;
> -
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> +	if (foll->hva == KVM_HVA_ERR_RO_BAD)
>  		return KVM_PFN_ERR_RO_FAULT;
> -	}
>  
> -	if (kvm_is_error_hva(addr)) {
> -		if (writable)
> -			*writable = false;
> +	if (kvm_is_error_hva(foll->hva))
>  		return KVM_PFN_NOSLOT;
> -	}
>  
> -	/* Do not map writable pfn in the readonly memslot. */
> -	if (writable && memslot_is_readonly(slot)) {
> -		*writable = false;
> -		writable = NULL;
> -	}
> +	if (memslot_is_readonly(foll->slot))
> +		foll->try_map_writable = false;
>  
> -	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> -			  writable);
> +	return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
> +
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.try_map_writable = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
> -	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> -				    NULL, write_fault, writable, NULL);
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(kvm, gfn),
> +		.gfn = gfn,
> +		.flags = write_fault ? FOLL_WRITE : 0,
> +		.try_map_writable = !!writable,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);

Unrelated, but it would be a good idea to document the contract behind these functions.

>  
>  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
Same here + the new code makes it at least much easier to understand what each function is doing.

>  
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.atomic = true,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
>  #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>  
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>  
>  #ifdef CONFIG_HAVE_KVM_PFNCACHE
>  void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..86cd40acad11 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
>  	void *new_khva = NULL;
>  	unsigned long mmu_seq;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gpc->memslot,
> +		.gfn = gpa_to_gfn(gpc->gpa),
> +		.flags = FOLL_WRITE,
> +		.hva = gpc->uhva,
> +	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);
>  
> @@ -182,8 +188,8 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  			cond_resched();
>  		}
>  
> -		/* We always request a writeable mapping */
> -		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> +		/* We always request a writable mapping */
> +		new_pfn = hva_to_pfn(&foll);
>  		if (is_error_noslot_pfn(new_pfn))
>  			goto out_error;
>  


Overall the patch looks like a very good refactoring.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns
  2023-09-11  2:16 ` [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns David Stevens
@ 2023-10-03 16:54   ` Maxim Levitsky
  2024-02-06  2:54     ` Sean Christopherson
  2024-02-13  3:44   ` Sean Christopherson
  1 sibling, 1 reply; 42+ messages in thread
From: Maxim Levitsky @ 2023-10-03 16:54 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> From: David Stevens <stevensd@chromium.org>
> 
> KVM's handling of non-refcounted pfns has two problems:
> 
>  - struct pages without refcounting (e.g. tail pages of non-compound
>    higher order pages) cannot be used at all, as gfn_to_pfn does not
>    provide enough information for callers to handle the refcount.
>  - pfns without struct pages can be accessed without the protection of a
>    mmu notifier. This is unsafe because KVM cannot monitor or control
>    the lifespan of such pfns, so it may continue to access the pfns
>    after they are freed.
> 
> This patch extends the __kvm_follow_pfn API to properly handle these
> cases. 


> First, it adds a is_refcounted_page output parameter so that
> callers can tell whether or not a pfn has a struct page that needs to be
> passed to put_page. 


> Second, it adds a guarded_by_mmu_notifier parameter
> that is used to avoid returning non-refcounted pages when the caller
> cannot safely use them.
> 
> Since callers need to be updated on a case-by-case basis to pay
> attention to is_refcounted_page, the new behavior of returning
> non-refcounted pages is opt-in via the allow_non_refcounted_struct_page
> parameter. Once all callers have been updated, this parameter should be
> removed.

Small note: since these new parameters are critical for understanding the patch,
Maybe it makes sense to re-order their description to match the order in the struct 
(or at least put the output parameter at the end of the description),
and give each a separate paragraph as I did above.

> 
> The fact that non-refcounted pfns can no longer be accessed without mmu
> notifier protection is a breaking change. Since there is no timeline for
> updating everything in KVM to use mmu notifiers, this change adds an
> opt-in module parameter called allow_unsafe_mappings to allow such
> mappings. Systems which trust userspace not to tear down such unsafe
> mappings while KVM is using them can set this parameter to re-enable the
> legacy behavior.

Do you have a practical example of a VM that can break with this change?
E.g will a normal VM break? will a VM with VFIO devices break? Will a VM with
hugepages mapped into it break?

Will the trick of limiting the kernel memory with 'mem=X', and then use the 
extra 'upper memory' for VMs still work?

> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h | 21 ++++++++++
>  virt/kvm/kvm_main.c      | 84 ++++++++++++++++++++++++----------------
>  virt/kvm/pfncache.c      |  1 +
>  3 files changed, 72 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c2e0ddf14dba..2ed08ae1a9be 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Try to create a writable mapping even for a read fault */
>  	bool try_map_writable;
> +	/* Usage of the returned pfn will be guared by a mmu notifier. */
> +	bool guarded_by_mmu_notifier;
> +	/*
> +	 * When false, do not return pfns for non-refcounted struct pages.
> +	 *
> +	 * TODO: This allows callers to use kvm_release_pfn on the pfns
> +	 * returned by gfn_to_pfn without worrying about corrupting the
> +	 * refcounted of non-refcounted pages. Once all callers respect
Typo: refcount.
> +	 * is_refcounted_page, this flag should be removed.
> +	 */
> +	bool allow_non_refcounted_struct_page;
>  
>  	/* Outputs of __kvm_follow_pfn */
>  	hva_t hva;
>  	bool writable;
> +	/*
> +	 * True if the returned pfn is for a page with a valid refcount. False
> +	 * if the returned pfn has no struct page or if the struct page is not
> +	 * being refcounted (e.g. tail pages of non-compound higher order
> +	 * allocations from IO/PFNMAP mappings).
> 
Aren't all tail pages not-refcounted (e.g tail page of a hugepage?)
I haven't researched this topic yet.

> +	 *
> +	 * When this output flag is false, callers should not try to convert
> +	 * the pfn to a struct page.
> +	 */
> +	bool is_refcounted_page;
>  };
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9b33a59c6d65..235c5cb3fdac 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink;
>  module_param(halt_poll_ns_shrink, uint, 0644);
>  EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
>  
> +/* Allow non-struct page memory to be mapped without MMU notifier protection. */
> +static bool allow_unsafe_mappings;
> +module_param(allow_unsafe_mappings, bool, 0444);
> +
>  /*
>   * Ordering of locks:
>   *
> @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>  	return rc == -EHWPOISON;
>  }
>  
> +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> +					   struct page *page)
> +{
> +	kvm_pfn_t pfn = page_to_pfn(page);
> +
> +	foll->is_refcounted_page = true;
> +	return pfn;
> +}

Just a matter of taste but to me this function looks confusing.
IMHO, just duplicating these two lines of code is better.
However if you prefer I won't argue over this.

> +
>  /*
>   * The fast path to get the writable pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.  It's also the
> @@ -2525,7 +2538,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		return false;
>  
>  	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> -		*pfn = page_to_pfn(page[0]);
> +		*pfn = kvm_follow_refcounted_pfn(foll, page[0]);

Yep, here just 'foll->is_refcounted_page = true;' looks more readable to me.

>  		foll->writable = true;
>  		return true;
>  	}
> @@ -2561,7 +2574,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  			page = wpage;
>  		}
>  	}
> -	*pfn = page_to_pfn(page);
> +	*pfn = kvm_follow_refcounted_pfn(foll, page);
Same here and probably in other places.
>  	return npages;
>  }
>  
> @@ -2576,16 +2589,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>  	return true;
>  }
>  
> -static int kvm_try_get_pfn(kvm_pfn_t pfn)
> -{
> -	struct page *page = kvm_pfn_to_refcounted_page(pfn);
> -
> -	if (!page)
> -		return 1;
> -
> -	return get_page_unless_zero(page);
> -}
> -
>  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)
>  {
> @@ -2594,6 +2597,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  	pte_t pte;
>  	spinlock_t *ptl;
>  	bool write_fault = foll->flags & FOLL_WRITE;
> +	struct page *page;
>  	int r;
>  
>  	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> @@ -2618,37 +2622,39 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  
>  	pte = ptep_get(ptep);
>  
> +	foll->writable = pte_write(pte);
> +	pfn = pte_pfn(pte);
> +
> +	page = kvm_pfn_to_refcounted_page(pfn);
> +
>  	if (write_fault && !pte_write(pte)) {
>  		pfn = KVM_PFN_ERR_RO_FAULT;
>  		goto out;
>  	}
>  
> -	foll->writable = pte_write(pte);
> -	pfn = pte_pfn(pte);
> +	if (!page)
> +		goto out;
>  
> -	/*
> -	 * Get a reference here because callers of *hva_to_pfn* and
> -	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> -	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> -	 * simply do nothing for reserved pfns.
> -	 *
> -	 * Whoever called remap_pfn_range is also going to call e.g.
> -	 * unmap_mapping_range before the underlying pages are freed,
> -	 * causing a call to our MMU notifier.
> -	 *
> -	 * Certain IO or PFNMAP mappings can be backed with valid
> -	 * struct pages, but be allocated without refcounting e.g.,
> -	 * tail pages of non-compound higher order allocations, which
> -	 * would then underflow the refcount when the caller does the
> -	 * required put_page. Don't allow those pages here.
> -	 */

Why the comment is removed? as far as I see the code still grabs a reference to the page.

> -	if (!kvm_try_get_pfn(pfn))
> -		r = -EFAULT;
> +	if (get_page_unless_zero(page))
> +		WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);

Once again, the kvm_follow_refcounted_pfn usage is confusing IMHO. 
It sets the 'foll->is_refcounted_page', and yet someone can think that it's only there for the WARN_ON_ONCE.

That IMHO would read better:

if (get_page_unless_zero(page))
	foll->is_refcounted_page = true;

WARN_ON_ONCE(page_to_pfn(page) != pfn);

Note that I moved the warn out of the 'get_page_unless_zero' condition
because I think that this condition should be true for non refcounted pages as well.

Also I don't understand why 'get_page_unless_zero(page)' result is ignored. As I understand it,
it will increase refcount of a page unless it is zero. 

If a refcount of a refcounted page is 0 isn't that a bug?

The page was returned from kvm_pfn_to_refcounted_page which supposed only to return pages that are refcounted.

I might not understand something in regard to 'get_page_unless_zero(page)' usage both in old and the new code.

>  
>  out:
>  	pte_unmap_unlock(ptep, ptl);
> -	*p_pfn = pfn;
> +
> +	/*
> +	 * TODO: Remove the first branch once all callers have been
> +	 * taught to play nice with non-refcounted struct pages.
> +	 */
> +	if (page && !foll->is_refcounted_page &&
> +	    !foll->allow_non_refcounted_struct_page) {
> +		r = -EFAULT;
> +	} else if (!foll->is_refcounted_page &&
> +		   !foll->guarded_by_mmu_notifier &&
> +		   !allow_unsafe_mappings) {
> +		r = -EFAULT;
> +	} else {
> +		*p_pfn = pfn;
> +	}
>  
>  	return r;
>  }
> @@ -2722,6 +2728,8 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
>  	foll->writable = false;
> +	foll->is_refcounted_page = false;
> +
>  	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
>  				      foll->flags & FOLL_WRITE);
>  
> @@ -2749,6 +2757,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>  		.flags = 0,
>  		.atomic = atomic,
>  		.try_map_writable = !!writable,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  
>  	if (write_fault)
> @@ -2780,6 +2789,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		.gfn = gfn,
>  		.flags = write_fault ? FOLL_WRITE : 0,
>  		.try_map_writable = !!writable,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  	pfn = __kvm_follow_pfn(&foll);
>  	if (writable)
> @@ -2794,6 +2804,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  		.slot = slot,
>  		.gfn = gfn,
>  		.flags = FOLL_WRITE,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  	return __kvm_follow_pfn(&foll);
>  }
> @@ -2806,6 +2817,11 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
>  		.gfn = gfn,
>  		.flags = FOLL_WRITE,
>  		.atomic = true,
> +		/*
> +		 * Setting atomic means __kvm_follow_pfn will never make it
> +		 * to hva_to_pfn_remapped, so this is vacuously true.
> +		 */
> +		.allow_non_refcounted_struct_page = true,
>  	};
>  	return __kvm_follow_pfn(&foll);
>  }
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 86cd40acad11..6bbf972c11f8 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -149,6 +149,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		.gfn = gpa_to_gfn(gpc->gpa),
>  		.flags = FOLL_WRITE,
>  		.hva = gpc->uhva,
> +		.allow_non_refcounted_struct_page = false,
>  	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);


Best regards,
	Maxim Levitsky







^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn
  2023-09-11  2:16 ` [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn David Stevens
@ 2023-10-03 16:54   ` Maxim Levitsky
  0 siblings, 0 replies; 42+ messages in thread
From: Maxim Levitsky @ 2023-10-03 16:54 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> From: David Stevens <stevensd@chromium.org>
> 
> Migrate kvm_vcpu_map to __kvm_follow_pfn. Track is_refcounted_page so
> that kvm_vcpu_unmap know whether or not it needs to release the page.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  2 +-
>  virt/kvm/kvm_main.c      | 24 ++++++++++++++----------
>  2 files changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2ed08ae1a9be..b95c79b7833b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -294,6 +294,7 @@ struct kvm_host_map {
>  	void *hva;
>  	kvm_pfn_t pfn;
>  	kvm_pfn_t gfn;
> +	bool is_refcounted_page;
>  };
>  
>  /*
> @@ -1228,7 +1229,6 @@ void kvm_release_pfn_dirty(kvm_pfn_t pfn);
>  void kvm_set_pfn_dirty(kvm_pfn_t pfn);
>  void kvm_set_pfn_accessed(kvm_pfn_t pfn);
>  
> -void kvm_release_pfn(kvm_pfn_t pfn, bool dirty);
>  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  			int len);
>  int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 235c5cb3fdac..913de4e86d9d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2886,24 +2886,22 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_page);
>  
> -void kvm_release_pfn(kvm_pfn_t pfn, bool dirty)
> -{
> -	if (dirty)
> -		kvm_release_pfn_dirty(pfn);
> -	else
> -		kvm_release_pfn_clean(pfn);
> -}
> -
>  int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
>  {
>  	kvm_pfn_t pfn;
>  	void *hva = NULL;
>  	struct page *page = KVM_UNMAPPED_PAGE;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(vcpu->kvm, gfn),
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.allow_non_refcounted_struct_page = true,
> +	};
>  
>  	if (!map)
>  		return -EINVAL;
>  
> -	pfn = gfn_to_pfn(vcpu->kvm, gfn);
> +	pfn = __kvm_follow_pfn(&foll);
>  	if (is_error_noslot_pfn(pfn))
>  		return -EINVAL;
>  
> @@ -2923,6 +2921,7 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
>  	map->hva = hva;
>  	map->pfn = pfn;
>  	map->gfn = gfn;
> +	map->is_refcounted_page = foll.is_refcounted_page;
>  
>  	return 0;
>  }
> @@ -2946,7 +2945,12 @@ void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty)
>  	if (dirty)
>  		kvm_vcpu_mark_page_dirty(vcpu, map->gfn);
>  
> -	kvm_release_pfn(map->pfn, dirty);
> +	if (map->is_refcounted_page) {
> +		if (dirty)
> +			kvm_release_page_dirty(map->page);
> +		else
> +			kvm_release_page_clean(map->page);
> +	}
>  
>  	map->hva = NULL;
>  	map->page = NULL;

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 5/6] KVM: x86: Migrate to __kvm_follow_pfn
  2023-09-11  2:16 ` [PATCH v9 5/6] KVM: x86: Migrate " David Stevens
@ 2023-10-03 16:54   ` Maxim Levitsky
  2023-10-03 20:58     ` Sean Christopherson
  0 siblings, 1 reply; 42+ messages in thread
From: Maxim Levitsky @ 2023-10-03 16:54 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> From: David Stevens <stevensd@chromium.org>
> 
> Migrate functions which need access to is_refcounted_page to
> __kvm_follow_pfn. The functions which need this are __kvm_faultin_pfn
> and reexecute_instruction. The former requires replacing the async
> in/out parameter with FOLL_NOWAIT parameter and the KVM_PFN_ERR_NEEDS_IO
> return value. Handling non-refcounted pages is complicated, so it will
> be done in a followup. The latter is a straightforward refactor.
> 
> APIC related callers do not need to migrate because KVM controls the
> memslot, so it will always be regular memory. Prefetch related callers
> do not need to be migrated because atomic gfn_to_pfn calls can never
> make it to hva_to_pfn_remapped.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 43 ++++++++++++++++++++++++++++++++----------
>  arch/x86/kvm/x86.c     | 12 ++++++++++--
>  2 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1d011c67cc6..e1eca26215e2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4254,7 +4254,14 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> -	bool async;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = fault->gfn,
> +		.flags = fault->write ? FOLL_WRITE : 0,
> +		.try_map_writable = true,
> +		.guarded_by_mmu_notifier = true,
> +		.allow_non_refcounted_struct_page = false,
> +	};
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4283,12 +4290,20 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> -	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +	foll.flags |= FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
Unrelated but I can't say I like the 'is_error_noslot_pfn()' name, 
I wish it was called something like is_valid_pfn().

> +
> +	/*
> +	 * If __kvm_follow_pfn() failed because I/O is needed to fault in the
> +	 * page, then either set up an asynchronous #PF to do the I/O, or if
> +	 * doing an async #PF isn't possible, retry __kvm_follow_pfn() with
> +	 * I/O allowed. All other failures are fatal, i.e. retrying won't help.
> +	 */
> +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> +		return RET_PF_CONTINUE;
>  
>  	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> @@ -4306,9 +4321,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	 * to wait for IO.  Note, gup always bails if it is unable to quickly
>  	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
>  	 */
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> +	foll.flags |= FOLL_INTERRUPTIBLE;
> +	foll.flags &= ~FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
> +
> +	return RET_PF_CONTINUE;
> +success:
> +	fault->hva = foll.hva;
> +	fault->map_writable = foll.writable;
>  	return RET_PF_CONTINUE;
>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6c9c81e82e65..2011a7e47296 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8556,6 +8556,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  {
>  	gpa_t gpa = cr2_or_gpa;
>  	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll;
>  
>  	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
>  		return false;
> @@ -8585,7 +8586,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	 * retry instruction -> write #PF -> emulation fail -> retry
>  	 * instruction -> ...
>  	 */
> -	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> +	foll = (struct kvm_follow_pfn) {
> +		.slot = gfn_to_memslot(vcpu->kvm, gpa_to_gfn(gpa)),
> +		.gfn = gpa_to_gfn(gpa),
> +		.flags = FOLL_WRITE,
> +		.allow_non_refcounted_struct_page = true,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
>  
>  	/*
>  	 * If the instruction failed on the error pfn, it can not be fixed,
> @@ -8594,7 +8601,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	if (is_error_noslot_pfn(pfn))
>  		return false;
>  
> -	kvm_release_pfn_clean(pfn);
> +	if (foll.is_refcounted_page)
> +		kvm_release_page_clean(pfn_to_page(pfn));
>  
>  	/* The instructions are well-emulated on direct mmu. */
>  	if (vcpu->arch.mmu->root_role.direct) {


Overall looks good, I might have missed something.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-11  2:16 ` [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages David Stevens
  2023-09-18  9:53   ` Dmitry Osipenko
  2023-09-18  9:58   ` Dmitry Osipenko
@ 2023-10-03 16:54   ` Maxim Levitsky
  2024-02-06  3:23   ` Sean Christopherson
  3 siblings, 0 replies; 42+ messages in thread
From: Maxim Levitsky @ 2023-10-03 16:54 UTC (permalink / raw)
  To: David Stevens, Sean Christopherson
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> From: David Stevens <stevensd@chromium.org>
> 
> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
> to map memory into the guest that is backed by non-refcounted struct
> pages - for example, the tail pages of higher order non-compound pages
> allocated by the amdgpu driver via ttm_pool_alloc_page.
> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes. There are no
> bits available in PAE SPTEs, so non-refcounted pages can only be handled
> on TDP and x86-64.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
>  include/linux/kvm_host.h        |  3 ++
>  virt/kvm/kvm_main.c             |  6 ++--
>  8 files changed, 76 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1eca26215e2..b8168cc4cc96 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -545,12 +545,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  
>  	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	return flush;
> @@ -588,14 +590,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>  	 * before they are reclaimed.  Sanity check that, if the pfn is backed
>  	 * by a refcounted page, the refcount is elevated.
>  	 */
> -	page = kvm_pfn_to_refcounted_page(pfn);
> -	WARN_ON_ONCE(page && !page_count(page));
> +	if (is_refcounted_page_pte(old_spte)) {
> +		page = kvm_pfn_to_refcounted_page(pfn);
> +		WARN_ON_ONCE(!page || !page_count(page));
> +	}
>  
> -	if (is_accessed_spte(old_spte))
> -		kvm_set_pfn_accessed(pfn);
> +	if (is_refcounted_page_pte(old_spte)) {
> +		if (is_accessed_spte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(pfn));
>  
> -	if (is_dirty_spte(old_spte))
> -		kvm_set_pfn_dirty(pfn);
> +		if (is_dirty_spte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(pfn));
> +	}
>  
>  	return old_spte;
>  }
> @@ -631,8 +637,8 @@ static bool mmu_spte_age(u64 *sptep)
>  		 * Capture the dirty status of the page, so that it doesn't get
>  		 * lost when the SPTE is marked for access tracking.
>  		 */
> -		if (is_writable_pte(spte))
> -			kvm_set_pfn_dirty(spte_to_pfn(spte));
> +		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
>  
>  		spte = mark_spte_for_access_track(spte);
>  		mmu_spte_update_no_track(sptep, spte);
> @@ -1261,8 +1267,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
>  {
>  	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
>  					       (unsigned long *)sptep);
> -	if (was_writable && !spte_ad_enabled(*sptep))
> -		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> +	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> +		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
>  
>  	return was_writable;
>  }
> @@ -2913,6 +2919,11 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	bool host_writable = !fault || fault->map_writable;
>  	bool prefetch = !fault || fault->prefetch;
>  	bool write_fault = fault && fault->write;
> +	/*
> +	 * Prefetching uses gfn_to_page_many_atomic, which never gets
> +	 * non-refcounted pages.
> +	 */
> +	bool is_refcounted = !fault || fault->is_refcounted_page;
A WARN_ON_ONCE for a future bug that will make this condition false might be a good idea.

>  
>  	if (unlikely(is_noslot_pfn(pfn))) {
>  		vcpu->stat.pf_mmio_spte_created++;
> @@ -2940,7 +2951,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	}
>  
>  	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> -			   true, host_writable, &spte);
> +			   true, host_writable, is_refcounted, &spte);
>  
>  	if (*sptep == spte) {
>  		ret = RET_PF_SPURIOUS;
> @@ -4254,13 +4265,18 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> +	/*
> +	 * There are no extra bits for tracking non-refcounted pages in
> +	 * PAE SPTEs, so reject non-refcounted struct pages in that case.
> +	 */
> +	bool has_spte_refcount_bit = tdp_enabled && IS_ENABLED(CONFIG_X86_64);

I think that the tdp_enabled condition is needed because shadow paging uses the same
paging mode as the guest and it can use PAE, thus there will be no reserved bits.

Is this true? If true, can you write a comment about this? I haven't worked with shadow paging
for a long time, I no longer remember some of the details.


>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = fault->gfn,
>  		.flags = fault->write ? FOLL_WRITE : 0,
>  		.try_map_writable = true,
>  		.guarded_by_mmu_notifier = true,
> -		.allow_non_refcounted_struct_page = false,
> +		.allow_non_refcounted_struct_page = has_spte_refcount_bit,
>  	};
>  
>  	/*
> @@ -4277,6 +4293,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			fault->slot = NULL;
>  			fault->pfn = KVM_PFN_NOSLOT;
>  			fault->map_writable = false;
> +			fault->is_refcounted_page = false;
>  			return RET_PF_CONTINUE;
>  		}
>  		/*
> @@ -4332,6 +4349,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  success:
>  	fault->hva = foll.hva;
>  	fault->map_writable = foll.writable;
> +	fault->is_refcounted_page = foll.is_refcounted_page;
>  	return RET_PF_CONTINUE;
>  }
>  
> @@ -4420,8 +4438,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	r = direct_map(vcpu, fault);
>  
>  out_unlock:
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
>  	return r;
>  }
>  
> @@ -4496,8 +4515,9 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  	r = kvm_tdp_mmu_map(vcpu, fault);
>  
>  out_unlock:
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
>  	return r;
>  }
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index b102014e2c60..7f73bc2a552e 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -239,6 +239,7 @@ struct kvm_page_fault {
>  	kvm_pfn_t pfn;
>  	hva_t hva;
>  	bool map_writable;
> +	bool is_refcounted_page;
>  
>  	/*
>  	 * Indicates the guest is trying to write a gfn that contains one or
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index c85255073f67..0ac4a4e5870c 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	return r;
>  }
>  
> @@ -902,7 +903,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   */
>  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
>  {
> -	bool host_writable;
> +	bool host_writable, is_refcounted;
>  	gpa_t first_pte_gpa;
>  	u64 *sptep, spte;
>  	struct kvm_memory_slot *slot;
> @@ -959,10 +960,11 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
>  	sptep = &sp->spt[i];
>  	spte = *sptep;
>  	host_writable = spte & shadow_host_writable_mask;
> +	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;

What will happen if this function is run on 32 bit kernel and/or without tdp enabled
(that is when SPTE_MMU_PAGE_REFCOUNTED is not defined)?

>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  	make_spte(vcpu, sp, slot, pte_access, gfn,
>  		  spte_to_pfn(spte), spte, true, false,
> -		  host_writable, &spte);
> +		  host_writable, is_refcounted, &spte);
>  
>  	return mmu_spte_update(sptep, spte);
>  }
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 4a599130e9c9..ce495819061f 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       const struct kvm_memory_slot *slot,
>  	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>  	       u64 old_spte, bool prefetch, bool can_unsync,
> -	       bool host_writable, u64 *new_spte)
> +	       bool host_writable, bool is_refcounted, u64 *new_spte)
>  {
>  	int level = sp->role.level;
>  	u64 spte = SPTE_MMU_PRESENT_MASK;
> @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  
>  	if (level > PG_LEVEL_4K)
>  		spte |= PT_PAGE_SIZE_MASK;
> +	if (is_refcounted)
> +		spte |= SPTE_MMU_PAGE_REFCOUNTED;

Same here, if make_spte is used in these modes won't it set a bit it shouldn't?

>  
>  	if (shadow_memtype_mask)
>  		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index a129951c9a88..4bf4a535c23d 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -96,6 +96,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
>  /* Defined only to keep the above static asserts readable. */
>  #undef SHADOW_ACC_TRACK_SAVED_MASK
>  
> +/*
> + * Indicates that the SPTE refers to a page with a valid refcount.
> + */
> +#define SPTE_MMU_PAGE_REFCOUNTED        BIT_ULL(59)
> +
>  /*
>   * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
>   * the memslots generation and is derived as follows:
> @@ -345,6 +350,11 @@ static inline bool is_dirty_spte(u64 spte)
>  	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
>  }
>  
> +static inline bool is_refcounted_page_pte(u64 spte)
> +{
> +	return spte & SPTE_MMU_PAGE_REFCOUNTED;

Here also: if the bit is not supported, we need to assume that all sptes are refcounted I think.

> +}
> +
>  static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
>  				int level)
>  {
> @@ -475,7 +485,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       const struct kvm_memory_slot *slot,
>  	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>  	       u64 old_spte, bool prefetch, bool can_unsync,
> -	       bool host_writable, u64 *new_spte);
> +	       bool host_writable, bool is_refcounted, u64 *new_spte);
>  u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
>  		      	      union kvm_mmu_page_role role, int index);
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6c63f2d1675f..185f3c666c2b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	bool was_leaf = was_present && is_last_spte(old_spte, level);
>  	bool is_leaf = is_present && is_last_spte(new_spte, level);
>  	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> +	bool is_refcounted = is_refcounted_page_pte(old_spte);
>  
>  	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
>  	WARN_ON_ONCE(level < PG_LEVEL_4K);
> @@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	if (is_leaf != was_leaf)
>  		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
>  
> -	if (was_leaf && is_dirty_spte(old_spte) &&
> +	if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
>  	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> -		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
>  
>  	/*
>  	 * Recursively handle child PTs if the change removed a subtree from
> @@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
>  
> -	if (was_leaf && is_accessed_spte(old_spte) &&
> +	if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
>  	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> -		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> +		kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
>  }
>  
>  /*
> @@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
>  	else
>  		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> -					 fault->pfn, iter->old_spte, fault->prefetch, true,
> -					 fault->map_writable, &new_spte);
> +				   fault->pfn, iter->old_spte, fault->prefetch, true,
> +				   fault->map_writable, fault->is_refcounted_page,
> +				   &new_spte);
>  
>  	if (new_spte == iter->old_spte)
>  		ret = RET_PF_SPURIOUS;
> @@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
>  		 * Capture the dirty status of the page, so that it doesn't get
>  		 * lost when the SPTE is marked for access tracking.
>  		 */
> -		if (is_writable_pte(iter->old_spte))
> -			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
> +		if (is_writable_pte(iter->old_spte) &&
> +		    is_refcounted_page_pte(iter->old_spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));
>  
>  		new_spte = mark_spte_for_access_track(iter->old_spte);
>  		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
> @@ -1628,7 +1631,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
>  		trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
>  					       iter.old_spte,
>  					       iter.old_spte & ~dbit);
> -		kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
> +		if (is_refcounted_page_pte(iter.old_spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
>  	}
>  
>  	rcu_read_unlock();
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b95c79b7833b..6696925f01f1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1179,6 +1179,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>  	const struct kvm_memory_slot *slot;
>  	gfn_t gfn;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 913de4e86d9d..4d8538cdb690 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2979,17 +2979,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>  	return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
>  
>  void kvm_release_page_clean(struct page *page)
>  {


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 5/6] KVM: x86: Migrate to __kvm_follow_pfn
  2023-10-03 16:54   ` Maxim Levitsky
@ 2023-10-03 20:58     ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2023-10-03 20:58 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: David Stevens, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Tue, Oct 03, 2023, Maxim Levitsky wrote:
> У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> > From: David Stevens <stevensd@chromium.org>
> > @@ -4283,12 +4290,20 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > -	async = false;
> > -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -					  fault->write, &fault->map_writable,
> > -					  &fault->hva);
> > -	if (!async)
> > -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> > +	foll.flags |= FOLL_NOWAIT;
> > +	fault->pfn = __kvm_follow_pfn(&foll);
> > +
> > +	if (!is_error_noslot_pfn(fault->pfn))
> > +		goto success;
> Unrelated but I can't say I like the 'is_error_noslot_pfn()' name, 
> I wish it was called something like is_valid_pfn().

I don't love the name either, but is_valid_pfn() would be extremely confusing
because the kernel already provides pfn_valid() to identify pfns/pages that are
managed by the kernel.  Trying to shove "guest" somewhere in the name also gets
confusing because it's a host pfn, not a guest pfn.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
                   ` (6 preceding siblings ...)
  2023-09-29  5:19 ` [PATCH v9 0/6] KVM: allow mapping " Christoph Hellwig
@ 2023-10-31  4:30 ` David Stevens
  2023-10-31 14:30   ` Sean Christopherson
  7 siblings, 1 reply; 42+ messages in thread
From: David Stevens @ 2023-10-31  4:30 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvmarm, linux-kernel, kvm

Sean, have you been waiting for a new patch series with responses to
Maxim's comments? I'm not really familiar with kernel contribution
etiquette, but I was hoping to get your feedback before spending the
time to put together another patch series.

-David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-10-31  4:30 ` David Stevens
@ 2023-10-31 14:30   ` Sean Christopherson
  2023-12-12  1:59     ` David Stevens
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2023-10-31 14:30 UTC (permalink / raw)
  To: David Stevens; +Cc: kvmarm, linux-kernel, kvm

On Tue, Oct 31, 2023, David Stevens wrote:
> Sean, have you been waiting for a new patch series with responses to
> Maxim's comments? I'm not really familiar with kernel contribution
> etiquette, but I was hoping to get your feedback before spending the
> time to put together another patch series.

No, I'm working my way back toward it.  The guest_memfd series took precedence
over everything that I wasn't confident would land in 6.7, i.e. larger series
effectively got put on the back burner.  Sorry :-(

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-10-31 14:30   ` Sean Christopherson
@ 2023-12-12  1:59     ` David Stevens
  2023-12-20  1:37       ` Sean Christopherson
  0 siblings, 1 reply; 42+ messages in thread
From: David Stevens @ 2023-12-12  1:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvmarm, linux-kernel, kvm

On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 31, 2023, David Stevens wrote:
> > Sean, have you been waiting for a new patch series with responses to
> > Maxim's comments? I'm not really familiar with kernel contribution
> > etiquette, but I was hoping to get your feedback before spending the
> > time to put together another patch series.
>
> No, I'm working my way back toward it.  The guest_memfd series took precedence
> over everything that I wasn't confident would land in 6.7, i.e. larger series
> effectively got put on the back burner.  Sorry :-(

Is this series something that may be able to make it into 6.8 or 6.9?

-David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-12-12  1:59     ` David Stevens
@ 2023-12-20  1:37       ` Sean Christopherson
  2024-02-06  3:30         ` Sean Christopherson
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2023-12-20  1:37 UTC (permalink / raw)
  To: David Stevens; +Cc: kvmarm, linux-kernel, kvm

On Tue, Dec 12, 2023, David Stevens wrote:
> On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Oct 31, 2023, David Stevens wrote:
> > > Sean, have you been waiting for a new patch series with responses to
> > > Maxim's comments? I'm not really familiar with kernel contribution
> > > etiquette, but I was hoping to get your feedback before spending the
> > > time to put together another patch series.
> >
> > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > effectively got put on the back burner.  Sorry :-(
> 
> Is this series something that may be able to make it into 6.8 or 6.9?

6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
frustratingly little code review since early November.  Sorry :-(

I haven't paged this series back into memory, so take this with a grain of salt,
but IIRC there was nothing that would block this from landing in 6.9.  Timing will
likely be tight though, especially for getting testing on all architectures.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function
  2023-10-03 16:54   ` Maxim Levitsky
@ 2024-02-06  1:25     ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-06  1:25 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: David Stevens, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Tue, Oct 03, 2023, Maxim Levitsky wrote:
> У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> > 
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h |  16 ++++
> >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> >  virt/kvm/kvm_mm.h        |   3 +-
> >  virt/kvm/pfncache.c      |  10 ++-
> >  4 files changed, 123 insertions(+), 77 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index fb6c6109fdca..c2e0ddf14dba 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
> >  
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1177,6 +1178,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >  
> > +struct kvm_follow_pfn {
> > +	const struct kvm_memory_slot *slot;
> > +	gfn_t gfn;
> > +	unsigned int flags;
> It might be useful for the future readers to have a note about which values
> the flags can take.  (e.g one of the 'FOLL_* flags).

+1.  It doesn't have to (probably shouldn't?) say _which_ FOLL_* flags are supported
(I forget if there was going to be a restriction).  Just a comment explaining that
it's used to pass various FOLL_* flags.

> > +	bool atomic;
> 
> I wish we had FOLL_ATOMIC, because there is a slight usability regression in
> regard to the fact, that now some of the flags are here and in the 'atomic'
> variable.
> 
> 
> > +	/* Try to create a writable mapping even for a read fault */
> > +	bool try_map_writable;
> > +
> > +	/* Outputs of __kvm_follow_pfn */
> > +	hva_t hva;
> > +	bool writable;
> > +};
> 
> 
> Another small usability note. I feel like the name 'follow_pfn' is not the
> best name for this.
> 
> I think ultimately it comes from 'follow_pte()' and even that name IMHO is
> incorrect.  the 'follow_pte' should be called 'lookup_kernel_pte', because that
> is what it does - it finds a pointer to pte of hva in its process's kernel
> page tables.

Yeah, I 100% agree follow_pte() is a bad name (I suggested kvm_follow_pfn()), but
for me, this falls into the "if you can't beat 'em, join 'em" scenario.  It's kinda
like the XKCD comic about 14 standards; coming up with a new name because the
existing one sucks doesn't make the world any better, it's just one more less
than perfect name for developers to remember :-)

> IMHO, the kvm_follow_pfn struct should be called something like
> gfn_to_pfn_params, because it specifies on how to convert gfn to pfn (or
> create the pfn if the page was swapped out).

Again, I don't disagree in a vacuum, but I want the name of the struct to be
tightly coupled to the API, e.g. so that it's super obvious where in KVM's flows
the struct is used, at the expense of making it less obviously how exactly said
flow uses the struct.

> Same note applies to '__kvm_follow_pfn()'
> 
> If you really want to keep the original name though, I won't argue over this.
> 
> > +
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > +
> >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> >  		      bool *writable);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ee6090ecb1fe..9b33a59c6d65 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2512,8 +2512,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >   * true indicates success, otherwise false is returned.  It's also the
> >   * only part that runs if we can in atomic context.
> >   */
> > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > -			    bool *writable, kvm_pfn_t *pfn)
> > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> >  	struct page *page[1];
> >  
> > @@ -2522,14 +2521,12 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >  	 * or the caller allows to map a writable pfn for a read fault
> >  	 * request.
> >  	 */
> > -	if (!(write_fault || writable))
> > +	if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
> >  		return false;
> 
> A small note: the 'foll' variable and the FOLL_* flags have different
> meaning: foll is the pointer to a new 'struct kvm_follow_pfn' while FOLL_ is
> from the folio API, I think.
> 
> Maybe we should rename the 'foll' to something, like 'args' or something like
> that?

Hmm, I was going for something similar to "struct kvm_page_fault *fault" (this
was another suggestion of mine).  I don't love args, mainly because the usage
isn't tied back to the struct name, e.g. deep in hva_to_pfn_remapped() and friends,
"args" starts to lose context/meaning.

Looking at this with fresh eyes, I still like "foll", though I agree it's far
from ideal.  Maybe an acronym?  "kfp" isn't used in the kernel, AFAICT.  I'd vote
for "foll" over "kfp", but I'm ok with either (or whatever, so long as the name
is tied back to the struct in some way, i.e. not super generic).

> > -	/* map read fault as writable if possible */
> > -	if (unlikely(!write_fault) && writable) {
> > +	if (foll->flags & FOLL_WRITE) {
> > +		foll->writable = true;
> > +	} else if (foll->try_map_writable) {
> >  		struct page *wpage;
> >  
> > -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > -			*writable = true;
> > +		/* map read fault as writable if possible */
> > +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > +			foll->writable = true;
> >  			put_page(page);
> >  			page = wpage;
> 
> Regardless of this patch, I am wondering, what was the reason to first map the
> page in the same way as requested and then try to map it as writable.
> 
> Since the vast majority of the guest pages are likely to be writable, isn't
> it better to first opportunistically map them as writable and if that fails,
> then try to map readonly?

KVM actually does do that.  hva_to_pfn_fast() tries to map WRITABLE, and then
only falls back to the slow path if that fails.

As for why KVM doesn't "try" to faultin the hva as writable, that would break
CoW and probably KSM as well.  I.e. if KVM _asked_ for a writable mapping instead
of opportunistically seeing if the primary MMU created a writable mapping.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns
  2023-10-03 16:54   ` Maxim Levitsky
@ 2024-02-06  2:54     ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-06  2:54 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: David Stevens, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Tue, Oct 03, 2023, Maxim Levitsky wrote:
> У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> > From: David Stevens <stevensd@chromium.org>
> > The fact that non-refcounted pfns can no longer be accessed without mmu
> > notifier protection is a breaking change. Since there is no timeline for
> > updating everything in KVM to use mmu notifiers, this change adds an
> > opt-in module parameter called allow_unsafe_mappings to allow such
> > mappings. Systems which trust userspace not to tear down such unsafe
> > mappings while KVM is using them can set this parameter to re-enable the
> > legacy behavior.
> 
> Do you have a practical example of a VM that can break with this change?
> E.g will a normal VM break? will a VM with VFIO devices break? Will a VM with
> hugepages mapped into it break?
> 
> Will the trick of limiting the kernel memory with 'mem=X', and then use the 
> extra 'upper memory' for VMs still work?

This is the trick that will require an opt-in from the admin.  Anything where KVM
is effectively relying on userspace to pinky swear that the memory won't be
migrated, freed, etc.

It's unlikely, but theoretically possible that it might break existing setups for
"normal" VMs.  E.g. if a VM is using VM_PFNMAP'd memory for a nested VMCS.  But
such setups are already wildly broken, their users just don't know it.  The proposal
here is to require admins for such setups to opt-in to the "unsafe" behavior,
i.e. give backwards compatibility, but make the admin explicitly acknowledge that
what they are doing may have unwanted consequences.

> > +	/*
> > +	 * True if the returned pfn is for a page with a valid refcount. False
> > +	 * if the returned pfn has no struct page or if the struct page is not
> > +	 * being refcounted (e.g. tail pages of non-compound higher order
> > +	 * allocations from IO/PFNMAP mappings).
> > 
> Aren't all tail pages not-refcounted (e.g tail page of a hugepage?)
> I haven't researched this topic yet.

Nope.  As Christoph stated, they are most definitely "weird" allocations though.
In this case, IIRC, it's the DRM's Translation Table Manager (TTM) code that
kmalloc's a large chunk of memory, and then stuffs the pfn into the page tables
courtesy of the vmf_insert_pfn_prot() in ttm_bo_vm_fault_reserved().

The head page has a non-zero refcount, but it's not really refcounted.  And the
tail pages have nothing, which IIRC, results in KVM inadvertantly causing pages
to be freed due to putting the last refeferences.

[*] https://lore.kernel.org/all/ZRZeaP7W5SuereMX@infradead.org


> > +	 *
> > +	 * When this output flag is false, callers should not try to convert
> > +	 * the pfn to a struct page.

This should explain what the flag tracks, not what callers should or shouldn't
do with the flag.  E.g. strictly speaking, there's no danger in grabbing the
corresponding "struct page" if the caller verifies it's a valid PFN.  Trying to
use the page outside of mmu_notifier protection is where things would get dicey.

> > +	 */
> > +	bool is_refcounted_page;
> >  };
> >  
> >  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 9b33a59c6d65..235c5cb3fdac 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink;
> >  module_param(halt_poll_ns_shrink, uint, 0644);
> >  EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
> >  
> > +/* Allow non-struct page memory to be mapped without MMU notifier protection. */
> > +static bool allow_unsafe_mappings;
> > +module_param(allow_unsafe_mappings, bool, 0444);
> > +
> >  /*
> >   * Ordering of locks:
> >   *
> > @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >  	return rc == -EHWPOISON;
> >  }
> >  
> > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > +					   struct page *page)
> > +{
> > +	kvm_pfn_t pfn = page_to_pfn(page);
> > +
> > +	foll->is_refcounted_page = true;
> > +	return pfn;
> > +}
> 
> Just a matter of taste but to me this function looks confusing.
> IMHO, just duplicating these two lines of code is better.
> However if you prefer I won't argue over this.

Hrm, when I suggested this, there was also a put_page() and a comment about hacky
code in there:

static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
                                          struct page *page)
{
       kvm_pfn_t pfn = page_to_pfn(page);

       foll->is_refcounted_page = true;

       /*
        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
        * doesn't want to grab a reference, but gup() doesn't support getting
        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
        * changes, drop this and simply don't pass FOLL_GET to gup().
        */
       if (!(foll->flags & FOLL_GET))
               put_page(page);

       return pfn;
}


More below.

> > -	/*
> > -	 * Get a reference here because callers of *hva_to_pfn* and
> > -	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> > -	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> > -	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> > -	 * simply do nothing for reserved pfns.
> > -	 *
> > -	 * Whoever called remap_pfn_range is also going to call e.g.
> > -	 * unmap_mapping_range before the underlying pages are freed,
> > -	 * causing a call to our MMU notifier.
> > -	 *
> > -	 * Certain IO or PFNMAP mappings can be backed with valid
> > -	 * struct pages, but be allocated without refcounting e.g.,
> > -	 * tail pages of non-compound higher order allocations, which
> > -	 * would then underflow the refcount when the caller does the
> > -	 * required put_page. Don't allow those pages here.
> > -	 */
> 
> Why the comment is removed? as far as I see the code still grabs a reference
> to the page.

Because what the above comment is saying is effectively obsoleted by is_refcounted_page.
The old comment is essentially saying, "grab a reference because KVM expects to
put a reference", whereas as the new code grabs a reference because it's necessary
to honor allow_unsafe_mappings.

So I agree with David that the existing comment should go away, but I agree with
you in that kvm_follow_refcounted_pfn() really needs a comment, e.g. to explain
how KVM manages struct page refcounts.

> > -	if (!kvm_try_get_pfn(pfn))
> > -		r = -EFAULT;
> > +	if (get_page_unless_zero(page))
> > +		WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> 
> Once again, the kvm_follow_refcounted_pfn usage is confusing IMHO.  It sets
> the 'foll->is_refcounted_page', and yet someone can think that it's only
> there for the WARN_ON_ONCE.
> 
> That IMHO would read better:
> 
> if (get_page_unless_zero(page))
> 	foll->is_refcounted_page = true;
> 
> WARN_ON_ONCE(page_to_pfn(page) != pfn);
> 
> Note that I moved the warn out of the 'get_page_unless_zero' condition
> because I think that this condition should be true for non refcounted pages
> as well.

Yeah, let me see if I can piece together what happened to the put_page() call.

> Also I don't understand why 'get_page_unless_zero(page)' result is ignored.
> As I understand it, it will increase refcount of a page unless it is zero. 
> 
> If a refcount of a refcounted page is 0 isn't that a bug?

Yes, the problem is that KVM doesn't know if a page is refcounted or not, without
actually trying to acquire a reference.  See the TTM mess mentioned above.

Note, Christoph is suggesting that KVM instead refuse to play nice and force the
TTM code to properly refcount things.  I very much like that idea in theory, but
I have no idea how feasible it is (that code is all kinds of twisty).

> The page was returned from kvm_pfn_to_refcounted_page which supposed only to
> return pages that are refcounted.
> 
> I might not understand something in regard to 'get_page_unless_zero(page)'
> usage both in old and the new code.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-19  2:31     ` David Stevens
  2023-09-21 20:04       ` Dmitry Osipenko
@ 2024-02-06  3:02       ` Sean Christopherson
  1 sibling, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-06  3:02 UTC (permalink / raw)
  To: David Stevens
  Cc: Dmitry Osipenko, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Tue, Sep 19, 2023, David Stevens wrote:
> On Mon, Sep 18, 2023 at 6:58 PM Dmitry Osipenko
> <dmitry.osipenko@collabora.com> wrote:
> >
> > On 9/11/23 05:16, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
> > > to map memory into the guest that is backed by non-refcounted struct
> > > pages - for example, the tail pages of higher order non-compound pages
> > > allocated by the amdgpu driver via ttm_pool_alloc_page.
> > >
> > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > done by storing the flag in an unused bit in the sptes. There are no
> > > bits available in PAE SPTEs, so non-refcounted pages can only be handled
> > > on TDP and x86-64.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          | 52 +++++++++++++++++++++++----------
> > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > >  arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++--
> > >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> > >  arch/x86/kvm/mmu/spte.h         | 12 +++++++-
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++------
> > >  include/linux/kvm_host.h        |  3 ++
> > >  virt/kvm/kvm_main.c             |  6 ++--
> > >  8 files changed, 76 insertions(+), 32 deletions(-)
> >
> > Could you please tell which kernel tree you used for the base of this
> > series? This patch #6 doesn't apply cleanly to stable/mainline/next/kvm
> >
> > error: sha1 information is lacking or useless (arch/x86/kvm/mmu/mmu.c).
> > error: could not build fake ancestor
> 
> This series is based on the kvm next branch (i.e.
> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next). The
> specific hash is d011151616e73de20c139580b73fa4c7042bd861.

Drat, found this too late.  Please use --base so that git appends the exact base
commit.   From Documentation/process/maintainer-kvm-x86.rst:

Git Base
~~~~~~~~
If you are using git version 2.9.0 or later (Googlers, this is all of you!),
use ``git format-patch`` with the ``--base`` flag to automatically include the
base tree information in the generated patches.

Note, ``--base=auto`` works as expected if and only if a branch's upstream is
set to the base topic branch, e.g. it will do the wrong thing if your upstream
is set to your personal repository for backup purposes.  An alternative "auto"
solution is to derive the names of your development branches based on their
KVM x86 topic, and feed that into ``--base``.  E.g. ``x86/pmu/my_branch_name``,
and then write a small wrapper to extract ``pmu`` from the current branch name
to yield ``--base=x/pmu``, where ``x`` is whatever name your repository uses to
track the KVM x86 remote.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function
  2023-09-11  2:16 ` [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
@ 2024-02-06  3:16   ` Sean Christopherson
  2024-02-13  3:27   ` Sean Christopherson
  2024-02-13  3:44   ` Sean Christopherson
  3 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-06  3:16 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On Mon, Sep 11, 2023, David Stevens wrote:
> @@ -2681,24 +2668,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */

Can you change this comment?  Not your fault, as it's already wierd, but it ends
being really confusing after the conversion to FOLL_NOWAIT, because not waiting
in atomic context seems completely sane.

> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));

Side topic, a BUG_ON() here is ridiculous overkill.  Can you add a patch somewhere
in the series to convert this to a WARN_ON_ONCE()?  The check is there purely to
guard against incorrect usage in KVM, the absolutely worst case scenario is that
KVM simply doesn't go down the slow path when it should and effectively DoS's the
guest.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages
  2023-09-11  2:16 ` [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages David Stevens
                     ` (2 preceding siblings ...)
  2023-10-03 16:54   ` Maxim Levitsky
@ 2024-02-06  3:23   ` Sean Christopherson
  3 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-06  3:23 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On Mon, Sep 11, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host
> to map memory into the guest that is backed by non-refcounted struct
> pages - for example, the tail pages of higher order non-compound pages
> allocated by the amdgpu driver via ttm_pool_alloc_page.
> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes. There are no
> bits available in PAE SPTEs, so non-refcounted pages can only be handled
> on TDP and x86-64.

Can you split this into two patches?  One to add all of the SPTE tracking, and
then one final patch to allow faulting in non-refcounted pages.  I want to isolate
the latter as much as possible, both for review purposes and in case something
goes awry and needs to be reverted.

> @@ -4254,13 +4265,18 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> +	/*
> +	 * There are no extra bits for tracking non-refcounted pages in
> +	 * PAE SPTEs, so reject non-refcounted struct pages in that case.
> +	 */
> +	bool has_spte_refcount_bit = tdp_enabled && IS_ENABLED(CONFIG_X86_64);

Eh, just drop the local variable and do

		.allow_non_refcounted_struct_page = tdp_enabled &&
						    IS_ENABLED(CONFIG_X86_64);
(but keep the comment)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-10-02  6:25     ` Christoph Hellwig
@ 2024-02-06  3:29       ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-06  3:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Stevens, Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm,
	linux-kernel, kvm

On Sun, Oct 01, 2023, Christoph Hellwig wrote:
> On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > maintenance cost.
> 
> I tend to strongly disagree with that, though.  We can't just let these
> non-refcounted pages spread everywhere and instead need to fix their
> usage.

Sorry for the horrifically slow reply.

Is there a middle ground somewhere between allowing this willy nilly, and tainting
the kernel?  I too would love to get the TTM stuff fixed up, but every time I look
at that code I am less and less confident that it will happen anytime soon.  It's
not even clear to me what all code needs to be touched.

In other words, is there a way we can unblock David and friends, while still
providing a forcing function of some kind to motivate/heckle the TTM (or whatever
is making the allocations) to change?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2023-12-20  1:37       ` Sean Christopherson
@ 2024-02-06  3:30         ` Sean Christopherson
  2024-02-13  3:39           ` Sean Christopherson
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2024-02-06  3:30 UTC (permalink / raw)
  To: David Stevens; +Cc: kvmarm, linux-kernel, kvm

On Tue, Dec 19, 2023, Sean Christopherson wrote:
> On Tue, Dec 12, 2023, David Stevens wrote:
> > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Oct 31, 2023, David Stevens wrote:
> > > > Sean, have you been waiting for a new patch series with responses to
> > > > Maxim's comments? I'm not really familiar with kernel contribution
> > > > etiquette, but I was hoping to get your feedback before spending the
> > > > time to put together another patch series.
> > >
> > > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > > effectively got put on the back burner.  Sorry :-(
> > 
> > Is this series something that may be able to make it into 6.8 or 6.9?
> 
> 6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
> frustratingly little code review since early November.  Sorry :-(
> 
> I haven't paged this series back into memory, so take this with a grain of salt,
> but IIRC there was nothing that would block this from landing in 6.9.  Timing will
> likely be tight though, especially for getting testing on all architectures.

I did a quick-ish pass today.  If you can hold off on v10 until later this week,
I'll try to take a more in-depth look by EOD Thursday.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function
  2023-09-11  2:16 ` [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
  2024-02-06  3:16   ` Sean Christopherson
@ 2024-02-13  3:27   ` Sean Christopherson
  2024-02-13  3:44   ` Sean Christopherson
  3 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-13  3:27 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On Mon, Sep 11, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.

Please use parantheses when refeferencing functions, i.e. __kvm_follow_pfn(),
__gfn_to_pfn_memslot(), etc.

> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.

This needs a more verbose changelog, but I honestly wouldn't mind doing that myself
as a way of refreshing all of this, and to lay out out what the end state will
(hopefully) look like when all architectures are converted.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2024-02-06  3:30         ` Sean Christopherson
@ 2024-02-13  3:39           ` Sean Christopherson
  2024-02-21  6:05             ` David Stevens
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2024-02-13  3:39 UTC (permalink / raw)
  To: David Stevens; +Cc: kvmarm, linux-kernel, kvm

On Mon, Feb 05, 2024, Sean Christopherson wrote:
> On Tue, Dec 19, 2023, Sean Christopherson wrote:
> > On Tue, Dec 12, 2023, David Stevens wrote:
> > > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Tue, Oct 31, 2023, David Stevens wrote:
> > > > > Sean, have you been waiting for a new patch series with responses to
> > > > > Maxim's comments? I'm not really familiar with kernel contribution
> > > > > etiquette, but I was hoping to get your feedback before spending the
> > > > > time to put together another patch series.
> > > >
> > > > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > > > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > > > effectively got put on the back burner.  Sorry :-(
> > > 
> > > Is this series something that may be able to make it into 6.8 or 6.9?
> > 
> > 6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
> > frustratingly little code review since early November.  Sorry :-(
> > 
> > I haven't paged this series back into memory, so take this with a grain of salt,
> > but IIRC there was nothing that would block this from landing in 6.9.  Timing will
> > likely be tight though, especially for getting testing on all architectures.
> 
> I did a quick-ish pass today.  If you can hold off on v10 until later this week,
> I'll try to take a more in-depth look by EOD Thursday.

So I took a "deeper" look, but honestly it wasn't really any more in-depth than
the previous look.  I think I was somewhat surprised at the relatively small amount
of churn this ended up requiring.

Anywho, no major complaints.  This might be fodder for 6.9?  Maybe.  It'll be
tight.  At the very least though, I expect to shove v10 in a branch and start
beating on it in anticipation of landing it no later than 6.10.

One question though: what happened to the !FOLL_GET logic in kvm_follow_refcounted_pfn()?

In a previous version, I suggested:

  static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
                                             struct page *page)
  {
       kvm_pfn_t pfn = page_to_pfn(page);

       foll->is_refcounted_page = true;

       /*
        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
        * doesn't want to grab a reference, but gup() doesn't support getting
        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
        * changes, drop this and simply don't pass FOLL_GET to gup().
        */
       if (!(foll->flags & FOLL_GET))
               put_page(page);

       return pfn;
  }

but in v9 it's simply:

  static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
					     struct page *page)
  {
	kvm_pfn_t pfn = page_to_pfn(page);

	foll->is_refcounted_page = true;
	return pfn;
  }

And instead the x86 page fault handlers manually drop the reference.  Why is that?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns
  2023-09-11  2:16 ` [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns David Stevens
  2023-10-03 16:54   ` Maxim Levitsky
@ 2024-02-13  3:44   ` Sean Christopherson
  1 sibling, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-13  3:44 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On Mon, Sep 11, 2023, David Stevens wrote:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c2e0ddf14dba..2ed08ae1a9be 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Try to create a writable mapping even for a read fault */
>  	bool try_map_writable;
> +	/* Usage of the returned pfn will be guared by a mmu notifier. */
> +	bool guarded_by_mmu_notifier;
> +	/*
> +	 * When false, do not return pfns for non-refcounted struct pages.
> +	 *
> +	 * TODO: This allows callers to use kvm_release_pfn on the pfns
> +	 * returned by gfn_to_pfn without worrying about corrupting the
> +	 * refcounted of non-refcounted pages. Once all callers respect
> +	 * is_refcounted_page, this flag should be removed.
> +	 */
> +	bool allow_non_refcounted_struct_page;
>  
>  	/* Outputs of __kvm_follow_pfn */
>  	hva_t hva;
>  	bool writable;
> +	/*
> +	 * True if the returned pfn is for a page with a valid refcount. False
> +	 * if the returned pfn has no struct page or if the struct page is not
> +	 * being refcounted (e.g. tail pages of non-compound higher order
> +	 * allocations from IO/PFNMAP mappings).
> +	 *
> +	 * When this output flag is false, callers should not try to convert
> +	 * the pfn to a struct page.
> +	 */
> +	bool is_refcounted_page;

Idea.  Hopefully a good one.  Rather than tracking a bool, what if we track:

	struct page *refcounted_page;

and then make kvm_xxx_page_clean() wrappers around inner helpers that play nice
with NULL pages, e.g.

  static inline void kvm_release_page_clean(struct page *page)
  {
  	if (!page)
		return

  	__kvm_release_page_clean(page);
  }

Then callers of __kvm_follow_pfn() can do:

	kvm_release_page_clean(fault->refcounted_page);

instead of

 	if (fault->is_refcounted_page)
		kvm_release_page_clean(pfn_to_page(fault->pfn));

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function
  2023-09-11  2:16 ` [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function David Stevens
                     ` (2 preceding siblings ...)
  2024-02-13  3:27   ` Sean Christopherson
@ 2024-02-13  3:44   ` Sean Christopherson
  3 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2024-02-13  3:44 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Isaku Yamahata, Zhi Wang, kvmarm, linux-kernel, kvm

On Mon, Sep 11, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.

Belated question: why is there no kvm_follow_pfn()?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v9 0/6] KVM: allow mapping non-refcounted pages
  2024-02-13  3:39           ` Sean Christopherson
@ 2024-02-21  6:05             ` David Stevens
  0 siblings, 0 replies; 42+ messages in thread
From: David Stevens @ 2024-02-21  6:05 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvmarm, linux-kernel, kvm

On Tue, Feb 13, 2024 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote:
> On Mon, Feb 05, 2024, Sean Christopherson wrote:
> > On Tue, Dec 19, 2023, Sean Christopherson wrote:
> > > On Tue, Dec 12, 2023, David Stevens wrote:
> > > > On Tue, Oct 31, 2023 at 11:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 31, 2023, David Stevens wrote:
> > > > > > Sean, have you been waiting for a new patch series with responses to
> > > > > > Maxim's comments? I'm not really familiar with kernel contribution
> > > > > > etiquette, but I was hoping to get your feedback before spending the
> > > > > > time to put together another patch series.
> > > > >
> > > > > No, I'm working my way back toward it.  The guest_memfd series took precedence
> > > > > over everything that I wasn't confident would land in 6.7, i.e. larger series
> > > > > effectively got put on the back burner.  Sorry :-(
> > > >
> > > > Is this series something that may be able to make it into 6.8 or 6.9?
> > >
> > > 6.8 isn't realistic.  Between LPC, vacation, and non-upstream stuff, I've done
> > > frustratingly little code review since early November.  Sorry :-(
> > >
> > > I haven't paged this series back into memory, so take this with a grain of salt,
> > > but IIRC there was nothing that would block this from landing in 6.9.  Timing will
> > > likely be tight though, especially for getting testing on all architectures.
> >
> > I did a quick-ish pass today.  If you can hold off on v10 until later this week,
> > I'll try to take a more in-depth look by EOD Thursday.
>
> So I took a "deeper" look, but honestly it wasn't really any more in-depth than
> the previous look.  I think I was somewhat surprised at the relatively small amount
> of churn this ended up requiring.
>
> Anywho, no major complaints.  This might be fodder for 6.9?  Maybe.  It'll be
> tight.  At the very least though, I expect to shove v10 in a branch and start
> beating on it in anticipation of landing it no later than 6.10.
>
> One question though: what happened to the !FOLL_GET logic in kvm_follow_refcounted_pfn()?
>
> In a previous version, I suggested:
>
>   static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
>                                              struct page *page)
>   {
>        kvm_pfn_t pfn = page_to_pfn(page);
>
>        foll->is_refcounted_page = true;
>
>        /*
>         * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
>         * doesn't want to grab a reference, but gup() doesn't support getting
>         * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
>         * changes, drop this and simply don't pass FOLL_GET to gup().
>         */
>        if (!(foll->flags & FOLL_GET))
>                put_page(page);
>
>        return pfn;
>   }
>
> but in v9 it's simply:
>
>   static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
>                                              struct page *page)
>   {
>         kvm_pfn_t pfn = page_to_pfn(page);
>
>         foll->is_refcounted_page = true;
>         return pfn;
>   }
>
> And instead the x86 page fault handlers manually drop the reference.  Why is that?

I don't think FOLL_GET adds much to the API if is_refcounted_page is
present. At least right now, all of the callers need to pay attention
to is_refcounted_page so that they can update the access/dirty flags
properly. If they already have to do that anyway, then it's not any
harder for them to call put_page(). Not taking a reference also adds
one more footgun to the API, since the caller needs to make sure it
only accesses the page under an appropriate lock (v7 of this series
had that bug).

That said, I don't have particularly strong feelings one way or the
other, so I've added it back to v10 of the series.

-David

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-02-21  6:05 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-11  2:16 [PATCH v9 0/6] KVM: allow mapping non-refcounted pages David Stevens
2023-09-11  2:16 ` [PATCH v9 1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty David Stevens
2023-09-11  2:16 ` [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function David Stevens
2023-10-03 16:54   ` Maxim Levitsky
2024-02-06  1:25     ` Sean Christopherson
2024-02-06  3:16   ` Sean Christopherson
2024-02-13  3:27   ` Sean Christopherson
2024-02-13  3:44   ` Sean Christopherson
2023-09-11  2:16 ` [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns David Stevens
2023-10-03 16:54   ` Maxim Levitsky
2024-02-06  2:54     ` Sean Christopherson
2024-02-13  3:44   ` Sean Christopherson
2023-09-11  2:16 ` [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn David Stevens
2023-10-03 16:54   ` Maxim Levitsky
2023-09-11  2:16 ` [PATCH v9 5/6] KVM: x86: Migrate " David Stevens
2023-10-03 16:54   ` Maxim Levitsky
2023-10-03 20:58     ` Sean Christopherson
2023-09-11  2:16 ` [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages David Stevens
2023-09-18  9:53   ` Dmitry Osipenko
2023-09-19  2:25     ` David Stevens
2023-09-30 13:34       ` Dmitry Osipenko
2023-09-18  9:58   ` Dmitry Osipenko
2023-09-18 11:19     ` Dmitry Osipenko
2023-09-19  2:59       ` David Stevens
2023-09-21 20:06         ` Dmitry Osipenko
2023-09-30 13:34           ` Dmitry Osipenko
2023-09-19  2:31     ` David Stevens
2023-09-21 20:04       ` Dmitry Osipenko
2024-02-06  3:02       ` Sean Christopherson
2023-10-03 16:54   ` Maxim Levitsky
2024-02-06  3:23   ` Sean Christopherson
2023-09-29  5:19 ` [PATCH v9 0/6] KVM: allow mapping " Christoph Hellwig
2023-09-29 16:06   ` Sean Christopherson
2023-10-02  6:25     ` Christoph Hellwig
2024-02-06  3:29       ` Sean Christopherson
2023-10-31  4:30 ` David Stevens
2023-10-31 14:30   ` Sean Christopherson
2023-12-12  1:59     ` David Stevens
2023-12-20  1:37       ` Sean Christopherson
2024-02-06  3:30         ` Sean Christopherson
2024-02-13  3:39           ` Sean Christopherson
2024-02-21  6:05             ` David Stevens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).