All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/8] KVM: allow mapping non-refcounted pages
@ 2023-07-04  7:50 ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.

First, this series replaces the __gfn_to_pfn_memslot API with a more
extensible __kvm_faultin_pfn API. The updated API rearranges
__gfn_to_pfn_memslot's args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior, except as noted in the PPC change.

When introduced in the refactoring, __kvm_faultin_pfn implies FOLL_GET
to preserve existing behavior. From there, the API is made to support
mapping non-refcounted pages by respecting the FOLL_GET flag.

This series only adds support for non-refcounted pages to the x86 MMU.
Other MMUs can likely be updated without too much difficulty, but it is
not needed at this point. Updating other parts of KVM (e.g. pfncache) is
not straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/
[2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages.
v5 -> v6:
 - rebase on kvm next branch
 - rename gfn_to_pfn_page to gfn_to_pfn_noref
 - fix uninitialized outparam in error case of __kvm_faultin_pfn
 - add kvm_release_pfn_noref_clean for releasing pfn/page pair
v4 -> v5:
 - rebase on kvm next branch again
v3 -> v4:
 - rebase on kvm next branch again
 - Add some more context to a comment in ensure_pfn_ref
v2 -> v3:
 - rebase on kvm next branch
v1 -> v2:
 - Introduce new gfn_to_pfn_page functions instead of modifying the
   behavior of existing gfn_to_pfn functions, to make the change less
   invasive.
 - Drop changes to mmu_audit.c
 - Include Nicholas Piggin's patch to avoid corrupting refcount in the
   follow_pte case, and use it in depreciated gfn_to_pfn functions.
 - Rebase on kvm/next
David Stevens (7):
  KVM: Introduce __kvm_follow_pfn function
  KVM: Make __kvm_follow_pfn not imply FOLL_GET
  KVM: x86/mmu: Migrate to __kvm_follow_pfn
  KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  KVM: arm64: Migrate to __kvm_follow_pfn
  KVM: PPC: Migrate to __kvm_follow_pfn
  KVM: remove __gfn_to_pfn_memslot

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
    accessed/dirty

 arch/arm64/kvm/mmu.c                   |  25 +--
 arch/powerpc/include/asm/kvm_book3s.h  |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  38 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  50 +++---
 arch/powerpc/kvm/book3s_hv_nested.c    |   4 +-
 arch/x86/kvm/mmu/mmu.c                 |  77 ++++++---
 arch/x86/kvm/mmu/mmu_internal.h        |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h         |   9 +-
 arch/x86/kvm/mmu/spte.c                |   4 +-
 arch/x86/kvm/mmu/spte.h                |  12 +-
 arch/x86/kvm/mmu/tdp_mmu.c             |  22 +--
 include/linux/kvm_host.h               |  26 +++
 virt/kvm/kvm_main.c                    | 210 +++++++++++++------------
 virt/kvm/kvm_mm.h                      |   3 +-
 virt/kvm/pfncache.c                    |   8 +-
 15 files changed, 282 insertions(+), 209 deletions(-)

-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH v7 0/8] KVM: allow mapping non-refcounted pages
@ 2023-07-04  7:50 ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.

First, this series replaces the __gfn_to_pfn_memslot API with a more
extensible __kvm_faultin_pfn API. The updated API rearranges
__gfn_to_pfn_memslot's args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior, except as noted in the PPC change.

When introduced in the refactoring, __kvm_faultin_pfn implies FOLL_GET
to preserve existing behavior. From there, the API is made to support
mapping non-refcounted pages by respecting the FOLL_GET flag.

This series only adds support for non-refcounted pages to the x86 MMU.
Other MMUs can likely be updated without too much difficulty, but it is
not needed at this point. Updating other parts of KVM (e.g. pfncache) is
not straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/
[2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages.
v5 -> v6:
 - rebase on kvm next branch
 - rename gfn_to_pfn_page to gfn_to_pfn_noref
 - fix uninitialized outparam in error case of __kvm_faultin_pfn
 - add kvm_release_pfn_noref_clean for releasing pfn/page pair
v4 -> v5:
 - rebase on kvm next branch again
v3 -> v4:
 - rebase on kvm next branch again
 - Add some more context to a comment in ensure_pfn_ref
v2 -> v3:
 - rebase on kvm next branch
v1 -> v2:
 - Introduce new gfn_to_pfn_page functions instead of modifying the
   behavior of existing gfn_to_pfn functions, to make the change less
   invasive.
 - Drop changes to mmu_audit.c
 - Include Nicholas Piggin's patch to avoid corrupting refcount in the
   follow_pte case, and use it in depreciated gfn_to_pfn functions.
 - Rebase on kvm/next
David Stevens (7):
  KVM: Introduce __kvm_follow_pfn function
  KVM: Make __kvm_follow_pfn not imply FOLL_GET
  KVM: x86/mmu: Migrate to __kvm_follow_pfn
  KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  KVM: arm64: Migrate to __kvm_follow_pfn
  KVM: PPC: Migrate to __kvm_follow_pfn
  KVM: remove __gfn_to_pfn_memslot

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
    accessed/dirty

 arch/arm64/kvm/mmu.c                   |  25 +--
 arch/powerpc/include/asm/kvm_book3s.h  |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  38 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  50 +++---
 arch/powerpc/kvm/book3s_hv_nested.c    |   4 +-
 arch/x86/kvm/mmu/mmu.c                 |  77 ++++++---
 arch/x86/kvm/mmu/mmu_internal.h        |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h         |   9 +-
 arch/x86/kvm/mmu/spte.c                |   4 +-
 arch/x86/kvm/mmu/spte.h                |  12 +-
 arch/x86/kvm/mmu/tdp_mmu.c             |  22 +--
 include/linux/kvm_host.h               |  26 +++
 virt/kvm/kvm_main.c                    | 210 +++++++++++++------------
 virt/kvm/kvm_mm.h                      |   3 +-
 virt/kvm/pfncache.c                    |   8 +-
 15 files changed, 282 insertions(+), 209 deletions(-)

-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH v7 0/8] KVM: allow mapping non-refcounted pages
@ 2023-07-04  7:50 ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.

First, this series replaces the __gfn_to_pfn_memslot API with a more
extensible __kvm_faultin_pfn API. The updated API rearranges
__gfn_to_pfn_memslot's args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior, except as noted in the PPC change.

When introduced in the refactoring, __kvm_faultin_pfn implies FOLL_GET
to preserve existing behavior. From there, the API is made to support
mapping non-refcounted pages by respecting the FOLL_GET flag.

This series only adds support for non-refcounted pages to the x86 MMU.
Other MMUs can likely be updated without too much difficulty, but it is
not needed at this point. Updating other parts of KVM (e.g. pfncache) is
not straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansingh@chromium.org/
[2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages.
v5 -> v6:
 - rebase on kvm next branch
 - rename gfn_to_pfn_page to gfn_to_pfn_noref
 - fix uninitialized outparam in error case of __kvm_faultin_pfn
 - add kvm_release_pfn_noref_clean for releasing pfn/page pair
v4 -> v5:
 - rebase on kvm next branch again
v3 -> v4:
 - rebase on kvm next branch again
 - Add some more context to a comment in ensure_pfn_ref
v2 -> v3:
 - rebase on kvm next branch
v1 -> v2:
 - Introduce new gfn_to_pfn_page functions instead of modifying the
   behavior of existing gfn_to_pfn functions, to make the change less
   invasive.
 - Drop changes to mmu_audit.c
 - Include Nicholas Piggin's patch to avoid corrupting refcount in the
   follow_pte case, and use it in depreciated gfn_to_pfn functions.
 - Rebase on kvm/next
David Stevens (7):
  KVM: Introduce __kvm_follow_pfn function
  KVM: Make __kvm_follow_pfn not imply FOLL_GET
  KVM: x86/mmu: Migrate to __kvm_follow_pfn
  KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  KVM: arm64: Migrate to __kvm_follow_pfn
  KVM: PPC: Migrate to __kvm_follow_pfn
  KVM: remove __gfn_to_pfn_memslot

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
    accessed/dirty

 arch/arm64/kvm/mmu.c                   |  25 +--
 arch/powerpc/include/asm/kvm_book3s.h  |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  38 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  50 +++---
 arch/powerpc/kvm/book3s_hv_nested.c    |   4 +-
 arch/x86/kvm/mmu/mmu.c                 |  77 ++++++---
 arch/x86/kvm/mmu/mmu_internal.h        |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h         |   9 +-
 arch/x86/kvm/mmu/spte.c                |   4 +-
 arch/x86/kvm/mmu/spte.h                |  12 +-
 arch/x86/kvm/mmu/tdp_mmu.c             |  22 +--
 include/linux/kvm_host.h               |  26 +++
 virt/kvm/kvm_main.c                    | 210 +++++++++++++------------
 virt/kvm/kvm_mm.h                      |   3 +-
 virt/kvm/pfncache.c                    |   8 +-
 15 files changed, 282 insertions(+), 209 deletions(-)

-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH v7 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm

From: Sean Christopherson <seanjc@google.com>

Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.

Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free.  Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occurred, but yelling
does help detect, triage, and debug use-after-free bugs.

Note, the assertion must use page_count(), NOT page_ref_count()!  For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b838c8f71349..371bd783ff2b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2885,6 +2885,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
 
 static bool kvm_is_ad_tracked_page(struct page *page)
 {
+	/*
+	 * Assert that KVM isn't attempting to mark a freed page as Accessed or
+	 * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug.  KVM
+	 * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+	 * instead relies on mmu_notifiers to know when a mapping needs to be
+	 * zapped/invalidated.  Unmapping from KVM's MMU must happen _before_
+	 * KVM returns from its mmu_notifier, i.e. the page should have an
+	 * elevated refcount at this point even though KVM doesn't hold a
+	 * reference of its own.
+	 */
+	if (WARN_ON_ONCE(!page_count(page)))
+		return false;
+
 	/*
 	 * Per page-flags.h, pages tagged PG_reserved "should in general not be
 	 * touched (e.g. set dirty) except by its owner".
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, kvmarm, linuxppc-dev,
	linux-arm-kernel

From: Sean Christopherson <seanjc@google.com>

Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.

Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free.  Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occurred, but yelling
does help detect, triage, and debug use-after-free bugs.

Note, the assertion must use page_count(), NOT page_ref_count()!  For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b838c8f71349..371bd783ff2b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2885,6 +2885,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
 
 static bool kvm_is_ad_tracked_page(struct page *page)
 {
+	/*
+	 * Assert that KVM isn't attempting to mark a freed page as Accessed or
+	 * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug.  KVM
+	 * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+	 * instead relies on mmu_notifiers to know when a mapping needs to be
+	 * zapped/invalidated.  Unmapping from KVM's MMU must happen _before_
+	 * KVM returns from its mmu_notifier, i.e. the page should have an
+	 * elevated refcount at this point even though KVM doesn't hold a
+	 * reference of its own.
+	 */
+	if (WARN_ON_ONCE(!page_count(page)))
+		return false;
+
 	/*
 	 * Per page-flags.h, pages tagged PG_reserved "should in general not be
 	 * touched (e.g. set dirty) except by its owner".
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm

From: Sean Christopherson <seanjc@google.com>

Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.

Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free.  Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occurred, but yelling
does help detect, triage, and debug use-after-free bugs.

Note, the assertion must use page_count(), NOT page_ref_count()!  For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b838c8f71349..371bd783ff2b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2885,6 +2885,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
 
 static bool kvm_is_ad_tracked_page(struct page *page)
 {
+	/*
+	 * Assert that KVM isn't attempting to mark a freed page as Accessed or
+	 * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug.  KVM
+	 * (typically) doesn't pin pages that are mapped in KVM's MMU, and
+	 * instead relies on mmu_notifiers to know when a mapping needs to be
+	 * zapped/invalidated.  Unmapping from KVM's MMU must happen _before_
+	 * KVM returns from its mmu_notifier, i.e. the page should have an
+	 * elevated refcount at this point even though KVM doesn't hold a
+	 * reference of its own.
+	 */
+	if (WARN_ON_ONCE(!page_count(page)))
+		return false;
+
 	/*
 	 * Per page-flags.h, pages tagged PG_reserved "should in general not be
 	 * touched (e.g. set dirty) except by its owner".
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
__kvm_follow_pfn refactors the old API's arguments into a struct and,
where possible, combines the boolean arguments into a single flags
argument.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h |  16 ++++
 virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
 virt/kvm/kvm_mm.h        |   3 +-
 virt/kvm/pfncache.c      |   8 +-
 4 files changed, 122 insertions(+), 76 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..ef2763c2b12e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
 #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
 #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
 
 /*
  * error pfns indicate that the gfn is in slot but faild to
@@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+struct kvm_follow_pfn {
+	const struct kvm_memory_slot *slot;
+	gfn_t gfn;
+	unsigned int flags;
+	bool atomic;
+	/* Allow a read fault to create a writeable mapping. */
+	bool allow_write_mapping;
+
+	/* Outputs of __kvm_follow_pfn */
+	hva_t hva;
+	bool writable;
+};
+
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
+
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 371bd783ff2b..b13f22861d2f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
-static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-			    bool *writable, kvm_pfn_t *pfn)
+static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
+	bool write_fault = foll->flags & FOLL_WRITE;
 
 	/*
 	 * Fast pin a writable pfn only if it is a write fault request
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
+	if (!(write_fault || foll->allow_write_mapping))
 		return false;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
 		*pfn = page_to_pfn(page[0]);
-
-		if (writable)
-			*writable = true;
+		foll->writable = foll->allow_write_mapping;
 		return true;
 	}
 
@@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * The slow path to get the pfn of the specified host virtual address,
  * 1 indicates success, -errno is returned if error is detected.
  */
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
+static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
 	struct page *page;
 	int npages;
 
 	might_sleep();
 
-	if (writable)
-		*writable = write_fault;
-
-	if (write_fault)
-		flags |= FOLL_WRITE;
-	if (async)
-		flags |= FOLL_NOWAIT;
-	if (interruptible)
-		flags |= FOLL_INTERRUPTIBLE;
-
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
 	if (npages != 1)
 		return npages;
 
+	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+
 	/* map read fault as writable if possible */
-	if (unlikely(!write_fault) && writable) {
+	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
 		struct page *wpage;
 
-		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
-			*writable = true;
+		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
+			foll->writable = true;
 			put_page(page);
 			page = wpage;
 		}
@@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
 	return get_page_unless_zero(page);
 }
 
-static int hva_to_pfn_remapped(struct vm_area_struct *vma,
-			       unsigned long addr, bool write_fault,
-			       bool *writable, kvm_pfn_t *p_pfn)
+static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
+			       kvm_pfn_t *p_pfn)
 {
 	kvm_pfn_t pfn;
 	pte_t *ptep;
 	spinlock_t *ptl;
+	bool write_fault = foll->flags & FOLL_WRITE;
 	int r;
 
-	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
 		 * not call the fault handler, so do it here.
 		 */
 		bool unlocked = false;
-		r = fixup_user_fault(current->mm, addr,
+		r = fixup_user_fault(current->mm, foll->hva,
 				     (write_fault ? FAULT_FLAG_WRITE : 0),
 				     &unlocked);
 		if (unlocked)
@@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 		if (r)
 			return r;
 	}
@@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (writable)
-		*writable = pte_write(*ptep);
+	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
 	pfn = pte_pfn(*ptep);
 
 	/*
@@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn;
 	int npages, r;
 
 	/* we can do it either atomically or asynchronously, not both */
-	BUG_ON(atomic && async);
+	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
 
-	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+	if (hva_to_pfn_fast(foll, &pfn))
 		return pfn;
 
-	if (atomic)
+	if (foll->atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
-				 writable, &pfn);
+	npages = hva_to_pfn_slow(foll, &pfn);
 	if (npages == 1)
 		return pfn;
 	if (npages == -EINTR)
@@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 
 	mmap_read_lock(current->mm);
 	if (npages == -EHWPOISON ||
-	      (!async && check_user_page_hwpoison(addr))) {
+	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
 		pfn = KVM_PFN_ERR_HWPOISON;
 		goto exit;
 	}
 
 retry:
-	vma = vma_lookup(current->mm, addr);
+	vma = vma_lookup(current->mm, foll->hva);
 
 	if (vma == NULL)
 		pfn = KVM_PFN_ERR_FAULT;
 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
-		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
+		r = hva_to_pfn_remapped(vma, foll, &pfn);
 		if (r == -EAGAIN)
 			goto retry;
 		if (r < 0)
 			pfn = KVM_PFN_ERR_FAULT;
 	} else {
-		if (async && vma_is_valid(vma, write_fault))
-			*async = true;
-		pfn = KVM_PFN_ERR_FAULT;
+		if ((foll->flags & FOLL_NOWAIT) &&
+		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
+			pfn = KVM_PFN_ERR_NEEDS_IO;
+		else
+			pfn = KVM_PFN_ERR_FAULT;
 	}
 exit:
 	mmap_read_unlock(current->mm);
 	return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
-	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
-
-	if (hva)
-		*hva = addr;
+	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
+				      foll->flags & FOLL_WRITE);
 
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
+	if (foll->hva == KVM_HVA_ERR_RO_BAD)
 		return KVM_PFN_ERR_RO_FAULT;
-	}
 
-	if (kvm_is_error_hva(addr)) {
-		if (writable)
-			*writable = false;
+	if (kvm_is_error_hva(foll->hva))
 		return KVM_PFN_NOSLOT;
-	}
 
-	/* Do not map writable pfn in the readonly memslot. */
-	if (writable && memslot_is_readonly(slot)) {
-		*writable = false;
-		writable = NULL;
-	}
+	if (memslot_is_readonly(foll->slot))
+		foll->allow_write_mapping = false;
+
+	return hva_to_pfn(foll);
+}
+EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
 
-	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
-			  writable);
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool interruptible, bool *async,
+			       bool write_fault, bool *writable, hva_t *hva)
+{
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = 0,
+		.atomic = atomic,
+		.allow_write_mapping = !!writable,
+	};
+
+	if (write_fault)
+		foll.flags |= FOLL_WRITE;
+	if (async)
+		foll.flags |= FOLL_NOWAIT;
+	if (interruptible)
+		foll.flags |= FOLL_INTERRUPTIBLE;
+
+	pfn = __kvm_follow_pfn(&foll);
+	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
+		*async = true;
+		pfn = KVM_PFN_ERR_FAULT;
+	}
+	if (hva)
+		*hva = foll.hva;
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = gfn_to_memslot(kvm, gfn),
+		.gfn = gfn,
+		.flags = write_fault ? FOLL_WRITE : 0,
+		.allow_write_mapping = !!writable,
+	};
+	pfn = __kvm_follow_pfn(&foll);
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+		.atomic = true,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..ed896aee5396 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -20,8 +20,7 @@
 #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable);
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
 
 #ifdef CONFIG_HAVE_KVM_PFNCACHE
 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..e3fefa753a51 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
 	void *new_khva = NULL;
 	unsigned long mmu_seq;
+	struct kvm_follow_pfn foll = {
+		.slot = gpc->memslot,
+		.gfn = gpa_to_gfn(gpc->gpa),
+		.flags = FOLL_WRITE,
+		.hva = gpc->uhva,
+	};
 
 	lockdep_assert_held(&gpc->refresh_lock);
 
@@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		}
 
 		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+		new_pfn = hva_to_pfn(&foll);
 		if (is_error_noslot_pfn(new_pfn))
 			goto out_error;
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
__kvm_follow_pfn refactors the old API's arguments into a struct and,
where possible, combines the boolean arguments into a single flags
argument.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h |  16 ++++
 virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
 virt/kvm/kvm_mm.h        |   3 +-
 virt/kvm/pfncache.c      |   8 +-
 4 files changed, 122 insertions(+), 76 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..ef2763c2b12e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
 #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
 #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
 
 /*
  * error pfns indicate that the gfn is in slot but faild to
@@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+struct kvm_follow_pfn {
+	const struct kvm_memory_slot *slot;
+	gfn_t gfn;
+	unsigned int flags;
+	bool atomic;
+	/* Allow a read fault to create a writeable mapping. */
+	bool allow_write_mapping;
+
+	/* Outputs of __kvm_follow_pfn */
+	hva_t hva;
+	bool writable;
+};
+
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
+
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 371bd783ff2b..b13f22861d2f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
-static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-			    bool *writable, kvm_pfn_t *pfn)
+static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
+	bool write_fault = foll->flags & FOLL_WRITE;
 
 	/*
 	 * Fast pin a writable pfn only if it is a write fault request
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
+	if (!(write_fault || foll->allow_write_mapping))
 		return false;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
 		*pfn = page_to_pfn(page[0]);
-
-		if (writable)
-			*writable = true;
+		foll->writable = foll->allow_write_mapping;
 		return true;
 	}
 
@@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * The slow path to get the pfn of the specified host virtual address,
  * 1 indicates success, -errno is returned if error is detected.
  */
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
+static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
 	struct page *page;
 	int npages;
 
 	might_sleep();
 
-	if (writable)
-		*writable = write_fault;
-
-	if (write_fault)
-		flags |= FOLL_WRITE;
-	if (async)
-		flags |= FOLL_NOWAIT;
-	if (interruptible)
-		flags |= FOLL_INTERRUPTIBLE;
-
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
 	if (npages != 1)
 		return npages;
 
+	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+
 	/* map read fault as writable if possible */
-	if (unlikely(!write_fault) && writable) {
+	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
 		struct page *wpage;
 
-		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
-			*writable = true;
+		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
+			foll->writable = true;
 			put_page(page);
 			page = wpage;
 		}
@@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
 	return get_page_unless_zero(page);
 }
 
-static int hva_to_pfn_remapped(struct vm_area_struct *vma,
-			       unsigned long addr, bool write_fault,
-			       bool *writable, kvm_pfn_t *p_pfn)
+static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
+			       kvm_pfn_t *p_pfn)
 {
 	kvm_pfn_t pfn;
 	pte_t *ptep;
 	spinlock_t *ptl;
+	bool write_fault = foll->flags & FOLL_WRITE;
 	int r;
 
-	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
 		 * not call the fault handler, so do it here.
 		 */
 		bool unlocked = false;
-		r = fixup_user_fault(current->mm, addr,
+		r = fixup_user_fault(current->mm, foll->hva,
 				     (write_fault ? FAULT_FLAG_WRITE : 0),
 				     &unlocked);
 		if (unlocked)
@@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 		if (r)
 			return r;
 	}
@@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (writable)
-		*writable = pte_write(*ptep);
+	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
 	pfn = pte_pfn(*ptep);
 
 	/*
@@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn;
 	int npages, r;
 
 	/* we can do it either atomically or asynchronously, not both */
-	BUG_ON(atomic && async);
+	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
 
-	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+	if (hva_to_pfn_fast(foll, &pfn))
 		return pfn;
 
-	if (atomic)
+	if (foll->atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
-				 writable, &pfn);
+	npages = hva_to_pfn_slow(foll, &pfn);
 	if (npages == 1)
 		return pfn;
 	if (npages == -EINTR)
@@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 
 	mmap_read_lock(current->mm);
 	if (npages == -EHWPOISON ||
-	      (!async && check_user_page_hwpoison(addr))) {
+	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
 		pfn = KVM_PFN_ERR_HWPOISON;
 		goto exit;
 	}
 
 retry:
-	vma = vma_lookup(current->mm, addr);
+	vma = vma_lookup(current->mm, foll->hva);
 
 	if (vma == NULL)
 		pfn = KVM_PFN_ERR_FAULT;
 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
-		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
+		r = hva_to_pfn_remapped(vma, foll, &pfn);
 		if (r == -EAGAIN)
 			goto retry;
 		if (r < 0)
 			pfn = KVM_PFN_ERR_FAULT;
 	} else {
-		if (async && vma_is_valid(vma, write_fault))
-			*async = true;
-		pfn = KVM_PFN_ERR_FAULT;
+		if ((foll->flags & FOLL_NOWAIT) &&
+		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
+			pfn = KVM_PFN_ERR_NEEDS_IO;
+		else
+			pfn = KVM_PFN_ERR_FAULT;
 	}
 exit:
 	mmap_read_unlock(current->mm);
 	return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
-	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
-
-	if (hva)
-		*hva = addr;
+	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
+				      foll->flags & FOLL_WRITE);
 
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
+	if (foll->hva == KVM_HVA_ERR_RO_BAD)
 		return KVM_PFN_ERR_RO_FAULT;
-	}
 
-	if (kvm_is_error_hva(addr)) {
-		if (writable)
-			*writable = false;
+	if (kvm_is_error_hva(foll->hva))
 		return KVM_PFN_NOSLOT;
-	}
 
-	/* Do not map writable pfn in the readonly memslot. */
-	if (writable && memslot_is_readonly(slot)) {
-		*writable = false;
-		writable = NULL;
-	}
+	if (memslot_is_readonly(foll->slot))
+		foll->allow_write_mapping = false;
+
+	return hva_to_pfn(foll);
+}
+EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
 
-	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
-			  writable);
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool interruptible, bool *async,
+			       bool write_fault, bool *writable, hva_t *hva)
+{
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = 0,
+		.atomic = atomic,
+		.allow_write_mapping = !!writable,
+	};
+
+	if (write_fault)
+		foll.flags |= FOLL_WRITE;
+	if (async)
+		foll.flags |= FOLL_NOWAIT;
+	if (interruptible)
+		foll.flags |= FOLL_INTERRUPTIBLE;
+
+	pfn = __kvm_follow_pfn(&foll);
+	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
+		*async = true;
+		pfn = KVM_PFN_ERR_FAULT;
+	}
+	if (hva)
+		*hva = foll.hva;
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = gfn_to_memslot(kvm, gfn),
+		.gfn = gfn,
+		.flags = write_fault ? FOLL_WRITE : 0,
+		.allow_write_mapping = !!writable,
+	};
+	pfn = __kvm_follow_pfn(&foll);
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+		.atomic = true,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..ed896aee5396 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -20,8 +20,7 @@
 #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable);
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
 
 #ifdef CONFIG_HAVE_KVM_PFNCACHE
 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..e3fefa753a51 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
 	void *new_khva = NULL;
 	unsigned long mmu_seq;
+	struct kvm_follow_pfn foll = {
+		.slot = gpc->memslot,
+		.gfn = gpa_to_gfn(gpc->gpa),
+		.flags = FOLL_WRITE,
+		.hva = gpc->uhva,
+	};
 
 	lockdep_assert_held(&gpc->refresh_lock);
 
@@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		}
 
 		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+		new_pfn = hva_to_pfn(&foll);
 		if (is_error_noslot_pfn(new_pfn))
 			goto out_error;
 
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
__kvm_follow_pfn refactors the old API's arguments into a struct and,
where possible, combines the boolean arguments into a single flags
argument.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h |  16 ++++
 virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
 virt/kvm/kvm_mm.h        |   3 +-
 virt/kvm/pfncache.c      |   8 +-
 4 files changed, 122 insertions(+), 76 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..ef2763c2b12e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
 #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
 #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
 
 /*
  * error pfns indicate that the gfn is in slot but faild to
@@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+struct kvm_follow_pfn {
+	const struct kvm_memory_slot *slot;
+	gfn_t gfn;
+	unsigned int flags;
+	bool atomic;
+	/* Allow a read fault to create a writeable mapping. */
+	bool allow_write_mapping;
+
+	/* Outputs of __kvm_follow_pfn */
+	hva_t hva;
+	bool writable;
+};
+
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
+
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 371bd783ff2b..b13f22861d2f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
-static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-			    bool *writable, kvm_pfn_t *pfn)
+static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
+	bool write_fault = foll->flags & FOLL_WRITE;
 
 	/*
 	 * Fast pin a writable pfn only if it is a write fault request
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
+	if (!(write_fault || foll->allow_write_mapping))
 		return false;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
 		*pfn = page_to_pfn(page[0]);
-
-		if (writable)
-			*writable = true;
+		foll->writable = foll->allow_write_mapping;
 		return true;
 	}
 
@@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * The slow path to get the pfn of the specified host virtual address,
  * 1 indicates success, -errno is returned if error is detected.
  */
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
+static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
 	struct page *page;
 	int npages;
 
 	might_sleep();
 
-	if (writable)
-		*writable = write_fault;
-
-	if (write_fault)
-		flags |= FOLL_WRITE;
-	if (async)
-		flags |= FOLL_NOWAIT;
-	if (interruptible)
-		flags |= FOLL_INTERRUPTIBLE;
-
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
 	if (npages != 1)
 		return npages;
 
+	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+
 	/* map read fault as writable if possible */
-	if (unlikely(!write_fault) && writable) {
+	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
 		struct page *wpage;
 
-		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
-			*writable = true;
+		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
+			foll->writable = true;
 			put_page(page);
 			page = wpage;
 		}
@@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
 	return get_page_unless_zero(page);
 }
 
-static int hva_to_pfn_remapped(struct vm_area_struct *vma,
-			       unsigned long addr, bool write_fault,
-			       bool *writable, kvm_pfn_t *p_pfn)
+static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
+			       kvm_pfn_t *p_pfn)
 {
 	kvm_pfn_t pfn;
 	pte_t *ptep;
 	spinlock_t *ptl;
+	bool write_fault = foll->flags & FOLL_WRITE;
 	int r;
 
-	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
 		 * not call the fault handler, so do it here.
 		 */
 		bool unlocked = false;
-		r = fixup_user_fault(current->mm, addr,
+		r = fixup_user_fault(current->mm, foll->hva,
 				     (write_fault ? FAULT_FLAG_WRITE : 0),
 				     &unlocked);
 		if (unlocked)
@@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
+		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
 		if (r)
 			return r;
 	}
@@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (writable)
-		*writable = pte_write(*ptep);
+	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
 	pfn = pte_pfn(*ptep);
 
 	/*
@@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn;
 	int npages, r;
 
 	/* we can do it either atomically or asynchronously, not both */
-	BUG_ON(atomic && async);
+	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
 
-	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+	if (hva_to_pfn_fast(foll, &pfn))
 		return pfn;
 
-	if (atomic)
+	if (foll->atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
-				 writable, &pfn);
+	npages = hva_to_pfn_slow(foll, &pfn);
 	if (npages == 1)
 		return pfn;
 	if (npages == -EINTR)
@@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 
 	mmap_read_lock(current->mm);
 	if (npages == -EHWPOISON ||
-	      (!async && check_user_page_hwpoison(addr))) {
+	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
 		pfn = KVM_PFN_ERR_HWPOISON;
 		goto exit;
 	}
 
 retry:
-	vma = vma_lookup(current->mm, addr);
+	vma = vma_lookup(current->mm, foll->hva);
 
 	if (vma == NULL)
 		pfn = KVM_PFN_ERR_FAULT;
 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
-		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
+		r = hva_to_pfn_remapped(vma, foll, &pfn);
 		if (r == -EAGAIN)
 			goto retry;
 		if (r < 0)
 			pfn = KVM_PFN_ERR_FAULT;
 	} else {
-		if (async && vma_is_valid(vma, write_fault))
-			*async = true;
-		pfn = KVM_PFN_ERR_FAULT;
+		if ((foll->flags & FOLL_NOWAIT) &&
+		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
+			pfn = KVM_PFN_ERR_NEEDS_IO;
+		else
+			pfn = KVM_PFN_ERR_FAULT;
 	}
 exit:
 	mmap_read_unlock(current->mm);
 	return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
-	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
-
-	if (hva)
-		*hva = addr;
+	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
+				      foll->flags & FOLL_WRITE);
 
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
+	if (foll->hva == KVM_HVA_ERR_RO_BAD)
 		return KVM_PFN_ERR_RO_FAULT;
-	}
 
-	if (kvm_is_error_hva(addr)) {
-		if (writable)
-			*writable = false;
+	if (kvm_is_error_hva(foll->hva))
 		return KVM_PFN_NOSLOT;
-	}
 
-	/* Do not map writable pfn in the readonly memslot. */
-	if (writable && memslot_is_readonly(slot)) {
-		*writable = false;
-		writable = NULL;
-	}
+	if (memslot_is_readonly(foll->slot))
+		foll->allow_write_mapping = false;
+
+	return hva_to_pfn(foll);
+}
+EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
 
-	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
-			  writable);
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool interruptible, bool *async,
+			       bool write_fault, bool *writable, hva_t *hva)
+{
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = 0,
+		.atomic = atomic,
+		.allow_write_mapping = !!writable,
+	};
+
+	if (write_fault)
+		foll.flags |= FOLL_WRITE;
+	if (async)
+		foll.flags |= FOLL_NOWAIT;
+	if (interruptible)
+		foll.flags |= FOLL_INTERRUPTIBLE;
+
+	pfn = __kvm_follow_pfn(&foll);
+	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
+		*async = true;
+		pfn = KVM_PFN_ERR_FAULT;
+	}
+	if (hva)
+		*hva = foll.hva;
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+	kvm_pfn_t pfn;
+	struct kvm_follow_pfn foll = {
+		.slot = gfn_to_memslot(kvm, gfn),
+		.gfn = gfn,
+		.flags = write_fault ? FOLL_WRITE : 0,
+		.allow_write_mapping = !!writable,
+	};
+	pfn = __kvm_follow_pfn(&foll);
+	if (writable)
+		*writable = foll.writable;
+	return pfn;
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = gfn,
+		.flags = FOLL_WRITE,
+		.atomic = true,
+	};
+	return __kvm_follow_pfn(&foll);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..ed896aee5396 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -20,8 +20,7 @@
 #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
-		     bool *async, bool write_fault, bool *writable);
+kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
 
 #ifdef CONFIG_HAVE_KVM_PFNCACHE
 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2d6aba677830..e3fefa753a51 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
 	void *new_khva = NULL;
 	unsigned long mmu_seq;
+	struct kvm_follow_pfn foll = {
+		.slot = gpc->memslot,
+		.gfn = gpa_to_gfn(gpc->gpa),
+		.flags = FOLL_WRITE,
+		.hva = gpc->uhva,
+	};
 
 	lockdep_assert_held(&gpc->refresh_lock);
 
@@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		}
 
 		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+		new_pfn = hva_to_pfn(&foll);
 		if (is_error_noslot_pfn(new_pfn))
 			goto out_error;
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
callers to resolve a gfn when the associated pfn has a valid struct page
that isn't being actively refcounted (e.g. tail pages of non-compound
higher order pages). For a caller to safely omit FOLL_GET, all usages of
the returned pfn must be guarded by a mmu notifier.

This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
is set when the returned pfn has an associated struct page with a valid
refcount. Callers that don't pass FOLL_GET should remember this value
and use it to avoid places like kvm_is_ad_tracked_page that assume a
non-zero refcount.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h | 10 ++++++
 virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
 virt/kvm/pfncache.c      |  2 +-
 3 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ef2763c2b12e..a45308c7d2d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
 struct kvm_follow_pfn {
 	const struct kvm_memory_slot *slot;
 	gfn_t gfn;
@@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
 	bool atomic;
 	/* Allow a read fault to create a writeable mapping. */
 	bool allow_write_mapping;
+	/*
+	 * Usage of the returned pfn will be guared by a mmu notifier. Must
+	 * be true if FOLL_GET is not set.
+	 */
+	bool guarded_by_mmu_notifier;
 
 	/* Outputs of __kvm_follow_pfn */
 	hva_t hva;
 	bool writable;
+	/* True if the returned pfn is for a page with a valid refcount. */
+	bool is_refcounted_page;
 };
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b13f22861d2f..0f7b41f220b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
 		*pfn = page_to_pfn(page[0]);
 		foll->writable = foll->allow_write_mapping;
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page[0]);
 		return true;
 	}
 
@@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 		return npages;
 
 	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+	foll->is_refcounted_page = true;
 
 	/* map read fault as writable if possible */
 	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 		}
 	}
 	*pfn = page_to_pfn(page);
+	if (!(foll->flags & FOLL_GET))
+		put_page(page);
 	return npages;
 }
 
@@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
-	struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
-	if (!page)
-		return 1;
-
-	return get_page_unless_zero(page);
-}
-
 static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
 			       kvm_pfn_t *p_pfn)
 {
@@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
 	pte_t *ptep;
 	spinlock_t *ptl;
 	bool write_fault = foll->flags & FOLL_WRITE;
+	struct page *page;
 	int r;
 
 	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
@@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
 	pfn = pte_pfn(*ptep);
 
 	/*
-	 * Get a reference here because callers of *hva_to_pfn* and
-	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
-	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
-	 * simply do nothing for reserved pfns.
-	 *
-	 * Whoever called remap_pfn_range is also going to call e.g.
-	 * unmap_mapping_range before the underlying pages are freed,
-	 * causing a call to our MMU notifier.
+	 * Now deal with reference counting. If kvm_pfn_to_refcounted_page
+	 * returns NULL, then there's no refcount to worry about.
 	 *
-	 * Certain IO or PFNMAP mappings can be backed with valid
-	 * struct pages, but be allocated without refcounting e.g.,
-	 * tail pages of non-compound higher order allocations, which
-	 * would then underflow the refcount when the caller does the
-	 * required put_page. Don't allow those pages here.
+	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages but be allocated without refcounting e.g., tail pages of
+	 * non-compound higher order allocations. If FOLL_GET is set and we
+	 * increment such a refcount, then when that pfn is eventually passed to
+	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
 	 */ 
-	if (!kvm_try_get_pfn(pfn))
+	page = kvm_pfn_to_refcounted_page(pfn);
+	if (!page)
+		goto out;
+
+	if (get_page_unless_zero(page)) {
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page);
+	} else if (foll->flags & FOLL_GET) {
 		r = -EFAULT;
+	}
 
 out:
 	pte_unmap_unlock(ptep, ptl);
@@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
+	if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
+		return KVM_PFN_ERR_FAULT;
+
 	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
 				      foll->flags & FOLL_WRITE);
 
@@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = 0,
+		.flags = FOLL_GET,
 		.atomic = atomic,
 		.allow_write_mapping = !!writable,
 	};
@@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 	struct kvm_follow_pfn foll = {
 		.slot = gfn_to_memslot(kvm, gfn),
 		.gfn = gfn,
-		.flags = write_fault ? FOLL_WRITE : 0,
+		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
 		.allow_write_mapping = !!writable,
 	};
 	pfn = __kvm_follow_pfn(&foll);
@@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = FOLL_WRITE,
+		.flags = FOLL_GET | FOLL_WRITE,
 	};
 	return __kvm_follow_pfn(&foll);
 }
@@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = FOLL_WRITE,
+		.flags = FOLL_GET | FOLL_WRITE,
 		.atomic = true,
 	};
 	return __kvm_follow_pfn(&foll);
@@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
 	return !PageReserved(page);
 }
 
-static void kvm_set_page_dirty(struct page *page)
+void kvm_set_page_dirty(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		SetPageDirty(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
-static void kvm_set_page_accessed(struct page *page)
+void kvm_set_page_accessed(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		mark_page_accessed(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
 
 void kvm_release_page_clean(struct page *page)
 {
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index e3fefa753a51..87caafce3dd0 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	struct kvm_follow_pfn foll = {
 		.slot = gpc->memslot,
 		.gfn = gpa_to_gfn(gpc->gpa),
-		.flags = FOLL_WRITE,
+		.flags = FOLL_WRITE | FOLL_GET,
 		.hva = gpc->uhva,
 	};
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
callers to resolve a gfn when the associated pfn has a valid struct page
that isn't being actively refcounted (e.g. tail pages of non-compound
higher order pages). For a caller to safely omit FOLL_GET, all usages of
the returned pfn must be guarded by a mmu notifier.

This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
is set when the returned pfn has an associated struct page with a valid
refcount. Callers that don't pass FOLL_GET should remember this value
and use it to avoid places like kvm_is_ad_tracked_page that assume a
non-zero refcount.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h | 10 ++++++
 virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
 virt/kvm/pfncache.c      |  2 +-
 3 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ef2763c2b12e..a45308c7d2d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
 struct kvm_follow_pfn {
 	const struct kvm_memory_slot *slot;
 	gfn_t gfn;
@@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
 	bool atomic;
 	/* Allow a read fault to create a writeable mapping. */
 	bool allow_write_mapping;
+	/*
+	 * Usage of the returned pfn will be guared by a mmu notifier. Must
+	 * be true if FOLL_GET is not set.
+	 */
+	bool guarded_by_mmu_notifier;
 
 	/* Outputs of __kvm_follow_pfn */
 	hva_t hva;
 	bool writable;
+	/* True if the returned pfn is for a page with a valid refcount. */
+	bool is_refcounted_page;
 };
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b13f22861d2f..0f7b41f220b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
 		*pfn = page_to_pfn(page[0]);
 		foll->writable = foll->allow_write_mapping;
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page[0]);
 		return true;
 	}
 
@@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 		return npages;
 
 	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+	foll->is_refcounted_page = true;
 
 	/* map read fault as writable if possible */
 	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 		}
 	}
 	*pfn = page_to_pfn(page);
+	if (!(foll->flags & FOLL_GET))
+		put_page(page);
 	return npages;
 }
 
@@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
-	struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
-	if (!page)
-		return 1;
-
-	return get_page_unless_zero(page);
-}
-
 static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
 			       kvm_pfn_t *p_pfn)
 {
@@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
 	pte_t *ptep;
 	spinlock_t *ptl;
 	bool write_fault = foll->flags & FOLL_WRITE;
+	struct page *page;
 	int r;
 
 	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
@@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
 	pfn = pte_pfn(*ptep);
 
 	/*
-	 * Get a reference here because callers of *hva_to_pfn* and
-	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
-	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
-	 * simply do nothing for reserved pfns.
-	 *
-	 * Whoever called remap_pfn_range is also going to call e.g.
-	 * unmap_mapping_range before the underlying pages are freed,
-	 * causing a call to our MMU notifier.
+	 * Now deal with reference counting. If kvm_pfn_to_refcounted_page
+	 * returns NULL, then there's no refcount to worry about.
 	 *
-	 * Certain IO or PFNMAP mappings can be backed with valid
-	 * struct pages, but be allocated without refcounting e.g.,
-	 * tail pages of non-compound higher order allocations, which
-	 * would then underflow the refcount when the caller does the
-	 * required put_page. Don't allow those pages here.
+	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages but be allocated without refcounting e.g., tail pages of
+	 * non-compound higher order allocations. If FOLL_GET is set and we
+	 * increment such a refcount, then when that pfn is eventually passed to
+	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
 	 */ 
-	if (!kvm_try_get_pfn(pfn))
+	page = kvm_pfn_to_refcounted_page(pfn);
+	if (!page)
+		goto out;
+
+	if (get_page_unless_zero(page)) {
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page);
+	} else if (foll->flags & FOLL_GET) {
 		r = -EFAULT;
+	}
 
 out:
 	pte_unmap_unlock(ptep, ptl);
@@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
+	if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
+		return KVM_PFN_ERR_FAULT;
+
 	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
 				      foll->flags & FOLL_WRITE);
 
@@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = 0,
+		.flags = FOLL_GET,
 		.atomic = atomic,
 		.allow_write_mapping = !!writable,
 	};
@@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 	struct kvm_follow_pfn foll = {
 		.slot = gfn_to_memslot(kvm, gfn),
 		.gfn = gfn,
-		.flags = write_fault ? FOLL_WRITE : 0,
+		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
 		.allow_write_mapping = !!writable,
 	};
 	pfn = __kvm_follow_pfn(&foll);
@@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = FOLL_WRITE,
+		.flags = FOLL_GET | FOLL_WRITE,
 	};
 	return __kvm_follow_pfn(&foll);
 }
@@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = FOLL_WRITE,
+		.flags = FOLL_GET | FOLL_WRITE,
 		.atomic = true,
 	};
 	return __kvm_follow_pfn(&foll);
@@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
 	return !PageReserved(page);
 }
 
-static void kvm_set_page_dirty(struct page *page)
+void kvm_set_page_dirty(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		SetPageDirty(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
-static void kvm_set_page_accessed(struct page *page)
+void kvm_set_page_accessed(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		mark_page_accessed(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
 
 void kvm_release_page_clean(struct page *page)
 {
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index e3fefa753a51..87caafce3dd0 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	struct kvm_follow_pfn foll = {
 		.slot = gpc->memslot,
 		.gfn = gpa_to_gfn(gpc->gpa),
-		.flags = FOLL_WRITE,
+		.flags = FOLL_WRITE | FOLL_GET,
 		.hva = gpc->uhva,
 	};
 
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
callers to resolve a gfn when the associated pfn has a valid struct page
that isn't being actively refcounted (e.g. tail pages of non-compound
higher order pages). For a caller to safely omit FOLL_GET, all usages of
the returned pfn must be guarded by a mmu notifier.

This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
is set when the returned pfn has an associated struct page with a valid
refcount. Callers that don't pass FOLL_GET should remember this value
and use it to avoid places like kvm_is_ad_tracked_page that assume a
non-zero refcount.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h | 10 ++++++
 virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
 virt/kvm/pfncache.c      |  2 +-
 3 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ef2763c2b12e..a45308c7d2d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
 struct kvm_follow_pfn {
 	const struct kvm_memory_slot *slot;
 	gfn_t gfn;
@@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
 	bool atomic;
 	/* Allow a read fault to create a writeable mapping. */
 	bool allow_write_mapping;
+	/*
+	 * Usage of the returned pfn will be guared by a mmu notifier. Must
+	 * be true if FOLL_GET is not set.
+	 */
+	bool guarded_by_mmu_notifier;
 
 	/* Outputs of __kvm_follow_pfn */
 	hva_t hva;
 	bool writable;
+	/* True if the returned pfn is for a page with a valid refcount. */
+	bool is_refcounted_page;
 };
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b13f22861d2f..0f7b41f220b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
 		*pfn = page_to_pfn(page[0]);
 		foll->writable = foll->allow_write_mapping;
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page[0]);
 		return true;
 	}
 
@@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 		return npages;
 
 	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
+	foll->is_refcounted_page = true;
 
 	/* map read fault as writable if possible */
 	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 		}
 	}
 	*pfn = page_to_pfn(page);
+	if (!(foll->flags & FOLL_GET))
+		put_page(page);
 	return npages;
 }
 
@@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
-	struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
-	if (!page)
-		return 1;
-
-	return get_page_unless_zero(page);
-}
-
 static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
 			       kvm_pfn_t *p_pfn)
 {
@@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
 	pte_t *ptep;
 	spinlock_t *ptl;
 	bool write_fault = foll->flags & FOLL_WRITE;
+	struct page *page;
 	int r;
 
 	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
@@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
 	pfn = pte_pfn(*ptep);
 
 	/*
-	 * Get a reference here because callers of *hva_to_pfn* and
-	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
-	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
-	 * simply do nothing for reserved pfns.
-	 *
-	 * Whoever called remap_pfn_range is also going to call e.g.
-	 * unmap_mapping_range before the underlying pages are freed,
-	 * causing a call to our MMU notifier.
+	 * Now deal with reference counting. If kvm_pfn_to_refcounted_page
+	 * returns NULL, then there's no refcount to worry about.
 	 *
-	 * Certain IO or PFNMAP mappings can be backed with valid
-	 * struct pages, but be allocated without refcounting e.g.,
-	 * tail pages of non-compound higher order allocations, which
-	 * would then underflow the refcount when the caller does the
-	 * required put_page. Don't allow those pages here.
+	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages but be allocated without refcounting e.g., tail pages of
+	 * non-compound higher order allocations. If FOLL_GET is set and we
+	 * increment such a refcount, then when that pfn is eventually passed to
+	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
 	 */ 
-	if (!kvm_try_get_pfn(pfn))
+	page = kvm_pfn_to_refcounted_page(pfn);
+	if (!page)
+		goto out;
+
+	if (get_page_unless_zero(page)) {
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page);
+	} else if (foll->flags & FOLL_GET) {
 		r = -EFAULT;
+	}
 
 out:
 	pte_unmap_unlock(ptep, ptl);
@@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 {
+	if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
+		return KVM_PFN_ERR_FAULT;
+
 	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
 				      foll->flags & FOLL_WRITE);
 
@@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = 0,
+		.flags = FOLL_GET,
 		.atomic = atomic,
 		.allow_write_mapping = !!writable,
 	};
@@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 	struct kvm_follow_pfn foll = {
 		.slot = gfn_to_memslot(kvm, gfn),
 		.gfn = gfn,
-		.flags = write_fault ? FOLL_WRITE : 0,
+		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
 		.allow_write_mapping = !!writable,
 	};
 	pfn = __kvm_follow_pfn(&foll);
@@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = FOLL_WRITE,
+		.flags = FOLL_GET | FOLL_WRITE,
 	};
 	return __kvm_follow_pfn(&foll);
 }
@@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = gfn,
-		.flags = FOLL_WRITE,
+		.flags = FOLL_GET | FOLL_WRITE,
 		.atomic = true,
 	};
 	return __kvm_follow_pfn(&foll);
@@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
 	return !PageReserved(page);
 }
 
-static void kvm_set_page_dirty(struct page *page)
+void kvm_set_page_dirty(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		SetPageDirty(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
-static void kvm_set_page_accessed(struct page *page)
+void kvm_set_page_accessed(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		mark_page_accessed(page);
 }
+EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
 
 void kvm_release_page_clean(struct page *page)
 {
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index e3fefa753a51..87caafce3dd0 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	struct kvm_follow_pfn foll = {
 		.slot = gpc->memslot,
 		.gfn = gpa_to_gfn(gpc->gpa),
-		.flags = FOLL_WRITE,
+		.flags = FOLL_WRITE | FOLL_GET,
 		.hva = gpc->uhva,
 	};
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ec169f5c7dce..e44ab512c3a1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
-	bool async;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = fault->gfn,
+		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
-	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
-	if (!async)
-		return RET_PF_CONTINUE; /* *pfn has correct page already */
+	foll.flags |= FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
+		return RET_PF_CONTINUE;
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * to wait for IO.  Note, gup always bails if it is unable to quickly
 	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
 	 */
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+	foll.flags |= FOLL_INTERRUPTIBLE;
+	foll.flags &= ~FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	return RET_PF_CONTINUE;
+success:
+	fault->hva = foll.hva;
+	fault->map_writable = foll.writable;
 	return RET_PF_CONTINUE;
 }
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ec169f5c7dce..e44ab512c3a1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
-	bool async;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = fault->gfn,
+		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
-	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
-	if (!async)
-		return RET_PF_CONTINUE; /* *pfn has correct page already */
+	foll.flags |= FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
+		return RET_PF_CONTINUE;
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * to wait for IO.  Note, gup always bails if it is unable to quickly
 	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
 	 */
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+	foll.flags |= FOLL_INTERRUPTIBLE;
+	foll.flags &= ~FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	return RET_PF_CONTINUE;
+success:
+	fault->hva = foll.hva;
+	fault->map_writable = foll.writable;
 	return RET_PF_CONTINUE;
 }
 
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ec169f5c7dce..e44ab512c3a1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
-	bool async;
+	struct kvm_follow_pfn foll = {
+		.slot = slot,
+		.gfn = fault->gfn,
+		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
-	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
-	if (!async)
-		return RET_PF_CONTINUE; /* *pfn has correct page already */
+	foll.flags |= FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
+		return RET_PF_CONTINUE;
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * to wait for IO.  Note, gup always bails if it is unable to quickly
 	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
 	 */
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+	foll.flags |= FOLL_INTERRUPTIBLE;
+	foll.flags &= ~FOLL_NOWAIT;
+	fault->pfn = __kvm_follow_pfn(&foll);
+
+	if (!is_error_noslot_pfn(fault->pfn))
+		goto success;
+
+	return RET_PF_CONTINUE;
+success:
+	fault->hva = foll.hva;
+	fault->map_writable = foll.writable;
 	return RET_PF_CONTINUE;
 }
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
memory into the guest that is backed by un-refcounted struct pages - for
example, higher order non-compound pages allocated by the amdgpu driver
via ttm_pool_alloc_page.

The bulk of this change is tracking the is_refcounted_page flag so that
non-refcounted pages don't trigger page_count() == 0 warnings. This is
done by storing the flag in an unused bit in the sptes.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
 arch/x86/kvm/mmu/spte.c         |  4 ++-
 arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
 6 files changed, 62 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e44ab512c3a1..b1607e314497 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 
 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	return flush;
@@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	 * before they are reclaimed.  Sanity check that, if the pfn is backed
 	 * by a refcounted page, the refcount is elevated.
 	 */
-	page = kvm_pfn_to_refcounted_page(pfn);
-	WARN_ON(page && !page_count(page));
+	if (is_refcounted_page_pte(old_spte)) {
+		page = kvm_pfn_to_refcounted_page(pfn);
+		WARN_ON(!page || !page_count(page));
+	}
 
-	if (is_accessed_spte(old_spte))
-		kvm_set_pfn_accessed(pfn);
+	if (is_refcounted_page_pte(old_spte)) {
+		if (is_accessed_spte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(pfn));
 
-	if (is_dirty_spte(old_spte))
-		kvm_set_pfn_dirty(pfn);
+		if (is_dirty_spte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(pfn));
+	}
 
 	return old_spte;
 }
@@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(spte))
-			kvm_set_pfn_dirty(spte_to_pfn(spte));
+		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
 
 		spte = mark_spte_for_access_track(spte);
 		mmu_spte_update_no_track(sptep, spte);
@@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
 {
 	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
 					       (unsigned long *)sptep);
-	if (was_writable && !spte_ad_enabled(*sptep))
-		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
 
 	return was_writable;
 }
@@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	bool host_writable = !fault || fault->map_writable;
 	bool prefetch = !fault || fault->prefetch;
 	bool write_fault = fault && fault->write;
+	bool is_refcounted = !fault || fault->is_refcounted_page;
 
 	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 		 *sptep, write_fault, gfn);
@@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	}
 
 	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
-			   true, host_writable, &spte);
+			   true, host_writable, is_refcounted, &spte);
 
 	if (*sptep == spte) {
 		ret = RET_PF_SPURIOUS;
@@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = fault->gfn,
-		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+		.flags = fault->write ? FOLL_WRITE : 0,
 		.allow_write_mapping = true,
+		.guarded_by_mmu_notifier = true,
 	};
 
 	/*
@@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
+			fault->is_refcounted_page = false;
 			return RET_PF_CONTINUE;
 		}
 		/*
@@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 success:
 	fault->hva = foll.hva;
 	fault->map_writable = foll.writable;
+	fault->is_refcounted_page = foll.is_refcounted_page;
 	return RET_PF_CONTINUE;
 }
 
@@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 
 out_unlock:
 	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 #endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d39af5639ce9..55790085884f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -240,6 +240,7 @@ struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	bool is_refcounted_page;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0662e0278e70..3284e7bd9619 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -829,7 +829,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
  */
 static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
 {
-	bool host_writable;
+	bool host_writable, is_refcounted;
 	gpa_t first_pte_gpa;
 	u64 *sptep, spte;
 	struct kvm_memory_slot *slot;
@@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
 	sptep = &sp->spt[i];
 	spte = *sptep;
 	host_writable = spte & shadow_host_writable_mask;
+	// TODO: is this correct?
+	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	make_spte(vcpu, sp, slot, pte_access, gfn,
 		  spte_to_pfn(spte), spte, true, false,
-		  host_writable, &spte);
+		  host_writable, is_refcounted, &spte);
 
 	return mmu_spte_update(sptep, spte);
 }
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cf2c6426a6fc..46c681dc45e6 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte)
+	       bool host_writable, bool is_refcounted, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;
+	else if (is_refcounted)
+		spte |= SPTE_MMU_PAGE_REFCOUNTED;
 
 	if (shadow_memtype_mask)
 		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..be93dd061ae3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -95,6 +95,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 /* Defined only to keep the above static asserts readable. */
 #undef SHADOW_ACC_TRACK_SAVED_MASK
 
+/*
+ * Indicates that the SPTE refers to a page with a valid refcount.
+ */
+#define SPTE_MMU_PAGE_REFCOUNTED        BIT_ULL(59)
+
 /*
  * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
  * the memslots generation and is derived as follows:
@@ -332,6 +337,11 @@ static inline bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_refcounted_page_pte(u64 spte)
+{
+	return spte & SPTE_MMU_PAGE_REFCOUNTED;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
 				int level)
 {
@@ -462,7 +472,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte);
+	       bool host_writable, bool is_refcounted, u64 *new_spte);
 u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 		      	      union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 512163d52194..a9b1b14d2e26 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	bool is_refcounted = is_refcounted_page_pte(old_spte);
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
@@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (is_leaf != was_leaf)
 		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
 
-	if (was_leaf && is_dirty_spte(old_spte) &&
+	if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 
-	if (was_leaf && is_accessed_spte(old_spte) &&
+	if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 }
 
 /*
@@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+				   fault->pfn, iter->old_spte, fault->prefetch, true,
+				   fault->map_writable, fault->is_refcounted_page,
+				   &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(iter->old_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
+		if (is_writable_pte(iter->old_spte) &&
+		    is_refcounted_page_pte(iter->old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));
 
 		new_spte = mark_spte_for_access_track(iter->old_spte);
 		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
@@ -1626,7 +1629,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
 					       iter.old_spte,
 					       iter.old_spte & ~dbit);
-		kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+		if (is_refcounted_page_pte(iter.old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
 	}
 
 	rcu_read_unlock();
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
memory into the guest that is backed by un-refcounted struct pages - for
example, higher order non-compound pages allocated by the amdgpu driver
via ttm_pool_alloc_page.

The bulk of this change is tracking the is_refcounted_page flag so that
non-refcounted pages don't trigger page_count() == 0 warnings. This is
done by storing the flag in an unused bit in the sptes.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
 arch/x86/kvm/mmu/spte.c         |  4 ++-
 arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
 6 files changed, 62 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e44ab512c3a1..b1607e314497 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 
 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	return flush;
@@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	 * before they are reclaimed.  Sanity check that, if the pfn is backed
 	 * by a refcounted page, the refcount is elevated.
 	 */
-	page = kvm_pfn_to_refcounted_page(pfn);
-	WARN_ON(page && !page_count(page));
+	if (is_refcounted_page_pte(old_spte)) {
+		page = kvm_pfn_to_refcounted_page(pfn);
+		WARN_ON(!page || !page_count(page));
+	}
 
-	if (is_accessed_spte(old_spte))
-		kvm_set_pfn_accessed(pfn);
+	if (is_refcounted_page_pte(old_spte)) {
+		if (is_accessed_spte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(pfn));
 
-	if (is_dirty_spte(old_spte))
-		kvm_set_pfn_dirty(pfn);
+		if (is_dirty_spte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(pfn));
+	}
 
 	return old_spte;
 }
@@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(spte))
-			kvm_set_pfn_dirty(spte_to_pfn(spte));
+		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
 
 		spte = mark_spte_for_access_track(spte);
 		mmu_spte_update_no_track(sptep, spte);
@@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
 {
 	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
 					       (unsigned long *)sptep);
-	if (was_writable && !spte_ad_enabled(*sptep))
-		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
 
 	return was_writable;
 }
@@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	bool host_writable = !fault || fault->map_writable;
 	bool prefetch = !fault || fault->prefetch;
 	bool write_fault = fault && fault->write;
+	bool is_refcounted = !fault || fault->is_refcounted_page;
 
 	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 		 *sptep, write_fault, gfn);
@@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	}
 
 	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
-			   true, host_writable, &spte);
+			   true, host_writable, is_refcounted, &spte);
 
 	if (*sptep == spte) {
 		ret = RET_PF_SPURIOUS;
@@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = fault->gfn,
-		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+		.flags = fault->write ? FOLL_WRITE : 0,
 		.allow_write_mapping = true,
+		.guarded_by_mmu_notifier = true,
 	};
 
 	/*
@@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
+			fault->is_refcounted_page = false;
 			return RET_PF_CONTINUE;
 		}
 		/*
@@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 success:
 	fault->hva = foll.hva;
 	fault->map_writable = foll.writable;
+	fault->is_refcounted_page = foll.is_refcounted_page;
 	return RET_PF_CONTINUE;
 }
 
@@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 
 out_unlock:
 	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 #endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d39af5639ce9..55790085884f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -240,6 +240,7 @@ struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	bool is_refcounted_page;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0662e0278e70..3284e7bd9619 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -829,7 +829,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
  */
 static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
 {
-	bool host_writable;
+	bool host_writable, is_refcounted;
 	gpa_t first_pte_gpa;
 	u64 *sptep, spte;
 	struct kvm_memory_slot *slot;
@@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
 	sptep = &sp->spt[i];
 	spte = *sptep;
 	host_writable = spte & shadow_host_writable_mask;
+	// TODO: is this correct?
+	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	make_spte(vcpu, sp, slot, pte_access, gfn,
 		  spte_to_pfn(spte), spte, true, false,
-		  host_writable, &spte);
+		  host_writable, is_refcounted, &spte);
 
 	return mmu_spte_update(sptep, spte);
 }
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cf2c6426a6fc..46c681dc45e6 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte)
+	       bool host_writable, bool is_refcounted, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;
+	else if (is_refcounted)
+		spte |= SPTE_MMU_PAGE_REFCOUNTED;
 
 	if (shadow_memtype_mask)
 		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..be93dd061ae3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -95,6 +95,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 /* Defined only to keep the above static asserts readable. */
 #undef SHADOW_ACC_TRACK_SAVED_MASK
 
+/*
+ * Indicates that the SPTE refers to a page with a valid refcount.
+ */
+#define SPTE_MMU_PAGE_REFCOUNTED        BIT_ULL(59)
+
 /*
  * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
  * the memslots generation and is derived as follows:
@@ -332,6 +337,11 @@ static inline bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_refcounted_page_pte(u64 spte)
+{
+	return spte & SPTE_MMU_PAGE_REFCOUNTED;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
 				int level)
 {
@@ -462,7 +472,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte);
+	       bool host_writable, bool is_refcounted, u64 *new_spte);
 u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 		      	      union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 512163d52194..a9b1b14d2e26 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	bool is_refcounted = is_refcounted_page_pte(old_spte);
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
@@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (is_leaf != was_leaf)
 		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
 
-	if (was_leaf && is_dirty_spte(old_spte) &&
+	if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 
-	if (was_leaf && is_accessed_spte(old_spte) &&
+	if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 }
 
 /*
@@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+				   fault->pfn, iter->old_spte, fault->prefetch, true,
+				   fault->map_writable, fault->is_refcounted_page,
+				   &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(iter->old_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
+		if (is_writable_pte(iter->old_spte) &&
+		    is_refcounted_page_pte(iter->old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));
 
 		new_spte = mark_spte_for_access_track(iter->old_spte);
 		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
@@ -1626,7 +1629,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
 					       iter.old_spte,
 					       iter.old_spte & ~dbit);
-		kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+		if (is_refcounted_page_pte(iter.old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
 	}
 
 	rcu_read_unlock();
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
memory into the guest that is backed by un-refcounted struct pages - for
example, higher order non-compound pages allocated by the amdgpu driver
via ttm_pool_alloc_page.

The bulk of this change is tracking the is_refcounted_page flag so that
non-refcounted pages don't trigger page_count() == 0 warnings. This is
done by storing the flag in an unused bit in the sptes.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
 arch/x86/kvm/mmu/spte.c         |  4 ++-
 arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
 6 files changed, 62 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e44ab512c3a1..b1607e314497 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 
 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
 		flush = true;
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		if (is_refcounted_page_pte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 	}
 
 	return flush;
@@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	 * before they are reclaimed.  Sanity check that, if the pfn is backed
 	 * by a refcounted page, the refcount is elevated.
 	 */
-	page = kvm_pfn_to_refcounted_page(pfn);
-	WARN_ON(page && !page_count(page));
+	if (is_refcounted_page_pte(old_spte)) {
+		page = kvm_pfn_to_refcounted_page(pfn);
+		WARN_ON(!page || !page_count(page));
+	}
 
-	if (is_accessed_spte(old_spte))
-		kvm_set_pfn_accessed(pfn);
+	if (is_refcounted_page_pte(old_spte)) {
+		if (is_accessed_spte(old_spte))
+			kvm_set_page_accessed(pfn_to_page(pfn));
 
-	if (is_dirty_spte(old_spte))
-		kvm_set_pfn_dirty(pfn);
+		if (is_dirty_spte(old_spte))
+			kvm_set_page_dirty(pfn_to_page(pfn));
+	}
 
 	return old_spte;
 }
@@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(spte))
-			kvm_set_pfn_dirty(spte_to_pfn(spte));
+		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
 
 		spte = mark_spte_for_access_track(spte);
 		mmu_spte_update_no_track(sptep, spte);
@@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
 {
 	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
 					       (unsigned long *)sptep);
-	if (was_writable && !spte_ad_enabled(*sptep))
-		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
 
 	return was_writable;
 }
@@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	bool host_writable = !fault || fault->map_writable;
 	bool prefetch = !fault || fault->prefetch;
 	bool write_fault = fault && fault->write;
+	bool is_refcounted = !fault || fault->is_refcounted_page;
 
 	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 		 *sptep, write_fault, gfn);
@@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	}
 
 	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
-			   true, host_writable, &spte);
+			   true, host_writable, is_refcounted, &spte);
 
 	if (*sptep == spte) {
 		ret = RET_PF_SPURIOUS;
@@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	struct kvm_follow_pfn foll = {
 		.slot = slot,
 		.gfn = fault->gfn,
-		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+		.flags = fault->write ? FOLL_WRITE : 0,
 		.allow_write_mapping = true,
+		.guarded_by_mmu_notifier = true,
 	};
 
 	/*
@@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
+			fault->is_refcounted_page = false;
 			return RET_PF_CONTINUE;
 		}
 		/*
@@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 success:
 	fault->hva = foll.hva;
 	fault->map_writable = foll.writable;
+	fault->is_refcounted_page = foll.is_refcounted_page;
 	return RET_PF_CONTINUE;
 }
 
@@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 
 out_unlock:
 	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 #endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d39af5639ce9..55790085884f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -240,6 +240,7 @@ struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	bool is_refcounted_page;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0662e0278e70..3284e7bd9619 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -829,7 +829,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (fault->is_refcounted_page)
+		kvm_set_page_accessed(pfn_to_page(fault->pfn));
 	return r;
 }
 
@@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
  */
 static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
 {
-	bool host_writable;
+	bool host_writable, is_refcounted;
 	gpa_t first_pte_gpa;
 	u64 *sptep, spte;
 	struct kvm_memory_slot *slot;
@@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
 	sptep = &sp->spt[i];
 	spte = *sptep;
 	host_writable = spte & shadow_host_writable_mask;
+	// TODO: is this correct?
+	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	make_spte(vcpu, sp, slot, pte_access, gfn,
 		  spte_to_pfn(spte), spte, true, false,
-		  host_writable, &spte);
+		  host_writable, is_refcounted, &spte);
 
 	return mmu_spte_update(sptep, spte);
 }
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cf2c6426a6fc..46c681dc45e6 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte)
+	       bool host_writable, bool is_refcounted, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;
+	else if (is_refcounted)
+		spte |= SPTE_MMU_PAGE_REFCOUNTED;
 
 	if (shadow_memtype_mask)
 		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..be93dd061ae3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -95,6 +95,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 /* Defined only to keep the above static asserts readable. */
 #undef SHADOW_ACC_TRACK_SAVED_MASK
 
+/*
+ * Indicates that the SPTE refers to a page with a valid refcount.
+ */
+#define SPTE_MMU_PAGE_REFCOUNTED        BIT_ULL(59)
+
 /*
  * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
  * the memslots generation and is derived as follows:
@@ -332,6 +337,11 @@ static inline bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_refcounted_page_pte(u64 spte)
+{
+	return spte & SPTE_MMU_PAGE_REFCOUNTED;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
 				int level)
 {
@@ -462,7 +472,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
-	       bool host_writable, u64 *new_spte);
+	       bool host_writable, bool is_refcounted, u64 *new_spte);
 u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
 		      	      union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 512163d52194..a9b1b14d2e26 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	bool is_refcounted = is_refcounted_page_pte(old_spte);
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
@@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (is_leaf != was_leaf)
 		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
 
-	if (was_leaf && is_dirty_spte(old_spte) &&
+	if (was_leaf && is_dirty_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 
-	if (was_leaf && is_accessed_spte(old_spte) &&
+	if (was_leaf && is_accessed_spte(old_spte) && is_refcounted &&
 	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+		kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
 }
 
 /*
@@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+				   fault->pfn, iter->old_spte, fault->prefetch, true,
+				   fault->map_writable, fault->is_refcounted_page,
+				   &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(iter->old_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
+		if (is_writable_pte(iter->old_spte) &&
+		    is_refcounted_page_pte(iter->old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte)));
 
 		new_spte = mark_spte_for_access_track(iter->old_spte);
 		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
@@ -1626,7 +1629,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
 					       iter.old_spte,
 					       iter.old_spte & ~dbit);
-		kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+		if (is_refcounted_page_pte(iter.old_spte))
+			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte)));
 	}
 
 	rcu_read_unlock();
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 6/8] KVM: arm64: Migrate to __kvm_follow_pfn
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/arm64/kvm/mmu.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6db9ef288ec3..c706530d304d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1334,7 +1334,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  unsigned long fault_status)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
+	bool write_fault = kvm_is_write_fault(vcpu);
+	bool force_pte = false;
 	bool exec_fault, mte_allowed;
 	bool device = false;
 	unsigned long mmu_seq;
@@ -1342,16 +1343,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
 	short vma_shift;
-	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	struct kvm_follow_pfn foll = {
+		.slot = memslot,
+		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
-	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
 	VM_BUG_ON(write_fault && exec_fault);
 
@@ -1425,7 +1429,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
 		fault_ipa &= ~(vma_pagesize - 1);
 
-	gfn = fault_ipa >> PAGE_SHIFT;
+	foll.gfn = fault_ipa >> PAGE_SHIFT;
 	mte_allowed = kvm_vma_mte_allowed(vma);
 
 	/* Don't use the VMA after the unlock -- it may have vanished */
@@ -1433,7 +1437,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
-	 * vma_lookup() or __gfn_to_pfn_memslot() become stale prior to
+	 * vma_lookup() or __kvm_follow_pfn() become stale prior to
 	 * acquiring kvm->mmu_lock.
 	 *
 	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
@@ -1442,8 +1446,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __kvm_follow_pfn(&foll);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
@@ -1468,7 +1471,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 * Only actually map the page as writable if this was a write
 		 * fault.
 		 */
-		writable = false;
+		foll.writable = false;
 	}
 
 	if (exec_fault && device)
@@ -1508,7 +1511,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		}
 	}
 
-	if (writable)
+	if (foll.writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
 	if (exec_fault)
@@ -1534,9 +1537,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 					     KVM_PGTABLE_WALK_SHARED);
 
 	/* Mark the page dirty only if the fault is handled successfully */
-	if (writable && !ret) {
+	if (foll.writable && !ret) {
 		kvm_set_pfn_dirty(pfn);
-		mark_page_dirty_in_slot(kvm, memslot, gfn);
+		mark_page_dirty_in_slot(kvm, memslot, foll.gfn);
 	}
 
 out_unlock:
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 6/8] KVM: arm64: Migrate to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/arm64/kvm/mmu.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6db9ef288ec3..c706530d304d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1334,7 +1334,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  unsigned long fault_status)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
+	bool write_fault = kvm_is_write_fault(vcpu);
+	bool force_pte = false;
 	bool exec_fault, mte_allowed;
 	bool device = false;
 	unsigned long mmu_seq;
@@ -1342,16 +1343,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
 	short vma_shift;
-	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	struct kvm_follow_pfn foll = {
+		.slot = memslot,
+		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
-	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
 	VM_BUG_ON(write_fault && exec_fault);
 
@@ -1425,7 +1429,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
 		fault_ipa &= ~(vma_pagesize - 1);
 
-	gfn = fault_ipa >> PAGE_SHIFT;
+	foll.gfn = fault_ipa >> PAGE_SHIFT;
 	mte_allowed = kvm_vma_mte_allowed(vma);
 
 	/* Don't use the VMA after the unlock -- it may have vanished */
@@ -1433,7 +1437,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
-	 * vma_lookup() or __gfn_to_pfn_memslot() become stale prior to
+	 * vma_lookup() or __kvm_follow_pfn() become stale prior to
 	 * acquiring kvm->mmu_lock.
 	 *
 	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
@@ -1442,8 +1446,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __kvm_follow_pfn(&foll);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
@@ -1468,7 +1471,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 * Only actually map the page as writable if this was a write
 		 * fault.
 		 */
-		writable = false;
+		foll.writable = false;
 	}
 
 	if (exec_fault && device)
@@ -1508,7 +1511,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		}
 	}
 
-	if (writable)
+	if (foll.writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
 	if (exec_fault)
@@ -1534,9 +1537,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 					     KVM_PGTABLE_WALK_SHARED);
 
 	/* Mark the page dirty only if the fault is handled successfully */
-	if (writable && !ret) {
+	if (foll.writable && !ret) {
 		kvm_set_pfn_dirty(pfn);
-		mark_page_dirty_in_slot(kvm, memslot, gfn);
+		mark_page_dirty_in_slot(kvm, memslot, foll.gfn);
 	}
 
 out_unlock:
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 6/8] KVM: arm64: Migrate to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/arm64/kvm/mmu.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6db9ef288ec3..c706530d304d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1334,7 +1334,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  unsigned long fault_status)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
+	bool write_fault = kvm_is_write_fault(vcpu);
+	bool force_pte = false;
 	bool exec_fault, mte_allowed;
 	bool device = false;
 	unsigned long mmu_seq;
@@ -1342,16 +1343,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
 	short vma_shift;
-	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	struct kvm_follow_pfn foll = {
+		.slot = memslot,
+		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
-	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
 	VM_BUG_ON(write_fault && exec_fault);
 
@@ -1425,7 +1429,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
 		fault_ipa &= ~(vma_pagesize - 1);
 
-	gfn = fault_ipa >> PAGE_SHIFT;
+	foll.gfn = fault_ipa >> PAGE_SHIFT;
 	mte_allowed = kvm_vma_mte_allowed(vma);
 
 	/* Don't use the VMA after the unlock -- it may have vanished */
@@ -1433,7 +1437,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
-	 * vma_lookup() or __gfn_to_pfn_memslot() become stale prior to
+	 * vma_lookup() or __kvm_follow_pfn() become stale prior to
 	 * acquiring kvm->mmu_lock.
 	 *
 	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
@@ -1442,8 +1446,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __kvm_follow_pfn(&foll);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
@@ -1468,7 +1471,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 * Only actually map the page as writable if this was a write
 		 * fault.
 		 */
-		writable = false;
+		foll.writable = false;
 	}
 
 	if (exec_fault && device)
@@ -1508,7 +1511,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		}
 	}
 
-	if (writable)
+	if (foll.writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
 	if (exec_fault)
@@ -1534,9 +1537,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 					     KVM_PGTABLE_WALK_SHARED);
 
 	/* Mark the page dirty only if the fault is handled successfully */
-	if (writable && !ret) {
+	if (foll.writable && !ret) {
 		kvm_set_pfn_dirty(pfn);
-		mark_page_dirty_in_slot(kvm, memslot, gfn);
+		mark_page_dirty_in_slot(kvm, memslot, foll.gfn);
 	}
 
 out_unlock:
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 7/8] KVM: PPC: Migrate to __kvm_follow_pfn
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn. As part of the
refactoring, remove the redundant calls to get_user_page_fast_only,
since the check for !async && !atomic was removed from the KVM generic
code in b9b33da2aa74. Also, remove the kvm_ro parameter because the KVM
generic code handles RO memslots.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
I have checked that this patch compiles, but I don't have the hardware
to test it myself.

 arch/powerpc/include/asm/kvm_book3s.h  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    | 38 +++++++++-----------
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 50 +++++++++++---------------
 arch/powerpc/kvm/book3s_hv_nested.c    |  4 +--
 4 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..bf48c511e700 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -202,7 +202,7 @@ extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 				unsigned long gpa,
 				struct kvm_memory_slot *memslot,
-				bool writing, bool kvm_ro,
+				bool writing,
 				pte_t *inserted_pte, unsigned int *levelp);
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..9a4715e73937 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -523,6 +523,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	unsigned long rcbits;
 	long mmio_update;
 	pte_t pte, *ptep;
+	struct kvm_follow_pfn foll = {
+		.allow_write_mapping = true,
+	};
 
 	if (kvm_is_radix(kvm))
 		return kvmppc_book3s_radix_page_fault(vcpu, ea, dsisr);
@@ -599,29 +602,20 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	page = NULL;
 	writing = (dsisr & DSISR_ISSTORE) != 0;
 	/* If writing != 0, then the HPTE must allow writing, if we get here */
-	write_ok = writing;
-	hva = gfn_to_hva_memslot(memslot, gfn);
 
-	/*
-	 * Do a fast check first, since __gfn_to_pfn_memslot doesn't
-	 * do it with !atomic && !async, which is how we call it.
-	 * We always ask for write permission since the common case
-	 * is that the page is writable.
-	 */
-	if (get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
-		write_ok = true;
-	} else {
-		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, &write_ok, NULL);
-		if (is_error_noslot_pfn(pfn))
-			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+	foll.slot = memslot;
+	foll.gfn = gfn;
+	foll.flags = FOLL_GET | (writing ? FOLL_WRITE : 0);
+	pfn = __kvm_follow_pfn(&foll);
+	if (is_error_noslot_pfn(pfn))
+		return -EFAULT;
+	page = NULL;
+	write_ok = foll.writable;
+	hva = foll.hva;
+	if (pfn_valid(pfn)) {
+		page = pfn_to_page(pfn);
+		if (PageReserved(page))
+			page = NULL;
 	}
 
 	/*
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..339d1efcb6c9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -815,47 +815,39 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested, bool writing,
 int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 				   unsigned long gpa,
 				   struct kvm_memory_slot *memslot,
-				   bool writing, bool kvm_ro,
+				   bool writing,
 				   pte_t *inserted_pte, unsigned int *levelp)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct page *page = NULL;
 	unsigned long mmu_seq;
-	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
-	bool upgrade_write = false;
-	bool *upgrade_p = &upgrade_write;
+	unsigned long hva, pfn, gfn = gpa >> PAGE_SHIFT;
+	bool upgrade_write;
 	pte_t pte, *ptep;
 	unsigned int shift, level;
 	int ret;
 	bool large_enable;
+	struct kvm_follow_pfn foll = {
+		.slot = memslot,
+		.gfn = gfn,
+		.flags = FOLL_GET | (writing ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	/* used to check for invalidations in progress */
 	mmu_seq = kvm->mmu_invalidate_seq;
 	smp_rmb();
 
-	/*
-	 * Do a fast check first, since __gfn_to_pfn_memslot doesn't
-	 * do it with !atomic && !async, which is how we call it.
-	 * We always ask for write permission since the common case
-	 * is that the page is writable.
-	 */
-	hva = gfn_to_hva_memslot(memslot, gfn);
-	if (!kvm_ro && get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
-		upgrade_write = true;
-	} else {
-		unsigned long pfn;
-
-		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, upgrade_p, NULL);
-		if (is_error_noslot_pfn(pfn))
-			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+	pfn = __kvm_follow_pfn(&foll);
+	if (is_error_noslot_pfn(pfn))
+		return -EFAULT;
+	page = NULL;
+	hva = foll.hva;
+	upgrade_write = foll.writable;
+	if (pfn_valid(pfn)) {
+		page = pfn_to_page(pfn);
+		if (PageReserved(page))
+			page = NULL;
 	}
 
 	/*
@@ -944,7 +936,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 	struct kvm_memory_slot *memslot;
 	long ret;
 	bool writing = !!(dsisr & DSISR_ISSTORE);
-	bool kvm_ro = false;
 
 	/* Check for unusual errors */
 	if (dsisr & DSISR_UNSUPP_MMU) {
@@ -997,7 +988,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 					ea, DSISR_ISSTORE | DSISR_PROTFAULT);
 			return RESUME_GUEST;
 		}
-		kvm_ro = true;
 	}
 
 	/* Failed to set the reference/change bits */
@@ -1015,7 +1005,7 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 
 	/* Try to insert a pte */
 	ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot, writing,
-					     kvm_ro, NULL, NULL);
+					     NULL, NULL);
 
 	if (ret == 0 || ret == -EAGAIN)
 		ret = RESUME_GUEST;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 377d0b4a05ee..6d531051df04 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1497,7 +1497,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 	unsigned long n_gpa, gpa, gfn, perm = 0UL;
 	unsigned int shift, l1_shift, level;
 	bool writing = !!(dsisr & DSISR_ISSTORE);
-	bool kvm_ro = false;
 	long int ret;
 
 	if (!gp->l1_gr_to_hr) {
@@ -1577,7 +1576,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 					ea, DSISR_ISSTORE | DSISR_PROTFAULT);
 			return RESUME_GUEST;
 		}
-		kvm_ro = true;
 	}
 
 	/* 2. Find the host pte for this L1 guest real address */
@@ -1599,7 +1597,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 	if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
 		/* No suitable pte found -> try to insert a mapping */
 		ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot,
-					writing, kvm_ro, &pte, &level);
+					writing, &pte, &level);
 		if (ret == -EAGAIN)
 			return RESUME_GUEST;
 		else if (ret)
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 7/8] KVM: PPC: Migrate to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn. As part of the
refactoring, remove the redundant calls to get_user_page_fast_only,
since the check for !async && !atomic was removed from the KVM generic
code in b9b33da2aa74. Also, remove the kvm_ro parameter because the KVM
generic code handles RO memslots.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
I have checked that this patch compiles, but I don't have the hardware
to test it myself.

 arch/powerpc/include/asm/kvm_book3s.h  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    | 38 +++++++++-----------
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 50 +++++++++++---------------
 arch/powerpc/kvm/book3s_hv_nested.c    |  4 +--
 4 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..bf48c511e700 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -202,7 +202,7 @@ extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 				unsigned long gpa,
 				struct kvm_memory_slot *memslot,
-				bool writing, bool kvm_ro,
+				bool writing,
 				pte_t *inserted_pte, unsigned int *levelp);
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..9a4715e73937 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -523,6 +523,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	unsigned long rcbits;
 	long mmio_update;
 	pte_t pte, *ptep;
+	struct kvm_follow_pfn foll = {
+		.allow_write_mapping = true,
+	};
 
 	if (kvm_is_radix(kvm))
 		return kvmppc_book3s_radix_page_fault(vcpu, ea, dsisr);
@@ -599,29 +602,20 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	page = NULL;
 	writing = (dsisr & DSISR_ISSTORE) != 0;
 	/* If writing != 0, then the HPTE must allow writing, if we get here */
-	write_ok = writing;
-	hva = gfn_to_hva_memslot(memslot, gfn);
 
-	/*
-	 * Do a fast check first, since __gfn_to_pfn_memslot doesn't
-	 * do it with !atomic && !async, which is how we call it.
-	 * We always ask for write permission since the common case
-	 * is that the page is writable.
-	 */
-	if (get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
-		write_ok = true;
-	} else {
-		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, &write_ok, NULL);
-		if (is_error_noslot_pfn(pfn))
-			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+	foll.slot = memslot;
+	foll.gfn = gfn;
+	foll.flags = FOLL_GET | (writing ? FOLL_WRITE : 0);
+	pfn = __kvm_follow_pfn(&foll);
+	if (is_error_noslot_pfn(pfn))
+		return -EFAULT;
+	page = NULL;
+	write_ok = foll.writable;
+	hva = foll.hva;
+	if (pfn_valid(pfn)) {
+		page = pfn_to_page(pfn);
+		if (PageReserved(page))
+			page = NULL;
 	}
 
 	/*
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..339d1efcb6c9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -815,47 +815,39 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested, bool writing,
 int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 				   unsigned long gpa,
 				   struct kvm_memory_slot *memslot,
-				   bool writing, bool kvm_ro,
+				   bool writing,
 				   pte_t *inserted_pte, unsigned int *levelp)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct page *page = NULL;
 	unsigned long mmu_seq;
-	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
-	bool upgrade_write = false;
-	bool *upgrade_p = &upgrade_write;
+	unsigned long hva, pfn, gfn = gpa >> PAGE_SHIFT;
+	bool upgrade_write;
 	pte_t pte, *ptep;
 	unsigned int shift, level;
 	int ret;
 	bool large_enable;
+	struct kvm_follow_pfn foll = {
+		.slot = memslot,
+		.gfn = gfn,
+		.flags = FOLL_GET | (writing ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	/* used to check for invalidations in progress */
 	mmu_seq = kvm->mmu_invalidate_seq;
 	smp_rmb();
 
-	/*
-	 * Do a fast check first, since __gfn_to_pfn_memslot doesn't
-	 * do it with !atomic && !async, which is how we call it.
-	 * We always ask for write permission since the common case
-	 * is that the page is writable.
-	 */
-	hva = gfn_to_hva_memslot(memslot, gfn);
-	if (!kvm_ro && get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
-		upgrade_write = true;
-	} else {
-		unsigned long pfn;
-
-		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, upgrade_p, NULL);
-		if (is_error_noslot_pfn(pfn))
-			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+	pfn = __kvm_follow_pfn(&foll);
+	if (is_error_noslot_pfn(pfn))
+		return -EFAULT;
+	page = NULL;
+	hva = foll.hva;
+	upgrade_write = foll.writable;
+	if (pfn_valid(pfn)) {
+		page = pfn_to_page(pfn);
+		if (PageReserved(page))
+			page = NULL;
 	}
 
 	/*
@@ -944,7 +936,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 	struct kvm_memory_slot *memslot;
 	long ret;
 	bool writing = !!(dsisr & DSISR_ISSTORE);
-	bool kvm_ro = false;
 
 	/* Check for unusual errors */
 	if (dsisr & DSISR_UNSUPP_MMU) {
@@ -997,7 +988,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 					ea, DSISR_ISSTORE | DSISR_PROTFAULT);
 			return RESUME_GUEST;
 		}
-		kvm_ro = true;
 	}
 
 	/* Failed to set the reference/change bits */
@@ -1015,7 +1005,7 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 
 	/* Try to insert a pte */
 	ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot, writing,
-					     kvm_ro, NULL, NULL);
+					     NULL, NULL);
 
 	if (ret == 0 || ret == -EAGAIN)
 		ret = RESUME_GUEST;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 377d0b4a05ee..6d531051df04 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1497,7 +1497,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 	unsigned long n_gpa, gpa, gfn, perm = 0UL;
 	unsigned int shift, l1_shift, level;
 	bool writing = !!(dsisr & DSISR_ISSTORE);
-	bool kvm_ro = false;
 	long int ret;
 
 	if (!gp->l1_gr_to_hr) {
@@ -1577,7 +1576,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 					ea, DSISR_ISSTORE | DSISR_PROTFAULT);
 			return RESUME_GUEST;
 		}
-		kvm_ro = true;
 	}
 
 	/* 2. Find the host pte for this L1 guest real address */
@@ -1599,7 +1597,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 	if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
 		/* No suitable pte found -> try to insert a mapping */
 		ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot,
-					writing, kvm_ro, &pte, &level);
+					writing, &pte, &level);
 		if (ret == -EAGAIN)
 			return RESUME_GUEST;
 		else if (ret)
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 7/8] KVM: PPC: Migrate to __kvm_follow_pfn
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn. As part of the
refactoring, remove the redundant calls to get_user_page_fast_only,
since the check for !async && !atomic was removed from the KVM generic
code in b9b33da2aa74. Also, remove the kvm_ro parameter because the KVM
generic code handles RO memslots.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
I have checked that this patch compiles, but I don't have the hardware
to test it myself.

 arch/powerpc/include/asm/kvm_book3s.h  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    | 38 +++++++++-----------
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 50 +++++++++++---------------
 arch/powerpc/kvm/book3s_hv_nested.c    |  4 +--
 4 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..bf48c511e700 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -202,7 +202,7 @@ extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 				unsigned long gpa,
 				struct kvm_memory_slot *memslot,
-				bool writing, bool kvm_ro,
+				bool writing,
 				pte_t *inserted_pte, unsigned int *levelp);
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..9a4715e73937 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -523,6 +523,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	unsigned long rcbits;
 	long mmio_update;
 	pte_t pte, *ptep;
+	struct kvm_follow_pfn foll = {
+		.allow_write_mapping = true,
+	};
 
 	if (kvm_is_radix(kvm))
 		return kvmppc_book3s_radix_page_fault(vcpu, ea, dsisr);
@@ -599,29 +602,20 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	page = NULL;
 	writing = (dsisr & DSISR_ISSTORE) != 0;
 	/* If writing != 0, then the HPTE must allow writing, if we get here */
-	write_ok = writing;
-	hva = gfn_to_hva_memslot(memslot, gfn);
 
-	/*
-	 * Do a fast check first, since __gfn_to_pfn_memslot doesn't
-	 * do it with !atomic && !async, which is how we call it.
-	 * We always ask for write permission since the common case
-	 * is that the page is writable.
-	 */
-	if (get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
-		write_ok = true;
-	} else {
-		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, &write_ok, NULL);
-		if (is_error_noslot_pfn(pfn))
-			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+	foll.slot = memslot;
+	foll.gfn = gfn;
+	foll.flags = FOLL_GET | (writing ? FOLL_WRITE : 0);
+	pfn = __kvm_follow_pfn(&foll);
+	if (is_error_noslot_pfn(pfn))
+		return -EFAULT;
+	page = NULL;
+	write_ok = foll.writable;
+	hva = foll.hva;
+	if (pfn_valid(pfn)) {
+		page = pfn_to_page(pfn);
+		if (PageReserved(page))
+			page = NULL;
 	}
 
 	/*
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..339d1efcb6c9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -815,47 +815,39 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested, bool writing,
 int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 				   unsigned long gpa,
 				   struct kvm_memory_slot *memslot,
-				   bool writing, bool kvm_ro,
+				   bool writing,
 				   pte_t *inserted_pte, unsigned int *levelp)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct page *page = NULL;
 	unsigned long mmu_seq;
-	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
-	bool upgrade_write = false;
-	bool *upgrade_p = &upgrade_write;
+	unsigned long hva, pfn, gfn = gpa >> PAGE_SHIFT;
+	bool upgrade_write;
 	pte_t pte, *ptep;
 	unsigned int shift, level;
 	int ret;
 	bool large_enable;
+	struct kvm_follow_pfn foll = {
+		.slot = memslot,
+		.gfn = gfn,
+		.flags = FOLL_GET | (writing ? FOLL_WRITE : 0),
+		.allow_write_mapping = true,
+	};
 
 	/* used to check for invalidations in progress */
 	mmu_seq = kvm->mmu_invalidate_seq;
 	smp_rmb();
 
-	/*
-	 * Do a fast check first, since __gfn_to_pfn_memslot doesn't
-	 * do it with !atomic && !async, which is how we call it.
-	 * We always ask for write permission since the common case
-	 * is that the page is writable.
-	 */
-	hva = gfn_to_hva_memslot(memslot, gfn);
-	if (!kvm_ro && get_user_page_fast_only(hva, FOLL_WRITE, &page)) {
-		upgrade_write = true;
-	} else {
-		unsigned long pfn;
-
-		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, upgrade_p, NULL);
-		if (is_error_noslot_pfn(pfn))
-			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+	pfn = __kvm_follow_pfn(&foll);
+	if (is_error_noslot_pfn(pfn))
+		return -EFAULT;
+	page = NULL;
+	hva = foll.hva;
+	upgrade_write = foll.writable;
+	if (pfn_valid(pfn)) {
+		page = pfn_to_page(pfn);
+		if (PageReserved(page))
+			page = NULL;
 	}
 
 	/*
@@ -944,7 +936,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 	struct kvm_memory_slot *memslot;
 	long ret;
 	bool writing = !!(dsisr & DSISR_ISSTORE);
-	bool kvm_ro = false;
 
 	/* Check for unusual errors */
 	if (dsisr & DSISR_UNSUPP_MMU) {
@@ -997,7 +988,6 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 					ea, DSISR_ISSTORE | DSISR_PROTFAULT);
 			return RESUME_GUEST;
 		}
-		kvm_ro = true;
 	}
 
 	/* Failed to set the reference/change bits */
@@ -1015,7 +1005,7 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 
 	/* Try to insert a pte */
 	ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot, writing,
-					     kvm_ro, NULL, NULL);
+					     NULL, NULL);
 
 	if (ret == 0 || ret == -EAGAIN)
 		ret = RESUME_GUEST;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 377d0b4a05ee..6d531051df04 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1497,7 +1497,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 	unsigned long n_gpa, gpa, gfn, perm = 0UL;
 	unsigned int shift, l1_shift, level;
 	bool writing = !!(dsisr & DSISR_ISSTORE);
-	bool kvm_ro = false;
 	long int ret;
 
 	if (!gp->l1_gr_to_hr) {
@@ -1577,7 +1576,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 					ea, DSISR_ISSTORE | DSISR_PROTFAULT);
 			return RESUME_GUEST;
 		}
-		kvm_ro = true;
 	}
 
 	/* 2. Find the host pte for this L1 guest real address */
@@ -1599,7 +1597,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 	if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
 		/* No suitable pte found -> try to insert a mapping */
 		ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot,
-					writing, kvm_ro, &pte, &level);
+					writing, &pte, &level);
 		if (ret == -EAGAIN)
 			return RESUME_GUEST;
 		else if (ret)
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 8/8] KVM: remove __gfn_to_pfn_memslot
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-07-04  7:50   ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

All callers have been migrated to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 virt/kvm/kvm_main.c | 33 ---------------------------------
 1 file changed, 33 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f7b41f220b6..5b5afd70f239 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2712,39 +2712,6 @@ kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 }
 EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
-{
-	kvm_pfn_t pfn;
-	struct kvm_follow_pfn foll = {
-		.slot = slot,
-		.gfn = gfn,
-		.flags = FOLL_GET,
-		.atomic = atomic,
-		.allow_write_mapping = !!writable,
-	};
-
-	if (write_fault)
-		foll.flags |= FOLL_WRITE;
-	if (async)
-		foll.flags |= FOLL_NOWAIT;
-	if (interruptible)
-		foll.flags |= FOLL_INTERRUPTIBLE;
-
-	pfn = __kvm_follow_pfn(&foll);
-	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
-		*async = true;
-		pfn = KVM_PFN_ERR_FAULT;
-	}
-	if (hva)
-		*hva = foll.hva;
-	if (writable)
-		*writable = foll.writable;
-	return pfn;
-}
-EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
-
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 8/8] KVM: remove __gfn_to_pfn_memslot
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm, David Stevens

From: David Stevens <stevensd@chromium.org>

All callers have been migrated to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 virt/kvm/kvm_main.c | 33 ---------------------------------
 1 file changed, 33 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f7b41f220b6..5b5afd70f239 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2712,39 +2712,6 @@ kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 }
 EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
-{
-	kvm_pfn_t pfn;
-	struct kvm_follow_pfn foll = {
-		.slot = slot,
-		.gfn = gfn,
-		.flags = FOLL_GET,
-		.atomic = atomic,
-		.allow_write_mapping = !!writable,
-	};
-
-	if (write_fault)
-		foll.flags |= FOLL_WRITE;
-	if (async)
-		foll.flags |= FOLL_NOWAIT;
-	if (interruptible)
-		foll.flags |= FOLL_INTERRUPTIBLE;
-
-	pfn = __kvm_follow_pfn(&foll);
-	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
-		*async = true;
-		pfn = KVM_PFN_ERR_FAULT;
-	}
-	if (hva)
-		*hva = foll.hva;
-	if (writable)
-		*writable = foll.writable;
-	return pfn;
-}
-EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
-
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-- 
2.41.0.255.g8b1d071c50-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH v7 8/8] KVM: remove __gfn_to_pfn_memslot
@ 2023-07-04  7:50   ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-04  7:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

From: David Stevens <stevensd@chromium.org>

All callers have been migrated to __kvm_follow_pfn.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 virt/kvm/kvm_main.c | 33 ---------------------------------
 1 file changed, 33 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f7b41f220b6..5b5afd70f239 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2712,39 +2712,6 @@ kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 }
 EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
-{
-	kvm_pfn_t pfn;
-	struct kvm_follow_pfn foll = {
-		.slot = slot,
-		.gfn = gfn,
-		.flags = FOLL_GET,
-		.atomic = atomic,
-		.allow_write_mapping = !!writable,
-	};
-
-	if (write_fault)
-		foll.flags |= FOLL_WRITE;
-	if (async)
-		foll.flags |= FOLL_NOWAIT;
-	if (interruptible)
-		foll.flags |= FOLL_INTERRUPTIBLE;
-
-	pfn = __kvm_follow_pfn(&foll);
-	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
-		*async = true;
-		pfn = KVM_PFN_ERR_FAULT;
-	}
-	if (hva)
-		*hva = foll.hva;
-	if (writable)
-		*writable = foll.writable;
-	return pfn;
-}
-EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
-
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05  3:10     ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  3:10 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {

I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
...

> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {

Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
	if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
seems unnecessary.

> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  3:10     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  3:10 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {

I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
...

> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {

Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
	if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
seems unnecessary.

> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  3:10     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  3:10 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {

I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
...

> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {

Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
	if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
seems unnecessary.

> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05  7:23     ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  7:23 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
No one calls these 2 routines in this patch. How about move this change to
[PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn ?

...

> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>  	return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

Same here.

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-05  7:23     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  7:23 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
No one calls these 2 routines in this patch. How about move this change to
[PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn ?

...

> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>  	return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

Same here.

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-05  7:23     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  7:23 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
No one calls these 2 routines in this patch. How about move this change to
[PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn ?

...

> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>  	return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

Same here.

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05  8:07     ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  8:07 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> -	bool async;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = fault->gfn,
> +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.allow_write_mapping = true,
> +	};
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> -	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +	foll.flags |= FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
> +
> +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> +		return RET_PF_CONTINUE;

IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So
KVM_PFN_ERR_NEEDS_IO may not be necessary? 

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-07-05  8:07     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  8:07 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> -	bool async;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = fault->gfn,
> +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.allow_write_mapping = true,
> +	};
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> -	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +	foll.flags |= FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
> +
> +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> +		return RET_PF_CONTINUE;

IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So
KVM_PFN_ERR_NEEDS_IO may not be necessary? 

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-07-05  8:07     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05  8:07 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> -	bool async;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = fault->gfn,
> +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.allow_write_mapping = true,
> +	};
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> -	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +	foll.flags |= FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
> +
> +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> +		return RET_PF_CONTINUE;

IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So
KVM_PFN_ERR_NEEDS_IO may not be necessary? 

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05  8:47     ` Zhi Wang
  -1 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-05  8:47 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue,  4 Jul 2023 16:50:47 +0900
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> +	const struct kvm_memory_slot *slot;
> +	gfn_t gfn;
> +	unsigned int flags;
> +	bool atomic;
> +	/* Allow a read fault to create a writeable mapping. */
> +	bool allow_write_mapping;
> +
> +	/* Outputs of __kvm_follow_pfn */
> +	hva_t hva;
> +	bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> -			    bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
>  	 * request.
>  	 */
> -	if (!(write_fault || writable))
> +	if (!(write_fault || foll->allow_write_mapping))
>  		return false;
>  
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
> -
> -		if (writable)
> -			*writable = true;
> +		foll->writable = foll->allow_write_mapping;
>  		return true;
>  	}
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)
>  {
>  	kvm_pfn_t pfn;
>  	pte_t *ptep;
>  	spinlock_t *ptl;
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  	int r;
>  
> -	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  	if (r) {
>  		/*
>  		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
>  		 * not call the fault handler, so do it here.
>  		 */
>  		bool unlocked = false;
> -		r = fixup_user_fault(current->mm, addr,
> +		r = fixup_user_fault(current->mm, foll->hva,
>  				     (write_fault ? FAULT_FLAG_WRITE : 0),
>  				     &unlocked);
>  		if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		if (r)
>  			return r;
>  
> -		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  		if (r)
>  			return r;
>  	}
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
>  		pfn = KVM_PFN_ERR_HWPOISON;
>  		goto exit;
>  	}
>  
>  retry:
> -	vma = vma_lookup(current->mm, addr);
> +	vma = vma_lookup(current->mm, foll->hva);
>  
>  	if (vma == NULL)
>  		pfn = KVM_PFN_ERR_FAULT;
>  	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> -		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> +		r = hva_to_pfn_remapped(vma, foll, &pfn);
>  		if (r == -EAGAIN)
>  			goto retry;
>  		if (r < 0)
>  			pfn = KVM_PFN_ERR_FAULT;
>  	} else {
> -		if (async && vma_is_valid(vma, write_fault))
> -			*async = true;
> -		pfn = KVM_PFN_ERR_FAULT;
> +		if ((foll->flags & FOLL_NOWAIT) &&
> +		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
> +			pfn = KVM_PFN_ERR_NEEDS_IO;
> +		else
> +			pfn = KVM_PFN_ERR_FAULT;
>  	}
>  exit:
>  	mmap_read_unlock(current->mm);
>  	return pfn;
>  }
>  
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> -			       bool atomic, bool interruptible, bool *async,
> -			       bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> -	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> -	if (hva)
> -		*hva = addr;
> +	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> +				      foll->flags & FOLL_WRITE);
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> +	if (foll->hva == KVM_HVA_ERR_RO_BAD)
>  		return KVM_PFN_ERR_RO_FAULT;
> -	}
>  

Can you explain why updating foll->writable = false (previously *writeable
= false) is omitted here?

In the caller where the struct kvm_follow_pfn is initialized, e.g.
__gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
upper caller.

As the one of the output, it would be better to initalize it either in the
caller or update it in __kvm_follow_pfn(). Or
__gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
stack to the caller via bool *writable. It doesn't sound nice.

BTW: It seems both "writable" and "writeable" are used in this patch. I am
wondering maybe we can correct them.

> -	if (kvm_is_error_hva(addr)) {
> -		if (writable)
> -			*writable = false;
> +	if (kvm_is_error_hva(foll->hva))
>  		return KVM_PFN_NOSLOT;
> -	}
>  
> -	/* Do not map writable pfn in the readonly memslot. */
> -	if (writable && memslot_is_readonly(slot)) {
> -		*writable = false;
> -		writable = NULL;
> -	}
> +	if (memslot_is_readonly(foll->slot))
> +		foll->allow_write_mapping = false;
> +
> +	return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>  
> -	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> -			  writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
> -	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> -				    NULL, write_fault, writable, NULL);
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(kvm, gfn),
> +		.gfn = gfn,
> +		.flags = write_fault ? FOLL_WRITE : 0,
> +		.allow_write_mapping = !!writable,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.atomic = true,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
>  #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>  
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>  
>  #ifdef CONFIG_HAVE_KVM_PFNCACHE
>  void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
>  	void *new_khva = NULL;
>  	unsigned long mmu_seq;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gpc->memslot,
> +		.gfn = gpa_to_gfn(gpc->gpa),
> +		.flags = FOLL_WRITE,
> +		.hva = gpc->uhva,
> +	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);
>  
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		}
>  
>  		/* We always request a writeable mapping */
> -		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> +		new_pfn = hva_to_pfn(&foll);
>  		if (is_error_noslot_pfn(new_pfn))
>  			goto out_error;
>  


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  8:47     ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-05  8:47 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue,  4 Jul 2023 16:50:47 +0900
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> +	const struct kvm_memory_slot *slot;
> +	gfn_t gfn;
> +	unsigned int flags;
> +	bool atomic;
> +	/* Allow a read fault to create a writeable mapping. */
> +	bool allow_write_mapping;
> +
> +	/* Outputs of __kvm_follow_pfn */
> +	hva_t hva;
> +	bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> -			    bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
>  	 * request.
>  	 */
> -	if (!(write_fault || writable))
> +	if (!(write_fault || foll->allow_write_mapping))
>  		return false;
>  
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
> -
> -		if (writable)
> -			*writable = true;
> +		foll->writable = foll->allow_write_mapping;
>  		return true;
>  	}
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)
>  {
>  	kvm_pfn_t pfn;
>  	pte_t *ptep;
>  	spinlock_t *ptl;
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  	int r;
>  
> -	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  	if (r) {
>  		/*
>  		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
>  		 * not call the fault handler, so do it here.
>  		 */
>  		bool unlocked = false;
> -		r = fixup_user_fault(current->mm, addr,
> +		r = fixup_user_fault(current->mm, foll->hva,
>  				     (write_fault ? FAULT_FLAG_WRITE : 0),
>  				     &unlocked);
>  		if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		if (r)
>  			return r;
>  
> -		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  		if (r)
>  			return r;
>  	}
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
>  		pfn = KVM_PFN_ERR_HWPOISON;
>  		goto exit;
>  	}
>  
>  retry:
> -	vma = vma_lookup(current->mm, addr);
> +	vma = vma_lookup(current->mm, foll->hva);
>  
>  	if (vma == NULL)
>  		pfn = KVM_PFN_ERR_FAULT;
>  	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> -		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> +		r = hva_to_pfn_remapped(vma, foll, &pfn);
>  		if (r == -EAGAIN)
>  			goto retry;
>  		if (r < 0)
>  			pfn = KVM_PFN_ERR_FAULT;
>  	} else {
> -		if (async && vma_is_valid(vma, write_fault))
> -			*async = true;
> -		pfn = KVM_PFN_ERR_FAULT;
> +		if ((foll->flags & FOLL_NOWAIT) &&
> +		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
> +			pfn = KVM_PFN_ERR_NEEDS_IO;
> +		else
> +			pfn = KVM_PFN_ERR_FAULT;
>  	}
>  exit:
>  	mmap_read_unlock(current->mm);
>  	return pfn;
>  }
>  
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> -			       bool atomic, bool interruptible, bool *async,
> -			       bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> -	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> -	if (hva)
> -		*hva = addr;
> +	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> +				      foll->flags & FOLL_WRITE);
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> +	if (foll->hva == KVM_HVA_ERR_RO_BAD)
>  		return KVM_PFN_ERR_RO_FAULT;
> -	}
>  

Can you explain why updating foll->writable = false (previously *writeable
= false) is omitted here?

In the caller where the struct kvm_follow_pfn is initialized, e.g.
__gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
upper caller.

As the one of the output, it would be better to initalize it either in the
caller or update it in __kvm_follow_pfn(). Or
__gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
stack to the caller via bool *writable. It doesn't sound nice.

BTW: It seems both "writable" and "writeable" are used in this patch. I am
wondering maybe we can correct them.

> -	if (kvm_is_error_hva(addr)) {
> -		if (writable)
> -			*writable = false;
> +	if (kvm_is_error_hva(foll->hva))
>  		return KVM_PFN_NOSLOT;
> -	}
>  
> -	/* Do not map writable pfn in the readonly memslot. */
> -	if (writable && memslot_is_readonly(slot)) {
> -		*writable = false;
> -		writable = NULL;
> -	}
> +	if (memslot_is_readonly(foll->slot))
> +		foll->allow_write_mapping = false;
> +
> +	return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>  
> -	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> -			  writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
> -	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> -				    NULL, write_fault, writable, NULL);
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(kvm, gfn),
> +		.gfn = gfn,
> +		.flags = write_fault ? FOLL_WRITE : 0,
> +		.allow_write_mapping = !!writable,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.atomic = true,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
>  #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>  
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>  
>  #ifdef CONFIG_HAVE_KVM_PFNCACHE
>  void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
>  	void *new_khva = NULL;
>  	unsigned long mmu_seq;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gpc->memslot,
> +		.gfn = gpa_to_gfn(gpc->gpa),
> +		.flags = FOLL_WRITE,
> +		.hva = gpc->uhva,
> +	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);
>  
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		}
>  
>  		/* We always request a writeable mapping */
> -		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> +		new_pfn = hva_to_pfn(&foll);
>  		if (is_error_noslot_pfn(new_pfn))
>  			goto out_error;
>  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  8:47     ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-05  8:47 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Tue,  4 Jul 2023 16:50:47 +0900
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> +	const struct kvm_memory_slot *slot;
> +	gfn_t gfn;
> +	unsigned int flags;
> +	bool atomic;
> +	/* Allow a read fault to create a writeable mapping. */
> +	bool allow_write_mapping;
> +
> +	/* Outputs of __kvm_follow_pfn */
> +	hva_t hva;
> +	bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> -			    bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
>  	 * request.
>  	 */
> -	if (!(write_fault || writable))
> +	if (!(write_fault || foll->allow_write_mapping))
>  		return false;
>  
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
> -
> -		if (writable)
> -			*writable = true;
> +		foll->writable = foll->allow_write_mapping;
>  		return true;
>  	}
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)
>  {
>  	kvm_pfn_t pfn;
>  	pte_t *ptep;
>  	spinlock_t *ptl;
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  	int r;
>  
> -	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  	if (r) {
>  		/*
>  		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
>  		 * not call the fault handler, so do it here.
>  		 */
>  		bool unlocked = false;
> -		r = fixup_user_fault(current->mm, addr,
> +		r = fixup_user_fault(current->mm, foll->hva,
>  				     (write_fault ? FAULT_FLAG_WRITE : 0),
>  				     &unlocked);
>  		if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		if (r)
>  			return r;
>  
> -		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  		if (r)
>  			return r;
>  	}
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
>  		pfn = KVM_PFN_ERR_HWPOISON;
>  		goto exit;
>  	}
>  
>  retry:
> -	vma = vma_lookup(current->mm, addr);
> +	vma = vma_lookup(current->mm, foll->hva);
>  
>  	if (vma == NULL)
>  		pfn = KVM_PFN_ERR_FAULT;
>  	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> -		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> +		r = hva_to_pfn_remapped(vma, foll, &pfn);
>  		if (r == -EAGAIN)
>  			goto retry;
>  		if (r < 0)
>  			pfn = KVM_PFN_ERR_FAULT;
>  	} else {
> -		if (async && vma_is_valid(vma, write_fault))
> -			*async = true;
> -		pfn = KVM_PFN_ERR_FAULT;
> +		if ((foll->flags & FOLL_NOWAIT) &&
> +		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
> +			pfn = KVM_PFN_ERR_NEEDS_IO;
> +		else
> +			pfn = KVM_PFN_ERR_FAULT;
>  	}
>  exit:
>  	mmap_read_unlock(current->mm);
>  	return pfn;
>  }
>  
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> -			       bool atomic, bool interruptible, bool *async,
> -			       bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> -	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> -	if (hva)
> -		*hva = addr;
> +	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> +				      foll->flags & FOLL_WRITE);
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> +	if (foll->hva == KVM_HVA_ERR_RO_BAD)
>  		return KVM_PFN_ERR_RO_FAULT;
> -	}
>  

Can you explain why updating foll->writable = false (previously *writeable
= false) is omitted here?

In the caller where the struct kvm_follow_pfn is initialized, e.g.
__gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
upper caller.

As the one of the output, it would be better to initalize it either in the
caller or update it in __kvm_follow_pfn(). Or
__gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
stack to the caller via bool *writable. It doesn't sound nice.

BTW: It seems both "writable" and "writeable" are used in this patch. I am
wondering maybe we can correct them.

> -	if (kvm_is_error_hva(addr)) {
> -		if (writable)
> -			*writable = false;
> +	if (kvm_is_error_hva(foll->hva))
>  		return KVM_PFN_NOSLOT;
> -	}
>  
> -	/* Do not map writable pfn in the readonly memslot. */
> -	if (writable && memslot_is_readonly(slot)) {
> -		*writable = false;
> -		writable = NULL;
> -	}
> +	if (memslot_is_readonly(foll->slot))
> +		foll->allow_write_mapping = false;
> +
> +	return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>  
> -	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> -			  writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
> -	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> -				    NULL, write_fault, writable, NULL);
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(kvm, gfn),
> +		.gfn = gfn,
> +		.flags = write_fault ? FOLL_WRITE : 0,
> +		.allow_write_mapping = !!writable,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.atomic = true,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
>  #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>  
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>  
>  #ifdef CONFIG_HAVE_KVM_PFNCACHE
>  void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
>  	void *new_khva = NULL;
>  	unsigned long mmu_seq;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gpc->memslot,
> +		.gfn = gpa_to_gfn(gpc->gpa),
> +		.flags = FOLL_WRITE,
> +		.hva = gpc->uhva,
> +	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);
>  
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		}
>  
>  		/* We always request a writeable mapping */
> -		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> +		new_pfn = hva_to_pfn(&foll);
>  		if (is_error_noslot_pfn(new_pfn))
>  			goto out_error;
>  


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-05  8:47     ` Zhi Wang
  (?)
@ 2023-07-05  9:08       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-05  9:08 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 5:47 PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
>
> On Tue,  4 Jul 2023 16:50:47 +0900
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h |  16 ++++
> >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> >  virt/kvm/kvm_mm.h        |   3 +-
> >  virt/kvm/pfncache.c      |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > +     const struct kvm_memory_slot *slot;
> > +     gfn_t gfn;
> > +     unsigned int flags;
> > +     bool atomic;
> > +     /* Allow a read fault to create a writeable mapping. */
> > +     bool allow_write_mapping;
> > +
> > +     /* Outputs of __kvm_follow_pfn */
> > +     hva_t hva;
> > +     bool writable;
> > +};
> > +
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > +
> >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> >                     bool *writable);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 371bd783ff2b..b13f22861d2f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >   * true indicates success, otherwise false is returned.  It's also the
> >   * only part that runs if we can in atomic context.
> >   */
> > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > -                         bool *writable, kvm_pfn_t *pfn)
> > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> >       struct page *page[1];
> > +     bool write_fault = foll->flags & FOLL_WRITE;
> >
> >       /*
> >        * Fast pin a writable pfn only if it is a write fault request
> >        * or the caller allows to map a writable pfn for a read fault
> >        * request.
> >        */
> > -     if (!(write_fault || writable))
> > +     if (!(write_fault || foll->allow_write_mapping))
> >               return false;
> >
> > -     if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > +     if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> >               *pfn = page_to_pfn(page[0]);
> > -
> > -             if (writable)
> > -                     *writable = true;
> > +             foll->writable = foll->allow_write_mapping;
> >               return true;
> >       }
> >
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > -     unsigned int flags = FOLL_HWPOISON;
> > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >       struct page *page;
> >       int npages;
> >
> >       might_sleep();
> >
> > -     if (writable)
> > -             *writable = write_fault;
> > -
> > -     if (write_fault)
> > -             flags |= FOLL_WRITE;
> > -     if (async)
> > -             flags |= FOLL_NOWAIT;
> > -     if (interruptible)
> > -             flags |= FOLL_INTERRUPTIBLE;
> > -
> > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >       if (npages != 1)
> >               return npages;
> >
> > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> >       /* map read fault as writable if possible */
> > -     if (unlikely(!write_fault) && writable) {
> > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >               struct page *wpage;
> >
> > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > -                     *writable = true;
> > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > +                     foll->writable = true;
> >                       put_page(page);
> >                       page = wpage;
> >               }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> >       return get_page_unless_zero(page);
> >  }
> >
> > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > -                            unsigned long addr, bool write_fault,
> > -                            bool *writable, kvm_pfn_t *p_pfn)
> > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > +                            kvm_pfn_t *p_pfn)
> >  {
> >       kvm_pfn_t pfn;
> >       pte_t *ptep;
> >       spinlock_t *ptl;
> > +     bool write_fault = foll->flags & FOLL_WRITE;
> >       int r;
> >
> > -     r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > +     r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> >       if (r) {
> >               /*
> >                * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> >                * not call the fault handler, so do it here.
> >                */
> >               bool unlocked = false;
> > -             r = fixup_user_fault(current->mm, addr,
> > +             r = fixup_user_fault(current->mm, foll->hva,
> >                                    (write_fault ? FAULT_FLAG_WRITE : 0),
> >                                    &unlocked);
> >               if (unlocked)
> > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >               if (r)
> >                       return r;
> >
> > -             r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > +             r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> >               if (r)
> >                       return r;
> >       }
> > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >               goto out;
> >       }
> >
> > -     if (writable)
> > -             *writable = pte_write(*ptep);
> > +     foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> >       pfn = pte_pfn(*ptep);
> >
> >       /*
> > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >   * 2): @write_fault = false && @writable, @writable will tell the caller
> >   *     whether the mapping is writable.
> >   */
> > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > -                  bool *async, bool write_fault, bool *writable)
> > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> >  {
> >       struct vm_area_struct *vma;
> >       kvm_pfn_t pfn;
> >       int npages, r;
> >
> >       /* we can do it either atomically or asynchronously, not both */
> > -     BUG_ON(atomic && async);
> > +     BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> >
> > -     if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > +     if (hva_to_pfn_fast(foll, &pfn))
> >               return pfn;
> >
> > -     if (atomic)
> > +     if (foll->atomic)
> >               return KVM_PFN_ERR_FAULT;
> >
> > -     npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > -                              writable, &pfn);
> > +     npages = hva_to_pfn_slow(foll, &pfn);
> >       if (npages == 1)
> >               return pfn;
> >       if (npages == -EINTR)
> > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> >
> >       mmap_read_lock(current->mm);
> >       if (npages == -EHWPOISON ||
> > -           (!async && check_user_page_hwpoison(addr))) {
> > +           (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> >               pfn = KVM_PFN_ERR_HWPOISON;
> >               goto exit;
> >       }
> >
> >  retry:
> > -     vma = vma_lookup(current->mm, addr);
> > +     vma = vma_lookup(current->mm, foll->hva);
> >
> >       if (vma == NULL)
> >               pfn = KVM_PFN_ERR_FAULT;
> >       else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > -             r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > +             r = hva_to_pfn_remapped(vma, foll, &pfn);
> >               if (r == -EAGAIN)
> >                       goto retry;
> >               if (r < 0)
> >                       pfn = KVM_PFN_ERR_FAULT;
> >       } else {
> > -             if (async && vma_is_valid(vma, write_fault))
> > -                     *async = true;
> > -             pfn = KVM_PFN_ERR_FAULT;
> > +             if ((foll->flags & FOLL_NOWAIT) &&
> > +                 vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > +                     pfn = KVM_PFN_ERR_NEEDS_IO;
> > +             else
> > +                     pfn = KVM_PFN_ERR_FAULT;
> >       }
> >  exit:
> >       mmap_read_unlock(current->mm);
> >       return pfn;
> >  }
> >
> > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > -                            bool atomic, bool interruptible, bool *async,
> > -                            bool write_fault, bool *writable, hva_t *hva)
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> >  {
> > -     unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > -
> > -     if (hva)
> > -             *hva = addr;
> > +     foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > +                                   foll->flags & FOLL_WRITE);
> >
> > -     if (addr == KVM_HVA_ERR_RO_BAD) {
> > -             if (writable)
> > -                     *writable = false;
> > +     if (foll->hva == KVM_HVA_ERR_RO_BAD)
> >               return KVM_PFN_ERR_RO_FAULT;
> > -     }
> >
>
> Can you explain why updating foll->writable = false (previously *writeable
> = false) is omitted here?
>
> In the caller where the struct kvm_follow_pfn is initialized, e.g.
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> upper caller.
>
> As the one of the output, it would be better to initalize it either in the
> caller or update it in __kvm_follow_pfn(). Or
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> stack to the caller via bool *writable. It doesn't sound nice.

Entries omitted from an initializer are initialized to zero, so
.writable does get initialized in all of the patches in this series
via designated initializers. Although you're right that explicitly
setting it to false is a good idea, in case someday someone adds a
caller that doesn't use an initializer when declaring its
kvm_follow_pfn.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  9:08       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-05  9:08 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 5:47 PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
>
> On Tue,  4 Jul 2023 16:50:47 +0900
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h |  16 ++++
> >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> >  virt/kvm/kvm_mm.h        |   3 +-
> >  virt/kvm/pfncache.c      |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > +     const struct kvm_memory_slot *slot;
> > +     gfn_t gfn;
> > +     unsigned int flags;
> > +     bool atomic;
> > +     /* Allow a read fault to create a writeable mapping. */
> > +     bool allow_write_mapping;
> > +
> > +     /* Outputs of __kvm_follow_pfn */
> > +     hva_t hva;
> > +     bool writable;
> > +};
> > +
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > +
> >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> >                     bool *writable);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 371bd783ff2b..b13f22861d2f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >   * true indicates success, otherwise false is returned.  It's also the
> >   * only part that runs if we can in atomic context.
> >   */
> > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > -                         bool *writable, kvm_pfn_t *pfn)
> > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> >       struct page *page[1];
> > +     bool write_fault = foll->flags & FOLL_WRITE;
> >
> >       /*
> >        * Fast pin a writable pfn only if it is a write fault request
> >        * or the caller allows to map a writable pfn for a read fault
> >        * request.
> >        */
> > -     if (!(write_fault || writable))
> > +     if (!(write_fault || foll->allow_write_mapping))
> >               return false;
> >
> > -     if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > +     if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> >               *pfn = page_to_pfn(page[0]);
> > -
> > -             if (writable)
> > -                     *writable = true;
> > +             foll->writable = foll->allow_write_mapping;
> >               return true;
> >       }
> >
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > -     unsigned int flags = FOLL_HWPOISON;
> > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >       struct page *page;
> >       int npages;
> >
> >       might_sleep();
> >
> > -     if (writable)
> > -             *writable = write_fault;
> > -
> > -     if (write_fault)
> > -             flags |= FOLL_WRITE;
> > -     if (async)
> > -             flags |= FOLL_NOWAIT;
> > -     if (interruptible)
> > -             flags |= FOLL_INTERRUPTIBLE;
> > -
> > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >       if (npages != 1)
> >               return npages;
> >
> > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> >       /* map read fault as writable if possible */
> > -     if (unlikely(!write_fault) && writable) {
> > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >               struct page *wpage;
> >
> > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > -                     *writable = true;
> > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > +                     foll->writable = true;
> >                       put_page(page);
> >                       page = wpage;
> >               }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> >       return get_page_unless_zero(page);
> >  }
> >
> > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > -                            unsigned long addr, bool write_fault,
> > -                            bool *writable, kvm_pfn_t *p_pfn)
> > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > +                            kvm_pfn_t *p_pfn)
> >  {
> >       kvm_pfn_t pfn;
> >       pte_t *ptep;
> >       spinlock_t *ptl;
> > +     bool write_fault = foll->flags & FOLL_WRITE;
> >       int r;
> >
> > -     r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > +     r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> >       if (r) {
> >               /*
> >                * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> >                * not call the fault handler, so do it here.
> >                */
> >               bool unlocked = false;
> > -             r = fixup_user_fault(current->mm, addr,
> > +             r = fixup_user_fault(current->mm, foll->hva,
> >                                    (write_fault ? FAULT_FLAG_WRITE : 0),
> >                                    &unlocked);
> >               if (unlocked)
> > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >               if (r)
> >                       return r;
> >
> > -             r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > +             r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> >               if (r)
> >                       return r;
> >       }
> > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >               goto out;
> >       }
> >
> > -     if (writable)
> > -             *writable = pte_write(*ptep);
> > +     foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> >       pfn = pte_pfn(*ptep);
> >
> >       /*
> > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >   * 2): @write_fault = false && @writable, @writable will tell the caller
> >   *     whether the mapping is writable.
> >   */
> > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > -                  bool *async, bool write_fault, bool *writable)
> > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> >  {
> >       struct vm_area_struct *vma;
> >       kvm_pfn_t pfn;
> >       int npages, r;
> >
> >       /* we can do it either atomically or asynchronously, not both */
> > -     BUG_ON(atomic && async);
> > +     BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> >
> > -     if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > +     if (hva_to_pfn_fast(foll, &pfn))
> >               return pfn;
> >
> > -     if (atomic)
> > +     if (foll->atomic)
> >               return KVM_PFN_ERR_FAULT;
> >
> > -     npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > -                              writable, &pfn);
> > +     npages = hva_to_pfn_slow(foll, &pfn);
> >       if (npages == 1)
> >               return pfn;
> >       if (npages == -EINTR)
> > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> >
> >       mmap_read_lock(current->mm);
> >       if (npages == -EHWPOISON ||
> > -           (!async && check_user_page_hwpoison(addr))) {
> > +           (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> >               pfn = KVM_PFN_ERR_HWPOISON;
> >               goto exit;
> >       }
> >
> >  retry:
> > -     vma = vma_lookup(current->mm, addr);
> > +     vma = vma_lookup(current->mm, foll->hva);
> >
> >       if (vma == NULL)
> >               pfn = KVM_PFN_ERR_FAULT;
> >       else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > -             r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > +             r = hva_to_pfn_remapped(vma, foll, &pfn);
> >               if (r == -EAGAIN)
> >                       goto retry;
> >               if (r < 0)
> >                       pfn = KVM_PFN_ERR_FAULT;
> >       } else {
> > -             if (async && vma_is_valid(vma, write_fault))
> > -                     *async = true;
> > -             pfn = KVM_PFN_ERR_FAULT;
> > +             if ((foll->flags & FOLL_NOWAIT) &&
> > +                 vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > +                     pfn = KVM_PFN_ERR_NEEDS_IO;
> > +             else
> > +                     pfn = KVM_PFN_ERR_FAULT;
> >       }
> >  exit:
> >       mmap_read_unlock(current->mm);
> >       return pfn;
> >  }
> >
> > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > -                            bool atomic, bool interruptible, bool *async,
> > -                            bool write_fault, bool *writable, hva_t *hva)
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> >  {
> > -     unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > -
> > -     if (hva)
> > -             *hva = addr;
> > +     foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > +                                   foll->flags & FOLL_WRITE);
> >
> > -     if (addr == KVM_HVA_ERR_RO_BAD) {
> > -             if (writable)
> > -                     *writable = false;
> > +     if (foll->hva == KVM_HVA_ERR_RO_BAD)
> >               return KVM_PFN_ERR_RO_FAULT;
> > -     }
> >
>
> Can you explain why updating foll->writable = false (previously *writeable
> = false) is omitted here?
>
> In the caller where the struct kvm_follow_pfn is initialized, e.g.
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> upper caller.
>
> As the one of the output, it would be better to initalize it either in the
> caller or update it in __kvm_follow_pfn(). Or
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> stack to the caller via bool *writable. It doesn't sound nice.

Entries omitted from an initializer are initialized to zero, so
.writable does get initialized in all of the patches in this series
via designated initializers. Although you're right that explicitly
setting it to false is a good idea, in case someday someone adds a
caller that doesn't use an initializer when declaring its
kvm_follow_pfn.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  9:08       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-05  9:08 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 5, 2023 at 5:47 PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
>
> On Tue,  4 Jul 2023 16:50:47 +0900
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h |  16 ++++
> >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> >  virt/kvm/kvm_mm.h        |   3 +-
> >  virt/kvm/pfncache.c      |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > +     const struct kvm_memory_slot *slot;
> > +     gfn_t gfn;
> > +     unsigned int flags;
> > +     bool atomic;
> > +     /* Allow a read fault to create a writeable mapping. */
> > +     bool allow_write_mapping;
> > +
> > +     /* Outputs of __kvm_follow_pfn */
> > +     hva_t hva;
> > +     bool writable;
> > +};
> > +
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > +
> >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> >                     bool *writable);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 371bd783ff2b..b13f22861d2f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >   * true indicates success, otherwise false is returned.  It's also the
> >   * only part that runs if we can in atomic context.
> >   */
> > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > -                         bool *writable, kvm_pfn_t *pfn)
> > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> >       struct page *page[1];
> > +     bool write_fault = foll->flags & FOLL_WRITE;
> >
> >       /*
> >        * Fast pin a writable pfn only if it is a write fault request
> >        * or the caller allows to map a writable pfn for a read fault
> >        * request.
> >        */
> > -     if (!(write_fault || writable))
> > +     if (!(write_fault || foll->allow_write_mapping))
> >               return false;
> >
> > -     if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > +     if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> >               *pfn = page_to_pfn(page[0]);
> > -
> > -             if (writable)
> > -                     *writable = true;
> > +             foll->writable = foll->allow_write_mapping;
> >               return true;
> >       }
> >
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > -     unsigned int flags = FOLL_HWPOISON;
> > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >       struct page *page;
> >       int npages;
> >
> >       might_sleep();
> >
> > -     if (writable)
> > -             *writable = write_fault;
> > -
> > -     if (write_fault)
> > -             flags |= FOLL_WRITE;
> > -     if (async)
> > -             flags |= FOLL_NOWAIT;
> > -     if (interruptible)
> > -             flags |= FOLL_INTERRUPTIBLE;
> > -
> > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >       if (npages != 1)
> >               return npages;
> >
> > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> >       /* map read fault as writable if possible */
> > -     if (unlikely(!write_fault) && writable) {
> > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >               struct page *wpage;
> >
> > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > -                     *writable = true;
> > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > +                     foll->writable = true;
> >                       put_page(page);
> >                       page = wpage;
> >               }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> >       return get_page_unless_zero(page);
> >  }
> >
> > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > -                            unsigned long addr, bool write_fault,
> > -                            bool *writable, kvm_pfn_t *p_pfn)
> > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > +                            kvm_pfn_t *p_pfn)
> >  {
> >       kvm_pfn_t pfn;
> >       pte_t *ptep;
> >       spinlock_t *ptl;
> > +     bool write_fault = foll->flags & FOLL_WRITE;
> >       int r;
> >
> > -     r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > +     r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> >       if (r) {
> >               /*
> >                * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> >                * not call the fault handler, so do it here.
> >                */
> >               bool unlocked = false;
> > -             r = fixup_user_fault(current->mm, addr,
> > +             r = fixup_user_fault(current->mm, foll->hva,
> >                                    (write_fault ? FAULT_FLAG_WRITE : 0),
> >                                    &unlocked);
> >               if (unlocked)
> > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >               if (r)
> >                       return r;
> >
> > -             r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > +             r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> >               if (r)
> >                       return r;
> >       }
> > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >               goto out;
> >       }
> >
> > -     if (writable)
> > -             *writable = pte_write(*ptep);
> > +     foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> >       pfn = pte_pfn(*ptep);
> >
> >       /*
> > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> >   * 2): @write_fault = false && @writable, @writable will tell the caller
> >   *     whether the mapping is writable.
> >   */
> > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > -                  bool *async, bool write_fault, bool *writable)
> > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> >  {
> >       struct vm_area_struct *vma;
> >       kvm_pfn_t pfn;
> >       int npages, r;
> >
> >       /* we can do it either atomically or asynchronously, not both */
> > -     BUG_ON(atomic && async);
> > +     BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> >
> > -     if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > +     if (hva_to_pfn_fast(foll, &pfn))
> >               return pfn;
> >
> > -     if (atomic)
> > +     if (foll->atomic)
> >               return KVM_PFN_ERR_FAULT;
> >
> > -     npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > -                              writable, &pfn);
> > +     npages = hva_to_pfn_slow(foll, &pfn);
> >       if (npages == 1)
> >               return pfn;
> >       if (npages == -EINTR)
> > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> >
> >       mmap_read_lock(current->mm);
> >       if (npages == -EHWPOISON ||
> > -           (!async && check_user_page_hwpoison(addr))) {
> > +           (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> >               pfn = KVM_PFN_ERR_HWPOISON;
> >               goto exit;
> >       }
> >
> >  retry:
> > -     vma = vma_lookup(current->mm, addr);
> > +     vma = vma_lookup(current->mm, foll->hva);
> >
> >       if (vma == NULL)
> >               pfn = KVM_PFN_ERR_FAULT;
> >       else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > -             r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > +             r = hva_to_pfn_remapped(vma, foll, &pfn);
> >               if (r == -EAGAIN)
> >                       goto retry;
> >               if (r < 0)
> >                       pfn = KVM_PFN_ERR_FAULT;
> >       } else {
> > -             if (async && vma_is_valid(vma, write_fault))
> > -                     *async = true;
> > -             pfn = KVM_PFN_ERR_FAULT;
> > +             if ((foll->flags & FOLL_NOWAIT) &&
> > +                 vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > +                     pfn = KVM_PFN_ERR_NEEDS_IO;
> > +             else
> > +                     pfn = KVM_PFN_ERR_FAULT;
> >       }
> >  exit:
> >       mmap_read_unlock(current->mm);
> >       return pfn;
> >  }
> >
> > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > -                            bool atomic, bool interruptible, bool *async,
> > -                            bool write_fault, bool *writable, hva_t *hva)
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> >  {
> > -     unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > -
> > -     if (hva)
> > -             *hva = addr;
> > +     foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > +                                   foll->flags & FOLL_WRITE);
> >
> > -     if (addr == KVM_HVA_ERR_RO_BAD) {
> > -             if (writable)
> > -                     *writable = false;
> > +     if (foll->hva == KVM_HVA_ERR_RO_BAD)
> >               return KVM_PFN_ERR_RO_FAULT;
> > -     }
> >
>
> Can you explain why updating foll->writable = false (previously *writeable
> = false) is omitted here?
>
> In the caller where the struct kvm_follow_pfn is initialized, e.g.
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> upper caller.
>
> As the one of the output, it would be better to initalize it either in the
> caller or update it in __kvm_follow_pfn(). Or
> __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> stack to the caller via bool *writable. It doesn't sound nice.

Entries omitted from an initializer are initialized to zero, so
.writable does get initialized in all of the patches in this series
via designated initializers. Although you're right that explicitly
setting it to false is a good idea, in case someday someone adds a
caller that doesn't use an initializer when declaring its
kvm_follow_pfn.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-05  3:10     ` Yu Zhang
  (?)
@ 2023-07-05  9:22       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-05  9:22 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > -     unsigned int flags = FOLL_HWPOISON;
> > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >       struct page *page;
> >       int npages;
> >
> >       might_sleep();
> >
> > -     if (writable)
> > -             *writable = write_fault;
> > -
> > -     if (write_fault)
> > -             flags |= FOLL_WRITE;
> > -     if (async)
> > -             flags |= FOLL_NOWAIT;
> > -     if (interruptible)
> > -             flags |= FOLL_INTERRUPTIBLE;
> > -
> > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >       if (npages != 1)
> >               return npages;
> >
> > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> >       /* map read fault as writable if possible */
> > -     if (unlikely(!write_fault) && writable) {
> > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>
> I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

The two statements are logically equivalent, although I guess using
!(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
verbose.

> >               struct page *wpage;
> >
> > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > -                     *writable = true;
> > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > +                     foll->writable = true;
> >                       put_page(page);
> >                       page = wpage;
> >               }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> >       return get_page_unless_zero(page);
> >  }
> >
> ...
>
> > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > +                            bool atomic, bool interruptible, bool *async,
> > +                            bool write_fault, bool *writable, hva_t *hva)
> > +{
> > +     kvm_pfn_t pfn;
> > +     struct kvm_follow_pfn foll = {
> > +             .slot = slot,
> > +             .gfn = gfn,
> > +             .flags = 0,
> > +             .atomic = atomic,
> > +             .allow_write_mapping = !!writable,
> > +     };
> > +
> > +     if (write_fault)
> > +             foll.flags |= FOLL_WRITE;
> > +     if (async)
> > +             foll.flags |= FOLL_NOWAIT;
> > +     if (interruptible)
> > +             foll.flags |= FOLL_INTERRUPTIBLE;
> > +
> > +     pfn = __kvm_follow_pfn(&foll);
> > +     if (pfn == KVM_PFN_ERR_NEEDS_IO) {
>
> Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
>         if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> seems unnecessary.

There are the cases where the fault does not fall within a vma or when
the target vma's flags don't support the fault's access permissions.
In those cases, continuing to try to resolve the fault won't cause
problems per-se, but it's wasteful and a bit confusing. Having
hva_to_pfn detect whether or not it may be possible to resolve the
fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
a good idea. It also matches what the existing code does.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  9:22       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-05  9:22 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > -     unsigned int flags = FOLL_HWPOISON;
> > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >       struct page *page;
> >       int npages;
> >
> >       might_sleep();
> >
> > -     if (writable)
> > -             *writable = write_fault;
> > -
> > -     if (write_fault)
> > -             flags |= FOLL_WRITE;
> > -     if (async)
> > -             flags |= FOLL_NOWAIT;
> > -     if (interruptible)
> > -             flags |= FOLL_INTERRUPTIBLE;
> > -
> > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >       if (npages != 1)
> >               return npages;
> >
> > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> >       /* map read fault as writable if possible */
> > -     if (unlikely(!write_fault) && writable) {
> > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>
> I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

The two statements are logically equivalent, although I guess using
!(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
verbose.

> >               struct page *wpage;
> >
> > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > -                     *writable = true;
> > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > +                     foll->writable = true;
> >                       put_page(page);
> >                       page = wpage;
> >               }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> >       return get_page_unless_zero(page);
> >  }
> >
> ...
>
> > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > +                            bool atomic, bool interruptible, bool *async,
> > +                            bool write_fault, bool *writable, hva_t *hva)
> > +{
> > +     kvm_pfn_t pfn;
> > +     struct kvm_follow_pfn foll = {
> > +             .slot = slot,
> > +             .gfn = gfn,
> > +             .flags = 0,
> > +             .atomic = atomic,
> > +             .allow_write_mapping = !!writable,
> > +     };
> > +
> > +     if (write_fault)
> > +             foll.flags |= FOLL_WRITE;
> > +     if (async)
> > +             foll.flags |= FOLL_NOWAIT;
> > +     if (interruptible)
> > +             foll.flags |= FOLL_INTERRUPTIBLE;
> > +
> > +     pfn = __kvm_follow_pfn(&foll);
> > +     if (pfn == KVM_PFN_ERR_NEEDS_IO) {
>
> Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
>         if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> seems unnecessary.

There are the cases where the fault does not fall within a vma or when
the target vma's flags don't support the fault's access permissions.
In those cases, continuing to try to resolve the fault won't cause
problems per-se, but it's wasteful and a bit confusing. Having
hva_to_pfn detect whether or not it may be possible to resolve the
fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
a good idea. It also matches what the existing code does.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05  9:22       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-05  9:22 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > -     unsigned int flags = FOLL_HWPOISON;
> > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >       struct page *page;
> >       int npages;
> >
> >       might_sleep();
> >
> > -     if (writable)
> > -             *writable = write_fault;
> > -
> > -     if (write_fault)
> > -             flags |= FOLL_WRITE;
> > -     if (async)
> > -             flags |= FOLL_NOWAIT;
> > -     if (interruptible)
> > -             flags |= FOLL_INTERRUPTIBLE;
> > -
> > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >       if (npages != 1)
> >               return npages;
> >
> > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > +
> >       /* map read fault as writable if possible */
> > -     if (unlikely(!write_fault) && writable) {
> > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>
> I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

The two statements are logically equivalent, although I guess using
!(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
verbose.

> >               struct page *wpage;
> >
> > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > -                     *writable = true;
> > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > +                     foll->writable = true;
> >                       put_page(page);
> >                       page = wpage;
> >               }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> >       return get_page_unless_zero(page);
> >  }
> >
> ...
>
> > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > +                            bool atomic, bool interruptible, bool *async,
> > +                            bool write_fault, bool *writable, hva_t *hva)
> > +{
> > +     kvm_pfn_t pfn;
> > +     struct kvm_follow_pfn foll = {
> > +             .slot = slot,
> > +             .gfn = gfn,
> > +             .flags = 0,
> > +             .atomic = atomic,
> > +             .allow_write_mapping = !!writable,
> > +     };
> > +
> > +     if (write_fault)
> > +             foll.flags |= FOLL_WRITE;
> > +     if (async)
> > +             foll.flags |= FOLL_NOWAIT;
> > +     if (interruptible)
> > +             foll.flags |= FOLL_INTERRUPTIBLE;
> > +
> > +     pfn = __kvm_follow_pfn(&foll);
> > +     if (pfn == KVM_PFN_ERR_NEEDS_IO) {
>
> Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
>         if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> seems unnecessary.

There are the cases where the fault does not fall within a vma or when
the target vma's flags don't support the fault's access permissions.
In those cases, continuing to try to resolve the fault won't cause
problems per-se, but it's wasteful and a bit confusing. Having
hva_to_pfn detect whether or not it may be possible to resolve the
fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
a good idea. It also matches what the existing code does.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05 10:18     ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:18 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.

I guess you mean the tail pages of the higher order non-compound pages?
And as to the head page, it is said to be set to one coincidentally[*],
and shall not be considered as refcounted.  IIUC, refcount of this head
page will be increased and decreased soon in hva_to_pfn_remapped(), so
this may not be a problem(?). But treating this head page differently,
as a refcounted one(e.g., to set the A/D flags), is weired. 

Or maybe I missed some context, e.g., can the head page be allocted to
guest at all? 


> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.

Also, maybe we should mention this only works on x86-64. 

> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c

...

> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	bool host_writable = !fault || fault->map_writable;
>  	bool prefetch = !fault || fault->prefetch;
>  	bool write_fault = fault && fault->write;
> +	bool is_refcounted = !fault || fault->is_refcounted_page;

Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
practice?

...
>  
> @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   */
>  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
>  {
> -	bool host_writable;
> +	bool host_writable, is_refcounted;
>  	gpa_t first_pte_gpa;
>  	u64 *sptep, spte;
>  	struct kvm_memory_slot *slot;
> @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
>  	sptep = &sp->spt[i];
>  	spte = *sptep;
>  	host_writable = spte & shadow_host_writable_mask;
> +	// TODO: is this correct?
> +	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  	make_spte(vcpu, sp, slot, pte_access, gfn,
>  		  spte_to_pfn(spte), spte, true, false,
> -		  host_writable, &spte);
> +		  host_writable, is_refcounted, &spte);

Could we restrict that a non-refcounted page shall not be used as shadow page?

[*] https://lore.kernel.org/all/8caf3008-dcf3-985a-631e-e019b277c6f0@amd.com/

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-05 10:18     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:18 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.

I guess you mean the tail pages of the higher order non-compound pages?
And as to the head page, it is said to be set to one coincidentally[*],
and shall not be considered as refcounted.  IIUC, refcount of this head
page will be increased and decreased soon in hva_to_pfn_remapped(), so
this may not be a problem(?). But treating this head page differently,
as a refcounted one(e.g., to set the A/D flags), is weired. 

Or maybe I missed some context, e.g., can the head page be allocted to
guest at all? 


> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.

Also, maybe we should mention this only works on x86-64. 

> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c

...

> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	bool host_writable = !fault || fault->map_writable;
>  	bool prefetch = !fault || fault->prefetch;
>  	bool write_fault = fault && fault->write;
> +	bool is_refcounted = !fault || fault->is_refcounted_page;

Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
practice?

...
>  
> @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   */
>  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
>  {
> -	bool host_writable;
> +	bool host_writable, is_refcounted;
>  	gpa_t first_pte_gpa;
>  	u64 *sptep, spte;
>  	struct kvm_memory_slot *slot;
> @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
>  	sptep = &sp->spt[i];
>  	spte = *sptep;
>  	host_writable = spte & shadow_host_writable_mask;
> +	// TODO: is this correct?
> +	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  	make_spte(vcpu, sp, slot, pte_access, gfn,
>  		  spte_to_pfn(spte), spte, true, false,
> -		  host_writable, &spte);
> +		  host_writable, is_refcounted, &spte);

Could we restrict that a non-refcounted page shall not be used as shadow page?

[*] https://lore.kernel.org/all/8caf3008-dcf3-985a-631e-e019b277c6f0@amd.com/

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-05 10:18     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:18 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.

I guess you mean the tail pages of the higher order non-compound pages?
And as to the head page, it is said to be set to one coincidentally[*],
and shall not be considered as refcounted.  IIUC, refcount of this head
page will be increased and decreased soon in hva_to_pfn_remapped(), so
this may not be a problem(?). But treating this head page differently,
as a refcounted one(e.g., to set the A/D flags), is weired. 

Or maybe I missed some context, e.g., can the head page be allocted to
guest at all? 


> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.

Also, maybe we should mention this only works on x86-64. 

> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c

...

> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	bool host_writable = !fault || fault->map_writable;
>  	bool prefetch = !fault || fault->prefetch;
>  	bool write_fault = fault && fault->write;
> +	bool is_refcounted = !fault || fault->is_refcounted_page;

Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
practice?

...
>  
> @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   */
>  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
>  {
> -	bool host_writable;
> +	bool host_writable, is_refcounted;
>  	gpa_t first_pte_gpa;
>  	u64 *sptep, spte;
>  	struct kvm_memory_slot *slot;
> @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
>  	sptep = &sp->spt[i];
>  	spte = *sptep;
>  	host_writable = spte & shadow_host_writable_mask;
> +	// TODO: is this correct?
> +	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
>  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  	make_spte(vcpu, sp, slot, pte_access, gfn,
>  		  spte_to_pfn(spte), spte, true, false,
> -		  host_writable, &spte);
> +		  host_writable, is_refcounted, &spte);

Could we restrict that a non-refcounted page shall not be used as shadow page?

[*] https://lore.kernel.org/all/8caf3008-dcf3-985a-631e-e019b277c6f0@amd.com/

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05 10:25     ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:25 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.
> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  
>  	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	return flush;
> @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>  	 * before they are reclaimed.  Sanity check that, if the pfn is backed
>  	 * by a refcounted page, the refcount is elevated.
>  	 */
> -	page = kvm_pfn_to_refcounted_page(pfn);
> -	WARN_ON(page && !page_count(page));
> +	if (is_refcounted_page_pte(old_spte)) {
> +		page = kvm_pfn_to_refcounted_page(pfn);
> +		WARN_ON(!page || !page_count(page));
> +	}
>  
> -	if (is_accessed_spte(old_spte))
> -		kvm_set_pfn_accessed(pfn);
> +	if (is_refcounted_page_pte(old_spte)) {
> +		if (is_accessed_spte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(pfn));
>  
> -	if (is_dirty_spte(old_spte))
> -		kvm_set_pfn_dirty(pfn);
> +		if (is_dirty_spte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(pfn));
> +	}
>  
>  	return old_spte;
>  }
> @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
>  		 * Capture the dirty status of the page, so that it doesn't get
>  		 * lost when the SPTE is marked for access tracking.
>  		 */
> -		if (is_writable_pte(spte))
> -			kvm_set_pfn_dirty(spte_to_pfn(spte));
> +		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
>  
>  		spte = mark_spte_for_access_track(spte);
>  		mmu_spte_update_no_track(sptep, spte);
> @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
>  {
>  	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
>  					       (unsigned long *)sptep);
> -	if (was_writable && !spte_ad_enabled(*sptep))
> -		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> +	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> +		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
>  
>  	return was_writable;
>  }
> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	bool host_writable = !fault || fault->map_writable;
>  	bool prefetch = !fault || fault->prefetch;
>  	bool write_fault = fault && fault->write;
> +	bool is_refcounted = !fault || fault->is_refcounted_page;
>  
>  	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
>  		 *sptep, write_fault, gfn);
> @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	}
>  
>  	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> -			   true, host_writable, &spte);
> +			   true, host_writable, is_refcounted, &spte);
>  
>  	if (*sptep == spte) {
>  		ret = RET_PF_SPURIOUS;
> @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = fault->gfn,
> -		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.flags = fault->write ? FOLL_WRITE : 0,
>  		.allow_write_mapping = true,
> +		.guarded_by_mmu_notifier = true,
>  	};
>  
>  	/*
> @@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			fault->slot = NULL;
>  			fault->pfn = KVM_PFN_NOSLOT;
>  			fault->map_writable = false;
> +			fault->is_refcounted_page = false;
>  			return RET_PF_CONTINUE;
>  		}
>  		/*
> @@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  success:
>  	fault->hva = foll.hva;
>  	fault->map_writable = foll.writable;
> +	fault->is_refcounted_page = foll.is_refcounted_page;
>  	return RET_PF_CONTINUE;
>  }
>  
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	return r;
>  }
>  
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  
>  out_unlock:
>  	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);

Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
believe this is not gonna happen in real world... 

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-05 10:25     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:25 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.
> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  
>  	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	return flush;
> @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>  	 * before they are reclaimed.  Sanity check that, if the pfn is backed
>  	 * by a refcounted page, the refcount is elevated.
>  	 */
> -	page = kvm_pfn_to_refcounted_page(pfn);
> -	WARN_ON(page && !page_count(page));
> +	if (is_refcounted_page_pte(old_spte)) {
> +		page = kvm_pfn_to_refcounted_page(pfn);
> +		WARN_ON(!page || !page_count(page));
> +	}
>  
> -	if (is_accessed_spte(old_spte))
> -		kvm_set_pfn_accessed(pfn);
> +	if (is_refcounted_page_pte(old_spte)) {
> +		if (is_accessed_spte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(pfn));
>  
> -	if (is_dirty_spte(old_spte))
> -		kvm_set_pfn_dirty(pfn);
> +		if (is_dirty_spte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(pfn));
> +	}
>  
>  	return old_spte;
>  }
> @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
>  		 * Capture the dirty status of the page, so that it doesn't get
>  		 * lost when the SPTE is marked for access tracking.
>  		 */
> -		if (is_writable_pte(spte))
> -			kvm_set_pfn_dirty(spte_to_pfn(spte));
> +		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
>  
>  		spte = mark_spte_for_access_track(spte);
>  		mmu_spte_update_no_track(sptep, spte);
> @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
>  {
>  	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
>  					       (unsigned long *)sptep);
> -	if (was_writable && !spte_ad_enabled(*sptep))
> -		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> +	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> +		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
>  
>  	return was_writable;
>  }
> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	bool host_writable = !fault || fault->map_writable;
>  	bool prefetch = !fault || fault->prefetch;
>  	bool write_fault = fault && fault->write;
> +	bool is_refcounted = !fault || fault->is_refcounted_page;
>  
>  	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
>  		 *sptep, write_fault, gfn);
> @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	}
>  
>  	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> -			   true, host_writable, &spte);
> +			   true, host_writable, is_refcounted, &spte);
>  
>  	if (*sptep == spte) {
>  		ret = RET_PF_SPURIOUS;
> @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = fault->gfn,
> -		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.flags = fault->write ? FOLL_WRITE : 0,
>  		.allow_write_mapping = true,
> +		.guarded_by_mmu_notifier = true,
>  	};
>  
>  	/*
> @@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			fault->slot = NULL;
>  			fault->pfn = KVM_PFN_NOSLOT;
>  			fault->map_writable = false;
> +			fault->is_refcounted_page = false;
>  			return RET_PF_CONTINUE;
>  		}
>  		/*
> @@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  success:
>  	fault->hva = foll.hva;
>  	fault->map_writable = foll.writable;
> +	fault->is_refcounted_page = foll.is_refcounted_page;
>  	return RET_PF_CONTINUE;
>  }
>  
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	return r;
>  }
>  
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  
>  out_unlock:
>  	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);

Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
believe this is not gonna happen in real world... 

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-05 10:25     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:25 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.
> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
>  arch/x86/kvm/mmu/spte.c         |  4 ++-
>  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  
>  	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
>  		flush = true;
> -		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +		if (is_refcounted_page_pte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
>  	}
>  
>  	return flush;
> @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>  	 * before they are reclaimed.  Sanity check that, if the pfn is backed
>  	 * by a refcounted page, the refcount is elevated.
>  	 */
> -	page = kvm_pfn_to_refcounted_page(pfn);
> -	WARN_ON(page && !page_count(page));
> +	if (is_refcounted_page_pte(old_spte)) {
> +		page = kvm_pfn_to_refcounted_page(pfn);
> +		WARN_ON(!page || !page_count(page));
> +	}
>  
> -	if (is_accessed_spte(old_spte))
> -		kvm_set_pfn_accessed(pfn);
> +	if (is_refcounted_page_pte(old_spte)) {
> +		if (is_accessed_spte(old_spte))
> +			kvm_set_page_accessed(pfn_to_page(pfn));
>  
> -	if (is_dirty_spte(old_spte))
> -		kvm_set_pfn_dirty(pfn);
> +		if (is_dirty_spte(old_spte))
> +			kvm_set_page_dirty(pfn_to_page(pfn));
> +	}
>  
>  	return old_spte;
>  }
> @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
>  		 * Capture the dirty status of the page, so that it doesn't get
>  		 * lost when the SPTE is marked for access tracking.
>  		 */
> -		if (is_writable_pte(spte))
> -			kvm_set_pfn_dirty(spte_to_pfn(spte));
> +		if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> +			kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
>  
>  		spte = mark_spte_for_access_track(spte);
>  		mmu_spte_update_no_track(sptep, spte);
> @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
>  {
>  	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
>  					       (unsigned long *)sptep);
> -	if (was_writable && !spte_ad_enabled(*sptep))
> -		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> +	if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> +		kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
>  
>  	return was_writable;
>  }
> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	bool host_writable = !fault || fault->map_writable;
>  	bool prefetch = !fault || fault->prefetch;
>  	bool write_fault = fault && fault->write;
> +	bool is_refcounted = !fault || fault->is_refcounted_page;
>  
>  	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
>  		 *sptep, write_fault, gfn);
> @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  	}
>  
>  	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> -			   true, host_writable, &spte);
> +			   true, host_writable, is_refcounted, &spte);
>  
>  	if (*sptep == spte) {
>  		ret = RET_PF_SPURIOUS;
> @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = fault->gfn,
> -		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.flags = fault->write ? FOLL_WRITE : 0,
>  		.allow_write_mapping = true,
> +		.guarded_by_mmu_notifier = true,
>  	};
>  
>  	/*
> @@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			fault->slot = NULL;
>  			fault->pfn = KVM_PFN_NOSLOT;
>  			fault->map_writable = false;
> +			fault->is_refcounted_page = false;
>  			return RET_PF_CONTINUE;
>  		}
>  		/*
> @@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  success:
>  	fault->hva = foll.hva;
>  	fault->map_writable = foll.writable;
> +	fault->is_refcounted_page = foll.is_refcounted_page;
>  	return RET_PF_CONTINUE;
>  }
>  
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	return r;
>  }
>  
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  
>  out_unlock:
>  	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);

Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
believe this is not gonna happen in real world... 

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-05  9:22       ` David Stevens
  (?)
@ 2023-07-05 10:53         ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:53 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > -     unsigned int flags = FOLL_HWPOISON;
> > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >       struct page *page;
> > >       int npages;
> > >
> > >       might_sleep();
> > >
> > > -     if (writable)
> > > -             *writable = write_fault;
> > > -
> > > -     if (write_fault)
> > > -             flags |= FOLL_WRITE;
> > > -     if (async)
> > > -             flags |= FOLL_NOWAIT;
> > > -     if (interruptible)
> > > -             flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > >       if (npages != 1)
> > >               return npages;
> > >
> > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > >       /* map read fault as writable if possible */
> > > -     if (unlikely(!write_fault) && writable) {
> > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >
> > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> 
> The two statements are logically equivalent, although I guess using
> !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> verbose.

Well, as the comment says, we wanna try to map the read fault as writable
whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
for write faults. So I guess using !foll->writable will not allow this.
Did I miss anything?

> > > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > +                            bool atomic, bool interruptible, bool *async,
> > > +                            bool write_fault, bool *writable, hva_t *hva)
> > > +{
> > > +     kvm_pfn_t pfn;
> > > +     struct kvm_follow_pfn foll = {
> > > +             .slot = slot,
> > > +             .gfn = gfn,
> > > +             .flags = 0,
> > > +             .atomic = atomic,
> > > +             .allow_write_mapping = !!writable,
> > > +     };
> > > +
> > > +     if (write_fault)
> > > +             foll.flags |= FOLL_WRITE;
> > > +     if (async)
> > > +             foll.flags |= FOLL_NOWAIT;
> > > +     if (interruptible)
> > > +             foll.flags |= FOLL_INTERRUPTIBLE;
> > > +
> > > +     pfn = __kvm_follow_pfn(&foll);
> > > +     if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> >
> > Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
> >         if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> > Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> > seems unnecessary.
> 
> There are the cases where the fault does not fall within a vma or when
> the target vma's flags don't support the fault's access permissions.
> In those cases, continuing to try to resolve the fault won't cause
> problems per-se, but it's wasteful and a bit confusing. Having
> hva_to_pfn detect whether or not it may be possible to resolve the
> fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
> a good idea. It also matches what the existing code does.

Got it. Sounds reasonable. And thanks! :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05 10:53         ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:53 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > -     unsigned int flags = FOLL_HWPOISON;
> > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >       struct page *page;
> > >       int npages;
> > >
> > >       might_sleep();
> > >
> > > -     if (writable)
> > > -             *writable = write_fault;
> > > -
> > > -     if (write_fault)
> > > -             flags |= FOLL_WRITE;
> > > -     if (async)
> > > -             flags |= FOLL_NOWAIT;
> > > -     if (interruptible)
> > > -             flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > >       if (npages != 1)
> > >               return npages;
> > >
> > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > >       /* map read fault as writable if possible */
> > > -     if (unlikely(!write_fault) && writable) {
> > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >
> > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> 
> The two statements are logically equivalent, although I guess using
> !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> verbose.

Well, as the comment says, we wanna try to map the read fault as writable
whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
for write faults. So I guess using !foll->writable will not allow this.
Did I miss anything?

> > > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > +                            bool atomic, bool interruptible, bool *async,
> > > +                            bool write_fault, bool *writable, hva_t *hva)
> > > +{
> > > +     kvm_pfn_t pfn;
> > > +     struct kvm_follow_pfn foll = {
> > > +             .slot = slot,
> > > +             .gfn = gfn,
> > > +             .flags = 0,
> > > +             .atomic = atomic,
> > > +             .allow_write_mapping = !!writable,
> > > +     };
> > > +
> > > +     if (write_fault)
> > > +             foll.flags |= FOLL_WRITE;
> > > +     if (async)
> > > +             foll.flags |= FOLL_NOWAIT;
> > > +     if (interruptible)
> > > +             foll.flags |= FOLL_INTERRUPTIBLE;
> > > +
> > > +     pfn = __kvm_follow_pfn(&foll);
> > > +     if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> >
> > Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
> >         if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> > Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> > seems unnecessary.
> 
> There are the cases where the fault does not fall within a vma or when
> the target vma's flags don't support the fault's access permissions.
> In those cases, continuing to try to resolve the fault won't cause
> problems per-se, but it's wasteful and a bit confusing. Having
> hva_to_pfn detect whether or not it may be possible to resolve the
> fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
> a good idea. It also matches what the existing code does.

Got it. Sounds reasonable. And thanks! :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-05 10:53         ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 10:53 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > -     unsigned int flags = FOLL_HWPOISON;
> > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >       struct page *page;
> > >       int npages;
> > >
> > >       might_sleep();
> > >
> > > -     if (writable)
> > > -             *writable = write_fault;
> > > -
> > > -     if (write_fault)
> > > -             flags |= FOLL_WRITE;
> > > -     if (async)
> > > -             flags |= FOLL_NOWAIT;
> > > -     if (interruptible)
> > > -             flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > >       if (npages != 1)
> > >               return npages;
> > >
> > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > >       /* map read fault as writable if possible */
> > > -     if (unlikely(!write_fault) && writable) {
> > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >
> > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> 
> The two statements are logically equivalent, although I guess using
> !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> verbose.

Well, as the comment says, we wanna try to map the read fault as writable
whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
for write faults. So I guess using !foll->writable will not allow this.
Did I miss anything?

> > > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > +                            bool atomic, bool interruptible, bool *async,
> > > +                            bool write_fault, bool *writable, hva_t *hva)
> > > +{
> > > +     kvm_pfn_t pfn;
> > > +     struct kvm_follow_pfn foll = {
> > > +             .slot = slot,
> > > +             .gfn = gfn,
> > > +             .flags = 0,
> > > +             .atomic = atomic,
> > > +             .allow_write_mapping = !!writable,
> > > +     };
> > > +
> > > +     if (write_fault)
> > > +             foll.flags |= FOLL_WRITE;
> > > +     if (async)
> > > +             foll.flags |= FOLL_NOWAIT;
> > > +     if (interruptible)
> > > +             foll.flags |= FOLL_INTERRUPTIBLE;
> > > +
> > > +     pfn = __kvm_follow_pfn(&foll);
> > > +     if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> >
> > Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
> >         if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> > Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> > seems unnecessary.
> 
> There are the cases where the fault does not fall within a vma or when
> the target vma's flags don't support the fault's access permissions.
> In those cases, continuing to try to resolve the fault won't cause
> problems per-se, but it's wasteful and a bit confusing. Having
> hva_to_pfn detect whether or not it may be possible to resolve the
> fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
> a good idea. It also matches what the existing code does.

Got it. Sounds reasonable. And thanks! :)

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05 11:56     ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 11:56 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h | 10 ++++++
>  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
>  virt/kvm/pfncache.c      |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>  	const struct kvm_memory_slot *slot;
>  	gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Allow a read fault to create a writeable mapping. */
>  	bool allow_write_mapping;
> +	/*
> +	 * Usage of the returned pfn will be guared by a mmu notifier. Must
> +	 * be true if FOLL_GET is not set.
> +	 */
> +	bool guarded_by_mmu_notifier;

And how? Any place to check the invalidate seq?

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-05 11:56     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 11:56 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h | 10 ++++++
>  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
>  virt/kvm/pfncache.c      |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>  	const struct kvm_memory_slot *slot;
>  	gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Allow a read fault to create a writeable mapping. */
>  	bool allow_write_mapping;
> +	/*
> +	 * Usage of the returned pfn will be guared by a mmu notifier. Must
> +	 * be true if FOLL_GET is not set.
> +	 */
> +	bool guarded_by_mmu_notifier;

And how? Any place to check the invalidate seq?

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-05 11:56     ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 11:56 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h | 10 ++++++
>  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
>  virt/kvm/pfncache.c      |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>  	const struct kvm_memory_slot *slot;
>  	gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Allow a read fault to create a writeable mapping. */
>  	bool allow_write_mapping;
> +	/*
> +	 * Usage of the returned pfn will be guared by a mmu notifier. Must
> +	 * be true if FOLL_GET is not set.
> +	 */
> +	bool guarded_by_mmu_notifier;

And how? Any place to check the invalidate seq?

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-05 13:19     ` Zhi Wang
  -1 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-05 13:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue,  4 Jul 2023 16:50:48 +0900
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h | 10 ++++++
>  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
>  virt/kvm/pfncache.c      |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>  	const struct kvm_memory_slot *slot;
>  	gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Allow a read fault to create a writeable mapping. */
>  	bool allow_write_mapping;
> +	/*
> +	 * Usage of the returned pfn will be guared by a mmu notifier. Must
                                              ^guarded
> +	 * be true if FOLL_GET is not set.
> +	 */
> +	bool guarded_by_mmu_notifier;
>
It seems no one sets the guraded_by_mmu_notifier in this patch. Is
guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
caller of __kvm_follow_pfn()?

If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
temporary solution. I don't think it is a good idea to play tricks with
a temporary solution, more like we are abusing the toleration.

Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
indicate a tail page?
 
>  	/* Outputs of __kvm_follow_pfn */
>  	hva_t hva;
>  	bool writable;
> +	/* True if the returned pfn is for a page with a valid refcount. */
> +	bool is_refcounted_page;
>  };
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b13f22861d2f..0f7b41f220b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
>  		foll->writable = foll->allow_write_mapping;
> +		foll->is_refcounted_page = true;
> +		if (!(foll->flags & FOLL_GET))
> +			put_page(page[0]);
>  		return true;
>  	}
>  
> @@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		return npages;
>  
>  	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +	foll->is_refcounted_page = true;
>  
>  	/* map read fault as writable if possible */
>  	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		}
>  	}
>  	*pfn = page_to_pfn(page);
> +	if (!(foll->flags & FOLL_GET))
> +		put_page(page);
>  	return npages;
>  }
>  
> @@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>  	return true;
>  }
>  
> -static int kvm_try_get_pfn(kvm_pfn_t pfn)
> -{
> -	struct page *page = kvm_pfn_to_refcounted_page(pfn);
> -
> -	if (!page)
> -		return 1;
> -
> -	return get_page_unless_zero(page);
> -}
> -
>  static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
>  			       kvm_pfn_t *p_pfn)
>  {
> @@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>  	pte_t *ptep;
>  	spinlock_t *ptl;
>  	bool write_fault = foll->flags & FOLL_WRITE;
> +	struct page *page;
>  	int r;
>  
>  	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> @@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> -	 * Get a reference here because callers of *hva_to_pfn* and
> -	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> -	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> -	 * simply do nothing for reserved pfns.
> -	 *
> -	 * Whoever called remap_pfn_range is also going to call e.g.
> -	 * unmap_mapping_range before the underlying pages are freed,
> -	 * causing a call to our MMU notifier.
> +	 * Now deal with reference counting. If kvm_pfn_to_refcounted_page
> +	 * returns NULL, then there's no refcount to worry about.
>  	 *
> -	 * Certain IO or PFNMAP mappings can be backed with valid
> -	 * struct pages, but be allocated without refcounting e.g.,
> -	 * tail pages of non-compound higher order allocations, which
> -	 * would then underflow the refcount when the caller does the
> -	 * required put_page. Don't allow those pages here.
> +	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages but be allocated without refcounting e.g., tail pages of
> +	 * non-compound higher order allocations. If FOLL_GET is set and we
> +	 * increment such a refcount, then when that pfn is eventually passed to
> +	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
> +	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
>  	 */ 
> -	if (!kvm_try_get_pfn(pfn))
> +	page = kvm_pfn_to_refcounted_page(pfn);
> +	if (!page)
> +		goto out;
> +
> +	if (get_page_unless_zero(page)) {
> +		foll->is_refcounted_page = true;
> +		if (!(foll->flags & FOLL_GET))
> +			put_page(page);
> +	} else if (foll->flags & FOLL_GET) {
>  		r = -EFAULT;
> +	}
>  
>  out:
>  	pte_unmap_unlock(ptep, ptl);
> @@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> +	if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
> +		return KVM_PFN_ERR_FAULT;
> +
>  	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
>  				      foll->flags & FOLL_WRITE);
>  
> @@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = 0,
> +		.flags = FOLL_GET,
>  		.atomic = atomic,
>  		.allow_write_mapping = !!writable,
>  	};
> @@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  	struct kvm_follow_pfn foll = {
>  		.slot = gfn_to_memslot(kvm, gfn),
>  		.gfn = gfn,
> -		.flags = write_fault ? FOLL_WRITE : 0,
> +		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
>  		.allow_write_mapping = !!writable,
>  	};
>  	pfn = __kvm_follow_pfn(&foll);
> @@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_GET | FOLL_WRITE,
>  	};
>  	return __kvm_follow_pfn(&foll);
>  }
> @@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_GET | FOLL_WRITE,
>  		.atomic = true,
>  	};
>  	return __kvm_follow_pfn(&foll);
> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>  	return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
>  
>  void kvm_release_page_clean(struct page *page)
>  {
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index e3fefa753a51..87caafce3dd0 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	struct kvm_follow_pfn foll = {
>  		.slot = gpc->memslot,
>  		.gfn = gpa_to_gfn(gpc->gpa),
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_WRITE | FOLL_GET,
>  		.hva = gpc->uhva,
>  	};
>  


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-05 13:19     ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-05 13:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue,  4 Jul 2023 16:50:48 +0900
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h | 10 ++++++
>  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
>  virt/kvm/pfncache.c      |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>  	const struct kvm_memory_slot *slot;
>  	gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Allow a read fault to create a writeable mapping. */
>  	bool allow_write_mapping;
> +	/*
> +	 * Usage of the returned pfn will be guared by a mmu notifier. Must
                                              ^guarded
> +	 * be true if FOLL_GET is not set.
> +	 */
> +	bool guarded_by_mmu_notifier;
>
It seems no one sets the guraded_by_mmu_notifier in this patch. Is
guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
caller of __kvm_follow_pfn()?

If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
temporary solution. I don't think it is a good idea to play tricks with
a temporary solution, more like we are abusing the toleration.

Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
indicate a tail page?
 
>  	/* Outputs of __kvm_follow_pfn */
>  	hva_t hva;
>  	bool writable;
> +	/* True if the returned pfn is for a page with a valid refcount. */
> +	bool is_refcounted_page;
>  };
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b13f22861d2f..0f7b41f220b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
>  		foll->writable = foll->allow_write_mapping;
> +		foll->is_refcounted_page = true;
> +		if (!(foll->flags & FOLL_GET))
> +			put_page(page[0]);
>  		return true;
>  	}
>  
> @@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		return npages;
>  
>  	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +	foll->is_refcounted_page = true;
>  
>  	/* map read fault as writable if possible */
>  	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		}
>  	}
>  	*pfn = page_to_pfn(page);
> +	if (!(foll->flags & FOLL_GET))
> +		put_page(page);
>  	return npages;
>  }
>  
> @@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>  	return true;
>  }
>  
> -static int kvm_try_get_pfn(kvm_pfn_t pfn)
> -{
> -	struct page *page = kvm_pfn_to_refcounted_page(pfn);
> -
> -	if (!page)
> -		return 1;
> -
> -	return get_page_unless_zero(page);
> -}
> -
>  static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
>  			       kvm_pfn_t *p_pfn)
>  {
> @@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>  	pte_t *ptep;
>  	spinlock_t *ptl;
>  	bool write_fault = foll->flags & FOLL_WRITE;
> +	struct page *page;
>  	int r;
>  
>  	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> @@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> -	 * Get a reference here because callers of *hva_to_pfn* and
> -	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> -	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> -	 * simply do nothing for reserved pfns.
> -	 *
> -	 * Whoever called remap_pfn_range is also going to call e.g.
> -	 * unmap_mapping_range before the underlying pages are freed,
> -	 * causing a call to our MMU notifier.
> +	 * Now deal with reference counting. If kvm_pfn_to_refcounted_page
> +	 * returns NULL, then there's no refcount to worry about.
>  	 *
> -	 * Certain IO or PFNMAP mappings can be backed with valid
> -	 * struct pages, but be allocated without refcounting e.g.,
> -	 * tail pages of non-compound higher order allocations, which
> -	 * would then underflow the refcount when the caller does the
> -	 * required put_page. Don't allow those pages here.
> +	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages but be allocated without refcounting e.g., tail pages of
> +	 * non-compound higher order allocations. If FOLL_GET is set and we
> +	 * increment such a refcount, then when that pfn is eventually passed to
> +	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
> +	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
>  	 */ 
> -	if (!kvm_try_get_pfn(pfn))
> +	page = kvm_pfn_to_refcounted_page(pfn);
> +	if (!page)
> +		goto out;
> +
> +	if (get_page_unless_zero(page)) {
> +		foll->is_refcounted_page = true;
> +		if (!(foll->flags & FOLL_GET))
> +			put_page(page);
> +	} else if (foll->flags & FOLL_GET) {
>  		r = -EFAULT;
> +	}
>  
>  out:
>  	pte_unmap_unlock(ptep, ptl);
> @@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> +	if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
> +		return KVM_PFN_ERR_FAULT;
> +
>  	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
>  				      foll->flags & FOLL_WRITE);
>  
> @@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = 0,
> +		.flags = FOLL_GET,
>  		.atomic = atomic,
>  		.allow_write_mapping = !!writable,
>  	};
> @@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  	struct kvm_follow_pfn foll = {
>  		.slot = gfn_to_memslot(kvm, gfn),
>  		.gfn = gfn,
> -		.flags = write_fault ? FOLL_WRITE : 0,
> +		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
>  		.allow_write_mapping = !!writable,
>  	};
>  	pfn = __kvm_follow_pfn(&foll);
> @@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_GET | FOLL_WRITE,
>  	};
>  	return __kvm_follow_pfn(&foll);
>  }
> @@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_GET | FOLL_WRITE,
>  		.atomic = true,
>  	};
>  	return __kvm_follow_pfn(&foll);
> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>  	return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
>  
>  void kvm_release_page_clean(struct page *page)
>  {
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index e3fefa753a51..87caafce3dd0 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	struct kvm_follow_pfn foll = {
>  		.slot = gpc->memslot,
>  		.gfn = gpa_to_gfn(gpc->gpa),
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_WRITE | FOLL_GET,
>  		.hva = gpc->uhva,
>  	};
>  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-05 13:19     ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-05 13:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Tue,  4 Jul 2023 16:50:48 +0900
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h | 10 ++++++
>  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
>  virt/kvm/pfncache.c      |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>  	const struct kvm_memory_slot *slot;
>  	gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>  	bool atomic;
>  	/* Allow a read fault to create a writeable mapping. */
>  	bool allow_write_mapping;
> +	/*
> +	 * Usage of the returned pfn will be guared by a mmu notifier. Must
                                              ^guarded
> +	 * be true if FOLL_GET is not set.
> +	 */
> +	bool guarded_by_mmu_notifier;
>
It seems no one sets the guraded_by_mmu_notifier in this patch. Is
guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
caller of __kvm_follow_pfn()?

If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
temporary solution. I don't think it is a good idea to play tricks with
a temporary solution, more like we are abusing the toleration.

Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
indicate a tail page?
 
>  	/* Outputs of __kvm_follow_pfn */
>  	hva_t hva;
>  	bool writable;
> +	/* True if the returned pfn is for a page with a valid refcount. */
> +	bool is_refcounted_page;
>  };
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b13f22861d2f..0f7b41f220b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
>  		foll->writable = foll->allow_write_mapping;
> +		foll->is_refcounted_page = true;
> +		if (!(foll->flags & FOLL_GET))
> +			put_page(page[0]);
>  		return true;
>  	}
>  
> @@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		return npages;
>  
>  	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +	foll->is_refcounted_page = true;
>  
>  	/* map read fault as writable if possible */
>  	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  		}
>  	}
>  	*pfn = page_to_pfn(page);
> +	if (!(foll->flags & FOLL_GET))
> +		put_page(page);
>  	return npages;
>  }
>  
> @@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>  	return true;
>  }
>  
> -static int kvm_try_get_pfn(kvm_pfn_t pfn)
> -{
> -	struct page *page = kvm_pfn_to_refcounted_page(pfn);
> -
> -	if (!page)
> -		return 1;
> -
> -	return get_page_unless_zero(page);
> -}
> -
>  static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
>  			       kvm_pfn_t *p_pfn)
>  {
> @@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>  	pte_t *ptep;
>  	spinlock_t *ptl;
>  	bool write_fault = foll->flags & FOLL_WRITE;
> +	struct page *page;
>  	int r;
>  
>  	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> @@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> -	 * Get a reference here because callers of *hva_to_pfn* and
> -	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> -	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
> -	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> -	 * simply do nothing for reserved pfns.
> -	 *
> -	 * Whoever called remap_pfn_range is also going to call e.g.
> -	 * unmap_mapping_range before the underlying pages are freed,
> -	 * causing a call to our MMU notifier.
> +	 * Now deal with reference counting. If kvm_pfn_to_refcounted_page
> +	 * returns NULL, then there's no refcount to worry about.
>  	 *
> -	 * Certain IO or PFNMAP mappings can be backed with valid
> -	 * struct pages, but be allocated without refcounting e.g.,
> -	 * tail pages of non-compound higher order allocations, which
> -	 * would then underflow the refcount when the caller does the
> -	 * required put_page. Don't allow those pages here.
> +	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages but be allocated without refcounting e.g., tail pages of
> +	 * non-compound higher order allocations. If FOLL_GET is set and we
> +	 * increment such a refcount, then when that pfn is eventually passed to
> +	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
> +	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
>  	 */ 
> -	if (!kvm_try_get_pfn(pfn))
> +	page = kvm_pfn_to_refcounted_page(pfn);
> +	if (!page)
> +		goto out;
> +
> +	if (get_page_unless_zero(page)) {
> +		foll->is_refcounted_page = true;
> +		if (!(foll->flags & FOLL_GET))
> +			put_page(page);
> +	} else if (foll->flags & FOLL_GET) {
>  		r = -EFAULT;
> +	}
>  
>  out:
>  	pte_unmap_unlock(ptep, ptl);
> @@ -2693,6 +2693,9 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> +	if (WARN_ON_ONCE(!(foll->flags & FOLL_GET) && !foll->guarded_by_mmu_notifier))
> +		return KVM_PFN_ERR_FAULT;
> +
>  	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
>  				      foll->flags & FOLL_WRITE);
>  
> @@ -2717,7 +2720,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = 0,
> +		.flags = FOLL_GET,
>  		.atomic = atomic,
>  		.allow_write_mapping = !!writable,
>  	};
> @@ -2749,7 +2752,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  	struct kvm_follow_pfn foll = {
>  		.slot = gfn_to_memslot(kvm, gfn),
>  		.gfn = gfn,
> -		.flags = write_fault ? FOLL_WRITE : 0,
> +		.flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
>  		.allow_write_mapping = !!writable,
>  	};
>  	pfn = __kvm_follow_pfn(&foll);
> @@ -2764,7 +2767,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_GET | FOLL_WRITE,
>  	};
>  	return __kvm_follow_pfn(&foll);
>  }
> @@ -2775,7 +2778,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gf
>  	struct kvm_follow_pfn foll = {
>  		.slot = slot,
>  		.gfn = gfn,
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_GET | FOLL_WRITE,
>  		.atomic = true,
>  	};
>  	return __kvm_follow_pfn(&foll);
> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>  	return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>  	if (kvm_is_ad_tracked_page(page))
>  		mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
>  
>  void kvm_release_page_clean(struct page *page)
>  {
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index e3fefa753a51..87caafce3dd0 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -147,7 +147,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	struct kvm_follow_pfn foll = {
>  		.slot = gpc->memslot,
>  		.gfn = gpa_to_gfn(gpc->gpa),
> -		.flags = FOLL_WRITE,
> +		.flags = FOLL_WRITE | FOLL_GET,
>  		.hva = gpc->uhva,
>  	};
>  


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-05 10:18     ` Yu Zhang
  (?)
@ 2023-07-05 14:17       ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 14:17 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> >  {
> > -	bool host_writable;
> > +	bool host_writable, is_refcounted;
> >  	gpa_t first_pte_gpa;
> >  	u64 *sptep, spte;
> >  	struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> >  	sptep = &sp->spt[i];
> >  	spte = *sptep;
> >  	host_writable = spte & shadow_host_writable_mask;
> > +	// TODO: is this correct?
> > +	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> >  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >  	make_spte(vcpu, sp, slot, pte_access, gfn,
> >  		  spte_to_pfn(spte), spte, true, false,
> > -		  host_writable, &spte);
> > +		  host_writable, is_refcounted, &spte);
> 
> Could we restrict that a non-refcounted page shall not be used as shadow page?

Oh, sorry. It's not about shadow page. It's about guest page being
mapped as not refcounted. Silly me...

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-05 14:17       ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 14:17 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> >  {
> > -	bool host_writable;
> > +	bool host_writable, is_refcounted;
> >  	gpa_t first_pte_gpa;
> >  	u64 *sptep, spte;
> >  	struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> >  	sptep = &sp->spt[i];
> >  	spte = *sptep;
> >  	host_writable = spte & shadow_host_writable_mask;
> > +	// TODO: is this correct?
> > +	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> >  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >  	make_spte(vcpu, sp, slot, pte_access, gfn,
> >  		  spte_to_pfn(spte), spte, true, false,
> > -		  host_writable, &spte);
> > +		  host_writable, is_refcounted, &spte);
> 
> Could we restrict that a non-refcounted page shall not be used as shadow page?

Oh, sorry. It's not about shadow page. It's about guest page being
mapped as not refcounted. Silly me...

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-05 14:17       ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-05 14:17 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> >  {
> > -	bool host_writable;
> > +	bool host_writable, is_refcounted;
> >  	gpa_t first_pte_gpa;
> >  	u64 *sptep, spte;
> >  	struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> >  	sptep = &sp->spt[i];
> >  	spte = *sptep;
> >  	host_writable = spte & shadow_host_writable_mask;
> > +	// TODO: is this correct?
> > +	is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> >  	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >  	make_spte(vcpu, sp, slot, pte_access, gfn,
> >  		  spte_to_pfn(spte), spte, true, false,
> > -		  host_writable, &spte);
> > +		  host_writable, is_refcounted, &spte);
> 
> Could we restrict that a non-refcounted page shall not be used as shadow page?

Oh, sorry. It's not about shadow page. It's about guest page being
mapped as not refcounted. Silly me...

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-06  1:34     ` Isaku Yamahata
  -1 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  1:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm,
	isaku.yamahata

On Tue, Jul 04, 2023 at 04:50:47PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> +	const struct kvm_memory_slot *slot;
> +	gfn_t gfn;
> +	unsigned int flags;
> +	bool atomic;
> +	/* Allow a read fault to create a writeable mapping. */
> +	bool allow_write_mapping;

Maybe, make them const for input arguments?


> +
> +	/* Outputs of __kvm_follow_pfn */
> +	hva_t hva;
> +	bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> -			    bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
>  	 * request.
>  	 */
> -	if (!(write_fault || writable))
> +	if (!(write_fault || foll->allow_write_mapping))
>  		return false;
>  
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
> -
> -		if (writable)
> -			*writable = true;
> +		foll->writable = foll->allow_write_mapping;
>  		return true;
>  	}
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;

Although adding FOLL_GET doesn't affect the behavior of
get_user_pages_unlocked(), I wondered how this affects the next change
It's better to mention it in the commit message.
get_user_pages_*() called by hva_to_pfn_{fast, slot} imply FOLL_GET,
but __kvm_follow_pfn() doesn't imply FOLL_GET.


>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)
>  {
>  	kvm_pfn_t pfn;
>  	pte_t *ptep;
>  	spinlock_t *ptl;
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  	int r;
>  
> -	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  	if (r) {
>  		/*
>  		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
>  		 * not call the fault handler, so do it here.
>  		 */
>  		bool unlocked = false;
> -		r = fixup_user_fault(current->mm, addr,
> +		r = fixup_user_fault(current->mm, foll->hva,
>  				     (write_fault ? FAULT_FLAG_WRITE : 0),
>  				     &unlocked);
>  		if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		if (r)
>  			return r;
>  
> -		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  		if (r)
>  			return r;
>  	}
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
>  		pfn = KVM_PFN_ERR_HWPOISON;
>  		goto exit;
>  	}
>  
>  retry:
> -	vma = vma_lookup(current->mm, addr);
> +	vma = vma_lookup(current->mm, foll->hva);
>  
>  	if (vma == NULL)
>  		pfn = KVM_PFN_ERR_FAULT;
>  	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> -		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> +		r = hva_to_pfn_remapped(vma, foll, &pfn);
>  		if (r == -EAGAIN)
>  			goto retry;
>  		if (r < 0)
>  			pfn = KVM_PFN_ERR_FAULT;
>  	} else {
> -		if (async && vma_is_valid(vma, write_fault))
> -			*async = true;
> -		pfn = KVM_PFN_ERR_FAULT;
> +		if ((foll->flags & FOLL_NOWAIT) &&
> +		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
> +			pfn = KVM_PFN_ERR_NEEDS_IO;
> +		else
> +			pfn = KVM_PFN_ERR_FAULT;
>  	}
>  exit:
>  	mmap_read_unlock(current->mm);
>  	return pfn;
>  }
>  
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> -			       bool atomic, bool interruptible, bool *async,
> -			       bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> -	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> -	if (hva)
> -		*hva = addr;
> +	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> +				      foll->flags & FOLL_WRITE);
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> +	if (foll->hva == KVM_HVA_ERR_RO_BAD)
>  		return KVM_PFN_ERR_RO_FAULT;
> -	}
>  
> -	if (kvm_is_error_hva(addr)) {
> -		if (writable)
> -			*writable = false;
> +	if (kvm_is_error_hva(foll->hva))
>  		return KVM_PFN_NOSLOT;
> -	}
>  
> -	/* Do not map writable pfn in the readonly memslot. */
> -	if (writable && memslot_is_readonly(slot)) {
> -		*writable = false;
> -		writable = NULL;
> -	}
> +	if (memslot_is_readonly(foll->slot))
> +		foll->allow_write_mapping = false;
> +
> +	return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>  
> -	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> -			  writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
> -	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> -				    NULL, write_fault, writable, NULL);
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(kvm, gfn),
> +		.gfn = gfn,
> +		.flags = write_fault ? FOLL_WRITE : 0,
> +		.allow_write_mapping = !!writable,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.atomic = true,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
>  #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>  
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>  
>  #ifdef CONFIG_HAVE_KVM_PFNCACHE
>  void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
>  	void *new_khva = NULL;
>  	unsigned long mmu_seq;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gpc->memslot,
> +		.gfn = gpa_to_gfn(gpc->gpa),
> +		.flags = FOLL_WRITE,
> +		.hva = gpc->uhva,
> +	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);
>  
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		}
>  
>  		/* We always request a writeable mapping */
> -		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> +		new_pfn = hva_to_pfn(&foll);
>  		if (is_error_noslot_pfn(new_pfn))
>  			goto out_error;
>  
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06  1:34     ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  1:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	isaku.yamahata, kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:47PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> +	const struct kvm_memory_slot *slot;
> +	gfn_t gfn;
> +	unsigned int flags;
> +	bool atomic;
> +	/* Allow a read fault to create a writeable mapping. */
> +	bool allow_write_mapping;

Maybe, make them const for input arguments?


> +
> +	/* Outputs of __kvm_follow_pfn */
> +	hva_t hva;
> +	bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> -			    bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
>  	 * request.
>  	 */
> -	if (!(write_fault || writable))
> +	if (!(write_fault || foll->allow_write_mapping))
>  		return false;
>  
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
> -
> -		if (writable)
> -			*writable = true;
> +		foll->writable = foll->allow_write_mapping;
>  		return true;
>  	}
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;

Although adding FOLL_GET doesn't affect the behavior of
get_user_pages_unlocked(), I wondered how this affects the next change
It's better to mention it in the commit message.
get_user_pages_*() called by hva_to_pfn_{fast, slot} imply FOLL_GET,
but __kvm_follow_pfn() doesn't imply FOLL_GET.


>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)
>  {
>  	kvm_pfn_t pfn;
>  	pte_t *ptep;
>  	spinlock_t *ptl;
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  	int r;
>  
> -	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  	if (r) {
>  		/*
>  		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
>  		 * not call the fault handler, so do it here.
>  		 */
>  		bool unlocked = false;
> -		r = fixup_user_fault(current->mm, addr,
> +		r = fixup_user_fault(current->mm, foll->hva,
>  				     (write_fault ? FAULT_FLAG_WRITE : 0),
>  				     &unlocked);
>  		if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		if (r)
>  			return r;
>  
> -		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  		if (r)
>  			return r;
>  	}
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
>  		pfn = KVM_PFN_ERR_HWPOISON;
>  		goto exit;
>  	}
>  
>  retry:
> -	vma = vma_lookup(current->mm, addr);
> +	vma = vma_lookup(current->mm, foll->hva);
>  
>  	if (vma == NULL)
>  		pfn = KVM_PFN_ERR_FAULT;
>  	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> -		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> +		r = hva_to_pfn_remapped(vma, foll, &pfn);
>  		if (r == -EAGAIN)
>  			goto retry;
>  		if (r < 0)
>  			pfn = KVM_PFN_ERR_FAULT;
>  	} else {
> -		if (async && vma_is_valid(vma, write_fault))
> -			*async = true;
> -		pfn = KVM_PFN_ERR_FAULT;
> +		if ((foll->flags & FOLL_NOWAIT) &&
> +		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
> +			pfn = KVM_PFN_ERR_NEEDS_IO;
> +		else
> +			pfn = KVM_PFN_ERR_FAULT;
>  	}
>  exit:
>  	mmap_read_unlock(current->mm);
>  	return pfn;
>  }
>  
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> -			       bool atomic, bool interruptible, bool *async,
> -			       bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> -	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> -	if (hva)
> -		*hva = addr;
> +	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> +				      foll->flags & FOLL_WRITE);
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> +	if (foll->hva == KVM_HVA_ERR_RO_BAD)
>  		return KVM_PFN_ERR_RO_FAULT;
> -	}
>  
> -	if (kvm_is_error_hva(addr)) {
> -		if (writable)
> -			*writable = false;
> +	if (kvm_is_error_hva(foll->hva))
>  		return KVM_PFN_NOSLOT;
> -	}
>  
> -	/* Do not map writable pfn in the readonly memslot. */
> -	if (writable && memslot_is_readonly(slot)) {
> -		*writable = false;
> -		writable = NULL;
> -	}
> +	if (memslot_is_readonly(foll->slot))
> +		foll->allow_write_mapping = false;
> +
> +	return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>  
> -	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> -			  writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
> -	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> -				    NULL, write_fault, writable, NULL);
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(kvm, gfn),
> +		.gfn = gfn,
> +		.flags = write_fault ? FOLL_WRITE : 0,
> +		.allow_write_mapping = !!writable,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.atomic = true,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
>  #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>  
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>  
>  #ifdef CONFIG_HAVE_KVM_PFNCACHE
>  void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
>  	void *new_khva = NULL;
>  	unsigned long mmu_seq;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gpc->memslot,
> +		.gfn = gpa_to_gfn(gpc->gpa),
> +		.flags = FOLL_WRITE,
> +		.hva = gpc->uhva,
> +	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);
>  
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		}
>  
>  		/* We always request a writeable mapping */
> -		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> +		new_pfn = hva_to_pfn(&foll);
>  		if (is_error_noslot_pfn(new_pfn))
>  			goto out_error;
>  
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06  1:34     ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  1:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm,
	isaku.yamahata

On Tue, Jul 04, 2023 at 04:50:47PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> +	const struct kvm_memory_slot *slot;
> +	gfn_t gfn;
> +	unsigned int flags;
> +	bool atomic;
> +	/* Allow a read fault to create a writeable mapping. */
> +	bool allow_write_mapping;

Maybe, make them const for input arguments?


> +
> +	/* Outputs of __kvm_follow_pfn */
> +	hva_t hva;
> +	bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> -			    bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  
>  	/*
>  	 * Fast pin a writable pfn only if it is a write fault request
>  	 * or the caller allows to map a writable pfn for a read fault
>  	 * request.
>  	 */
> -	if (!(write_fault || writable))
> +	if (!(write_fault || foll->allow_write_mapping))
>  		return false;
>  
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>  		*pfn = page_to_pfn(page[0]);
> -
> -		if (writable)
> -			*writable = true;
> +		foll->writable = foll->allow_write_mapping;
>  		return true;
>  	}
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> -	unsigned int flags = FOLL_HWPOISON;
> +	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;

Although adding FOLL_GET doesn't affect the behavior of
get_user_pages_unlocked(), I wondered how this affects the next change
It's better to mention it in the commit message.
get_user_pages_*() called by hva_to_pfn_{fast, slot} imply FOLL_GET,
but __kvm_follow_pfn() doesn't imply FOLL_GET.


>  	struct page *page;
>  	int npages;
>  
>  	might_sleep();
>  
> -	if (writable)
> -		*writable = write_fault;
> -
> -	if (write_fault)
> -		flags |= FOLL_WRITE;
> -	if (async)
> -		flags |= FOLL_NOWAIT;
> -	if (interruptible)
> -		flags |= FOLL_INTERRUPTIBLE;
> -
> -	npages = get_user_pages_unlocked(addr, 1, &page, flags);
> +	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>  	if (npages != 1)
>  		return npages;
>  
> +	foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> +
>  	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> +	if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>  		struct page *wpage;
>  
> -		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> -			*writable = true;
> +		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> +			foll->writable = true;
>  			put_page(page);
>  			page = wpage;
>  		}
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)
>  {
>  	kvm_pfn_t pfn;
>  	pte_t *ptep;
>  	spinlock_t *ptl;
> +	bool write_fault = foll->flags & FOLL_WRITE;
>  	int r;
>  
> -	r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +	r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  	if (r) {
>  		/*
>  		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
>  		 * not call the fault handler, so do it here.
>  		 */
>  		bool unlocked = false;
> -		r = fixup_user_fault(current->mm, addr,
> +		r = fixup_user_fault(current->mm, foll->hva,
>  				     (write_fault ? FAULT_FLAG_WRITE : 0),
>  				     &unlocked);
>  		if (unlocked)
> @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		if (r)
>  			return r;
>  
> -		r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> +		r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
>  		if (r)
>  			return r;
>  	}
> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
>  		pfn = KVM_PFN_ERR_HWPOISON;
>  		goto exit;
>  	}
>  
>  retry:
> -	vma = vma_lookup(current->mm, addr);
> +	vma = vma_lookup(current->mm, foll->hva);
>  
>  	if (vma == NULL)
>  		pfn = KVM_PFN_ERR_FAULT;
>  	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> -		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> +		r = hva_to_pfn_remapped(vma, foll, &pfn);
>  		if (r == -EAGAIN)
>  			goto retry;
>  		if (r < 0)
>  			pfn = KVM_PFN_ERR_FAULT;
>  	} else {
> -		if (async && vma_is_valid(vma, write_fault))
> -			*async = true;
> -		pfn = KVM_PFN_ERR_FAULT;
> +		if ((foll->flags & FOLL_NOWAIT) &&
> +		    vma_is_valid(vma, foll->flags & FOLL_WRITE))
> +			pfn = KVM_PFN_ERR_NEEDS_IO;
> +		else
> +			pfn = KVM_PFN_ERR_FAULT;
>  	}
>  exit:
>  	mmap_read_unlock(current->mm);
>  	return pfn;
>  }
>  
> -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> -			       bool atomic, bool interruptible, bool *async,
> -			       bool write_fault, bool *writable, hva_t *hva)
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
>  {
> -	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> -
> -	if (hva)
> -		*hva = addr;
> +	foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> +				      foll->flags & FOLL_WRITE);
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> +	if (foll->hva == KVM_HVA_ERR_RO_BAD)
>  		return KVM_PFN_ERR_RO_FAULT;
> -	}
>  
> -	if (kvm_is_error_hva(addr)) {
> -		if (writable)
> -			*writable = false;
> +	if (kvm_is_error_hva(foll->hva))
>  		return KVM_PFN_NOSLOT;
> -	}
>  
> -	/* Do not map writable pfn in the readonly memslot. */
> -	if (writable && memslot_is_readonly(slot)) {
> -		*writable = false;
> -		writable = NULL;
> -	}
> +	if (memslot_is_readonly(foll->slot))
> +		foll->allow_write_mapping = false;
> +
> +	return hva_to_pfn(foll);
> +}
> +EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
>  
> -	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> -			  writable);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +			       bool atomic, bool interruptible, bool *async,
> +			       bool write_fault, bool *writable, hva_t *hva)
> +{
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = 0,
> +		.atomic = atomic,
> +		.allow_write_mapping = !!writable,
> +	};
> +
> +	if (write_fault)
> +		foll.flags |= FOLL_WRITE;
> +	if (async)
> +		foll.flags |= FOLL_NOWAIT;
> +	if (interruptible)
> +		foll.flags |= FOLL_INTERRUPTIBLE;
> +
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> +		*async = true;
> +		pfn = KVM_PFN_ERR_FAULT;
> +	}
> +	if (hva)
> +		*hva = foll.hva;
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
>  		      bool *writable)
>  {
> -	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
> -				    NULL, write_fault, writable, NULL);
> +	kvm_pfn_t pfn;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gfn_to_memslot(kvm, gfn),
> +		.gfn = gfn,
> +		.flags = write_fault ? FOLL_WRITE : 0,
> +		.allow_write_mapping = !!writable,
> +	};
> +	pfn = __kvm_follow_pfn(&foll);
> +	if (writable)
> +		*writable = foll.writable;
> +	return pfn;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>  
>  kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
>  {
> -	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
> -				    NULL, NULL);
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = gfn,
> +		.flags = FOLL_WRITE,
> +		.atomic = true,
> +	};
> +	return __kvm_follow_pfn(&foll);
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
>  
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index 180f1a09e6ba..ed896aee5396 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,8 +20,7 @@
>  #define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>  
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable);
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll);
>  
>  #ifdef CONFIG_HAVE_KVM_PFNCACHE
>  void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 2d6aba677830..e3fefa753a51 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
>  	void *new_khva = NULL;
>  	unsigned long mmu_seq;
> +	struct kvm_follow_pfn foll = {
> +		.slot = gpc->memslot,
> +		.gfn = gpa_to_gfn(gpc->gpa),
> +		.flags = FOLL_WRITE,
> +		.hva = gpc->uhva,
> +	};
>  
>  	lockdep_assert_held(&gpc->refresh_lock);
>  
> @@ -183,7 +189,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  		}
>  
>  		/* We always request a writeable mapping */
> -		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
> +		new_pfn = hva_to_pfn(&foll);
>  		if (is_error_noslot_pfn(new_pfn))
>  			goto out_error;
>  
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-06  1:54     ` Isaku Yamahata
  -1 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  1:54 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm,
	isaku.yamahata

On Tue, Jul 04, 2023 at 04:50:49PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> -	bool async;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = fault->gfn,
> +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.allow_write_mapping = true,
> +	};
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> -	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +	foll.flags |= FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))

We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn()
return int instead of kvm_pfn_t?  KVM_PFN_* seems widely used, though.


> +		goto success;
> +
> +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> +		return RET_PF_CONTINUE;
>  
>  	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> @@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	 * to wait for IO.  Note, gup always bails if it is unable to quickly
>  	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
>  	 */
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> +	foll.flags |= FOLL_INTERRUPTIBLE;
> +	foll.flags &= ~FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
> +
> +	return RET_PF_CONTINUE;
> +success:
> +	fault->hva = foll.hva;
> +	fault->map_writable = foll.writable;
>  	return RET_PF_CONTINUE;
>  }
>  
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-07-06  1:54     ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  1:54 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm,
	isaku.yamahata

On Tue, Jul 04, 2023 at 04:50:49PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> -	bool async;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = fault->gfn,
> +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.allow_write_mapping = true,
> +	};
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> -	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +	foll.flags |= FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))

We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn()
return int instead of kvm_pfn_t?  KVM_PFN_* seems widely used, though.


> +		goto success;
> +
> +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> +		return RET_PF_CONTINUE;
>  
>  	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> @@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	 * to wait for IO.  Note, gup always bails if it is unable to quickly
>  	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
>  	 */
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> +	foll.flags |= FOLL_INTERRUPTIBLE;
> +	foll.flags &= ~FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
> +
> +	return RET_PF_CONTINUE;
> +success:
> +	fault->hva = foll.hva;
> +	fault->map_writable = foll.writable;
>  	return RET_PF_CONTINUE;
>  }
>  
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-07-06  1:54     ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  1:54 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	isaku.yamahata, kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:49PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> -	bool async;
> +	struct kvm_follow_pfn foll = {
> +		.slot = slot,
> +		.gfn = fault->gfn,
> +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> +		.allow_write_mapping = true,
> +	};
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> -	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +	foll.flags |= FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))

We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn()
return int instead of kvm_pfn_t?  KVM_PFN_* seems widely used, though.


> +		goto success;
> +
> +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> +		return RET_PF_CONTINUE;
>  
>  	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> @@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	 * to wait for IO.  Note, gup always bails if it is unable to quickly
>  	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
>  	 */
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> +	foll.flags |= FOLL_INTERRUPTIBLE;
> +	foll.flags &= ~FOLL_NOWAIT;
> +	fault->pfn = __kvm_follow_pfn(&foll);
> +
> +	if (!is_error_noslot_pfn(fault->pfn))
> +		goto success;
> +
> +	return RET_PF_CONTINUE;
> +success:
> +	fault->hva = foll.hva;
> +	fault->map_writable = foll.writable;
>  	return RET_PF_CONTINUE;
>  }
>  
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-06  2:10     ` Isaku Yamahata
  -1 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  2:10 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm,
	isaku.yamahata

On Tue, Jul 04, 2023 at 04:50:50PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cf2c6426a6fc..46c681dc45e6 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       const struct kvm_memory_slot *slot,
>  	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>  	       u64 old_spte, bool prefetch, bool can_unsync,
> -	       bool host_writable, u64 *new_spte)
> +	       bool host_writable, bool is_refcounted, u64 *new_spte)
>  {
>  	int level = sp->role.level;
>  	u64 spte = SPTE_MMU_PRESENT_MASK;
> @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  
>  	if (level > PG_LEVEL_4K)
>  		spte |= PT_PAGE_SIZE_MASK;
> +	else if (is_refcounted)
> +		spte |= SPTE_MMU_PAGE_REFCOUNTED;

Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
FOLL_GET? or can we set the bit for large page?


>  
>  	if (shadow_memtype_mask)
>  		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  2:10     ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  2:10 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm,
	isaku.yamahata

On Tue, Jul 04, 2023 at 04:50:50PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cf2c6426a6fc..46c681dc45e6 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       const struct kvm_memory_slot *slot,
>  	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>  	       u64 old_spte, bool prefetch, bool can_unsync,
> -	       bool host_writable, u64 *new_spte)
> +	       bool host_writable, bool is_refcounted, u64 *new_spte)
>  {
>  	int level = sp->role.level;
>  	u64 spte = SPTE_MMU_PRESENT_MASK;
> @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  
>  	if (level > PG_LEVEL_4K)
>  		spte |= PT_PAGE_SIZE_MASK;
> +	else if (is_refcounted)
> +		spte |= SPTE_MMU_PAGE_REFCOUNTED;

Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
FOLL_GET? or can we set the bit for large page?


>  
>  	if (shadow_memtype_mask)
>  		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  2:10     ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06  2:10 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	isaku.yamahata, kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:50PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cf2c6426a6fc..46c681dc45e6 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       const struct kvm_memory_slot *slot,
>  	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>  	       u64 old_spte, bool prefetch, bool can_unsync,
> -	       bool host_writable, u64 *new_spte)
> +	       bool host_writable, bool is_refcounted, u64 *new_spte)
>  {
>  	int level = sp->role.level;
>  	u64 spte = SPTE_MMU_PRESENT_MASK;
> @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  
>  	if (level > PG_LEVEL_4K)
>  		spte |= PT_PAGE_SIZE_MASK;
> +	else if (is_refcounted)
> +		spte |= SPTE_MMU_PAGE_REFCOUNTED;

Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
FOLL_GET? or can we set the bit for large page?


>  
>  	if (shadow_memtype_mask)
>  		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-05 10:18     ` Yu Zhang
  (?)
@ 2023-07-06  4:52       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  4:52 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
>
> I guess you mean the tail pages of the higher order non-compound pages?
> And as to the head page, it is said to be set to one coincidentally[*],
> and shall not be considered as refcounted.  IIUC, refcount of this head
> page will be increased and decreased soon in hva_to_pfn_remapped(), so
> this may not be a problem(?). But treating this head page differently,
> as a refcounted one(e.g., to set the A/D flags), is weired.
>
> Or maybe I missed some context, e.g., can the head page be allocted to
> guest at all?

Yes, this is to allow mapping the tail pages of higher order
non-compound pages - I should have been more precise in my wording.
The head pages can already be mapped into the guest.

Treating the head and tail pages would require changing how KVM
behaves in a situation it supports today (rather than just adding
support for an unsupported situation). Currently, without this series,
KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
guest. When that happens, KVM sets the A/D flags. I'm not sure whether
that's actually valid behavior, nor do I know whether anyone actually
cares about it. But it's what KVM does today, and I would shy away
from modifying that behavior without good reason.

> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
>
> Also, maybe we should mention this only works on x86-64.
>
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> >  6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
>
> ...
>
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       bool host_writable = !fault || fault->map_writable;
> >       bool prefetch = !fault || fault->prefetch;
> >       bool write_fault = fault && fault->write;
> > +     bool is_refcounted = !fault || fault->is_refcounted_page;
>
> Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> practice?

Prefetching is still done via gfn_to_page_many_atomic, which sets
FOLL_GET. That's fixable, but it's not something this series currently
does.

> ...
> >
> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> >  {
> > -     bool host_writable;
> > +     bool host_writable, is_refcounted;
> >       gpa_t first_pte_gpa;
> >       u64 *sptep, spte;
> >       struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> >       sptep = &sp->spt[i];
> >       spte = *sptep;
> >       host_writable = spte & shadow_host_writable_mask;
> > +     // TODO: is this correct?
> > +     is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> >       slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >       make_spte(vcpu, sp, slot, pte_access, gfn,
> >                 spte_to_pfn(spte), spte, true, false,
> > -               host_writable, &spte);
> > +               host_writable, is_refcounted, &spte);
>
> Could we restrict that a non-refcounted page shall not be used as shadow page?

I'm not very familiar with the shadow mmu, so my response might not
make sense. But do you mean not allowing non-refcoutned pages as the
guest page tables shadowed by a kvm_mmu_page? It would probably be
possible to do that, and I doubt anyone would care about the
restriction. But as far as I can tell, the guest page table is only
accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
pages just fine.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  4:52       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  4:52 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
>
> I guess you mean the tail pages of the higher order non-compound pages?
> And as to the head page, it is said to be set to one coincidentally[*],
> and shall not be considered as refcounted.  IIUC, refcount of this head
> page will be increased and decreased soon in hva_to_pfn_remapped(), so
> this may not be a problem(?). But treating this head page differently,
> as a refcounted one(e.g., to set the A/D flags), is weired.
>
> Or maybe I missed some context, e.g., can the head page be allocted to
> guest at all?

Yes, this is to allow mapping the tail pages of higher order
non-compound pages - I should have been more precise in my wording.
The head pages can already be mapped into the guest.

Treating the head and tail pages would require changing how KVM
behaves in a situation it supports today (rather than just adding
support for an unsupported situation). Currently, without this series,
KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
guest. When that happens, KVM sets the A/D flags. I'm not sure whether
that's actually valid behavior, nor do I know whether anyone actually
cares about it. But it's what KVM does today, and I would shy away
from modifying that behavior without good reason.

> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
>
> Also, maybe we should mention this only works on x86-64.
>
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> >  6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
>
> ...
>
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       bool host_writable = !fault || fault->map_writable;
> >       bool prefetch = !fault || fault->prefetch;
> >       bool write_fault = fault && fault->write;
> > +     bool is_refcounted = !fault || fault->is_refcounted_page;
>
> Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> practice?

Prefetching is still done via gfn_to_page_many_atomic, which sets
FOLL_GET. That's fixable, but it's not something this series currently
does.

> ...
> >
> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> >  {
> > -     bool host_writable;
> > +     bool host_writable, is_refcounted;
> >       gpa_t first_pte_gpa;
> >       u64 *sptep, spte;
> >       struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> >       sptep = &sp->spt[i];
> >       spte = *sptep;
> >       host_writable = spte & shadow_host_writable_mask;
> > +     // TODO: is this correct?
> > +     is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> >       slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >       make_spte(vcpu, sp, slot, pte_access, gfn,
> >                 spte_to_pfn(spte), spte, true, false,
> > -               host_writable, &spte);
> > +               host_writable, is_refcounted, &spte);
>
> Could we restrict that a non-refcounted page shall not be used as shadow page?

I'm not very familiar with the shadow mmu, so my response might not
make sense. But do you mean not allowing non-refcoutned pages as the
guest page tables shadowed by a kvm_mmu_page? It would probably be
possible to do that, and I doubt anyone would care about the
restriction. But as far as I can tell, the guest page table is only
accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
pages just fine.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  4:52       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  4:52 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
>
> I guess you mean the tail pages of the higher order non-compound pages?
> And as to the head page, it is said to be set to one coincidentally[*],
> and shall not be considered as refcounted.  IIUC, refcount of this head
> page will be increased and decreased soon in hva_to_pfn_remapped(), so
> this may not be a problem(?). But treating this head page differently,
> as a refcounted one(e.g., to set the A/D flags), is weired.
>
> Or maybe I missed some context, e.g., can the head page be allocted to
> guest at all?

Yes, this is to allow mapping the tail pages of higher order
non-compound pages - I should have been more precise in my wording.
The head pages can already be mapped into the guest.

Treating the head and tail pages would require changing how KVM
behaves in a situation it supports today (rather than just adding
support for an unsupported situation). Currently, without this series,
KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
guest. When that happens, KVM sets the A/D flags. I'm not sure whether
that's actually valid behavior, nor do I know whether anyone actually
cares about it. But it's what KVM does today, and I would shy away
from modifying that behavior without good reason.

> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
>
> Also, maybe we should mention this only works on x86-64.
>
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> >  6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
>
> ...
>
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       bool host_writable = !fault || fault->map_writable;
> >       bool prefetch = !fault || fault->prefetch;
> >       bool write_fault = fault && fault->write;
> > +     bool is_refcounted = !fault || fault->is_refcounted_page;
>
> Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> practice?

Prefetching is still done via gfn_to_page_many_atomic, which sets
FOLL_GET. That's fixable, but it's not something this series currently
does.

> ...
> >
> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> >  {
> > -     bool host_writable;
> > +     bool host_writable, is_refcounted;
> >       gpa_t first_pte_gpa;
> >       u64 *sptep, spte;
> >       struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> >       sptep = &sp->spt[i];
> >       spte = *sptep;
> >       host_writable = spte & shadow_host_writable_mask;
> > +     // TODO: is this correct?
> > +     is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> >       slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >       make_spte(vcpu, sp, slot, pte_access, gfn,
> >                 spte_to_pfn(spte), spte, true, false,
> > -               host_writable, &spte);
> > +               host_writable, is_refcounted, &spte);
>
> Could we restrict that a non-refcounted page shall not be used as shadow page?

I'm not very familiar with the shadow mmu, so my response might not
make sense. But do you mean not allowing non-refcoutned pages as the
guest page tables shadowed by a kvm_mmu_page? It would probably be
possible to do that, and I doubt anyone would care about the
restriction. But as far as I can tell, the guest page table is only
accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
pages just fine.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-06  2:10     ` Isaku Yamahata
  (?)
@ 2023-07-06  5:18       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:18 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 6, 2023 at 11:10 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index cf2c6426a6fc..46c681dc45e6 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >              const struct kvm_memory_slot *slot,
> >              unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> >              u64 old_spte, bool prefetch, bool can_unsync,
> > -            bool host_writable, u64 *new_spte)
> > +            bool host_writable, bool is_refcounted, u64 *new_spte)
> >  {
> >       int level = sp->role.level;
> >       u64 spte = SPTE_MMU_PRESENT_MASK;
> > @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >
> >       if (level > PG_LEVEL_4K)
> >               spte |= PT_PAGE_SIZE_MASK;
> > +     else if (is_refcounted)
> > +             spte |= SPTE_MMU_PAGE_REFCOUNTED;
>
> Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
> FOLL_GET? or can we set the bit for large page?

Oh, you're right, it should apply to >4K pages as well. This was based
on stale thinking from earlier versions of this series.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  5:18       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:18 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Thu, Jul 6, 2023 at 11:10 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index cf2c6426a6fc..46c681dc45e6 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >              const struct kvm_memory_slot *slot,
> >              unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> >              u64 old_spte, bool prefetch, bool can_unsync,
> > -            bool host_writable, u64 *new_spte)
> > +            bool host_writable, bool is_refcounted, u64 *new_spte)
> >  {
> >       int level = sp->role.level;
> >       u64 spte = SPTE_MMU_PRESENT_MASK;
> > @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >
> >       if (level > PG_LEVEL_4K)
> >               spte |= PT_PAGE_SIZE_MASK;
> > +     else if (is_refcounted)
> > +             spte |= SPTE_MMU_PAGE_REFCOUNTED;
>
> Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
> FOLL_GET? or can we set the bit for large page?

Oh, you're right, it should apply to >4K pages as well. This was based
on stale thinking from earlier versions of this series.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  5:18       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:18 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 6, 2023 at 11:10 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index cf2c6426a6fc..46c681dc45e6 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >              const struct kvm_memory_slot *slot,
> >              unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> >              u64 old_spte, bool prefetch, bool can_unsync,
> > -            bool host_writable, u64 *new_spte)
> > +            bool host_writable, bool is_refcounted, u64 *new_spte)
> >  {
> >       int level = sp->role.level;
> >       u64 spte = SPTE_MMU_PRESENT_MASK;
> > @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >
> >       if (level > PG_LEVEL_4K)
> >               spte |= PT_PAGE_SIZE_MASK;
> > +     else if (is_refcounted)
> > +             spte |= SPTE_MMU_PAGE_REFCOUNTED;
>
> Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
> FOLL_GET? or can we set the bit for large page?

Oh, you're right, it should apply to >4K pages as well. This was based
on stale thinking from earlier versions of this series.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-05 10:53         ` Yu Zhang
  (?)
@ 2023-07-06  5:29           ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:29 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > >   * The slow path to get the pfn of the specified host virtual address,
> > > >   * 1 indicates success, -errno is returned if error is detected.
> > > >   */
> > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > >  {
> > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > >       struct page *page;
> > > >       int npages;
> > > >
> > > >       might_sleep();
> > > >
> > > > -     if (writable)
> > > > -             *writable = write_fault;
> > > > -
> > > > -     if (write_fault)
> > > > -             flags |= FOLL_WRITE;
> > > > -     if (async)
> > > > -             flags |= FOLL_NOWAIT;
> > > > -     if (interruptible)
> > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > -
> > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > >       if (npages != 1)
> > > >               return npages;
> > > >
> > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > +
> > > >       /* map read fault as writable if possible */
> > > > -     if (unlikely(!write_fault) && writable) {
> > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >
> > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> >
> > The two statements are logically equivalent, although I guess using
> > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > verbose.
>
> Well, as the comment says, we wanna try to map the read fault as writable
> whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> for write faults. So I guess using !foll->writable will not allow this.
> Did I miss anything?

We just set the foll->writable out parameter to be equal to
((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
!(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
b.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06  5:29           ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:29 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > >   * The slow path to get the pfn of the specified host virtual address,
> > > >   * 1 indicates success, -errno is returned if error is detected.
> > > >   */
> > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > >  {
> > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > >       struct page *page;
> > > >       int npages;
> > > >
> > > >       might_sleep();
> > > >
> > > > -     if (writable)
> > > > -             *writable = write_fault;
> > > > -
> > > > -     if (write_fault)
> > > > -             flags |= FOLL_WRITE;
> > > > -     if (async)
> > > > -             flags |= FOLL_NOWAIT;
> > > > -     if (interruptible)
> > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > -
> > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > >       if (npages != 1)
> > > >               return npages;
> > > >
> > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > +
> > > >       /* map read fault as writable if possible */
> > > > -     if (unlikely(!write_fault) && writable) {
> > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >
> > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> >
> > The two statements are logically equivalent, although I guess using
> > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > verbose.
>
> Well, as the comment says, we wanna try to map the read fault as writable
> whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> for write faults. So I guess using !foll->writable will not allow this.
> Did I miss anything?

We just set the foll->writable out parameter to be equal to
((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
!(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
b.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06  5:29           ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:29 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > >   * The slow path to get the pfn of the specified host virtual address,
> > > >   * 1 indicates success, -errno is returned if error is detected.
> > > >   */
> > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > >  {
> > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > >       struct page *page;
> > > >       int npages;
> > > >
> > > >       might_sleep();
> > > >
> > > > -     if (writable)
> > > > -             *writable = write_fault;
> > > > -
> > > > -     if (write_fault)
> > > > -             flags |= FOLL_WRITE;
> > > > -     if (async)
> > > > -             flags |= FOLL_NOWAIT;
> > > > -     if (interruptible)
> > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > -
> > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > >       if (npages != 1)
> > > >               return npages;
> > > >
> > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > +
> > > >       /* map read fault as writable if possible */
> > > > -     if (unlikely(!write_fault) && writable) {
> > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >
> > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> >
> > The two statements are logically equivalent, although I guess using
> > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > verbose.
>
> Well, as the comment says, we wanna try to map the read fault as writable
> whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> for write faults. So I guess using !foll->writable will not allow this.
> Did I miss anything?

We just set the foll->writable out parameter to be equal to
((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
!(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
b.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-06  1:34     ` Isaku Yamahata
  (?)
@ 2023-07-06  5:52       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:52 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 6, 2023 at 10:34 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:47PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h |  16 ++++
> >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> >  virt/kvm/kvm_mm.h        |   3 +-
> >  virt/kvm/pfncache.c      |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > +     const struct kvm_memory_slot *slot;
> > +     gfn_t gfn;
> > +     unsigned int flags;
> > +     bool atomic;
> > +     /* Allow a read fault to create a writeable mapping. */
> > +     bool allow_write_mapping;
>
> Maybe, make them const for input arguments?

Unfortunately using const isn't straightforward as long as the kernel
continues to use -Wdeclaration-after-statement. If these fields were
const, then they would need to be specified in the initializer when
declaring the variable, but that's not necessarily always possible.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06  5:52       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:52 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Thu, Jul 6, 2023 at 10:34 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:47PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h |  16 ++++
> >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> >  virt/kvm/kvm_mm.h        |   3 +-
> >  virt/kvm/pfncache.c      |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > +     const struct kvm_memory_slot *slot;
> > +     gfn_t gfn;
> > +     unsigned int flags;
> > +     bool atomic;
> > +     /* Allow a read fault to create a writeable mapping. */
> > +     bool allow_write_mapping;
>
> Maybe, make them const for input arguments?

Unfortunately using const isn't straightforward as long as the kernel
continues to use -Wdeclaration-after-statement. If these fields were
const, then they would need to be specified in the initializer when
declaring the variable, but that's not necessarily always possible.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06  5:52       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  5:52 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 6, 2023 at 10:34 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:47PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h |  16 ++++
> >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> >  virt/kvm/kvm_mm.h        |   3 +-
> >  virt/kvm/pfncache.c      |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > +     const struct kvm_memory_slot *slot;
> > +     gfn_t gfn;
> > +     unsigned int flags;
> > +     bool atomic;
> > +     /* Allow a read fault to create a writeable mapping. */
> > +     bool allow_write_mapping;
>
> Maybe, make them const for input arguments?

Unfortunately using const isn't straightforward as long as the kernel
continues to use -Wdeclaration-after-statement. If these fields were
const, then they would need to be specified in the initializer when
declaring the variable, but that's not necessarily always possible.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-05 11:56     ` Yu Zhang
  (?)
@ 2023-07-06  6:09       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  6:09 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h | 10 ++++++
> >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> >  virt/kvm/pfncache.c      |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >       const struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >       bool atomic;
> >       /* Allow a read fault to create a writeable mapping. */
> >       bool allow_write_mapping;
> > +     /*
> > +      * Usage of the returned pfn will be guared by a mmu notifier. Must
> > +      * be true if FOLL_GET is not set.
> > +      */
> > +     bool guarded_by_mmu_notifier;
>
> And how? Any place to check the invalidate seq?

kvm_follow_pfn can't meaningfully validate the seq number, since the
mmu notifier locking is handled by the caller. This is more of a
sanity check that the API is being used properly, as proposed here
[1]. I did deviate from the proposal with a bool instead of some type
of integer, since the exact value of mmu_seq wouldn't be useful.

[1] https://lore.kernel.org/all/ZGvUsf7lMkrNDHuE@google.com/#t

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-06  6:09       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  6:09 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h | 10 ++++++
> >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> >  virt/kvm/pfncache.c      |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >       const struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >       bool atomic;
> >       /* Allow a read fault to create a writeable mapping. */
> >       bool allow_write_mapping;
> > +     /*
> > +      * Usage of the returned pfn will be guared by a mmu notifier. Must
> > +      * be true if FOLL_GET is not set.
> > +      */
> > +     bool guarded_by_mmu_notifier;
>
> And how? Any place to check the invalidate seq?

kvm_follow_pfn can't meaningfully validate the seq number, since the
mmu notifier locking is handled by the caller. This is more of a
sanity check that the API is being used properly, as proposed here
[1]. I did deviate from the proposal with a bool instead of some type
of integer, since the exact value of mmu_seq wouldn't be useful.

[1] https://lore.kernel.org/all/ZGvUsf7lMkrNDHuE@google.com/#t

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-06  6:09       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  6:09 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h | 10 ++++++
> >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> >  virt/kvm/pfncache.c      |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >       const struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >       bool atomic;
> >       /* Allow a read fault to create a writeable mapping. */
> >       bool allow_write_mapping;
> > +     /*
> > +      * Usage of the returned pfn will be guared by a mmu notifier. Must
> > +      * be true if FOLL_GET is not set.
> > +      */
> > +     bool guarded_by_mmu_notifier;
>
> And how? Any place to check the invalidate seq?

kvm_follow_pfn can't meaningfully validate the seq number, since the
mmu notifier locking is handled by the caller. This is more of a
sanity check that the API is being used properly, as proposed here
[1]. I did deviate from the proposal with a bool instead of some type
of integer, since the exact value of mmu_seq wouldn't be useful.

[1] https://lore.kernel.org/all/ZGvUsf7lMkrNDHuE@google.com/#t

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-05 13:19     ` Zhi Wang
  (?)
@ 2023-07-06  6:49       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  6:49 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
>
> On Tue,  4 Jul 2023 16:50:48 +0900
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h | 10 ++++++
> >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> >  virt/kvm/pfncache.c      |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >       const struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >       bool atomic;
> >       /* Allow a read fault to create a writeable mapping. */
> >       bool allow_write_mapping;
> > +     /*
> > +      * Usage of the returned pfn will be guared by a mmu notifier. Must
>                                               ^guarded
> > +      * be true if FOLL_GET is not set.
> > +      */
> > +     bool guarded_by_mmu_notifier;
> >
> It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> caller of __kvm_follow_pfn()?

Yes, this is the case.

> If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> temporary solution. I don't think it is a good idea to play tricks with
> a temporary solution, more like we are abusing the toleration.

I'm not sure I understand what you're getting at. This series never
calls gup without FOLL_GET.

This series aims to provide kvm_follow_pfn as a unified API on top of
gup+follow_pte. Since one of the major clients of this API uses an mmu
notifier, it makes sense to support returning a pfn without taking a
reference. And we indeed need to do that for certain types of memory.

> Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> indicate a tail page?

What do you mean by to indicate a tail page? Do you mean to indicate
that the returned pfn refers to non-refcounted page? That's specified
by is_refcounted_page.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-06  6:49       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  6:49 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
>
> On Tue,  4 Jul 2023 16:50:48 +0900
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h | 10 ++++++
> >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> >  virt/kvm/pfncache.c      |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >       const struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >       bool atomic;
> >       /* Allow a read fault to create a writeable mapping. */
> >       bool allow_write_mapping;
> > +     /*
> > +      * Usage of the returned pfn will be guared by a mmu notifier. Must
>                                               ^guarded
> > +      * be true if FOLL_GET is not set.
> > +      */
> > +     bool guarded_by_mmu_notifier;
> >
> It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> caller of __kvm_follow_pfn()?

Yes, this is the case.

> If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> temporary solution. I don't think it is a good idea to play tricks with
> a temporary solution, more like we are abusing the toleration.

I'm not sure I understand what you're getting at. This series never
calls gup without FOLL_GET.

This series aims to provide kvm_follow_pfn as a unified API on top of
gup+follow_pte. Since one of the major clients of this API uses an mmu
notifier, it makes sense to support returning a pfn without taking a
reference. And we indeed need to do that for certain types of memory.

> Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> indicate a tail page?

What do you mean by to indicate a tail page? Do you mean to indicate
that the returned pfn refers to non-refcounted page? That's specified
by is_refcounted_page.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-06  6:49       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-06  6:49 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
>
> On Tue,  4 Jul 2023 16:50:48 +0900
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  include/linux/kvm_host.h | 10 ++++++
> >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> >  virt/kvm/pfncache.c      |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >       const struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >       bool atomic;
> >       /* Allow a read fault to create a writeable mapping. */
> >       bool allow_write_mapping;
> > +     /*
> > +      * Usage of the returned pfn will be guared by a mmu notifier. Must
>                                               ^guarded
> > +      * be true if FOLL_GET is not set.
> > +      */
> > +     bool guarded_by_mmu_notifier;
> >
> It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> caller of __kvm_follow_pfn()?

Yes, this is the case.

> If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> temporary solution. I don't think it is a good idea to play tricks with
> a temporary solution, more like we are abusing the toleration.

I'm not sure I understand what you're getting at. This series never
calls gup without FOLL_GET.

This series aims to provide kvm_follow_pfn as a unified API on top of
gup+follow_pte. Since one of the major clients of this API uses an mmu
notifier, it makes sense to support returning a pfn without taking a
reference. And we indeed need to do that for certain types of memory.

> Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> indicate a tail page?

What do you mean by to indicate a tail page? Do you mean to indicate
that the returned pfn refers to non-refcounted page? That's specified
by is_refcounted_page.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-06  4:52       ` David Stevens
  (?)
@ 2023-07-06  7:19         ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-06  7:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023 at 01:52:08PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.

I know the A/D status of the refcounted, VM_PFNMAP|VM_IO backed pages
will be recorded. And I have no idea if this is a necessary requirement
either.

But it feels awkward to see the head and the tail ones of non-compound
pages be treated inconsistently. After all, the head page just happens
to have its refcount being 1, it is not a real refcounted page.

So I would suggest to mention such different behehavior in the commit
message at least. :)

> > >
> > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > >   */
> > >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> > >  {
> > > -     bool host_writable;
> > > +     bool host_writable, is_refcounted;
> > >       gpa_t first_pte_gpa;
> > >       u64 *sptep, spte;
> > >       struct kvm_memory_slot *slot;
> > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> > >       sptep = &sp->spt[i];
> > >       spte = *sptep;
> > >       host_writable = spte & shadow_host_writable_mask;
> > > +     // TODO: is this correct?
> > > +     is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > >       slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > >       make_spte(vcpu, sp, slot, pte_access, gfn,
> > >                 spte_to_pfn(spte), spte, true, false,
> > > -               host_writable, &spte);
> > > +               host_writable, is_refcounted, &spte);
> >
> > Could we restrict that a non-refcounted page shall not be used as shadow page?
> 
> I'm not very familiar with the shadow mmu, so my response might not
> make sense. But do you mean not allowing non-refcoutned pages as the
> guest page tables shadowed by a kvm_mmu_page? It would probably be
> possible to do that, and I doubt anyone would care about the
> restriction. But as far as I can tell, the guest page table is only
> accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
> pages just fine.

Sorry, my brain just got baked... Pls just ignore this question :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  7:19         ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-06  7:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023 at 01:52:08PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.

I know the A/D status of the refcounted, VM_PFNMAP|VM_IO backed pages
will be recorded. And I have no idea if this is a necessary requirement
either.

But it feels awkward to see the head and the tail ones of non-compound
pages be treated inconsistently. After all, the head page just happens
to have its refcount being 1, it is not a real refcounted page.

So I would suggest to mention such different behehavior in the commit
message at least. :)

> > >
> > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > >   */
> > >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> > >  {
> > > -     bool host_writable;
> > > +     bool host_writable, is_refcounted;
> > >       gpa_t first_pte_gpa;
> > >       u64 *sptep, spte;
> > >       struct kvm_memory_slot *slot;
> > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> > >       sptep = &sp->spt[i];
> > >       spte = *sptep;
> > >       host_writable = spte & shadow_host_writable_mask;
> > > +     // TODO: is this correct?
> > > +     is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > >       slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > >       make_spte(vcpu, sp, slot, pte_access, gfn,
> > >                 spte_to_pfn(spte), spte, true, false,
> > > -               host_writable, &spte);
> > > +               host_writable, is_refcounted, &spte);
> >
> > Could we restrict that a non-refcounted page shall not be used as shadow page?
> 
> I'm not very familiar with the shadow mmu, so my response might not
> make sense. But do you mean not allowing non-refcoutned pages as the
> guest page tables shadowed by a kvm_mmu_page? It would probably be
> possible to do that, and I doubt anyone would care about the
> restriction. But as far as I can tell, the guest page table is only
> accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
> pages just fine.

Sorry, my brain just got baked... Pls just ignore this question :)

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06  7:19         ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-06  7:19 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Thu, Jul 06, 2023 at 01:52:08PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.

I know the A/D status of the refcounted, VM_PFNMAP|VM_IO backed pages
will be recorded. And I have no idea if this is a necessary requirement
either.

But it feels awkward to see the head and the tail ones of non-compound
pages be treated inconsistently. After all, the head page just happens
to have its refcount being 1, it is not a real refcounted page.

So I would suggest to mention such different behehavior in the commit
message at least. :)

> > >
> > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > >   */
> > >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i)
> > >  {
> > > -     bool host_writable;
> > > +     bool host_writable, is_refcounted;
> > >       gpa_t first_pte_gpa;
> > >       u64 *sptep, spte;
> > >       struct kvm_memory_slot *slot;
> > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
> > >       sptep = &sp->spt[i];
> > >       spte = *sptep;
> > >       host_writable = spte & shadow_host_writable_mask;
> > > +     // TODO: is this correct?
> > > +     is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > >       slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > >       make_spte(vcpu, sp, slot, pte_access, gfn,
> > >                 spte_to_pfn(spte), spte, true, false,
> > > -               host_writable, &spte);
> > > +               host_writable, is_refcounted, &spte);
> >
> > Could we restrict that a non-refcounted page shall not be used as shadow page?
> 
> I'm not very familiar with the shadow mmu, so my response might not
> make sense. But do you mean not allowing non-refcoutned pages as the
> guest page tables shadowed by a kvm_mmu_page? It would probably be
> possible to do that, and I doubt anyone would care about the
> restriction. But as far as I can tell, the guest page table is only
> accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
> pages just fine.

Sorry, my brain just got baked... Pls just ignore this question :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-06  5:29           ` David Stevens
  (?)
@ 2023-07-06 14:52             ` Yu Zhang
  -1 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-06 14:52 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >
> > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > >   * The slow path to get the pfn of the specified host virtual address,
> > > > >   * 1 indicates success, -errno is returned if error is detected.
> > > > >   */
> > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > >  {
> > > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > >       struct page *page;
> > > > >       int npages;
> > > > >
> > > > >       might_sleep();
> > > > >
> > > > > -     if (writable)
> > > > > -             *writable = write_fault;
> > > > > -
> > > > > -     if (write_fault)
> > > > > -             flags |= FOLL_WRITE;
> > > > > -     if (async)
> > > > > -             flags |= FOLL_NOWAIT;
> > > > > -     if (interruptible)
> > > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > > -
> > > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > >       if (npages != 1)
> > > > >               return npages;
> > > > >
> > > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > +
> > > > >       /* map read fault as writable if possible */
> > > > > -     if (unlikely(!write_fault) && writable) {
> > > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > >
> > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > >
> > > The two statements are logically equivalent, although I guess using
> > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > verbose.
> >
> > Well, as the comment says, we wanna try to map the read fault as writable
> > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > for write faults. So I guess using !foll->writable will not allow this.
> > Did I miss anything?
> 
> We just set the foll->writable out parameter to be equal to
> ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> b.

Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
my head while calculating... Thanks! :)

B.R.
Yu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06 14:52             ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-06 14:52 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >
> > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > >   * The slow path to get the pfn of the specified host virtual address,
> > > > >   * 1 indicates success, -errno is returned if error is detected.
> > > > >   */
> > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > >  {
> > > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > >       struct page *page;
> > > > >       int npages;
> > > > >
> > > > >       might_sleep();
> > > > >
> > > > > -     if (writable)
> > > > > -             *writable = write_fault;
> > > > > -
> > > > > -     if (write_fault)
> > > > > -             flags |= FOLL_WRITE;
> > > > > -     if (async)
> > > > > -             flags |= FOLL_NOWAIT;
> > > > > -     if (interruptible)
> > > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > > -
> > > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > >       if (npages != 1)
> > > > >               return npages;
> > > > >
> > > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > +
> > > > >       /* map read fault as writable if possible */
> > > > > -     if (unlikely(!write_fault) && writable) {
> > > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > >
> > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > >
> > > The two statements are logically equivalent, although I guess using
> > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > verbose.
> >
> > Well, as the comment says, we wanna try to map the read fault as writable
> > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > for write faults. So I guess using !foll->writable will not allow this.
> > Did I miss anything?
> 
> We just set the foll->writable out parameter to be equal to
> ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> b.

Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
my head while calculating... Thanks! :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-06 14:52             ` Yu Zhang
  0 siblings, 0 replies; 165+ messages in thread
From: Yu Zhang @ 2023-07-06 14:52 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >
> > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > >   * The slow path to get the pfn of the specified host virtual address,
> > > > >   * 1 indicates success, -errno is returned if error is detected.
> > > > >   */
> > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > >  {
> > > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > >       struct page *page;
> > > > >       int npages;
> > > > >
> > > > >       might_sleep();
> > > > >
> > > > > -     if (writable)
> > > > > -             *writable = write_fault;
> > > > > -
> > > > > -     if (write_fault)
> > > > > -             flags |= FOLL_WRITE;
> > > > > -     if (async)
> > > > > -             flags |= FOLL_NOWAIT;
> > > > > -     if (interruptible)
> > > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > > -
> > > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > >       if (npages != 1)
> > > > >               return npages;
> > > > >
> > > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > +
> > > > >       /* map read fault as writable if possible */
> > > > > -     if (unlikely(!write_fault) && writable) {
> > > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > >
> > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > >
> > > The two statements are logically equivalent, although I guess using
> > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > verbose.
> >
> > Well, as the comment says, we wanna try to map the read fault as writable
> > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > for write faults. So I guess using !foll->writable will not allow this.
> > Did I miss anything?
> 
> We just set the foll->writable out parameter to be equal to
> ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> b.

Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
my head while calculating... Thanks! :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-06  4:52       ` David Stevens
  (?)
@ 2023-07-06 15:58         ` Isaku Yamahata
  -1 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06 15:58 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Sean Christopherson, Marc Zyngier, Michael Ellerman,
	Peter Xu, linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev,
	kvm, isaku.yamahata

On Thu, Jul 06, 2023 at 01:52:08PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.
> 
> > >
> > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > done by storing the flag in an unused bit in the sptes.
> >
> > Also, maybe we should mention this only works on x86-64.
> >
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> > >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> > >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e44ab512c3a1..b1607e314497 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> >
> > ...
> >
> > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > >       bool host_writable = !fault || fault->map_writable;
> > >       bool prefetch = !fault || fault->prefetch;
> > >       bool write_fault = fault && fault->write;
> > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > practice?
> 
> Prefetching is still done via gfn_to_page_many_atomic, which sets
> FOLL_GET. That's fixable, but it's not something this series currently
> does.

So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
whether the corresponding page is ref-countable or not, right?

Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
the issue, though.

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06 15:58         ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06 15:58 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	Yu Zhang, isaku.yamahata, kvmarm, linuxppc-dev, linux-arm-kernel

On Thu, Jul 06, 2023 at 01:52:08PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.
> 
> > >
> > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > done by storing the flag in an unused bit in the sptes.
> >
> > Also, maybe we should mention this only works on x86-64.
> >
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> > >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> > >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e44ab512c3a1..b1607e314497 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> >
> > ...
> >
> > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > >       bool host_writable = !fault || fault->map_writable;
> > >       bool prefetch = !fault || fault->prefetch;
> > >       bool write_fault = fault && fault->write;
> > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > practice?
> 
> Prefetching is still done via gfn_to_page_many_atomic, which sets
> FOLL_GET. That's fixable, but it's not something this series currently
> does.

So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
whether the corresponding page is ref-countable or not, right?

Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
the issue, though.

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-06 15:58         ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-06 15:58 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Sean Christopherson, Marc Zyngier, Michael Ellerman,
	Peter Xu, linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev,
	kvm, isaku.yamahata

On Thu, Jul 06, 2023 at 01:52:08PM +0900,
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.
> 
> > >
> > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > done by storing the flag in an unused bit in the sptes.
> >
> > Also, maybe we should mention this only works on x86-64.
> >
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> > >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> > >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e44ab512c3a1..b1607e314497 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> >
> > ...
> >
> > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > >       bool host_writable = !fault || fault->map_writable;
> > >       bool prefetch = !fault || fault->prefetch;
> > >       bool write_fault = fault && fault->write;
> > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > practice?
> 
> Prefetching is still done via gfn_to_page_many_atomic, which sets
> FOLL_GET. That's fixable, but it's not something this series currently
> does.

So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
whether the corresponding page is ref-countable or not, right?

Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
the issue, though.

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-06 15:58         ` Isaku Yamahata
  (?)
@ 2023-07-07  1:35           ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-07  1:35 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Yu Zhang, Sean Christopherson, Marc Zyngier, Michael Ellerman,
	Peter Xu, linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev,
	kvm

On Fri, Jul 7, 2023 at 12:58 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Thu, Jul 06, 2023 at 01:52:08PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > From: David Stevens <stevensd@chromium.org>
> > > >
> > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > > memory into the guest that is backed by un-refcounted struct pages - for
> > > > example, higher order non-compound pages allocated by the amdgpu driver
> > > > via ttm_pool_alloc_page.
> > >
> > > I guess you mean the tail pages of the higher order non-compound pages?
> > > And as to the head page, it is said to be set to one coincidentally[*],
> > > and shall not be considered as refcounted.  IIUC, refcount of this head
> > > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > > this may not be a problem(?). But treating this head page differently,
> > > as a refcounted one(e.g., to set the A/D flags), is weired.
> > >
> > > Or maybe I missed some context, e.g., can the head page be allocted to
> > > guest at all?
> >
> > Yes, this is to allow mapping the tail pages of higher order
> > non-compound pages - I should have been more precise in my wording.
> > The head pages can already be mapped into the guest.
> >
> > Treating the head and tail pages would require changing how KVM
> > behaves in a situation it supports today (rather than just adding
> > support for an unsupported situation). Currently, without this series,
> > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> > guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> > that's actually valid behavior, nor do I know whether anyone actually
> > cares about it. But it's what KVM does today, and I would shy away
> > from modifying that behavior without good reason.
> >
> > > >
> > > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > > done by storing the flag in an unused bit in the sptes.
> > >
> > > Also, maybe we should mention this only works on x86-64.
> > >
> > > >
> > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> > > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> > > >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> > > >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> > > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> > > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e44ab512c3a1..b1607e314497 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > >
> > > ...
> > >
> > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > >       bool host_writable = !fault || fault->map_writable;
> > > >       bool prefetch = !fault || fault->prefetch;
> > > >       bool write_fault = fault && fault->write;
> > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > >
> > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > practice?
> >
> > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > FOLL_GET. That's fixable, but it's not something this series currently
> > does.
>
> So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> whether the corresponding page is ref-countable or not, right?
>
> Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> the issue, though.
>

direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
fault parameter, so is_refcounted will evaluate to true. So the spte's
refcounted bit will get set in that case.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-07  1:35           ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-07  1:35 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	Yu Zhang, kvmarm, linuxppc-dev, linux-arm-kernel

On Fri, Jul 7, 2023 at 12:58 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Thu, Jul 06, 2023 at 01:52:08PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > From: David Stevens <stevensd@chromium.org>
> > > >
> > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > > memory into the guest that is backed by un-refcounted struct pages - for
> > > > example, higher order non-compound pages allocated by the amdgpu driver
> > > > via ttm_pool_alloc_page.
> > >
> > > I guess you mean the tail pages of the higher order non-compound pages?
> > > And as to the head page, it is said to be set to one coincidentally[*],
> > > and shall not be considered as refcounted.  IIUC, refcount of this head
> > > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > > this may not be a problem(?). But treating this head page differently,
> > > as a refcounted one(e.g., to set the A/D flags), is weired.
> > >
> > > Or maybe I missed some context, e.g., can the head page be allocted to
> > > guest at all?
> >
> > Yes, this is to allow mapping the tail pages of higher order
> > non-compound pages - I should have been more precise in my wording.
> > The head pages can already be mapped into the guest.
> >
> > Treating the head and tail pages would require changing how KVM
> > behaves in a situation it supports today (rather than just adding
> > support for an unsupported situation). Currently, without this series,
> > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> > guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> > that's actually valid behavior, nor do I know whether anyone actually
> > cares about it. But it's what KVM does today, and I would shy away
> > from modifying that behavior without good reason.
> >
> > > >
> > > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > > done by storing the flag in an unused bit in the sptes.
> > >
> > > Also, maybe we should mention this only works on x86-64.
> > >
> > > >
> > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> > > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> > > >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> > > >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> > > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> > > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e44ab512c3a1..b1607e314497 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > >
> > > ...
> > >
> > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > >       bool host_writable = !fault || fault->map_writable;
> > > >       bool prefetch = !fault || fault->prefetch;
> > > >       bool write_fault = fault && fault->write;
> > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > >
> > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > practice?
> >
> > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > FOLL_GET. That's fixable, but it's not something this series currently
> > does.
>
> So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> whether the corresponding page is ref-countable or not, right?
>
> Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> the issue, though.
>

direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
fault parameter, so is_refcounted will evaluate to true. So the spte's
refcounted bit will get set in that case.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-07  1:35           ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-07  1:35 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Yu Zhang, Sean Christopherson, Marc Zyngier, Michael Ellerman,
	Peter Xu, linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev,
	kvm

On Fri, Jul 7, 2023 at 12:58 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Thu, Jul 06, 2023 at 01:52:08PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > From: David Stevens <stevensd@chromium.org>
> > > >
> > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > > memory into the guest that is backed by un-refcounted struct pages - for
> > > > example, higher order non-compound pages allocated by the amdgpu driver
> > > > via ttm_pool_alloc_page.
> > >
> > > I guess you mean the tail pages of the higher order non-compound pages?
> > > And as to the head page, it is said to be set to one coincidentally[*],
> > > and shall not be considered as refcounted.  IIUC, refcount of this head
> > > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > > this may not be a problem(?). But treating this head page differently,
> > > as a refcounted one(e.g., to set the A/D flags), is weired.
> > >
> > > Or maybe I missed some context, e.g., can the head page be allocted to
> > > guest at all?
> >
> > Yes, this is to allow mapping the tail pages of higher order
> > non-compound pages - I should have been more precise in my wording.
> > The head pages can already be mapped into the guest.
> >
> > Treating the head and tail pages would require changing how KVM
> > behaves in a situation it supports today (rather than just adding
> > support for an unsupported situation). Currently, without this series,
> > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> > guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> > that's actually valid behavior, nor do I know whether anyone actually
> > cares about it. But it's what KVM does today, and I would shy away
> > from modifying that behavior without good reason.
> >
> > > >
> > > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > > done by storing the flag in an unused bit in the sptes.
> > >
> > > Also, maybe we should mention this only works on x86-64.
> > >
> > > >
> > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> > > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> > > >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> > > >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> > > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> > > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e44ab512c3a1..b1607e314497 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > >
> > > ...
> > >
> > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > >       bool host_writable = !fault || fault->map_writable;
> > > >       bool prefetch = !fault || fault->prefetch;
> > > >       bool write_fault = fault && fault->write;
> > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > >
> > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > practice?
> >
> > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > FOLL_GET. That's fixable, but it's not something this series currently
> > does.
>
> So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> whether the corresponding page is ref-countable or not, right?
>
> Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> the issue, though.
>

direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
fault parameter, so is_refcounted will evaluate to true. So the spte's
refcounted bit will get set in that case.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-07  1:35           ` David Stevens
  (?)
@ 2023-07-10 16:34             ` Isaku Yamahata
  -1 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-10 16:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Isaku Yamahata, Yu Zhang, Sean Christopherson, Marc Zyngier,
	Michael Ellerman, Peter Xu, linux-arm-kernel, kvmarm,
	linux-kernel, linuxppc-dev, kvm

On Fri, Jul 07, 2023 at 10:35:02AM +0900,
David Stevens <stevensd@chromium.org> wrote:

> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > >
> > > > ...
> > > >
> > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > >       bool host_writable = !fault || fault->map_writable;
> > > > >       bool prefetch = !fault || fault->prefetch;
> > > > >       bool write_fault = fault && fault->write;
> > > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > > >
> > > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > > practice?
> > >
> > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > FOLL_GET. That's fixable, but it's not something this series currently
> > > does.
> >
> > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > whether the corresponding page is ref-countable or not, right?
> >
> > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > the issue, though.
> >
> 
> direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> fault parameter, so is_refcounted will evaluate to true. So the spte's
> refcounted bit will get set in that case.

Oops, my bad.  My point is "unconditionally".  Is the bit always set for
non-refcountable pages?  Or non-refcountable pages are not prefeched?
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-10 16:34             ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-10 16:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Isaku Yamahata, Yu Zhang, Sean Christopherson, Marc Zyngier,
	Michael Ellerman, Peter Xu, linux-arm-kernel, kvmarm,
	linux-kernel, linuxppc-dev, kvm

On Fri, Jul 07, 2023 at 10:35:02AM +0900,
David Stevens <stevensd@chromium.org> wrote:

> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > >
> > > > ...
> > > >
> > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > >       bool host_writable = !fault || fault->map_writable;
> > > > >       bool prefetch = !fault || fault->prefetch;
> > > > >       bool write_fault = fault && fault->write;
> > > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > > >
> > > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > > practice?
> > >
> > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > FOLL_GET. That's fixable, but it's not something this series currently
> > > does.
> >
> > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > whether the corresponding page is ref-countable or not, right?
> >
> > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > the issue, though.
> >
> 
> direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> fault parameter, so is_refcounted will evaluate to true. So the spte's
> refcounted bit will get set in that case.

Oops, my bad.  My point is "unconditionally".  Is the bit always set for
non-refcountable pages?  Or non-refcountable pages are not prefeched?
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-10 16:34             ` Isaku Yamahata
  0 siblings, 0 replies; 165+ messages in thread
From: Isaku Yamahata @ 2023-07-10 16:34 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	Yu Zhang, linux-arm-kernel, kvmarm, linuxppc-dev, Isaku Yamahata

On Fri, Jul 07, 2023 at 10:35:02AM +0900,
David Stevens <stevensd@chromium.org> wrote:

> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > >
> > > > ...
> > > >
> > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > >       bool host_writable = !fault || fault->map_writable;
> > > > >       bool prefetch = !fault || fault->prefetch;
> > > > >       bool write_fault = fault && fault->write;
> > > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > > >
> > > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > > practice?
> > >
> > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > FOLL_GET. That's fixable, but it's not something this series currently
> > > does.
> >
> > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > whether the corresponding page is ref-countable or not, right?
> >
> > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > the issue, though.
> >
> 
> direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> fault parameter, so is_refcounted will evaluate to true. So the spte's
> refcounted bit will get set in that case.

Oops, my bad.  My point is "unconditionally".  Is the bit always set for
non-refcountable pages?  Or non-refcountable pages are not prefeched?
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-10 16:34             ` Isaku Yamahata
  (?)
@ 2023-07-11  2:59               ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-11  2:59 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Yu Zhang, Sean Christopherson, Marc Zyngier, Michael Ellerman,
	Peter Xu, linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev,
	kvm

On Tue, Jul 11, 2023 at 1:34 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Fri, Jul 07, 2023 at 10:35:02AM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > >
> > > > > ...
> > > > >
> > > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > > >       bool host_writable = !fault || fault->map_writable;
> > > > > >       bool prefetch = !fault || fault->prefetch;
> > > > > >       bool write_fault = fault && fault->write;
> > > > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > > > >
> > > > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > > > practice?
> > > >
> > > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > > FOLL_GET. That's fixable, but it's not something this series currently
> > > > does.
> > >
> > > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > > hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > > spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > > whether the corresponding page is ref-countable or not, right?
> > >
> > > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > > the issue, though.
> > >
> >
> > direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> > fault parameter, so is_refcounted will evaluate to true. So the spte's
> > refcounted bit will get set in that case.
>
> Oops, my bad.  My point is "unconditionally".  Is the bit always set for
> non-refcountable pages?  Or non-refcountable pages are not prefeched?

The bit is never set for non-refcounted pages, and is always set for
refcounted pages. The current series never prefetches non-refcounted
pages, since it continues to use the gfn_to_page_many_atomic API.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-11  2:59               ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-11  2:59 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Yu Zhang, Sean Christopherson, Marc Zyngier, Michael Ellerman,
	Peter Xu, linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev,
	kvm

On Tue, Jul 11, 2023 at 1:34 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Fri, Jul 07, 2023 at 10:35:02AM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > >
> > > > > ...
> > > > >
> > > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > > >       bool host_writable = !fault || fault->map_writable;
> > > > > >       bool prefetch = !fault || fault->prefetch;
> > > > > >       bool write_fault = fault && fault->write;
> > > > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > > > >
> > > > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > > > practice?
> > > >
> > > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > > FOLL_GET. That's fixable, but it's not something this series currently
> > > > does.
> > >
> > > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > > hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > > spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > > whether the corresponding page is ref-countable or not, right?
> > >
> > > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > > the issue, though.
> > >
> >
> > direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> > fault parameter, so is_refcounted will evaluate to true. So the spte's
> > refcounted bit will get set in that case.
>
> Oops, my bad.  My point is "unconditionally".  Is the bit always set for
> non-refcountable pages?  Or non-refcountable pages are not prefeched?

The bit is never set for non-refcounted pages, and is always set for
refcounted pages. The current series never prefetches non-refcounted
pages, since it continues to use the gfn_to_page_many_atomic API.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-11  2:59               ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-11  2:59 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	Yu Zhang, kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 11, 2023 at 1:34 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Fri, Jul 07, 2023 at 10:35:02AM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > index e44ab512c3a1..b1607e314497 100644
> > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > >
> > > > > ...
> > > > >
> > > > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> > > > > >       bool host_writable = !fault || fault->map_writable;
> > > > > >       bool prefetch = !fault || fault->prefetch;
> > > > > >       bool write_fault = fault && fault->write;
> > > > > > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> > > > >
> > > > > Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
> > > > > practice?
> > > >
> > > > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > > > FOLL_GET. That's fixable, but it's not something this series currently
> > > > does.
> > >
> > > So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> > > hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
> > > spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> > > whether the corresponding page is ref-countable or not, right?
> > >
> > > Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
> > > is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
> > > the issue, though.
> > >
> >
> > direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
> > fault parameter, so is_refcounted will evaluate to true. So the spte's
> > refcounted bit will get set in that case.
>
> Oops, my bad.  My point is "unconditionally".  Is the bit always set for
> non-refcountable pages?  Or non-refcountable pages are not prefeched?

The bit is never set for non-refcounted pages, and is always set for
refcounted pages. The current series never prefetches non-refcounted
pages, since it continues to use the gfn_to_page_many_atomic API.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-06  6:49       ` David Stevens
  (?)
@ 2023-07-11 17:33         ` Zhi Wang
  -1 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-11 17:33 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, 6 Jul 2023 15:49:39 +0900
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> >
> > On Tue,  4 Jul 2023 16:50:48 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >  
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > > callers to resolve a gfn when the associated pfn has a valid struct page
> > > that isn't being actively refcounted (e.g. tail pages of non-compound
> > > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > > the returned pfn must be guarded by a mmu notifier.
> > >
> > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > > is set when the returned pfn has an associated struct page with a valid
> > > refcount. Callers that don't pass FOLL_GET should remember this value
> > > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > > non-zero refcount.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  include/linux/kvm_host.h | 10 ++++++
> > >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> > >  virt/kvm/pfncache.c      |  2 +-
> > >  3 files changed, 47 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index ef2763c2b12e..a45308c7d2d9 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +void kvm_set_page_accessed(struct page *page);
> > > +void kvm_set_page_dirty(struct page *page);
> > > +
> > >  struct kvm_follow_pfn {
> > >       const struct kvm_memory_slot *slot;
> > >       gfn_t gfn;
> > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> > >       bool atomic;
> > >       /* Allow a read fault to create a writeable mapping. */
> > >       bool allow_write_mapping;
> > > +     /*
> > > +      * Usage of the returned pfn will be guared by a mmu notifier. Must  
> >                                               ^guarded  
> > > +      * be true if FOLL_GET is not set.
> > > +      */
> > > +     bool guarded_by_mmu_notifier;
> > >  
> > It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> > caller of __kvm_follow_pfn()?  
> 
> Yes, this is the case.
> 
> > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > temporary solution. I don't think it is a good idea to play tricks with
> > a temporary solution, more like we are abusing the toleration.  
> 
> I'm not sure I understand what you're getting at. This series never
> calls gup without FOLL_GET.
> 
> This series aims to provide kvm_follow_pfn as a unified API on top of
> gup+follow_pte. Since one of the major clients of this API uses an mmu
> notifier, it makes sense to support returning a pfn without taking a
> reference. And we indeed need to do that for certain types of memory.
> 

I am not having prob with taking a pfn without taking a ref. I am
questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
a pfn without a ref is a good idea or not, while there is another flag
actually showing it.

I can understand that using FOLL_XXX in kvm_follow_pfn saves some
translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
requirements of GUP in the code path that going to call GUP is reasonable.

But using FOLL_XXX with purposes that are not related to GUP call really
feels off. Those flags can be changed in future because of GUP requirements.
Then people have to figure out what actually is happening with FOLL_GET here
as it is not actually tied to GUP calls.


> > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> > indicate a tail page?  
> 
> What do you mean by to indicate a tail page? Do you mean to indicate
> that the returned pfn refers to non-refcounted page? That's specified
> by is_refcounted_page.
>

I figured out the reason why I got confused.

+	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages but be allocated without refcounting e.g., tail pages of
+	 * non-compound higher order allocations. If FOLL_GET is set and we
+	 * increment such a refcount, then when that pfn is eventually passed to
+	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
 	 */ 

The above statements only explains the wrong behavior, but doesn't explain the
expected behavior. It would be better to explain that for manipulating mmu
notifier guard page (!FOLL_GET), we put back the reference taken by GUP.
FOLL_GET stuff really confused me a lot.

-	if (!kvm_try_get_pfn(pfn))
+	page = kvm_pfn_to_refcounted_page(pfn);
+	if (!page)
+		goto out;
+
+	if (get_page_unless_zero(page)) {
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page);
+	} else if (foll->flags & FOLL_GET) {
 		r = -EFAULT;
+	}

> -David


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-11 17:33         ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-11 17:33 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, 6 Jul 2023 15:49:39 +0900
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> >
> > On Tue,  4 Jul 2023 16:50:48 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >  
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > > callers to resolve a gfn when the associated pfn has a valid struct page
> > > that isn't being actively refcounted (e.g. tail pages of non-compound
> > > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > > the returned pfn must be guarded by a mmu notifier.
> > >
> > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > > is set when the returned pfn has an associated struct page with a valid
> > > refcount. Callers that don't pass FOLL_GET should remember this value
> > > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > > non-zero refcount.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  include/linux/kvm_host.h | 10 ++++++
> > >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> > >  virt/kvm/pfncache.c      |  2 +-
> > >  3 files changed, 47 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index ef2763c2b12e..a45308c7d2d9 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +void kvm_set_page_accessed(struct page *page);
> > > +void kvm_set_page_dirty(struct page *page);
> > > +
> > >  struct kvm_follow_pfn {
> > >       const struct kvm_memory_slot *slot;
> > >       gfn_t gfn;
> > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> > >       bool atomic;
> > >       /* Allow a read fault to create a writeable mapping. */
> > >       bool allow_write_mapping;
> > > +     /*
> > > +      * Usage of the returned pfn will be guared by a mmu notifier. Must  
> >                                               ^guarded  
> > > +      * be true if FOLL_GET is not set.
> > > +      */
> > > +     bool guarded_by_mmu_notifier;
> > >  
> > It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> > caller of __kvm_follow_pfn()?  
> 
> Yes, this is the case.
> 
> > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > temporary solution. I don't think it is a good idea to play tricks with
> > a temporary solution, more like we are abusing the toleration.  
> 
> I'm not sure I understand what you're getting at. This series never
> calls gup without FOLL_GET.
> 
> This series aims to provide kvm_follow_pfn as a unified API on top of
> gup+follow_pte. Since one of the major clients of this API uses an mmu
> notifier, it makes sense to support returning a pfn without taking a
> reference. And we indeed need to do that for certain types of memory.
> 

I am not having prob with taking a pfn without taking a ref. I am
questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
a pfn without a ref is a good idea or not, while there is another flag
actually showing it.

I can understand that using FOLL_XXX in kvm_follow_pfn saves some
translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
requirements of GUP in the code path that going to call GUP is reasonable.

But using FOLL_XXX with purposes that are not related to GUP call really
feels off. Those flags can be changed in future because of GUP requirements.
Then people have to figure out what actually is happening with FOLL_GET here
as it is not actually tied to GUP calls.


> > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> > indicate a tail page?  
> 
> What do you mean by to indicate a tail page? Do you mean to indicate
> that the returned pfn refers to non-refcounted page? That's specified
> by is_refcounted_page.
>

I figured out the reason why I got confused.

+	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages but be allocated without refcounting e.g., tail pages of
+	 * non-compound higher order allocations. If FOLL_GET is set and we
+	 * increment such a refcount, then when that pfn is eventually passed to
+	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
 	 */ 

The above statements only explains the wrong behavior, but doesn't explain the
expected behavior. It would be better to explain that for manipulating mmu
notifier guard page (!FOLL_GET), we put back the reference taken by GUP.
FOLL_GET stuff really confused me a lot.

-	if (!kvm_try_get_pfn(pfn))
+	page = kvm_pfn_to_refcounted_page(pfn);
+	if (!page)
+		goto out;
+
+	if (get_page_unless_zero(page)) {
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page);
+	} else if (foll->flags & FOLL_GET) {
 		r = -EFAULT;
+	}

> -David


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-11 17:33         ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-11 17:33 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Thu, 6 Jul 2023 15:49:39 +0900
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> >
> > On Tue,  4 Jul 2023 16:50:48 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >  
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > > callers to resolve a gfn when the associated pfn has a valid struct page
> > > that isn't being actively refcounted (e.g. tail pages of non-compound
> > > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > > the returned pfn must be guarded by a mmu notifier.
> > >
> > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > > is set when the returned pfn has an associated struct page with a valid
> > > refcount. Callers that don't pass FOLL_GET should remember this value
> > > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > > non-zero refcount.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  include/linux/kvm_host.h | 10 ++++++
> > >  virt/kvm/kvm_main.c      | 67 +++++++++++++++++++++-------------------
> > >  virt/kvm/pfncache.c      |  2 +-
> > >  3 files changed, 47 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index ef2763c2b12e..a45308c7d2d9 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +void kvm_set_page_accessed(struct page *page);
> > > +void kvm_set_page_dirty(struct page *page);
> > > +
> > >  struct kvm_follow_pfn {
> > >       const struct kvm_memory_slot *slot;
> > >       gfn_t gfn;
> > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> > >       bool atomic;
> > >       /* Allow a read fault to create a writeable mapping. */
> > >       bool allow_write_mapping;
> > > +     /*
> > > +      * Usage of the returned pfn will be guared by a mmu notifier. Must  
> >                                               ^guarded  
> > > +      * be true if FOLL_GET is not set.
> > > +      */
> > > +     bool guarded_by_mmu_notifier;
> > >  
> > It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> > caller of __kvm_follow_pfn()?  
> 
> Yes, this is the case.
> 
> > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > temporary solution. I don't think it is a good idea to play tricks with
> > a temporary solution, more like we are abusing the toleration.  
> 
> I'm not sure I understand what you're getting at. This series never
> calls gup without FOLL_GET.
> 
> This series aims to provide kvm_follow_pfn as a unified API on top of
> gup+follow_pte. Since one of the major clients of this API uses an mmu
> notifier, it makes sense to support returning a pfn without taking a
> reference. And we indeed need to do that for certain types of memory.
> 

I am not having prob with taking a pfn without taking a ref. I am
questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
a pfn without a ref is a good idea or not, while there is another flag
actually showing it.

I can understand that using FOLL_XXX in kvm_follow_pfn saves some
translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
requirements of GUP in the code path that going to call GUP is reasonable.

But using FOLL_XXX with purposes that are not related to GUP call really
feels off. Those flags can be changed in future because of GUP requirements.
Then people have to figure out what actually is happening with FOLL_GET here
as it is not actually tied to GUP calls.


> > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> > indicate a tail page?  
> 
> What do you mean by to indicate a tail page? Do you mean to indicate
> that the returned pfn refers to non-refcounted page? That's specified
> by is_refcounted_page.
>

I figured out the reason why I got confused.

+	 * Otherwise, certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages but be allocated without refcounting e.g., tail pages of
+	 * non-compound higher order allocations. If FOLL_GET is set and we
+	 * increment such a refcount, then when that pfn is eventually passed to
+	 * kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+	 * freed. Therefore don't allow those pages here when FOLL_GET is set.
 	 */ 

The above statements only explains the wrong behavior, but doesn't explain the
expected behavior. It would be better to explain that for manipulating mmu
notifier guard page (!FOLL_GET), we put back the reference taken by GUP.
FOLL_GET stuff really confused me a lot.

-	if (!kvm_try_get_pfn(pfn))
+	page = kvm_pfn_to_refcounted_page(pfn);
+	if (!page)
+		goto out;
+
+	if (get_page_unless_zero(page)) {
+		foll->is_refcounted_page = true;
+		if (!(foll->flags & FOLL_GET))
+			put_page(page);
+	} else if (foll->flags & FOLL_GET) {
 		r = -EFAULT;
+	}

> -David


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-05  9:08       ` David Stevens
  (?)
@ 2023-07-11 17:37         ` Zhi Wang
  -1 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-11 17:37 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, 5 Jul 2023 18:08:17 +0900
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 5:47___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> >
> > On Tue,  4 Jul 2023 16:50:47 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >  
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > > where possible, combines the boolean arguments into a single flags
> > > argument.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  include/linux/kvm_host.h |  16 ++++
> > >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> > >  virt/kvm/kvm_mm.h        |   3 +-
> > >  virt/kvm/pfncache.c      |   8 +-
> > >  4 files changed, 122 insertions(+), 76 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 9d3ac7720da9..ef2763c2b12e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -97,6 +97,7 @@
> > >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> > >
> > >  /*
> > >   * error pfns indicate that the gfn is in slot but faild to
> > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +struct kvm_follow_pfn {
> > > +     const struct kvm_memory_slot *slot;
> > > +     gfn_t gfn;
> > > +     unsigned int flags;
> > > +     bool atomic;
> > > +     /* Allow a read fault to create a writeable mapping. */
> > > +     bool allow_write_mapping;
> > > +
> > > +     /* Outputs of __kvm_follow_pfn */
> > > +     hva_t hva;
> > > +     bool writable;
> > > +};
> > > +
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > > +
> > >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> > >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > >                     bool *writable);
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 371bd783ff2b..b13f22861d2f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > >   * true indicates success, otherwise false is returned.  It's also the
> > >   * only part that runs if we can in atomic context.
> > >   */
> > > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > -                         bool *writable, kvm_pfn_t *pfn)
> > > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > >       struct page *page[1];
> > > +     bool write_fault = foll->flags & FOLL_WRITE;
> > >
> > >       /*
> > >        * Fast pin a writable pfn only if it is a write fault request
> > >        * or the caller allows to map a writable pfn for a read fault
> > >        * request.
> > >        */
> > > -     if (!(write_fault || writable))
> > > +     if (!(write_fault || foll->allow_write_mapping))
> > >               return false;
> > >
> > > -     if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > > +     if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > >               *pfn = page_to_pfn(page[0]);
> > > -
> > > -             if (writable)
> > > -                     *writable = true;
> > > +             foll->writable = foll->allow_write_mapping;
> > >               return true;
> > >       }
> > >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > -     unsigned int flags = FOLL_HWPOISON;
> > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >       struct page *page;
> > >       int npages;
> > >
> > >       might_sleep();
> > >
> > > -     if (writable)
> > > -             *writable = write_fault;
> > > -
> > > -     if (write_fault)
> > > -             flags |= FOLL_WRITE;
> > > -     if (async)
> > > -             flags |= FOLL_NOWAIT;
> > > -     if (interruptible)
> > > -             flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > >       if (npages != 1)
> > >               return npages;
> > >
> > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > >       /* map read fault as writable if possible */
> > > -     if (unlikely(!write_fault) && writable) {
> > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >               struct page *wpage;
> > >
> > > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > > -                     *writable = true;
> > > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > > +                     foll->writable = true;
> > >                       put_page(page);
> > >                       page = wpage;
> > >               }
> > > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> > >       return get_page_unless_zero(page);
> > >  }
> > >
> > > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > > -                            unsigned long addr, bool write_fault,
> > > -                            bool *writable, kvm_pfn_t *p_pfn)
> > > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > > +                            kvm_pfn_t *p_pfn)
> > >  {
> > >       kvm_pfn_t pfn;
> > >       pte_t *ptep;
> > >       spinlock_t *ptl;
> > > +     bool write_fault = foll->flags & FOLL_WRITE;
> > >       int r;
> > >
> > > -     r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > +     r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > >       if (r) {
> > >               /*
> > >                * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> > >                * not call the fault handler, so do it here.
> > >                */
> > >               bool unlocked = false;
> > > -             r = fixup_user_fault(current->mm, addr,
> > > +             r = fixup_user_fault(current->mm, foll->hva,
> > >                                    (write_fault ? FAULT_FLAG_WRITE : 0),
> > >                                    &unlocked);
> > >               if (unlocked)
> > > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >               if (r)
> > >                       return r;
> > >
> > > -             r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > +             r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > >               if (r)
> > >                       return r;
> > >       }
> > > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >               goto out;
> > >       }
> > >
> > > -     if (writable)
> > > -             *writable = pte_write(*ptep);
> > > +     foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> > >       pfn = pte_pfn(*ptep);
> > >
> > >       /*
> > > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >   * 2): @write_fault = false && @writable, @writable will tell the caller
> > >   *     whether the mapping is writable.
> > >   */
> > > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > > -                  bool *async, bool write_fault, bool *writable)
> > > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> > >  {
> > >       struct vm_area_struct *vma;
> > >       kvm_pfn_t pfn;
> > >       int npages, r;
> > >
> > >       /* we can do it either atomically or asynchronously, not both */
> > > -     BUG_ON(atomic && async);
> > > +     BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> > >
> > > -     if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > > +     if (hva_to_pfn_fast(foll, &pfn))
> > >               return pfn;
> > >
> > > -     if (atomic)
> > > +     if (foll->atomic)
> > >               return KVM_PFN_ERR_FAULT;
> > >
> > > -     npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > > -                              writable, &pfn);
> > > +     npages = hva_to_pfn_slow(foll, &pfn);
> > >       if (npages == 1)
> > >               return pfn;
> > >       if (npages == -EINTR)
> > > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > >
> > >       mmap_read_lock(current->mm);
> > >       if (npages == -EHWPOISON ||
> > > -           (!async && check_user_page_hwpoison(addr))) {
> > > +           (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> > >               pfn = KVM_PFN_ERR_HWPOISON;
> > >               goto exit;
> > >       }
> > >
> > >  retry:
> > > -     vma = vma_lookup(current->mm, addr);
> > > +     vma = vma_lookup(current->mm, foll->hva);
> > >
> > >       if (vma == NULL)
> > >               pfn = KVM_PFN_ERR_FAULT;
> > >       else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > > -             r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > > +             r = hva_to_pfn_remapped(vma, foll, &pfn);
> > >               if (r == -EAGAIN)
> > >                       goto retry;
> > >               if (r < 0)
> > >                       pfn = KVM_PFN_ERR_FAULT;
> > >       } else {
> > > -             if (async && vma_is_valid(vma, write_fault))
> > > -                     *async = true;
> > > -             pfn = KVM_PFN_ERR_FAULT;
> > > +             if ((foll->flags & FOLL_NOWAIT) &&
> > > +                 vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > > +                     pfn = KVM_PFN_ERR_NEEDS_IO;
> > > +             else
> > > +                     pfn = KVM_PFN_ERR_FAULT;
> > >       }
> > >  exit:
> > >       mmap_read_unlock(current->mm);
> > >       return pfn;
> > >  }
> > >
> > > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > -                            bool atomic, bool interruptible, bool *async,
> > > -                            bool write_fault, bool *writable, hva_t *hva)
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > >  {
> > > -     unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > > -
> > > -     if (hva)
> > > -             *hva = addr;
> > > +     foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > > +                                   foll->flags & FOLL_WRITE);
> > >
> > > -     if (addr == KVM_HVA_ERR_RO_BAD) {
> > > -             if (writable)
> > > -                     *writable = false;
> > > +     if (foll->hva == KVM_HVA_ERR_RO_BAD)
> > >               return KVM_PFN_ERR_RO_FAULT;
> > > -     }
> > >  
> >
> > Can you explain why updating foll->writable = false (previously *writeable
> > = false) is omitted here?
> >
> > In the caller where the struct kvm_follow_pfn is initialized, e.g.
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> > IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> > upper caller.
> >
> > As the one of the output, it would be better to initalize it either in the
> > caller or update it in __kvm_follow_pfn(). Or
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> > stack to the caller via bool *writable. It doesn't sound nice.  
> 
> Entries omitted from an initializer are initialized to zero, so
> .writable does get initialized in all of the patches in this series
> via designated initializers. Although you're right that explicitly
> setting it to false is a good idea, in case someday someone adds a
> caller that doesn't use an initializer when declaring its
> kvm_follow_pfn.
>

Nice trick and nice to know that. :) Agreed on improving readability and
preventing a risk from the caller. 

> -David


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-11 17:37         ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-11 17:37 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, 5 Jul 2023 18:08:17 +0900
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 5:47___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> >
> > On Tue,  4 Jul 2023 16:50:47 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >  
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > > where possible, combines the boolean arguments into a single flags
> > > argument.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  include/linux/kvm_host.h |  16 ++++
> > >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> > >  virt/kvm/kvm_mm.h        |   3 +-
> > >  virt/kvm/pfncache.c      |   8 +-
> > >  4 files changed, 122 insertions(+), 76 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 9d3ac7720da9..ef2763c2b12e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -97,6 +97,7 @@
> > >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> > >
> > >  /*
> > >   * error pfns indicate that the gfn is in slot but faild to
> > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +struct kvm_follow_pfn {
> > > +     const struct kvm_memory_slot *slot;
> > > +     gfn_t gfn;
> > > +     unsigned int flags;
> > > +     bool atomic;
> > > +     /* Allow a read fault to create a writeable mapping. */
> > > +     bool allow_write_mapping;
> > > +
> > > +     /* Outputs of __kvm_follow_pfn */
> > > +     hva_t hva;
> > > +     bool writable;
> > > +};
> > > +
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > > +
> > >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> > >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > >                     bool *writable);
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 371bd783ff2b..b13f22861d2f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > >   * true indicates success, otherwise false is returned.  It's also the
> > >   * only part that runs if we can in atomic context.
> > >   */
> > > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > -                         bool *writable, kvm_pfn_t *pfn)
> > > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > >       struct page *page[1];
> > > +     bool write_fault = foll->flags & FOLL_WRITE;
> > >
> > >       /*
> > >        * Fast pin a writable pfn only if it is a write fault request
> > >        * or the caller allows to map a writable pfn for a read fault
> > >        * request.
> > >        */
> > > -     if (!(write_fault || writable))
> > > +     if (!(write_fault || foll->allow_write_mapping))
> > >               return false;
> > >
> > > -     if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > > +     if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > >               *pfn = page_to_pfn(page[0]);
> > > -
> > > -             if (writable)
> > > -                     *writable = true;
> > > +             foll->writable = foll->allow_write_mapping;
> > >               return true;
> > >       }
> > >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > -     unsigned int flags = FOLL_HWPOISON;
> > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >       struct page *page;
> > >       int npages;
> > >
> > >       might_sleep();
> > >
> > > -     if (writable)
> > > -             *writable = write_fault;
> > > -
> > > -     if (write_fault)
> > > -             flags |= FOLL_WRITE;
> > > -     if (async)
> > > -             flags |= FOLL_NOWAIT;
> > > -     if (interruptible)
> > > -             flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > >       if (npages != 1)
> > >               return npages;
> > >
> > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > >       /* map read fault as writable if possible */
> > > -     if (unlikely(!write_fault) && writable) {
> > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >               struct page *wpage;
> > >
> > > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > > -                     *writable = true;
> > > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > > +                     foll->writable = true;
> > >                       put_page(page);
> > >                       page = wpage;
> > >               }
> > > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> > >       return get_page_unless_zero(page);
> > >  }
> > >
> > > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > > -                            unsigned long addr, bool write_fault,
> > > -                            bool *writable, kvm_pfn_t *p_pfn)
> > > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > > +                            kvm_pfn_t *p_pfn)
> > >  {
> > >       kvm_pfn_t pfn;
> > >       pte_t *ptep;
> > >       spinlock_t *ptl;
> > > +     bool write_fault = foll->flags & FOLL_WRITE;
> > >       int r;
> > >
> > > -     r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > +     r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > >       if (r) {
> > >               /*
> > >                * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> > >                * not call the fault handler, so do it here.
> > >                */
> > >               bool unlocked = false;
> > > -             r = fixup_user_fault(current->mm, addr,
> > > +             r = fixup_user_fault(current->mm, foll->hva,
> > >                                    (write_fault ? FAULT_FLAG_WRITE : 0),
> > >                                    &unlocked);
> > >               if (unlocked)
> > > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >               if (r)
> > >                       return r;
> > >
> > > -             r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > +             r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > >               if (r)
> > >                       return r;
> > >       }
> > > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >               goto out;
> > >       }
> > >
> > > -     if (writable)
> > > -             *writable = pte_write(*ptep);
> > > +     foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> > >       pfn = pte_pfn(*ptep);
> > >
> > >       /*
> > > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >   * 2): @write_fault = false && @writable, @writable will tell the caller
> > >   *     whether the mapping is writable.
> > >   */
> > > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > > -                  bool *async, bool write_fault, bool *writable)
> > > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> > >  {
> > >       struct vm_area_struct *vma;
> > >       kvm_pfn_t pfn;
> > >       int npages, r;
> > >
> > >       /* we can do it either atomically or asynchronously, not both */
> > > -     BUG_ON(atomic && async);
> > > +     BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> > >
> > > -     if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > > +     if (hva_to_pfn_fast(foll, &pfn))
> > >               return pfn;
> > >
> > > -     if (atomic)
> > > +     if (foll->atomic)
> > >               return KVM_PFN_ERR_FAULT;
> > >
> > > -     npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > > -                              writable, &pfn);
> > > +     npages = hva_to_pfn_slow(foll, &pfn);
> > >       if (npages == 1)
> > >               return pfn;
> > >       if (npages == -EINTR)
> > > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > >
> > >       mmap_read_lock(current->mm);
> > >       if (npages == -EHWPOISON ||
> > > -           (!async && check_user_page_hwpoison(addr))) {
> > > +           (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> > >               pfn = KVM_PFN_ERR_HWPOISON;
> > >               goto exit;
> > >       }
> > >
> > >  retry:
> > > -     vma = vma_lookup(current->mm, addr);
> > > +     vma = vma_lookup(current->mm, foll->hva);
> > >
> > >       if (vma == NULL)
> > >               pfn = KVM_PFN_ERR_FAULT;
> > >       else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > > -             r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > > +             r = hva_to_pfn_remapped(vma, foll, &pfn);
> > >               if (r == -EAGAIN)
> > >                       goto retry;
> > >               if (r < 0)
> > >                       pfn = KVM_PFN_ERR_FAULT;
> > >       } else {
> > > -             if (async && vma_is_valid(vma, write_fault))
> > > -                     *async = true;
> > > -             pfn = KVM_PFN_ERR_FAULT;
> > > +             if ((foll->flags & FOLL_NOWAIT) &&
> > > +                 vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > > +                     pfn = KVM_PFN_ERR_NEEDS_IO;
> > > +             else
> > > +                     pfn = KVM_PFN_ERR_FAULT;
> > >       }
> > >  exit:
> > >       mmap_read_unlock(current->mm);
> > >       return pfn;
> > >  }
> > >
> > > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > -                            bool atomic, bool interruptible, bool *async,
> > > -                            bool write_fault, bool *writable, hva_t *hva)
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > >  {
> > > -     unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > > -
> > > -     if (hva)
> > > -             *hva = addr;
> > > +     foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > > +                                   foll->flags & FOLL_WRITE);
> > >
> > > -     if (addr == KVM_HVA_ERR_RO_BAD) {
> > > -             if (writable)
> > > -                     *writable = false;
> > > +     if (foll->hva == KVM_HVA_ERR_RO_BAD)
> > >               return KVM_PFN_ERR_RO_FAULT;
> > > -     }
> > >  
> >
> > Can you explain why updating foll->writable = false (previously *writeable
> > = false) is omitted here?
> >
> > In the caller where the struct kvm_follow_pfn is initialized, e.g.
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> > IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> > upper caller.
> >
> > As the one of the output, it would be better to initalize it either in the
> > caller or update it in __kvm_follow_pfn(). Or
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> > stack to the caller via bool *writable. It doesn't sound nice.  
> 
> Entries omitted from an initializer are initialized to zero, so
> .writable does get initialized in all of the patches in this series
> via designated initializers. Although you're right that explicitly
> setting it to false is a good idea, in case someday someone adds a
> caller that doesn't use an initializer when declaring its
> kvm_follow_pfn.
>

Nice trick and nice to know that. :) Agreed on improving readability and
preventing a risk from the caller. 

> -David


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-07-11 17:37         ` Zhi Wang
  0 siblings, 0 replies; 165+ messages in thread
From: Zhi Wang @ 2023-07-11 17:37 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, 5 Jul 2023 18:08:17 +0900
David Stevens <stevensd@chromium.org> wrote:

> On Wed, Jul 5, 2023 at 5:47___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> >
> > On Tue,  4 Jul 2023 16:50:47 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >  
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > > where possible, combines the boolean arguments into a single flags
> > > argument.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  include/linux/kvm_host.h |  16 ++++
> > >  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
> > >  virt/kvm/kvm_mm.h        |   3 +-
> > >  virt/kvm/pfncache.c      |   8 +-
> > >  4 files changed, 122 insertions(+), 76 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 9d3ac7720da9..ef2763c2b12e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -97,6 +97,7 @@
> > >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > >  #define KVM_PFN_ERR_SIGPENDING       (KVM_PFN_ERR_MASK + 3)
> > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> > >
> > >  /*
> > >   * error pfns indicate that the gfn is in slot but faild to
> > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +struct kvm_follow_pfn {
> > > +     const struct kvm_memory_slot *slot;
> > > +     gfn_t gfn;
> > > +     unsigned int flags;
> > > +     bool atomic;
> > > +     /* Allow a read fault to create a writeable mapping. */
> > > +     bool allow_write_mapping;
> > > +
> > > +     /* Outputs of __kvm_follow_pfn */
> > > +     hva_t hva;
> > > +     bool writable;
> > > +};
> > > +
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > > +
> > >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> > >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > >                     bool *writable);
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 371bd783ff2b..b13f22861d2f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > >   * true indicates success, otherwise false is returned.  It's also the
> > >   * only part that runs if we can in atomic context.
> > >   */
> > > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > -                         bool *writable, kvm_pfn_t *pfn)
> > > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > >       struct page *page[1];
> > > +     bool write_fault = foll->flags & FOLL_WRITE;
> > >
> > >       /*
> > >        * Fast pin a writable pfn only if it is a write fault request
> > >        * or the caller allows to map a writable pfn for a read fault
> > >        * request.
> > >        */
> > > -     if (!(write_fault || writable))
> > > +     if (!(write_fault || foll->allow_write_mapping))
> > >               return false;
> > >
> > > -     if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > > +     if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > >               *pfn = page_to_pfn(page[0]);
> > > -
> > > -             if (writable)
> > > -                     *writable = true;
> > > +             foll->writable = foll->allow_write_mapping;
> > >               return true;
> > >       }
> > >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > -     unsigned int flags = FOLL_HWPOISON;
> > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >       struct page *page;
> > >       int npages;
> > >
> > >       might_sleep();
> > >
> > > -     if (writable)
> > > -             *writable = write_fault;
> > > -
> > > -     if (write_fault)
> > > -             flags |= FOLL_WRITE;
> > > -     if (async)
> > > -             flags |= FOLL_NOWAIT;
> > > -     if (interruptible)
> > > -             flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > >       if (npages != 1)
> > >               return npages;
> > >
> > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > +
> > >       /* map read fault as writable if possible */
> > > -     if (unlikely(!write_fault) && writable) {
> > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >               struct page *wpage;
> > >
> > > -             if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > > -                     *writable = true;
> > > +             if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > > +                     foll->writable = true;
> > >                       put_page(page);
> > >                       page = wpage;
> > >               }
> > > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> > >       return get_page_unless_zero(page);
> > >  }
> > >
> > > -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > > -                            unsigned long addr, bool write_fault,
> > > -                            bool *writable, kvm_pfn_t *p_pfn)
> > > +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> > > +                            kvm_pfn_t *p_pfn)
> > >  {
> > >       kvm_pfn_t pfn;
> > >       pte_t *ptep;
> > >       spinlock_t *ptl;
> > > +     bool write_fault = foll->flags & FOLL_WRITE;
> > >       int r;
> > >
> > > -     r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > +     r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > >       if (r) {
> > >               /*
> > >                * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> > >                * not call the fault handler, so do it here.
> > >                */
> > >               bool unlocked = false;
> > > -             r = fixup_user_fault(current->mm, addr,
> > > +             r = fixup_user_fault(current->mm, foll->hva,
> > >                                    (write_fault ? FAULT_FLAG_WRITE : 0),
> > >                                    &unlocked);
> > >               if (unlocked)
> > > @@ -2596,7 +2585,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >               if (r)
> > >                       return r;
> > >
> > > -             r = follow_pte(vma->vm_mm, addr, &ptep, &ptl);
> > > +             r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> > >               if (r)
> > >                       return r;
> > >       }
> > > @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >               goto out;
> > >       }
> > >
> > > -     if (writable)
> > > -             *writable = pte_write(*ptep);
> > > +     foll->writable = pte_write(*ptep) && foll->allow_write_mapping;
> > >       pfn = pte_pfn(*ptep);
> > >
> > >       /*
> > > @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> > >   * 2): @write_fault = false && @writable, @writable will tell the caller
> > >   *     whether the mapping is writable.
> > >   */
> > > -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > > -                  bool *async, bool write_fault, bool *writable)
> > > +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
> > >  {
> > >       struct vm_area_struct *vma;
> > >       kvm_pfn_t pfn;
> > >       int npages, r;
> > >
> > >       /* we can do it either atomically or asynchronously, not both */
> > > -     BUG_ON(atomic && async);
> > > +     BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
> > >
> > > -     if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> > > +     if (hva_to_pfn_fast(foll, &pfn))
> > >               return pfn;
> > >
> > > -     if (atomic)
> > > +     if (foll->atomic)
> > >               return KVM_PFN_ERR_FAULT;
> > >
> > > -     npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> > > -                              writable, &pfn);
> > > +     npages = hva_to_pfn_slow(foll, &pfn);
> > >       if (npages == 1)
> > >               return pfn;
> > >       if (npages == -EINTR)
> > > @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > >
> > >       mmap_read_lock(current->mm);
> > >       if (npages == -EHWPOISON ||
> > > -           (!async && check_user_page_hwpoison(addr))) {
> > > +           (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
> > >               pfn = KVM_PFN_ERR_HWPOISON;
> > >               goto exit;
> > >       }
> > >
> > >  retry:
> > > -     vma = vma_lookup(current->mm, addr);
> > > +     vma = vma_lookup(current->mm, foll->hva);
> > >
> > >       if (vma == NULL)
> > >               pfn = KVM_PFN_ERR_FAULT;
> > >       else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > > -             r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
> > > +             r = hva_to_pfn_remapped(vma, foll, &pfn);
> > >               if (r == -EAGAIN)
> > >                       goto retry;
> > >               if (r < 0)
> > >                       pfn = KVM_PFN_ERR_FAULT;
> > >       } else {
> > > -             if (async && vma_is_valid(vma, write_fault))
> > > -                     *async = true;
> > > -             pfn = KVM_PFN_ERR_FAULT;
> > > +             if ((foll->flags & FOLL_NOWAIT) &&
> > > +                 vma_is_valid(vma, foll->flags & FOLL_WRITE))
> > > +                     pfn = KVM_PFN_ERR_NEEDS_IO;
> > > +             else
> > > +                     pfn = KVM_PFN_ERR_FAULT;
> > >       }
> > >  exit:
> > >       mmap_read_unlock(current->mm);
> > >       return pfn;
> > >  }
> > >
> > > -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > -                            bool atomic, bool interruptible, bool *async,
> > > -                            bool write_fault, bool *writable, hva_t *hva)
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
> > >  {
> > > -     unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
> > > -
> > > -     if (hva)
> > > -             *hva = addr;
> > > +     foll->hva = __gfn_to_hva_many(foll->slot, foll->gfn, NULL,
> > > +                                   foll->flags & FOLL_WRITE);
> > >
> > > -     if (addr == KVM_HVA_ERR_RO_BAD) {
> > > -             if (writable)
> > > -                     *writable = false;
> > > +     if (foll->hva == KVM_HVA_ERR_RO_BAD)
> > >               return KVM_PFN_ERR_RO_FAULT;
> > > -     }
> > >  
> >
> > Can you explain why updating foll->writable = false (previously *writeable
> > = false) is omitted here?
> >
> > In the caller where the struct kvm_follow_pfn is initialized, e.g.
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot(), .writable is not initialized.
> > IIUC, they expect __kvm_follow_pfn() to update it and return .writable to
> > upper caller.
> >
> > As the one of the output, it would be better to initalize it either in the
> > caller or update it in __kvm_follow_pfn(). Or
> > __gfn_to_pfn_memslot()/gfn_to_pfn_prot() will return random data in the
> > stack to the caller via bool *writable. It doesn't sound nice.  
> 
> Entries omitted from an initializer are initialized to zero, so
> .writable does get initialized in all of the patches in this series
> via designated initializers. Although you're right that explicitly
> setting it to false is a good idea, in case someday someone adds a
> caller that doesn't use an initializer when declaring its
> kvm_follow_pfn.
>

Nice trick and nice to know that. :) Agreed on improving readability and
preventing a risk from the caller. 

> -David


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-11 17:33         ` Zhi Wang
  (?)
@ 2023-07-11 21:59           ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-07-11 21:59 UTC (permalink / raw)
  To: Zhi Wang
  Cc: David Stevens, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 11, 2023, Zhi Wang wrote:
> On Thu, 6 Jul 2023 15:49:39 +0900
> David Stevens <stevensd@chromium.org> wrote:
> 
> > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > >
> > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > David Stevens <stevensd@chromium.org> wrote:
> > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > temporary solution. I don't think it is a good idea to play tricks with
> > > a temporary solution, more like we are abusing the toleration.  
> > 
> > I'm not sure I understand what you're getting at. This series never
> > calls gup without FOLL_GET.
> > 
> > This series aims to provide kvm_follow_pfn as a unified API on top of
> > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > notifier, it makes sense to support returning a pfn without taking a
> > reference. And we indeed need to do that for certain types of memory.
> > 
> 
> I am not having prob with taking a pfn without taking a ref. I am
> questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> a pfn without a ref is a good idea or not, while there is another flag
> actually showing it.
> 
> I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> requirements of GUP in the code path that going to call GUP is reasonable.
> 
> But using FOLL_XXX with purposes that are not related to GUP call really
> feels off.

I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
that handles non-refcounted pages, i.e. this

	if (get_page_unless_zero(page)) {
		foll->is_refcounted_page = true;
		if (!(foll->flags & FOLL_GET))
			put_page(page);
	} else if (foll->flags & FOLL_GET) {
		r = -EFAULT;
	}

should be

	if (get_page_unless_zero(page)) {
		foll->is_refcounted_page = true;
		if (!(foll->flags & FOLL_GET))
			put_page(page);
	else if (!foll->guarded_by_mmu_notifier)
		r = -EFAULT;

because it's not the desire to grab a reference that makes getting non-refcounted
pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.

Though that highlights that checking guarded_by_mmu_notifier should be done for
*all* non-refcounted pfns, not just non-refcounted struct page memory.

As for the other usage of FOLL_GET in this series (using it to conditionally do
put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.

I do think it's worth providing a helper to consolidate and document that hacky
code, e.g. add a kvm_follow_refcounted_pfn() helper.

All in all, I think the below (completely untested) is what we want?

David (and others), I am planning on doing a full review of this series "soon",
but it will likely be a few weeks until that happens.  I jumped in on this
specific thread because this caught my eye and I really don't want to throw out
*all* of the FOLL_GET usage.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b5afd70f239..90d424990e0a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
        return rc == -EHWPOISON;
 }
 
+static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
+                                          struct page *page)
+{
+       kvm_pfn_t pfn = page_to_pfn(page);
+
+       foll->is_refcounted_page = true;
+
+       /*
+        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
+        * doesn't want to grab a reference, but gup() doesn't support getting
+        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
+        * changes, drop this and simply don't pass FOLL_GET to gup().
+        */
+       if (!(foll->flags & FOLL_GET))
+               put_page(page);
+
+       return pfn;
+}
+
 /*
  * The fast path to get the writable pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
@@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                return false;
 
        if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
-               *pfn = page_to_pfn(page[0]);
                foll->writable = foll->allow_write_mapping;
-               foll->is_refcounted_page = true;
-               if (!(foll->flags & FOLL_GET))
-                       put_page(page[0]);
+
+               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
                return true;
        }
 
@@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                return npages;
 
        foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
-       foll->is_refcounted_page = true;
 
        /* map read fault as writable if possible */
        if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                        page = wpage;
                }
        }
-       *pfn = page_to_pfn(page);
-       if (!(foll->flags & FOLL_GET))
-               put_page(page);
+
+       *pfn = kvm_follow_refcounted_pfn(foll, page);
        return npages;
 }
 
@@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
        if (!page)
                goto out;
 
-       if (get_page_unless_zero(page)) {
-               foll->is_refcounted_page = true;
-               if (!(foll->flags & FOLL_GET))
-                       put_page(page);
-       } else if (foll->flags & FOLL_GET) {
-               r = -EFAULT;
-       }
-
+       if (get_page_unless_zero(page))
+               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
 out:
        pte_unmap_unlock(ptep, ptl);
-       *p_pfn = pfn;
+
+       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
+           !allow_unsafe_mappings)
+               r = -EFAULT;
+       else
+               *p_pfn = pfn;
 
        return r;
 }


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-11 21:59           ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-07-11 21:59 UTC (permalink / raw)
  To: Zhi Wang
  Cc: David Stevens, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 11, 2023, Zhi Wang wrote:
> On Thu, 6 Jul 2023 15:49:39 +0900
> David Stevens <stevensd@chromium.org> wrote:
> 
> > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > >
> > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > David Stevens <stevensd@chromium.org> wrote:
> > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > temporary solution. I don't think it is a good idea to play tricks with
> > > a temporary solution, more like we are abusing the toleration.  
> > 
> > I'm not sure I understand what you're getting at. This series never
> > calls gup without FOLL_GET.
> > 
> > This series aims to provide kvm_follow_pfn as a unified API on top of
> > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > notifier, it makes sense to support returning a pfn without taking a
> > reference. And we indeed need to do that for certain types of memory.
> > 
> 
> I am not having prob with taking a pfn without taking a ref. I am
> questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> a pfn without a ref is a good idea or not, while there is another flag
> actually showing it.
> 
> I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> requirements of GUP in the code path that going to call GUP is reasonable.
> 
> But using FOLL_XXX with purposes that are not related to GUP call really
> feels off.

I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
that handles non-refcounted pages, i.e. this

	if (get_page_unless_zero(page)) {
		foll->is_refcounted_page = true;
		if (!(foll->flags & FOLL_GET))
			put_page(page);
	} else if (foll->flags & FOLL_GET) {
		r = -EFAULT;
	}

should be

	if (get_page_unless_zero(page)) {
		foll->is_refcounted_page = true;
		if (!(foll->flags & FOLL_GET))
			put_page(page);
	else if (!foll->guarded_by_mmu_notifier)
		r = -EFAULT;

because it's not the desire to grab a reference that makes getting non-refcounted
pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.

Though that highlights that checking guarded_by_mmu_notifier should be done for
*all* non-refcounted pfns, not just non-refcounted struct page memory.

As for the other usage of FOLL_GET in this series (using it to conditionally do
put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.

I do think it's worth providing a helper to consolidate and document that hacky
code, e.g. add a kvm_follow_refcounted_pfn() helper.

All in all, I think the below (completely untested) is what we want?

David (and others), I am planning on doing a full review of this series "soon",
but it will likely be a few weeks until that happens.  I jumped in on this
specific thread because this caught my eye and I really don't want to throw out
*all* of the FOLL_GET usage.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b5afd70f239..90d424990e0a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
        return rc == -EHWPOISON;
 }
 
+static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
+                                          struct page *page)
+{
+       kvm_pfn_t pfn = page_to_pfn(page);
+
+       foll->is_refcounted_page = true;
+
+       /*
+        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
+        * doesn't want to grab a reference, but gup() doesn't support getting
+        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
+        * changes, drop this and simply don't pass FOLL_GET to gup().
+        */
+       if (!(foll->flags & FOLL_GET))
+               put_page(page);
+
+       return pfn;
+}
+
 /*
  * The fast path to get the writable pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
@@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                return false;
 
        if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
-               *pfn = page_to_pfn(page[0]);
                foll->writable = foll->allow_write_mapping;
-               foll->is_refcounted_page = true;
-               if (!(foll->flags & FOLL_GET))
-                       put_page(page[0]);
+
+               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
                return true;
        }
 
@@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                return npages;
 
        foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
-       foll->is_refcounted_page = true;
 
        /* map read fault as writable if possible */
        if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                        page = wpage;
                }
        }
-       *pfn = page_to_pfn(page);
-       if (!(foll->flags & FOLL_GET))
-               put_page(page);
+
+       *pfn = kvm_follow_refcounted_pfn(foll, page);
        return npages;
 }
 
@@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
        if (!page)
                goto out;
 
-       if (get_page_unless_zero(page)) {
-               foll->is_refcounted_page = true;
-               if (!(foll->flags & FOLL_GET))
-                       put_page(page);
-       } else if (foll->flags & FOLL_GET) {
-               r = -EFAULT;
-       }
-
+       if (get_page_unless_zero(page))
+               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
 out:
        pte_unmap_unlock(ptep, ptl);
-       *p_pfn = pfn;
+
+       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
+           !allow_unsafe_mappings)
+               r = -EFAULT;
+       else
+               *p_pfn = pfn;
 
        return r;
 }


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-07-11 21:59           ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-07-11 21:59 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Tue, Jul 11, 2023, Zhi Wang wrote:
> On Thu, 6 Jul 2023 15:49:39 +0900
> David Stevens <stevensd@chromium.org> wrote:
> 
> > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > >
> > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > David Stevens <stevensd@chromium.org> wrote:
> > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > temporary solution. I don't think it is a good idea to play tricks with
> > > a temporary solution, more like we are abusing the toleration.  
> > 
> > I'm not sure I understand what you're getting at. This series never
> > calls gup without FOLL_GET.
> > 
> > This series aims to provide kvm_follow_pfn as a unified API on top of
> > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > notifier, it makes sense to support returning a pfn without taking a
> > reference. And we indeed need to do that for certain types of memory.
> > 
> 
> I am not having prob with taking a pfn without taking a ref. I am
> questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> a pfn without a ref is a good idea or not, while there is another flag
> actually showing it.
> 
> I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> requirements of GUP in the code path that going to call GUP is reasonable.
> 
> But using FOLL_XXX with purposes that are not related to GUP call really
> feels off.

I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
that handles non-refcounted pages, i.e. this

	if (get_page_unless_zero(page)) {
		foll->is_refcounted_page = true;
		if (!(foll->flags & FOLL_GET))
			put_page(page);
	} else if (foll->flags & FOLL_GET) {
		r = -EFAULT;
	}

should be

	if (get_page_unless_zero(page)) {
		foll->is_refcounted_page = true;
		if (!(foll->flags & FOLL_GET))
			put_page(page);
	else if (!foll->guarded_by_mmu_notifier)
		r = -EFAULT;

because it's not the desire to grab a reference that makes getting non-refcounted
pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.

Though that highlights that checking guarded_by_mmu_notifier should be done for
*all* non-refcounted pfns, not just non-refcounted struct page memory.

As for the other usage of FOLL_GET in this series (using it to conditionally do
put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.

I do think it's worth providing a helper to consolidate and document that hacky
code, e.g. add a kvm_follow_refcounted_pfn() helper.

All in all, I think the below (completely untested) is what we want?

David (and others), I am planning on doing a full review of this series "soon",
but it will likely be a few weeks until that happens.  I jumped in on this
specific thread because this caught my eye and I really don't want to throw out
*all* of the FOLL_GET usage.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b5afd70f239..90d424990e0a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
        return rc == -EHWPOISON;
 }
 
+static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
+                                          struct page *page)
+{
+       kvm_pfn_t pfn = page_to_pfn(page);
+
+       foll->is_refcounted_page = true;
+
+       /*
+        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
+        * doesn't want to grab a reference, but gup() doesn't support getting
+        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
+        * changes, drop this and simply don't pass FOLL_GET to gup().
+        */
+       if (!(foll->flags & FOLL_GET))
+               put_page(page);
+
+       return pfn;
+}
+
 /*
  * The fast path to get the writable pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
@@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                return false;
 
        if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
-               *pfn = page_to_pfn(page[0]);
                foll->writable = foll->allow_write_mapping;
-               foll->is_refcounted_page = true;
-               if (!(foll->flags & FOLL_GET))
-                       put_page(page[0]);
+
+               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
                return true;
        }
 
@@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                return npages;
 
        foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
-       foll->is_refcounted_page = true;
 
        /* map read fault as writable if possible */
        if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
                        page = wpage;
                }
        }
-       *pfn = page_to_pfn(page);
-       if (!(foll->flags & FOLL_GET))
-               put_page(page);
+
+       *pfn = kvm_follow_refcounted_pfn(foll, page);
        return npages;
 }
 
@@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
        if (!page)
                goto out;
 
-       if (get_page_unless_zero(page)) {
-               foll->is_refcounted_page = true;
-               if (!(foll->flags & FOLL_GET))
-                       put_page(page);
-       } else if (foll->flags & FOLL_GET) {
-               r = -EFAULT;
-       }
-
+       if (get_page_unless_zero(page))
+               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
 out:
        pte_unmap_unlock(ptep, ptl);
-       *p_pfn = pfn;
+
+       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
+           !allow_unsafe_mappings)
+               r = -EFAULT;
+       else
+               *p_pfn = pfn;
 
        return r;
 }


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-07-19  6:09     ` Yan Zhao
  -1 siblings, 0 replies; 165+ messages in thread
From: Yan Zhao @ 2023-07-19  6:09 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
should this kvm_set_page_accessed() be placed before unlocking mmu_lock?

Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
with mmu_lock holding for write), and release the page, and if the two
steps happen after checking page_count() in kvm_set_page_accessed() and
before mark_page_accessed(), the latter function may mark accessed to a page
that is released or does not belong to current process.

Is it true?

>  	return r;
>  }
>  
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  
>  out_unlock:
>  	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	return r;
>  }
Ditto.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-19  6:09     ` Yan Zhao
  0 siblings, 0 replies; 165+ messages in thread
From: Yan Zhao @ 2023-07-19  6:09 UTC (permalink / raw)
  To: David Stevens
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
should this kvm_set_page_accessed() be placed before unlocking mmu_lock?

Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
with mmu_lock holding for write), and release the page, and if the two
steps happen after checking page_count() in kvm_set_page_accessed() and
before mark_page_accessed(), the latter function may mark accessed to a page
that is released or does not belong to current process.

Is it true?

>  	return r;
>  }
>  
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  
>  out_unlock:
>  	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	return r;
>  }
Ditto.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-19  6:09     ` Yan Zhao
  0 siblings, 0 replies; 165+ messages in thread
From: Yan Zhao @ 2023-07-19  6:09 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
should this kvm_set_page_accessed() be placed before unlocking mmu_lock?

Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
with mmu_lock holding for write), and release the page, and if the two
steps happen after checking page_count() in kvm_set_page_accessed() and
before mark_page_accessed(), the latter function may mark accessed to a page
that is released or does not belong to current process.

Is it true?

>  	return r;
>  }
>  
> @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  
>  out_unlock:
>  	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +	if (fault->is_refcounted_page)
> +		kvm_set_page_accessed(pfn_to_page(fault->pfn));
>  	return r;
>  }
Ditto.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-19  6:09     ` Yan Zhao
  (?)
@ 2023-07-19  7:16       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-19  7:16 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 19, 2023 at 3:35 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >
> >  out_unlock:
> >       write_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
> > +     if (fault->is_refcounted_page)
> > +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
> For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
> should this kvm_set_page_accessed() be placed before unlocking mmu_lock?
>
> Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
> with mmu_lock holding for write), and release the page, and if the two
> steps happen after checking page_count() in kvm_set_page_accessed() and
> before mark_page_accessed(), the latter function may mark accessed to a page
> that is released or does not belong to current process.
>
> Is it true?

Yes, good catch. During some testing last week, I actually found this
bug thanks to the WARN_ON the first patch in this series added to
kvm_is_ad_tracked_page. I'll fix it in the next revision, after Sean
gets a chance to comment on the series.

Thanks,
David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-19  7:16       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-19  7:16 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 19, 2023 at 3:35 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >
> >  out_unlock:
> >       write_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
> > +     if (fault->is_refcounted_page)
> > +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
> For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
> should this kvm_set_page_accessed() be placed before unlocking mmu_lock?
>
> Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
> with mmu_lock holding for write), and release the page, and if the two
> steps happen after checking page_count() in kvm_set_page_accessed() and
> before mark_page_accessed(), the latter function may mark accessed to a page
> that is released or does not belong to current process.
>
> Is it true?

Yes, good catch. During some testing last week, I actually found this
bug thanks to the WARN_ON the first patch in this series added to
kvm_is_ad_tracked_page. I'll fix it in the next revision, after Sean
gets a chance to comment on the series.

Thanks,
David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-07-19  7:16       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-07-19  7:16 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 19, 2023 at 3:35 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >
> >  out_unlock:
> >       write_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
> > +     if (fault->is_refcounted_page)
> > +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
> For a refcounted page, as now KVM puts its ref early in kvm_faultin_pfn(),
> should this kvm_set_page_accessed() be placed before unlocking mmu_lock?
>
> Otherwise, if the user unmaps a region (which triggers kvm_unmap_gfn_range()
> with mmu_lock holding for write), and release the page, and if the two
> steps happen after checking page_count() in kvm_set_page_accessed() and
> before mark_page_accessed(), the latter function may mark accessed to a page
> that is released or does not belong to current process.
>
> Is it true?

Yes, good catch. During some testing last week, I actually found this
bug thanks to the WARN_ON the first patch in this series added to
kvm_is_ad_tracked_page. I'll fix it in the next revision, after Sean
gets a chance to comment on the series.

Thanks,
David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-06 14:52             ` Yu Zhang
  (?)
@ 2023-08-04 22:03               ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:03 UTC (permalink / raw)
  To: Yu Zhang
  Cc: David Stevens, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023, Yu Zhang wrote:
> On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > >
> > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > > >   * The slow path to get the pfn of the specified host virtual address,
> > > > > >   * 1 indicates success, -errno is returned if error is detected.
> > > > > >   */
> > > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > > >  {
> > > > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > > >       struct page *page;
> > > > > >       int npages;
> > > > > >
> > > > > >       might_sleep();
> > > > > >
> > > > > > -     if (writable)
> > > > > > -             *writable = write_fault;
> > > > > > -
> > > > > > -     if (write_fault)
> > > > > > -             flags |= FOLL_WRITE;
> > > > > > -     if (async)
> > > > > > -             flags |= FOLL_NOWAIT;
> > > > > > -     if (interruptible)
> > > > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > > > -
> > > > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > > >       if (npages != 1)
> > > > > >               return npages;
> > > > > >
> > > > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > > +
> > > > > >       /* map read fault as writable if possible */
> > > > > > -     if (unlikely(!write_fault) && writable) {
> > > > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > > >
> > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > > >
> > > > The two statements are logically equivalent, although I guess using
> > > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > > verbose.
> > >
> > > Well, as the comment says, we wanna try to map the read fault as writable
> > > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > > for write faults. So I guess using !foll->writable will not allow this.
> > > Did I miss anything?
> > 
> > We just set the foll->writable out parameter to be equal to
> > ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> > foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> > !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> > b.
> 
> Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
> my head while calculating... Thanks! :)

The code is funky and confusing though.  Specifically, FOLL_WRITE without
allow_write_mapping is nonsensical, and yields the even more nonsensical output
of a successful FOLL_WRITE with foll->writable==%false.

It "works" because callers only consume foll->writable when foll->allow_write_mapping
is true, but relying on that is ugly and completely unnecessary.   Similarly, the
"allow" terminology is misleading.  FOLL_WRITE *always* allows writable mappings.

This wasn't as much of problem in the previous code because the lower levels took
the pointer, i.e. avoided the "allow" terminology entirely.

So we should either keep that behavior, i.e. replace "bool allow_write_mapping"
with "bool *writable", or rename allow_write_mapping to something like
opportunistically_map_writable, and then unconditionally set foll->writable
whenever KVM obtains a writable mapping, i.e. regardless of whether the original
fault was a read or a write.

My vote is for the latter.  If opportunistically_map_writable is too verbose,
try_map_writable would be another option.  Hmm, I'll make "try_map_writable" my
official vote.

Ah, and I also vote to use an if-elif instead of unconditionally setting foll->writable.
That makes the relationship between FOLL_WRITE and try_map_writable a bit more
obvious IMO.  E.g.

static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
	struct page *page[1];

	/*
	 * Fast pin a writable pfn only if it is a write fault request
	 * or the caller allows to map a writable pfn for a read fault
	 * request.
	 */
	if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
		return false;

	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
		*pfn = page_to_pfn(page[0]);
		foll->writable = true;
		return true;
	}

	return false;
}

/*
 * The slow path to get the pfn of the specified host virtual address,
 * 1 indicates success, -errno is returned if error is detected.
 */
static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
	struct page *page;
	int npages;

	might_sleep();

	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
	if (npages != 1)
		return npages;

	if (foll->flags & FOLL_WRITE) {
		foll->writable = true;
	} else if (foll->try_map_writable) {
		struct page *wpage;

		/* map read fault as writable if possible */
		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
			foll->writable = true;
			put_page(page);
			page = wpage;
		}
	}
	*pfn = page_to_pfn(page);
	return npages;
}


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-08-04 22:03               ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:03 UTC (permalink / raw)
  To: Yu Zhang
  Cc: David Stevens, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023, Yu Zhang wrote:
> On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > >
> > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > > >   * The slow path to get the pfn of the specified host virtual address,
> > > > > >   * 1 indicates success, -errno is returned if error is detected.
> > > > > >   */
> > > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > > >  {
> > > > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > > >       struct page *page;
> > > > > >       int npages;
> > > > > >
> > > > > >       might_sleep();
> > > > > >
> > > > > > -     if (writable)
> > > > > > -             *writable = write_fault;
> > > > > > -
> > > > > > -     if (write_fault)
> > > > > > -             flags |= FOLL_WRITE;
> > > > > > -     if (async)
> > > > > > -             flags |= FOLL_NOWAIT;
> > > > > > -     if (interruptible)
> > > > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > > > -
> > > > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > > >       if (npages != 1)
> > > > > >               return npages;
> > > > > >
> > > > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > > +
> > > > > >       /* map read fault as writable if possible */
> > > > > > -     if (unlikely(!write_fault) && writable) {
> > > > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > > >
> > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > > >
> > > > The two statements are logically equivalent, although I guess using
> > > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > > verbose.
> > >
> > > Well, as the comment says, we wanna try to map the read fault as writable
> > > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > > for write faults. So I guess using !foll->writable will not allow this.
> > > Did I miss anything?
> > 
> > We just set the foll->writable out parameter to be equal to
> > ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> > foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> > !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> > b.
> 
> Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
> my head while calculating... Thanks! :)

The code is funky and confusing though.  Specifically, FOLL_WRITE without
allow_write_mapping is nonsensical, and yields the even more nonsensical output
of a successful FOLL_WRITE with foll->writable==%false.

It "works" because callers only consume foll->writable when foll->allow_write_mapping
is true, but relying on that is ugly and completely unnecessary.   Similarly, the
"allow" terminology is misleading.  FOLL_WRITE *always* allows writable mappings.

This wasn't as much of problem in the previous code because the lower levels took
the pointer, i.e. avoided the "allow" terminology entirely.

So we should either keep that behavior, i.e. replace "bool allow_write_mapping"
with "bool *writable", or rename allow_write_mapping to something like
opportunistically_map_writable, and then unconditionally set foll->writable
whenever KVM obtains a writable mapping, i.e. regardless of whether the original
fault was a read or a write.

My vote is for the latter.  If opportunistically_map_writable is too verbose,
try_map_writable would be another option.  Hmm, I'll make "try_map_writable" my
official vote.

Ah, and I also vote to use an if-elif instead of unconditionally setting foll->writable.
That makes the relationship between FOLL_WRITE and try_map_writable a bit more
obvious IMO.  E.g.

static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
	struct page *page[1];

	/*
	 * Fast pin a writable pfn only if it is a write fault request
	 * or the caller allows to map a writable pfn for a read fault
	 * request.
	 */
	if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
		return false;

	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
		*pfn = page_to_pfn(page[0]);
		foll->writable = true;
		return true;
	}

	return false;
}

/*
 * The slow path to get the pfn of the specified host virtual address,
 * 1 indicates success, -errno is returned if error is detected.
 */
static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
	struct page *page;
	int npages;

	might_sleep();

	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
	if (npages != 1)
		return npages;

	if (foll->flags & FOLL_WRITE) {
		foll->writable = true;
	} else if (foll->try_map_writable) {
		struct page *wpage;

		/* map read fault as writable if possible */
		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
			foll->writable = true;
			put_page(page);
			page = wpage;
		}
	}
	*pfn = page_to_pfn(page);
	return npages;
}


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-08-04 22:03               ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:03 UTC (permalink / raw)
  To: Yu Zhang
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Thu, Jul 06, 2023, Yu Zhang wrote:
> On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > > >
> > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > > >   * The slow path to get the pfn of the specified host virtual address,
> > > > > >   * 1 indicates success, -errno is returned if error is detected.
> > > > > >   */
> > > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > > -                        bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > > >  {
> > > > > > -     unsigned int flags = FOLL_HWPOISON;
> > > > > > +     unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > > >       struct page *page;
> > > > > >       int npages;
> > > > > >
> > > > > >       might_sleep();
> > > > > >
> > > > > > -     if (writable)
> > > > > > -             *writable = write_fault;
> > > > > > -
> > > > > > -     if (write_fault)
> > > > > > -             flags |= FOLL_WRITE;
> > > > > > -     if (async)
> > > > > > -             flags |= FOLL_NOWAIT;
> > > > > > -     if (interruptible)
> > > > > > -             flags |= FOLL_INTERRUPTIBLE;
> > > > > > -
> > > > > > -     npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > > +     npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > > >       if (npages != 1)
> > > > > >               return npages;
> > > > > >
> > > > > > +     foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > > +
> > > > > >       /* map read fault as writable if possible */
> > > > > > -     if (unlikely(!write_fault) && writable) {
> > > > > > +     if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > > >
> > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > > >
> > > > The two statements are logically equivalent, although I guess using
> > > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > > verbose.
> > >
> > > Well, as the comment says, we wanna try to map the read fault as writable
> > > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > > for write faults. So I guess using !foll->writable will not allow this.
> > > Did I miss anything?
> > 
> > We just set the foll->writable out parameter to be equal to
> > ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> > foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> > !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> > b.
> 
> Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
> my head while calculating... Thanks! :)

The code is funky and confusing though.  Specifically, FOLL_WRITE without
allow_write_mapping is nonsensical, and yields the even more nonsensical output
of a successful FOLL_WRITE with foll->writable==%false.

It "works" because callers only consume foll->writable when foll->allow_write_mapping
is true, but relying on that is ugly and completely unnecessary.   Similarly, the
"allow" terminology is misleading.  FOLL_WRITE *always* allows writable mappings.

This wasn't as much of problem in the previous code because the lower levels took
the pointer, i.e. avoided the "allow" terminology entirely.

So we should either keep that behavior, i.e. replace "bool allow_write_mapping"
with "bool *writable", or rename allow_write_mapping to something like
opportunistically_map_writable, and then unconditionally set foll->writable
whenever KVM obtains a writable mapping, i.e. regardless of whether the original
fault was a read or a write.

My vote is for the latter.  If opportunistically_map_writable is too verbose,
try_map_writable would be another option.  Hmm, I'll make "try_map_writable" my
official vote.

Ah, and I also vote to use an if-elif instead of unconditionally setting foll->writable.
That makes the relationship between FOLL_WRITE and try_map_writable a bit more
obvious IMO.  E.g.

static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
	struct page *page[1];

	/*
	 * Fast pin a writable pfn only if it is a write fault request
	 * or the caller allows to map a writable pfn for a read fault
	 * request.
	 */
	if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
		return false;

	if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
		*pfn = page_to_pfn(page[0]);
		foll->writable = true;
		return true;
	}

	return false;
}

/*
 * The slow path to get the pfn of the specified host virtual address,
 * 1 indicates success, -errno is returned if error is detected.
 */
static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
	unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
	struct page *page;
	int npages;

	might_sleep();

	npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
	if (npages != 1)
		return npages;

	if (foll->flags & FOLL_WRITE) {
		foll->writable = true;
	} else if (foll->try_map_writable) {
		struct page *wpage;

		/* map read fault as writable if possible */
		if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
			foll->writable = true;
			put_page(page);
			page = wpage;
		}
	}
	*pfn = page_to_pfn(page);
	return npages;
}


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
  2023-07-04  7:50   ` David Stevens
  (?)
@ 2023-08-04 22:13     ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:13 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)

Hmm, ideally KVM_PFN_ERR_NEEDS_IO would be introduced in a separate prep patch,
e.g. by changing "bool *async" to "bool no_wait".  At a glance, I can't tell if
that's feasible though, so consider it more of a "wish" than a request.

> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)

Please wrap.  KVM still honors the 80 char soft limit unless there's a reason not
to, and in this case it's already wrapping

static int hva_to_pfn_remapped(struct vm_area_struct *vma,
			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)

> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;

Similar to feedback in my other response, don't condition this on try_map_writable,
i.e. just do:

	foll->writable = pte_write(...);

>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {

Opportunistically align the indentation, as an added bonus that makes the line
length a few chars shorter, i.e.

	if (npages == -EHWPOISON ||
	    (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
		pfn = KVM_PFN_ERR_HWPOISON;
		goto exit;
	}

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-08-04 22:13     ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:13 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)

Hmm, ideally KVM_PFN_ERR_NEEDS_IO would be introduced in a separate prep patch,
e.g. by changing "bool *async" to "bool no_wait".  At a glance, I can't tell if
that's feasible though, so consider it more of a "wish" than a request.

> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)

Please wrap.  KVM still honors the 80 char soft limit unless there's a reason not
to, and in this case it's already wrapping

static int hva_to_pfn_remapped(struct vm_area_struct *vma,
			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)

> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;

Similar to feedback in my other response, don't condition this on try_map_writable,
i.e. just do:

	foll->writable = pte_write(...);

>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {

Opportunistically align the indentation, as an added bonus that makes the line
length a few chars shorter, i.e.

	if (npages == -EHWPOISON ||
	    (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
		pfn = KVM_PFN_ERR_HWPOISON;
		goto exit;
	}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
@ 2023-08-04 22:13     ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:13 UTC (permalink / raw)
  To: David Stevens
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, kvmarm, linuxppc-dev,
	linux-arm-kernel

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  include/linux/kvm_host.h |  16 ++++
>  virt/kvm/kvm_main.c      | 171 ++++++++++++++++++++++-----------------
>  virt/kvm/kvm_mm.h        |   3 +-
>  virt/kvm/pfncache.c      |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON	(KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)

Hmm, ideally KVM_PFN_ERR_NEEDS_IO would be introduced in a separate prep patch,
e.g. by changing "bool *async" to "bool no_wait".  At a glance, I can't tell if
that's feasible though, so consider it more of a "wish" than a request.

> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>  	return get_page_unless_zero(page);
>  }
>  
> -static int hva_to_pfn_remapped(struct vm_area_struct *vma,
> -			       unsigned long addr, bool write_fault,
> -			       bool *writable, kvm_pfn_t *p_pfn)
> +static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll,
> +			       kvm_pfn_t *p_pfn)

Please wrap.  KVM still honors the 80 char soft limit unless there's a reason not
to, and in this case it's already wrapping

static int hva_to_pfn_remapped(struct vm_area_struct *vma,
			       struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn)

> @@ -2606,8 +2595,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (writable)
> -		*writable = pte_write(*ptep);
> +	foll->writable = pte_write(*ptep) && foll->allow_write_mapping;

Similar to feedback in my other response, don't condition this on try_map_writable,
i.e. just do:

	foll->writable = pte_write(...);

>  	pfn = pte_pfn(*ptep);
>  
>  	/*
> @@ -2652,24 +2640,22 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   * 2): @write_fault = false && @writable, @writable will tell the caller
>   *     whether the mapping is writable.
>   */
> -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> -		     bool *async, bool write_fault, bool *writable)
> +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll)
>  {
>  	struct vm_area_struct *vma;
>  	kvm_pfn_t pfn;
>  	int npages, r;
>  
>  	/* we can do it either atomically or asynchronously, not both */
> -	BUG_ON(atomic && async);
> +	BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT));
>  
> -	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
> +	if (hva_to_pfn_fast(foll, &pfn))
>  		return pfn;
>  
> -	if (atomic)
> +	if (foll->atomic)
>  		return KVM_PFN_ERR_FAULT;
>  
> -	npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
> -				 writable, &pfn);
> +	npages = hva_to_pfn_slow(foll, &pfn);
>  	if (npages == 1)
>  		return pfn;
>  	if (npages == -EINTR)
> @@ -2677,83 +2663,122 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>  
>  	mmap_read_lock(current->mm);
>  	if (npages == -EHWPOISON ||
> -	      (!async && check_user_page_hwpoison(addr))) {
> +	      (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {

Opportunistically align the indentation, as an added bonus that makes the line
length a few chars shorter, i.e.

	if (npages == -EHWPOISON ||
	    (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))) {
		pfn = KVM_PFN_ERR_HWPOISON;
		goto exit;
	}

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
  2023-07-05  8:07     ` Yu Zhang
  (?)
@ 2023-08-04 22:30       ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:30 UTC (permalink / raw)
  To: Yu Zhang
  Cc: David Stevens, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 05, 2023, Yu Zhang wrote:
> On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Please turn up your changelog verbosity from ~2 to ~8.  E.g. explain the transition
from async => FOLL_NOWAIT+KVM_PFN_ERR_NEEDS_IO, there's no reason to force readers
to suss that out on their own.

> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
> >  1 file changed, 25 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ec169f5c7dce..e44ab512c3a1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> > -	bool async;
> > +	struct kvm_follow_pfn foll = {
> > +		.slot = slot,
> > +		.gfn = fault->gfn,
> > +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +		.allow_write_mapping = true,
> > +	};
> >  
> >  	/*
> >  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > -	async = false;
> > -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -					  fault->write, &fault->map_writable,
> > -					  &fault->hva);
> > -	if (!async)
> > -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> > +	foll.flags |= FOLL_NOWAIT;
> > +	fault->pfn = __kvm_follow_pfn(&foll);
> > +
> > +	if (!is_error_noslot_pfn(fault->pfn))
> > +		goto success;
> > +
> > +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> > +		return RET_PF_CONTINUE;
> 
> IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So
> KVM_PFN_ERR_NEEDS_IO may not be necessary? 

But FOLL_NOWAIT is set above.  This logic is essentially saying "bail immediately
if __gfn_to_pfn_memslot() returned a fatal error".

A commented would definitely be helpful though.  How about?

	/*
	 * If __kvm_follow_pfn() failed because I/O is needed to fault in the
	 * page, then either set up an asynchronous #PF to do the I/O, or if
	 * doing an async #PF isn't possible, retry __kvm_follow_pfn() with
	  I/O allowed. All other failures are fatal, i.e. retrying won't help.
	 */
	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
		return RET_PF_CONTINUE;

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-08-04 22:30       ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:30 UTC (permalink / raw)
  To: Yu Zhang
  Cc: David Stevens, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 05, 2023, Yu Zhang wrote:
> On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Please turn up your changelog verbosity from ~2 to ~8.  E.g. explain the transition
from async => FOLL_NOWAIT+KVM_PFN_ERR_NEEDS_IO, there's no reason to force readers
to suss that out on their own.

> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
> >  1 file changed, 25 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ec169f5c7dce..e44ab512c3a1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> > -	bool async;
> > +	struct kvm_follow_pfn foll = {
> > +		.slot = slot,
> > +		.gfn = fault->gfn,
> > +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +		.allow_write_mapping = true,
> > +	};
> >  
> >  	/*
> >  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > -	async = false;
> > -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -					  fault->write, &fault->map_writable,
> > -					  &fault->hva);
> > -	if (!async)
> > -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> > +	foll.flags |= FOLL_NOWAIT;
> > +	fault->pfn = __kvm_follow_pfn(&foll);
> > +
> > +	if (!is_error_noslot_pfn(fault->pfn))
> > +		goto success;
> > +
> > +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> > +		return RET_PF_CONTINUE;
> 
> IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So
> KVM_PFN_ERR_NEEDS_IO may not be necessary? 

But FOLL_NOWAIT is set above.  This logic is essentially saying "bail immediately
if __gfn_to_pfn_memslot() returned a fatal error".

A commented would definitely be helpful though.  How about?

	/*
	 * If __kvm_follow_pfn() failed because I/O is needed to fault in the
	 * page, then either set up an asynchronous #PF to do the I/O, or if
	 * doing an async #PF isn't possible, retry __kvm_follow_pfn() with
	  I/O allowed. All other failures are fatal, i.e. retrying won't help.
	 */
	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
		return RET_PF_CONTINUE;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-08-04 22:30       ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:30 UTC (permalink / raw)
  To: Yu Zhang
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, David Stevens, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Wed, Jul 05, 2023, Yu Zhang wrote:
> On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Please turn up your changelog verbosity from ~2 to ~8.  E.g. explain the transition
from async => FOLL_NOWAIT+KVM_PFN_ERR_NEEDS_IO, there's no reason to force readers
to suss that out on their own.

> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
> >  1 file changed, 25 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ec169f5c7dce..e44ab512c3a1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> > -	bool async;
> > +	struct kvm_follow_pfn foll = {
> > +		.slot = slot,
> > +		.gfn = fault->gfn,
> > +		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +		.allow_write_mapping = true,
> > +	};
> >  
> >  	/*
> >  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > -	async = false;
> > -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -					  fault->write, &fault->map_writable,
> > -					  &fault->hva);
> > -	if (!async)
> > -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> > +	foll.flags |= FOLL_NOWAIT;
> > +	fault->pfn = __kvm_follow_pfn(&foll);
> > +
> > +	if (!is_error_noslot_pfn(fault->pfn))
> > +		goto success;
> > +
> > +	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> > +		return RET_PF_CONTINUE;
> 
> IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So
> KVM_PFN_ERR_NEEDS_IO may not be necessary? 

But FOLL_NOWAIT is set above.  This logic is essentially saying "bail immediately
if __gfn_to_pfn_memslot() returned a fatal error".

A commented would definitely be helpful though.  How about?

	/*
	 * If __kvm_follow_pfn() failed because I/O is needed to fault in the
	 * page, then either set up an asynchronous #PF to do the I/O, or if
	 * doing an async #PF isn't possible, retry __kvm_follow_pfn() with
	  I/O allowed. All other failures are fatal, i.e. retrying won't help.
	 */
	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
		return RET_PF_CONTINUE;

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-06  4:52       ` David Stevens
  (?)
@ 2023-08-04 22:45         ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:45 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.

Recording for posterity (or to make an incorrect statment and get corrected),
because I recently had a conversation about the head page not actually being
refcounted.  (I can't remember with whom I had the conversation, but I'm pretty
sure it wasn't an imaginary friend).

Even though whatever allocates the page doesn't explicit refcount the head page,
__free_pages() will still do the right thing and (a) keep the head page around
until its last reference is put.  And my understanding is that even though it's
a "head" page, it's not a PG_head page, i.e. not a compound page and so is treated
as an order-0 page when KVM invoke put_page().

void __free_pages(struct page *page, unsigned int order)
{
	/* get PageHead before we drop reference */
	int head = PageHead(page);

	if (put_page_testzero(page))  <=== will evaluate false if KVM holds a ref
		free_the_page(page, order);
	else if (!head)  <=== will be false for non-compound pages
		while (order-- > 0)
			free_the_page(page + (1 << order), order);
}
EXPORT_SYMBOL(__free_pages);

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-04 22:45         ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:45 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 06, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.

Recording for posterity (or to make an incorrect statment and get corrected),
because I recently had a conversation about the head page not actually being
refcounted.  (I can't remember with whom I had the conversation, but I'm pretty
sure it wasn't an imaginary friend).

Even though whatever allocates the page doesn't explicit refcount the head page,
__free_pages() will still do the right thing and (a) keep the head page around
until its last reference is put.  And my understanding is that even though it's
a "head" page, it's not a PG_head page, i.e. not a compound page and so is treated
as an order-0 page when KVM invoke put_page().

void __free_pages(struct page *page, unsigned int order)
{
	/* get PageHead before we drop reference */
	int head = PageHead(page);

	if (put_page_testzero(page))  <=== will evaluate false if KVM holds a ref
		free_the_page(page, order);
	else if (!head)  <=== will be false for non-compound pages
		while (order-- > 0)
			free_the_page(page + (1 << order), order);
}
EXPORT_SYMBOL(__free_pages);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-04 22:45         ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:45 UTC (permalink / raw)
  To: David Stevens
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, Yu Zhang, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Thu, Jul 06, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.

Recording for posterity (or to make an incorrect statment and get corrected),
because I recently had a conversation about the head page not actually being
refcounted.  (I can't remember with whom I had the conversation, but I'm pretty
sure it wasn't an imaginary friend).

Even though whatever allocates the page doesn't explicit refcount the head page,
__free_pages() will still do the right thing and (a) keep the head page around
until its last reference is put.  And my understanding is that even though it's
a "head" page, it's not a PG_head page, i.e. not a compound page and so is treated
as an order-0 page when KVM invoke put_page().

void __free_pages(struct page *page, unsigned int order)
{
	/* get PageHead before we drop reference */
	int head = PageHead(page);

	if (put_page_testzero(page))  <=== will evaluate false if KVM holds a ref
		free_the_page(page, order);
	else if (!head)  <=== will be false for non-compound pages
		while (order-- > 0)
			free_the_page(page + (1 << order), order);
}
EXPORT_SYMBOL(__free_pages);

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 0/8] KVM: allow mapping non-refcounted pages
  2023-07-04  7:50 ` David Stevens
  (?)
@ 2023-08-04 22:47   ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:47 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.

Aplogies for the slow review, I'm done with feedback for this version.

FWIW, it's probably too bit late to catch 6.6, especially since we need acks from
ARM and PPC, but 6.7 should be very doable unless someone outright objects.

Thanks for being persistent!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 0/8] KVM: allow mapping non-refcounted pages
@ 2023-08-04 22:47   ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:47 UTC (permalink / raw)
  To: David Stevens
  Cc: Marc Zyngier, Michael Ellerman, Peter Xu, linux-arm-kernel,
	kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.

Aplogies for the slow review, I'm done with feedback for this version.

FWIW, it's probably too bit late to catch 6.6, especially since we need acks from
ARM and PPC, but 6.7 should be very doable unless someone outright objects.

Thanks for being persistent!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 0/8] KVM: allow mapping non-refcounted pages
@ 2023-08-04 22:47   ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-04 22:47 UTC (permalink / raw)
  To: David Stevens
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, kvmarm, linuxppc-dev,
	linux-arm-kernel

On Tue, Jul 04, 2023, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for mapping VM_IO and VM_PFNMAP memory
> that is backed by struct pages that aren't currently being refcounted
> (e.g. tail pages of non-compound higher order allocations) into the
> guest.

Aplogies for the slow review, I'm done with feedback for this version.

FWIW, it's probably too bit late to catch 6.6, especially since we need acks from
ARM and PPC, but 6.7 should be very doable unless someone outright objects.

Thanks for being persistent!

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
  2023-07-06  1:54     ` Isaku Yamahata
  (?)
@ 2023-08-24  8:03       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-24  8:03 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Thu, Jul 6, 2023 at 10:54 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:49PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
> >  1 file changed, 25 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ec169f5c7dce..e44ab512c3a1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >       struct kvm_memory_slot *slot = fault->slot;
> > -     bool async;
> > +     struct kvm_follow_pfn foll = {
> > +             .slot = slot,
> > +             .gfn = fault->gfn,
> > +             .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +             .allow_write_mapping = true,
> > +     };
> >
> >       /*
> >        * Retry the page fault if the gfn hit a memslot that is being deleted
> > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >                       return RET_PF_EMULATE;
> >       }
> >
> > -     async = false;
> > -     fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -                                       fault->write, &fault->map_writable,
> > -                                       &fault->hva);
> > -     if (!async)
> > -             return RET_PF_CONTINUE; /* *pfn has correct page already */
> > +     foll.flags |= FOLL_NOWAIT;
> > +     fault->pfn = __kvm_follow_pfn(&foll);
> > +
> > +     if (!is_error_noslot_pfn(fault->pfn))
>
> We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn()
> return int instead of kvm_pfn_t?  KVM_PFN_* seems widely used, though.

Switching __kvm_follow_pfn to return an int isn't difficult, but doing
is cleanly would require reworking the kvm_pfn_t/KVM_PFN_ERR_* api,
which as you said is quite widely used. That's a bit larger scope than
I want to do in this patch series.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-08-24  8:03       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-24  8:03 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 6, 2023 at 10:54 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:49PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
> >  1 file changed, 25 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ec169f5c7dce..e44ab512c3a1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >       struct kvm_memory_slot *slot = fault->slot;
> > -     bool async;
> > +     struct kvm_follow_pfn foll = {
> > +             .slot = slot,
> > +             .gfn = fault->gfn,
> > +             .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +             .allow_write_mapping = true,
> > +     };
> >
> >       /*
> >        * Retry the page fault if the gfn hit a memslot that is being deleted
> > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >                       return RET_PF_EMULATE;
> >       }
> >
> > -     async = false;
> > -     fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -                                       fault->write, &fault->map_writable,
> > -                                       &fault->hva);
> > -     if (!async)
> > -             return RET_PF_CONTINUE; /* *pfn has correct page already */
> > +     foll.flags |= FOLL_NOWAIT;
> > +     fault->pfn = __kvm_follow_pfn(&foll);
> > +
> > +     if (!is_error_noslot_pfn(fault->pfn))
>
> We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn()
> return int instead of kvm_pfn_t?  KVM_PFN_* seems widely used, though.

Switching __kvm_follow_pfn to return an int isn't difficult, but doing
is cleanly would require reworking the kvm_pfn_t/KVM_PFN_ERR_* api,
which as you said is quite widely used. That's a bit larger scope than
I want to do in this patch series.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
@ 2023-08-24  8:03       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-24  8:03 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Jul 6, 2023 at 10:54 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:49PM +0900,
> David Stevens <stevensd@chromium.org> wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++++++++----------
> >  1 file changed, 25 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ec169f5c7dce..e44ab512c3a1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >       struct kvm_memory_slot *slot = fault->slot;
> > -     bool async;
> > +     struct kvm_follow_pfn foll = {
> > +             .slot = slot,
> > +             .gfn = fault->gfn,
> > +             .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +             .allow_write_mapping = true,
> > +     };
> >
> >       /*
> >        * Retry the page fault if the gfn hit a memslot that is being deleted
> > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >                       return RET_PF_EMULATE;
> >       }
> >
> > -     async = false;
> > -     fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -                                       fault->write, &fault->map_writable,
> > -                                       &fault->hva);
> > -     if (!async)
> > -             return RET_PF_CONTINUE; /* *pfn has correct page already */
> > +     foll.flags |= FOLL_NOWAIT;
> > +     fault->pfn = __kvm_follow_pfn(&foll);
> > +
> > +     if (!is_error_noslot_pfn(fault->pfn))
>
> We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn()
> return int instead of kvm_pfn_t?  KVM_PFN_* seems widely used, though.

Switching __kvm_follow_pfn to return an int isn't difficult, but doing
is cleanly would require reworking the kvm_pfn_t/KVM_PFN_ERR_* api,
which as you said is quite widely used. That's a bit larger scope than
I want to do in this patch series.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-07-05 10:25     ` Yu Zhang
  (?)
@ 2023-08-24  8:03       ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-24  8:03 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> >  6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >
> >       if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> >               flush = true;
> > -             kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > +             if (is_refcounted_page_pte(old_spte))
> > +                     kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
> >       }
> >
> >       if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> >               flush = true;
> > -             kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > +             if (is_refcounted_page_pte(old_spte))
> > +                     kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
> >       }
> >
> >       return flush;
> > @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
> >        * before they are reclaimed.  Sanity check that, if the pfn is backed
> >        * by a refcounted page, the refcount is elevated.
> >        */
> > -     page = kvm_pfn_to_refcounted_page(pfn);
> > -     WARN_ON(page && !page_count(page));
> > +     if (is_refcounted_page_pte(old_spte)) {
> > +             page = kvm_pfn_to_refcounted_page(pfn);
> > +             WARN_ON(!page || !page_count(page));
> > +     }
> >
> > -     if (is_accessed_spte(old_spte))
> > -             kvm_set_pfn_accessed(pfn);
> > +     if (is_refcounted_page_pte(old_spte)) {
> > +             if (is_accessed_spte(old_spte))
> > +                     kvm_set_page_accessed(pfn_to_page(pfn));
> >
> > -     if (is_dirty_spte(old_spte))
> > -             kvm_set_pfn_dirty(pfn);
> > +             if (is_dirty_spte(old_spte))
> > +                     kvm_set_page_dirty(pfn_to_page(pfn));
> > +     }
> >
> >       return old_spte;
> >  }
> > @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
> >                * Capture the dirty status of the page, so that it doesn't get
> >                * lost when the SPTE is marked for access tracking.
> >                */
> > -             if (is_writable_pte(spte))
> > -                     kvm_set_pfn_dirty(spte_to_pfn(spte));
> > +             if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> > +                     kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
> >
> >               spte = mark_spte_for_access_track(spte);
> >               mmu_spte_update_no_track(sptep, spte);
> > @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
> >  {
> >       bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
> >                                              (unsigned long *)sptep);
> > -     if (was_writable && !spte_ad_enabled(*sptep))
> > -             kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> > +     if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> > +             kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
> >
> >       return was_writable;
> >  }
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       bool host_writable = !fault || fault->map_writable;
> >       bool prefetch = !fault || fault->prefetch;
> >       bool write_fault = fault && fault->write;
> > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> >       pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
> >                *sptep, write_fault, gfn);
> > @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       }
> >
> >       wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> > -                        true, host_writable, &spte);
> > +                        true, host_writable, is_refcounted, &spte);
> >
> >       if (*sptep == spte) {
> >               ret = RET_PF_SPURIOUS;
> > @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >       struct kvm_follow_pfn foll = {
> >               .slot = slot,
> >               .gfn = fault->gfn,
> > -             .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +             .flags = fault->write ? FOLL_WRITE : 0,
> >               .allow_write_mapping = true,
> > +             .guarded_by_mmu_notifier = true,
> >       };
> >
> >       /*
> > @@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >                       fault->slot = NULL;
> >                       fault->pfn = KVM_PFN_NOSLOT;
> >                       fault->map_writable = false;
> > +                     fault->is_refcounted_page = false;
> >                       return RET_PF_CONTINUE;
> >               }
> >               /*
> > @@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  success:
> >       fault->hva = foll.hva;
> >       fault->map_writable = foll.writable;
> > +     fault->is_refcounted_page = foll.is_refcounted_page;
> >       return RET_PF_CONTINUE;
> >  }
> >
> > @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >
> >  out_unlock:
> >       write_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
> > +     if (fault->is_refcounted_page)
> > +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
> >       return r;
> >  }
> >
> > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> >
> >  out_unlock:
> >       read_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
>
> Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> believe this is not gonna happen in real world...

kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
like that, then kvm_vcpu_map will fail and the guest will probably
crash. It won't trigger any bugs in the host, though.

It is unfortunate that the guest will be able to use certain types of
memory for some purposes but not for others. However, while it is
theoretically fixable, it's an unreasonable amount of work for
something that, as you say, nobody really cares about in practice [1].

[1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-24  8:03       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-24  8:03 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> >  6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >
> >       if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> >               flush = true;
> > -             kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > +             if (is_refcounted_page_pte(old_spte))
> > +                     kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
> >       }
> >
> >       if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> >               flush = true;
> > -             kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > +             if (is_refcounted_page_pte(old_spte))
> > +                     kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
> >       }
> >
> >       return flush;
> > @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
> >        * before they are reclaimed.  Sanity check that, if the pfn is backed
> >        * by a refcounted page, the refcount is elevated.
> >        */
> > -     page = kvm_pfn_to_refcounted_page(pfn);
> > -     WARN_ON(page && !page_count(page));
> > +     if (is_refcounted_page_pte(old_spte)) {
> > +             page = kvm_pfn_to_refcounted_page(pfn);
> > +             WARN_ON(!page || !page_count(page));
> > +     }
> >
> > -     if (is_accessed_spte(old_spte))
> > -             kvm_set_pfn_accessed(pfn);
> > +     if (is_refcounted_page_pte(old_spte)) {
> > +             if (is_accessed_spte(old_spte))
> > +                     kvm_set_page_accessed(pfn_to_page(pfn));
> >
> > -     if (is_dirty_spte(old_spte))
> > -             kvm_set_pfn_dirty(pfn);
> > +             if (is_dirty_spte(old_spte))
> > +                     kvm_set_page_dirty(pfn_to_page(pfn));
> > +     }
> >
> >       return old_spte;
> >  }
> > @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
> >                * Capture the dirty status of the page, so that it doesn't get
> >                * lost when the SPTE is marked for access tracking.
> >                */
> > -             if (is_writable_pte(spte))
> > -                     kvm_set_pfn_dirty(spte_to_pfn(spte));
> > +             if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> > +                     kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
> >
> >               spte = mark_spte_for_access_track(spte);
> >               mmu_spte_update_no_track(sptep, spte);
> > @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
> >  {
> >       bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
> >                                              (unsigned long *)sptep);
> > -     if (was_writable && !spte_ad_enabled(*sptep))
> > -             kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> > +     if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> > +             kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
> >
> >       return was_writable;
> >  }
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       bool host_writable = !fault || fault->map_writable;
> >       bool prefetch = !fault || fault->prefetch;
> >       bool write_fault = fault && fault->write;
> > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> >       pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
> >                *sptep, write_fault, gfn);
> > @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       }
> >
> >       wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> > -                        true, host_writable, &spte);
> > +                        true, host_writable, is_refcounted, &spte);
> >
> >       if (*sptep == spte) {
> >               ret = RET_PF_SPURIOUS;
> > @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >       struct kvm_follow_pfn foll = {
> >               .slot = slot,
> >               .gfn = fault->gfn,
> > -             .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +             .flags = fault->write ? FOLL_WRITE : 0,
> >               .allow_write_mapping = true,
> > +             .guarded_by_mmu_notifier = true,
> >       };
> >
> >       /*
> > @@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >                       fault->slot = NULL;
> >                       fault->pfn = KVM_PFN_NOSLOT;
> >                       fault->map_writable = false;
> > +                     fault->is_refcounted_page = false;
> >                       return RET_PF_CONTINUE;
> >               }
> >               /*
> > @@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  success:
> >       fault->hva = foll.hva;
> >       fault->map_writable = foll.writable;
> > +     fault->is_refcounted_page = foll.is_refcounted_page;
> >       return RET_PF_CONTINUE;
> >  }
> >
> > @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >
> >  out_unlock:
> >       write_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
> > +     if (fault->is_refcounted_page)
> > +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
> >       return r;
> >  }
> >
> > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> >
> >  out_unlock:
> >       read_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
>
> Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> believe this is not gonna happen in real world...

kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
like that, then kvm_vcpu_map will fail and the guest will probably
crash. It won't trigger any bugs in the host, though.

It is unfortunate that the guest will be able to use certain types of
memory for some purposes but not for others. However, while it is
theoretically fixable, it's an unreasonable amount of work for
something that, as you say, nobody really cares about in practice [1].

[1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-24  8:03       ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-24  8:03 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Marc Zyngier, kvm, Sean Christopherson, linux-kernel, Peter Xu,
	kvmarm, linuxppc-dev, linux-arm-kernel

On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++++++------------
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ++++---
> >  arch/x86/kvm/mmu/spte.c         |  4 ++-
> >  arch/x86/kvm/mmu/spte.h         | 12 ++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++++++++-------
> >  6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >
> >       if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> >               flush = true;
> > -             kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > +             if (is_refcounted_page_pte(old_spte))
> > +                     kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
> >       }
> >
> >       if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> >               flush = true;
> > -             kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > +             if (is_refcounted_page_pte(old_spte))
> > +                     kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
> >       }
> >
> >       return flush;
> > @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
> >        * before they are reclaimed.  Sanity check that, if the pfn is backed
> >        * by a refcounted page, the refcount is elevated.
> >        */
> > -     page = kvm_pfn_to_refcounted_page(pfn);
> > -     WARN_ON(page && !page_count(page));
> > +     if (is_refcounted_page_pte(old_spte)) {
> > +             page = kvm_pfn_to_refcounted_page(pfn);
> > +             WARN_ON(!page || !page_count(page));
> > +     }
> >
> > -     if (is_accessed_spte(old_spte))
> > -             kvm_set_pfn_accessed(pfn);
> > +     if (is_refcounted_page_pte(old_spte)) {
> > +             if (is_accessed_spte(old_spte))
> > +                     kvm_set_page_accessed(pfn_to_page(pfn));
> >
> > -     if (is_dirty_spte(old_spte))
> > -             kvm_set_pfn_dirty(pfn);
> > +             if (is_dirty_spte(old_spte))
> > +                     kvm_set_page_dirty(pfn_to_page(pfn));
> > +     }
> >
> >       return old_spte;
> >  }
> > @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
> >                * Capture the dirty status of the page, so that it doesn't get
> >                * lost when the SPTE is marked for access tracking.
> >                */
> > -             if (is_writable_pte(spte))
> > -                     kvm_set_pfn_dirty(spte_to_pfn(spte));
> > +             if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> > +                     kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
> >
> >               spte = mark_spte_for_access_track(spte);
> >               mmu_spte_update_no_track(sptep, spte);
> > @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
> >  {
> >       bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
> >                                              (unsigned long *)sptep);
> > -     if (was_writable && !spte_ad_enabled(*sptep))
> > -             kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> > +     if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*sptep))
> > +             kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
> >
> >       return was_writable;
> >  }
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       bool host_writable = !fault || fault->map_writable;
> >       bool prefetch = !fault || fault->prefetch;
> >       bool write_fault = fault && fault->write;
> > +     bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> >       pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
> >                *sptep, write_fault, gfn);
> > @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >       }
> >
> >       wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
> > -                        true, host_writable, &spte);
> > +                        true, host_writable, is_refcounted, &spte);
> >
> >       if (*sptep == spte) {
> >               ret = RET_PF_SPURIOUS;
> > @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >       struct kvm_follow_pfn foll = {
> >               .slot = slot,
> >               .gfn = fault->gfn,
> > -             .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> > +             .flags = fault->write ? FOLL_WRITE : 0,
> >               .allow_write_mapping = true,
> > +             .guarded_by_mmu_notifier = true,
> >       };
> >
> >       /*
> > @@ -4317,6 +4325,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >                       fault->slot = NULL;
> >                       fault->pfn = KVM_PFN_NOSLOT;
> >                       fault->map_writable = false;
> > +                     fault->is_refcounted_page = false;
> >                       return RET_PF_CONTINUE;
> >               }
> >               /*
> > @@ -4366,6 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  success:
> >       fault->hva = foll.hva;
> >       fault->map_writable = foll.writable;
> > +     fault->is_refcounted_page = foll.is_refcounted_page;
> >       return RET_PF_CONTINUE;
> >  }
> >
> > @@ -4451,7 +4461,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >
> >  out_unlock:
> >       write_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
> > +     if (fault->is_refcounted_page)
> > +             kvm_set_page_accessed(pfn_to_page(fault->pfn));
> >       return r;
> >  }
> >
> > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> >
> >  out_unlock:
> >       read_unlock(&vcpu->kvm->mmu_lock);
> > -     kvm_release_pfn_clean(fault->pfn);
>
> Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> believe this is not gonna happen in real world...

kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
like that, then kvm_vcpu_map will fail and the guest will probably
crash. It won't trigger any bugs in the host, though.

It is unfortunate that the guest will be able to use certain types of
memory for some purposes but not for others. However, while it is
theoretically fixable, it's an unreasonable amount of work for
something that, as you say, nobody really cares about in practice [1].

[1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-08-24  8:03       ` David Stevens
  (?)
@ 2023-08-24 15:15         ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-24 15:15 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Aug 24, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > >
> > >  out_unlock:
> > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > -     kvm_release_pfn_clean(fault->pfn);
> >
> > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > believe this is not gonna happen in real world...
> 
> kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> like that, then kvm_vcpu_map will fail and the guest will probably
> crash. It won't trigger any bugs in the host, though.
> 
> It is unfortunate that the guest will be able to use certain types of
> memory for some purposes but not for others. However, while it is
> theoretically fixable, it's an unreasonable amount of work for
> something that, as you say, nobody really cares about in practice [1].
> 
> [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

There are use cases that care, which is why I suggested allow_unsafe_kmap.
Specifically, AWS manages their pool of guest memory in userspace and maps it all
via /dev/mem.  Without that module param to let userspace opt-in, this series will
break such setups.  It still arguably is a breaking change since it requires
userspace to opt-in, but allowing such behavior by default is simply not a viable
option, and I don't have much sympathy since so much of this mess has its origins
in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").

The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
(by default) mapping device memory into the guest for things like vGPUs, without
having to do the massive and invasive overhaul needed to safely allow backing guest
RAM with completely arbitrary memory.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-24 15:15         ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-24 15:15 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Thu, Aug 24, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > >
> > >  out_unlock:
> > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > -     kvm_release_pfn_clean(fault->pfn);
> >
> > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > believe this is not gonna happen in real world...
> 
> kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> like that, then kvm_vcpu_map will fail and the guest will probably
> crash. It won't trigger any bugs in the host, though.
> 
> It is unfortunate that the guest will be able to use certain types of
> memory for some purposes but not for others. However, while it is
> theoretically fixable, it's an unreasonable amount of work for
> something that, as you say, nobody really cares about in practice [1].
> 
> [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

There are use cases that care, which is why I suggested allow_unsafe_kmap.
Specifically, AWS manages their pool of guest memory in userspace and maps it all
via /dev/mem.  Without that module param to let userspace opt-in, this series will
break such setups.  It still arguably is a breaking change since it requires
userspace to opt-in, but allowing such behavior by default is simply not a viable
option, and I don't have much sympathy since so much of this mess has its origins
in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").

The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
(by default) mapping device memory into the guest for things like vGPUs, without
having to do the massive and invasive overhaul needed to safely allow backing guest
RAM with completely arbitrary memory.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-24 15:15         ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-24 15:15 UTC (permalink / raw)
  To: David Stevens
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, Yu Zhang, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Thu, Aug 24, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > >
> > >  out_unlock:
> > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > -     kvm_release_pfn_clean(fault->pfn);
> >
> > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > believe this is not gonna happen in real world...
> 
> kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> like that, then kvm_vcpu_map will fail and the guest will probably
> crash. It won't trigger any bugs in the host, though.
> 
> It is unfortunate that the guest will be able to use certain types of
> memory for some purposes but not for others. However, while it is
> theoretically fixable, it's an unreasonable amount of work for
> something that, as you say, nobody really cares about in practice [1].
> 
> [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

There are use cases that care, which is why I suggested allow_unsafe_kmap.
Specifically, AWS manages their pool of guest memory in userspace and maps it all
via /dev/mem.  Without that module param to let userspace opt-in, this series will
break such setups.  It still arguably is a breaking change since it requires
userspace to opt-in, but allowing such behavior by default is simply not a viable
option, and I don't have much sympathy since so much of this mess has its origins
in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").

The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
(by default) mapping device memory into the guest for things like vGPUs, without
having to do the massive and invasive overhaul needed to safely allow backing guest
RAM with completely arbitrary memory.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-08-24 15:15         ` Sean Christopherson
  (?)
@ 2023-08-25  1:38           ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-25  1:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, Yu Zhang, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Fri, Aug 25, 2023 at 12:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Aug 24, 2023, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > > >
> > > >  out_unlock:
> > > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > > -     kvm_release_pfn_clean(fault->pfn);
> > >
> > > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > > believe this is not gonna happen in real world...
> >
> > kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> > to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> > like that, then kvm_vcpu_map will fail and the guest will probably
> > crash. It won't trigger any bugs in the host, though.
> >
> > It is unfortunate that the guest will be able to use certain types of
> > memory for some purposes but not for others. However, while it is
> > theoretically fixable, it's an unreasonable amount of work for
> > something that, as you say, nobody really cares about in practice [1].
> >
> > [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
>
> There are use cases that care, which is why I suggested allow_unsafe_kmap.
> Specifically, AWS manages their pool of guest memory in userspace and maps it all
> via /dev/mem.  Without that module param to let userspace opt-in, this series will
> break such setups.  It still arguably is a breaking change since it requires
> userspace to opt-in, but allowing such behavior by default is simply not a viable
> option, and I don't have much sympathy since so much of this mess has its origins
> in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").
>
> The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
> to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
> (by default) mapping device memory into the guest for things like vGPUs, without
> having to do the massive and invasive overhaul needed to safely allow backing guest
> RAM with completely arbitrary memory.

Do you specifically want the allow_unsafe_kmap breaking change? v7 of
this series should have supported everything that is currently
supported by KVM, but you're right that the v8 version of
hva_to_pfn_remapped doesn't support mapping
!kvm_pfn_to_refcounted_page() pages. That could be supported
explicitly with allow_unsafe_kmap as you suggested, or it could be
supported implicitly with this implementation:

@@ -2774,8 +2771,14 @@ static int hva_to_pfn_remapped(struct
vm_area_struct *vma,
         * would then underflow the refcount when the caller does the
         * required put_page. Don't allow those pages here.
         */
-       if (!kvm_try_get_pfn(pfn))
-               r = -EFAULT;
+       page = kvm_pfn_to_refcounted_page(pfn);
+       if (page) {
+               if (get_page_unless_zero(page))
+                       WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll,
page) != pfn);
+
+               if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier)
+                       r = -EFAULT;
+       }

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-25  1:38           ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-25  1:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Fri, Aug 25, 2023 at 12:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Aug 24, 2023, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > > >
> > > >  out_unlock:
> > > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > > -     kvm_release_pfn_clean(fault->pfn);
> > >
> > > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > > believe this is not gonna happen in real world...
> >
> > kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> > to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> > like that, then kvm_vcpu_map will fail and the guest will probably
> > crash. It won't trigger any bugs in the host, though.
> >
> > It is unfortunate that the guest will be able to use certain types of
> > memory for some purposes but not for others. However, while it is
> > theoretically fixable, it's an unreasonable amount of work for
> > something that, as you say, nobody really cares about in practice [1].
> >
> > [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
>
> There are use cases that care, which is why I suggested allow_unsafe_kmap.
> Specifically, AWS manages their pool of guest memory in userspace and maps it all
> via /dev/mem.  Without that module param to let userspace opt-in, this series will
> break such setups.  It still arguably is a breaking change since it requires
> userspace to opt-in, but allowing such behavior by default is simply not a viable
> option, and I don't have much sympathy since so much of this mess has its origins
> in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").
>
> The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
> to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
> (by default) mapping device memory into the guest for things like vGPUs, without
> having to do the massive and invasive overhaul needed to safely allow backing guest
> RAM with completely arbitrary memory.

Do you specifically want the allow_unsafe_kmap breaking change? v7 of
this series should have supported everything that is currently
supported by KVM, but you're right that the v8 version of
hva_to_pfn_remapped doesn't support mapping
!kvm_pfn_to_refcounted_page() pages. That could be supported
explicitly with allow_unsafe_kmap as you suggested, or it could be
supported implicitly with this implementation:

@@ -2774,8 +2771,14 @@ static int hva_to_pfn_remapped(struct
vm_area_struct *vma,
         * would then underflow the refcount when the caller does the
         * required put_page. Don't allow those pages here.
         */
-       if (!kvm_try_get_pfn(pfn))
-               r = -EFAULT;
+       page = kvm_pfn_to_refcounted_page(pfn);
+       if (page) {
+               if (get_page_unless_zero(page))
+                       WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll,
page) != pfn);
+
+               if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier)
+                       r = -EFAULT;
+       }

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-25  1:38           ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-08-25  1:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Fri, Aug 25, 2023 at 12:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Aug 24, 2023, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > > >
> > > >  out_unlock:
> > > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > > -     kvm_release_pfn_clean(fault->pfn);
> > >
> > > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > > believe this is not gonna happen in real world...
> >
> > kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> > to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> > like that, then kvm_vcpu_map will fail and the guest will probably
> > crash. It won't trigger any bugs in the host, though.
> >
> > It is unfortunate that the guest will be able to use certain types of
> > memory for some purposes but not for others. However, while it is
> > theoretically fixable, it's an unreasonable amount of work for
> > something that, as you say, nobody really cares about in practice [1].
> >
> > [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
>
> There are use cases that care, which is why I suggested allow_unsafe_kmap.
> Specifically, AWS manages their pool of guest memory in userspace and maps it all
> via /dev/mem.  Without that module param to let userspace opt-in, this series will
> break such setups.  It still arguably is a breaking change since it requires
> userspace to opt-in, but allowing such behavior by default is simply not a viable
> option, and I don't have much sympathy since so much of this mess has its origins
> in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").
>
> The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
> to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
> (by default) mapping device memory into the guest for things like vGPUs, without
> having to do the massive and invasive overhaul needed to safely allow backing guest
> RAM with completely arbitrary memory.

Do you specifically want the allow_unsafe_kmap breaking change? v7 of
this series should have supported everything that is currently
supported by KVM, but you're right that the v8 version of
hva_to_pfn_remapped doesn't support mapping
!kvm_pfn_to_refcounted_page() pages. That could be supported
explicitly with allow_unsafe_kmap as you suggested, or it could be
supported implicitly with this implementation:

@@ -2774,8 +2771,14 @@ static int hva_to_pfn_remapped(struct
vm_area_struct *vma,
         * would then underflow the refcount when the caller does the
         * required put_page. Don't allow those pages here.
         */
-       if (!kvm_try_get_pfn(pfn))
-               r = -EFAULT;
+       page = kvm_pfn_to_refcounted_page(pfn);
+       if (page) {
+               if (get_page_unless_zero(page))
+                       WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll,
page) != pfn);
+
+               if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier)
+                       r = -EFAULT;
+       }

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  2023-08-25  1:38           ` David Stevens
  (?)
@ 2023-08-31 21:18             ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-31 21:18 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Fri, Aug 25, 2023, David Stevens wrote:
> On Fri, Aug 25, 2023 at 12:15 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Aug 24, 2023, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >
> > > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > > > >
> > > > >  out_unlock:
> > > > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > > > -     kvm_release_pfn_clean(fault->pfn);
> > > >
> > > > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > > > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > > > believe this is not gonna happen in real world...
> > >
> > > kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> > > to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> > > like that, then kvm_vcpu_map will fail and the guest will probably
> > > crash. It won't trigger any bugs in the host, though.
> > >
> > > It is unfortunate that the guest will be able to use certain types of
> > > memory for some purposes but not for others. However, while it is
> > > theoretically fixable, it's an unreasonable amount of work for
> > > something that, as you say, nobody really cares about in practice [1].
> > >
> > > [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
> >
> > There are use cases that care, which is why I suggested allow_unsafe_kmap.
> > Specifically, AWS manages their pool of guest memory in userspace and maps it all
> > via /dev/mem.  Without that module param to let userspace opt-in, this series will
> > break such setups.  It still arguably is a breaking change since it requires
> > userspace to opt-in, but allowing such behavior by default is simply not a viable
> > option, and I don't have much sympathy since so much of this mess has its origins
> > in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").
> >
> > The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
> > to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
> > (by default) mapping device memory into the guest for things like vGPUs, without
> > having to do the massive and invasive overhaul needed to safely allow backing guest
> > RAM with completely arbitrary memory.
> 
> Do you specifically want the allow_unsafe_kmap breaking change? v7 of
> this series should have supported everything that is currently
> supported by KVM, but you're right that the v8 version of
> hva_to_pfn_remapped doesn't support mapping
> !kvm_pfn_to_refcounted_page() pages. That could be supported
> explicitly with allow_unsafe_kmap as you suggested,

I think it needs to be explicit, i.e. needs the admin to opt-in to the unsafe
behavior.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-31 21:18             ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-31 21:18 UTC (permalink / raw)
  To: David Stevens
  Cc: kvm, Marc Zyngier, linux-kernel, Peter Xu, Yu Zhang, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Fri, Aug 25, 2023, David Stevens wrote:
> On Fri, Aug 25, 2023 at 12:15 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Aug 24, 2023, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >
> > > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > > > >
> > > > >  out_unlock:
> > > > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > > > -     kvm_release_pfn_clean(fault->pfn);
> > > >
> > > > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > > > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > > > believe this is not gonna happen in real world...
> > >
> > > kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> > > to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> > > like that, then kvm_vcpu_map will fail and the guest will probably
> > > crash. It won't trigger any bugs in the host, though.
> > >
> > > It is unfortunate that the guest will be able to use certain types of
> > > memory for some purposes but not for others. However, while it is
> > > theoretically fixable, it's an unreasonable amount of work for
> > > something that, as you say, nobody really cares about in practice [1].
> > >
> > > [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
> >
> > There are use cases that care, which is why I suggested allow_unsafe_kmap.
> > Specifically, AWS manages their pool of guest memory in userspace and maps it all
> > via /dev/mem.  Without that module param to let userspace opt-in, this series will
> > break such setups.  It still arguably is a breaking change since it requires
> > userspace to opt-in, but allowing such behavior by default is simply not a viable
> > option, and I don't have much sympathy since so much of this mess has its origins
> > in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").
> >
> > The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
> > to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
> > (by default) mapping device memory into the guest for things like vGPUs, without
> > having to do the massive and invasive overhaul needed to safely allow backing guest
> > RAM with completely arbitrary memory.
> 
> Do you specifically want the allow_unsafe_kmap breaking change? v7 of
> this series should have supported everything that is currently
> supported by KVM, but you're right that the v8 version of
> hva_to_pfn_remapped doesn't support mapping
> !kvm_pfn_to_refcounted_page() pages. That could be supported
> explicitly with allow_unsafe_kmap as you suggested,

I think it needs to be explicit, i.e. needs the admin to opt-in to the unsafe
behavior.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
@ 2023-08-31 21:18             ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-08-31 21:18 UTC (permalink / raw)
  To: David Stevens
  Cc: Yu Zhang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Fri, Aug 25, 2023, David Stevens wrote:
> On Fri, Aug 25, 2023 at 12:15 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Aug 24, 2023, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > >
> > > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > > > >
> > > > >  out_unlock:
> > > > >       read_unlock(&vcpu->kvm->mmu_lock);
> > > > > -     kvm_release_pfn_clean(fault->pfn);
> > > >
> > > > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > > > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > > > believe this is not gonna happen in real world...
> > >
> > > kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> > > to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> > > like that, then kvm_vcpu_map will fail and the guest will probably
> > > crash. It won't trigger any bugs in the host, though.
> > >
> > > It is unfortunate that the guest will be able to use certain types of
> > > memory for some purposes but not for others. However, while it is
> > > theoretically fixable, it's an unreasonable amount of work for
> > > something that, as you say, nobody really cares about in practice [1].
> > >
> > > [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
> >
> > There are use cases that care, which is why I suggested allow_unsafe_kmap.
> > Specifically, AWS manages their pool of guest memory in userspace and maps it all
> > via /dev/mem.  Without that module param to let userspace opt-in, this series will
> > break such setups.  It still arguably is a breaking change since it requires
> > userspace to opt-in, but allowing such behavior by default is simply not a viable
> > option, and I don't have much sympathy since so much of this mess has its origins
> > in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").
> >
> > The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
> > to back guest RAM with arbitrary memory.  In other words, I want KVM to allow
> > (by default) mapping device memory into the guest for things like vGPUs, without
> > having to do the massive and invasive overhaul needed to safely allow backing guest
> > RAM with completely arbitrary memory.
> 
> Do you specifically want the allow_unsafe_kmap breaking change? v7 of
> this series should have supported everything that is currently
> supported by KVM, but you're right that the v8 version of
> hva_to_pfn_remapped doesn't support mapping
> !kvm_pfn_to_refcounted_page() pages. That could be supported
> explicitly with allow_unsafe_kmap as you suggested,

I think it needs to be explicit, i.e. needs the admin to opt-in to the unsafe
behavior.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-07-11 21:59           ` Sean Christopherson
  (?)
@ 2023-09-05  8:26             ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-09-05  8:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 11, 2023, Zhi Wang wrote:
> > On Thu, 6 Jul 2023 15:49:39 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >
> > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > >
> > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > David Stevens <stevensd@chromium.org> wrote:
> > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > a temporary solution, more like we are abusing the toleration.
> > >
> > > I'm not sure I understand what you're getting at. This series never
> > > calls gup without FOLL_GET.
> > >
> > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > notifier, it makes sense to support returning a pfn without taking a
> > > reference. And we indeed need to do that for certain types of memory.
> > >
> >
> > I am not having prob with taking a pfn without taking a ref. I am
> > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > a pfn without a ref is a good idea or not, while there is another flag
> > actually showing it.
> >
> > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > requirements of GUP in the code path that going to call GUP is reasonable.
> >
> > But using FOLL_XXX with purposes that are not related to GUP call really
> > feels off.
>
> I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> that handles non-refcounted pages, i.e. this
>
>         if (get_page_unless_zero(page)) {
>                 foll->is_refcounted_page = true;
>                 if (!(foll->flags & FOLL_GET))
>                         put_page(page);
>         } else if (foll->flags & FOLL_GET) {
>                 r = -EFAULT;
>         }
>
> should be
>
>         if (get_page_unless_zero(page)) {
>                 foll->is_refcounted_page = true;
>                 if (!(foll->flags & FOLL_GET))
>                         put_page(page);
>         else if (!foll->guarded_by_mmu_notifier)
>                 r = -EFAULT;
>
> because it's not the desire to grab a reference that makes getting non-refcounted
> pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
>
> Though that highlights that checking guarded_by_mmu_notifier should be done for
> *all* non-refcounted pfns, not just non-refcounted struct page memory.

I think things are getting confused here because there are multiple
things which "safe" refers to. There are three different definitions
that I think are relevant here:

1) "safe" in the sense that KVM doesn't corrupt page reference counts
2) "safe" in the sense that KVM doesn't access pfns after they have been freed
3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations

For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
then they expect to be able to pass the returned pfn to
kvm_release_pfn. This means that when FOLL_GET is set, if
kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
must take a reference count to avoid eventually corrupting the page
ref count. I guess replacing the FOLL_GET check with
!guarded_by_mmu_notifier is logically equivalent because
__kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
and FOLL_GET is set. But since we're concerned about a property of the
refcount, I think that checking FOLL_GET is clearer.

For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
is set, then we're all good here. If guarded_by_mmu_notifier is not
set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
set. For struct page memory, we're safe because KVM will hold a
reference as long as it's still using the page. For non struct page
memory, we're not safe - this is where the breaking change of
allow_unsafe_mappings would go. Note that for non-refcounted struct
page, we can't use the allow_unsafe_mappings escape hatch. Since
FOLL_GET was requested, if we returned such a page, then the caller
would eventually corrupt the page refcount via kvm_release_pfn.

Property 3 would be nice, but we've already concluded that guarding
all translations with mmu notifiers is infeasible. So maintaining
property 2 is the best we can hope for.

> As for the other usage of FOLL_GET in this series (using it to conditionally do
> put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
>
> I do think it's worth providing a helper to consolidate and document that hacky
> code, e.g. add a kvm_follow_refcounted_pfn() helper.
>
> All in all, I think the below (completely untested) is what we want?
>
> David (and others), I am planning on doing a full review of this series "soon",
> but it will likely be a few weeks until that happens.  I jumped in on this
> specific thread because this caught my eye and I really don't want to throw out
> *all* of the FOLL_GET usage.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5b5afd70f239..90d424990e0a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>         return rc == -EHWPOISON;
>  }
>
> +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> +                                          struct page *page)
> +{
> +       kvm_pfn_t pfn = page_to_pfn(page);
> +
> +       foll->is_refcounted_page = true;
> +
> +       /*
> +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> +        * doesn't want to grab a reference, but gup() doesn't support getting
> +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> +        * changes, drop this and simply don't pass FOLL_GET to gup().
> +        */
> +       if (!(foll->flags & FOLL_GET))
> +               put_page(page);
> +
> +       return pfn;
> +}
> +
>  /*
>   * The fast path to get the writable pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.  It's also the
> @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                 return false;
>
>         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> -               *pfn = page_to_pfn(page[0]);
>                 foll->writable = foll->allow_write_mapping;
> -               foll->is_refcounted_page = true;
> -               if (!(foll->flags & FOLL_GET))
> -                       put_page(page[0]);
> +
> +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
>                 return true;
>         }
>
> @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                 return npages;
>
>         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> -       foll->is_refcounted_page = true;
>
>         /* map read fault as writable if possible */
>         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                         page = wpage;
>                 }
>         }
> -       *pfn = page_to_pfn(page);
> -       if (!(foll->flags & FOLL_GET))
> -               put_page(page);
> +
> +       *pfn = kvm_follow_refcounted_pfn(foll, page);
>         return npages;
>  }
>
> @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>         if (!page)
>                 goto out;
>
> -       if (get_page_unless_zero(page)) {
> -               foll->is_refcounted_page = true;
> -               if (!(foll->flags & FOLL_GET))
> -                       put_page(page);
> -       } else if (foll->flags & FOLL_GET) {
> -               r = -EFAULT;
> -       }
> -
> +       if (get_page_unless_zero(page))
> +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
>  out:
>         pte_unmap_unlock(ptep, ptl);
> -       *p_pfn = pfn;
> +
> +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> +           !allow_unsafe_mappings)
> +               r = -EFAULT;
> +       else
> +               *p_pfn = pfn;
>
>         return r;
>  }
>

As I pointed out above, this suggestion is broken because a FOLL_GET
&& !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
non-refcounted page will result in the refcount eventually being
corrupted.

What do you think of this implementation? If it makes sense, I can
send out an updated patch series.

/*
 * If FOLL_GET is set, then the caller wants us to take a reference to
 * keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
 * guarantees that guarded_by_mmu_notifier is set, so there aren't any
 * use-after-free concerns.
 */
page = kvm_pfn_to_refcounted_page(pfn);
if (page) {
        if (get_page_unless_zero(page)) {
                WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
        } else if (foll->flags & FOLL_GET) {
                /*
                 * Certain IO or PFNMAP mappings can be backed with
                 * valid struct pages but be allocated without
                 * refcounting e.g., tail pages of non-compound higher
                 * order allocations. The caller asked for a ref, but
                 * we can't take one, since releasing such a ref would
                 * free the page.
                 */
                r = -EFAULT;
        }
} else if (foll->flags & FOLL_GET) {
        /*
         * When there's no struct page to refcount and no MMU notifier,
         * then KVM can't be guarantee to avoid use-after-free. However,
         * there are valid reasons to set up such mappings. If userspace
         * is trusted and willing to forego kernel safety guarantees,
         * allow this check to be bypassed.
         */
        if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)
                r = -EFAULT;
}

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-05  8:26             ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-09-05  8:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zhi Wang, kvm, Marc Zyngier, linux-kernel, Peter Xu, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 11, 2023, Zhi Wang wrote:
> > On Thu, 6 Jul 2023 15:49:39 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >
> > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > >
> > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > David Stevens <stevensd@chromium.org> wrote:
> > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > a temporary solution, more like we are abusing the toleration.
> > >
> > > I'm not sure I understand what you're getting at. This series never
> > > calls gup without FOLL_GET.
> > >
> > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > notifier, it makes sense to support returning a pfn without taking a
> > > reference. And we indeed need to do that for certain types of memory.
> > >
> >
> > I am not having prob with taking a pfn without taking a ref. I am
> > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > a pfn without a ref is a good idea or not, while there is another flag
> > actually showing it.
> >
> > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > requirements of GUP in the code path that going to call GUP is reasonable.
> >
> > But using FOLL_XXX with purposes that are not related to GUP call really
> > feels off.
>
> I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> that handles non-refcounted pages, i.e. this
>
>         if (get_page_unless_zero(page)) {
>                 foll->is_refcounted_page = true;
>                 if (!(foll->flags & FOLL_GET))
>                         put_page(page);
>         } else if (foll->flags & FOLL_GET) {
>                 r = -EFAULT;
>         }
>
> should be
>
>         if (get_page_unless_zero(page)) {
>                 foll->is_refcounted_page = true;
>                 if (!(foll->flags & FOLL_GET))
>                         put_page(page);
>         else if (!foll->guarded_by_mmu_notifier)
>                 r = -EFAULT;
>
> because it's not the desire to grab a reference that makes getting non-refcounted
> pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
>
> Though that highlights that checking guarded_by_mmu_notifier should be done for
> *all* non-refcounted pfns, not just non-refcounted struct page memory.

I think things are getting confused here because there are multiple
things which "safe" refers to. There are three different definitions
that I think are relevant here:

1) "safe" in the sense that KVM doesn't corrupt page reference counts
2) "safe" in the sense that KVM doesn't access pfns after they have been freed
3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations

For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
then they expect to be able to pass the returned pfn to
kvm_release_pfn. This means that when FOLL_GET is set, if
kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
must take a reference count to avoid eventually corrupting the page
ref count. I guess replacing the FOLL_GET check with
!guarded_by_mmu_notifier is logically equivalent because
__kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
and FOLL_GET is set. But since we're concerned about a property of the
refcount, I think that checking FOLL_GET is clearer.

For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
is set, then we're all good here. If guarded_by_mmu_notifier is not
set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
set. For struct page memory, we're safe because KVM will hold a
reference as long as it's still using the page. For non struct page
memory, we're not safe - this is where the breaking change of
allow_unsafe_mappings would go. Note that for non-refcounted struct
page, we can't use the allow_unsafe_mappings escape hatch. Since
FOLL_GET was requested, if we returned such a page, then the caller
would eventually corrupt the page refcount via kvm_release_pfn.

Property 3 would be nice, but we've already concluded that guarding
all translations with mmu notifiers is infeasible. So maintaining
property 2 is the best we can hope for.

> As for the other usage of FOLL_GET in this series (using it to conditionally do
> put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
>
> I do think it's worth providing a helper to consolidate and document that hacky
> code, e.g. add a kvm_follow_refcounted_pfn() helper.
>
> All in all, I think the below (completely untested) is what we want?
>
> David (and others), I am planning on doing a full review of this series "soon",
> but it will likely be a few weeks until that happens.  I jumped in on this
> specific thread because this caught my eye and I really don't want to throw out
> *all* of the FOLL_GET usage.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5b5afd70f239..90d424990e0a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>         return rc == -EHWPOISON;
>  }
>
> +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> +                                          struct page *page)
> +{
> +       kvm_pfn_t pfn = page_to_pfn(page);
> +
> +       foll->is_refcounted_page = true;
> +
> +       /*
> +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> +        * doesn't want to grab a reference, but gup() doesn't support getting
> +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> +        * changes, drop this and simply don't pass FOLL_GET to gup().
> +        */
> +       if (!(foll->flags & FOLL_GET))
> +               put_page(page);
> +
> +       return pfn;
> +}
> +
>  /*
>   * The fast path to get the writable pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.  It's also the
> @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                 return false;
>
>         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> -               *pfn = page_to_pfn(page[0]);
>                 foll->writable = foll->allow_write_mapping;
> -               foll->is_refcounted_page = true;
> -               if (!(foll->flags & FOLL_GET))
> -                       put_page(page[0]);
> +
> +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
>                 return true;
>         }
>
> @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                 return npages;
>
>         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> -       foll->is_refcounted_page = true;
>
>         /* map read fault as writable if possible */
>         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                         page = wpage;
>                 }
>         }
> -       *pfn = page_to_pfn(page);
> -       if (!(foll->flags & FOLL_GET))
> -               put_page(page);
> +
> +       *pfn = kvm_follow_refcounted_pfn(foll, page);
>         return npages;
>  }
>
> @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>         if (!page)
>                 goto out;
>
> -       if (get_page_unless_zero(page)) {
> -               foll->is_refcounted_page = true;
> -               if (!(foll->flags & FOLL_GET))
> -                       put_page(page);
> -       } else if (foll->flags & FOLL_GET) {
> -               r = -EFAULT;
> -       }
> -
> +       if (get_page_unless_zero(page))
> +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
>  out:
>         pte_unmap_unlock(ptep, ptl);
> -       *p_pfn = pfn;
> +
> +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> +           !allow_unsafe_mappings)
> +               r = -EFAULT;
> +       else
> +               *p_pfn = pfn;
>
>         return r;
>  }
>

As I pointed out above, this suggestion is broken because a FOLL_GET
&& !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
non-refcounted page will result in the refcount eventually being
corrupted.

What do you think of this implementation? If it makes sense, I can
send out an updated patch series.

/*
 * If FOLL_GET is set, then the caller wants us to take a reference to
 * keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
 * guarantees that guarded_by_mmu_notifier is set, so there aren't any
 * use-after-free concerns.
 */
page = kvm_pfn_to_refcounted_page(pfn);
if (page) {
        if (get_page_unless_zero(page)) {
                WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
        } else if (foll->flags & FOLL_GET) {
                /*
                 * Certain IO or PFNMAP mappings can be backed with
                 * valid struct pages but be allocated without
                 * refcounting e.g., tail pages of non-compound higher
                 * order allocations. The caller asked for a ref, but
                 * we can't take one, since releasing such a ref would
                 * free the page.
                 */
                r = -EFAULT;
        }
} else if (foll->flags & FOLL_GET) {
        /*
         * When there's no struct page to refcount and no MMU notifier,
         * then KVM can't be guarantee to avoid use-after-free. However,
         * there are valid reasons to set up such mappings. If userspace
         * is trusted and willing to forego kernel safety guarantees,
         * allow this check to be bypassed.
         */
        if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)
                r = -EFAULT;
}

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-05  8:26             ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-09-05  8:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 11, 2023, Zhi Wang wrote:
> > On Thu, 6 Jul 2023 15:49:39 +0900
> > David Stevens <stevensd@chromium.org> wrote:
> >
> > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > >
> > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > David Stevens <stevensd@chromium.org> wrote:
> > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > a temporary solution, more like we are abusing the toleration.
> > >
> > > I'm not sure I understand what you're getting at. This series never
> > > calls gup without FOLL_GET.
> > >
> > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > notifier, it makes sense to support returning a pfn without taking a
> > > reference. And we indeed need to do that for certain types of memory.
> > >
> >
> > I am not having prob with taking a pfn without taking a ref. I am
> > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > a pfn without a ref is a good idea or not, while there is another flag
> > actually showing it.
> >
> > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > requirements of GUP in the code path that going to call GUP is reasonable.
> >
> > But using FOLL_XXX with purposes that are not related to GUP call really
> > feels off.
>
> I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> that handles non-refcounted pages, i.e. this
>
>         if (get_page_unless_zero(page)) {
>                 foll->is_refcounted_page = true;
>                 if (!(foll->flags & FOLL_GET))
>                         put_page(page);
>         } else if (foll->flags & FOLL_GET) {
>                 r = -EFAULT;
>         }
>
> should be
>
>         if (get_page_unless_zero(page)) {
>                 foll->is_refcounted_page = true;
>                 if (!(foll->flags & FOLL_GET))
>                         put_page(page);
>         else if (!foll->guarded_by_mmu_notifier)
>                 r = -EFAULT;
>
> because it's not the desire to grab a reference that makes getting non-refcounted
> pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
>
> Though that highlights that checking guarded_by_mmu_notifier should be done for
> *all* non-refcounted pfns, not just non-refcounted struct page memory.

I think things are getting confused here because there are multiple
things which "safe" refers to. There are three different definitions
that I think are relevant here:

1) "safe" in the sense that KVM doesn't corrupt page reference counts
2) "safe" in the sense that KVM doesn't access pfns after they have been freed
3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations

For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
then they expect to be able to pass the returned pfn to
kvm_release_pfn. This means that when FOLL_GET is set, if
kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
must take a reference count to avoid eventually corrupting the page
ref count. I guess replacing the FOLL_GET check with
!guarded_by_mmu_notifier is logically equivalent because
__kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
and FOLL_GET is set. But since we're concerned about a property of the
refcount, I think that checking FOLL_GET is clearer.

For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
is set, then we're all good here. If guarded_by_mmu_notifier is not
set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
set. For struct page memory, we're safe because KVM will hold a
reference as long as it's still using the page. For non struct page
memory, we're not safe - this is where the breaking change of
allow_unsafe_mappings would go. Note that for non-refcounted struct
page, we can't use the allow_unsafe_mappings escape hatch. Since
FOLL_GET was requested, if we returned such a page, then the caller
would eventually corrupt the page refcount via kvm_release_pfn.

Property 3 would be nice, but we've already concluded that guarding
all translations with mmu notifiers is infeasible. So maintaining
property 2 is the best we can hope for.

> As for the other usage of FOLL_GET in this series (using it to conditionally do
> put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
>
> I do think it's worth providing a helper to consolidate and document that hacky
> code, e.g. add a kvm_follow_refcounted_pfn() helper.
>
> All in all, I think the below (completely untested) is what we want?
>
> David (and others), I am planning on doing a full review of this series "soon",
> but it will likely be a few weeks until that happens.  I jumped in on this
> specific thread because this caught my eye and I really don't want to throw out
> *all* of the FOLL_GET usage.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5b5afd70f239..90d424990e0a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>         return rc == -EHWPOISON;
>  }
>
> +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> +                                          struct page *page)
> +{
> +       kvm_pfn_t pfn = page_to_pfn(page);
> +
> +       foll->is_refcounted_page = true;
> +
> +       /*
> +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> +        * doesn't want to grab a reference, but gup() doesn't support getting
> +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> +        * changes, drop this and simply don't pass FOLL_GET to gup().
> +        */
> +       if (!(foll->flags & FOLL_GET))
> +               put_page(page);
> +
> +       return pfn;
> +}
> +
>  /*
>   * The fast path to get the writable pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.  It's also the
> @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                 return false;
>
>         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> -               *pfn = page_to_pfn(page[0]);
>                 foll->writable = foll->allow_write_mapping;
> -               foll->is_refcounted_page = true;
> -               if (!(foll->flags & FOLL_GET))
> -                       put_page(page[0]);
> +
> +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
>                 return true;
>         }
>
> @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                 return npages;
>
>         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> -       foll->is_refcounted_page = true;
>
>         /* map read fault as writable if possible */
>         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>                         page = wpage;
>                 }
>         }
> -       *pfn = page_to_pfn(page);
> -       if (!(foll->flags & FOLL_GET))
> -               put_page(page);
> +
> +       *pfn = kvm_follow_refcounted_pfn(foll, page);
>         return npages;
>  }
>
> @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
>         if (!page)
>                 goto out;
>
> -       if (get_page_unless_zero(page)) {
> -               foll->is_refcounted_page = true;
> -               if (!(foll->flags & FOLL_GET))
> -                       put_page(page);
> -       } else if (foll->flags & FOLL_GET) {
> -               r = -EFAULT;
> -       }
> -
> +       if (get_page_unless_zero(page))
> +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
>  out:
>         pte_unmap_unlock(ptep, ptl);
> -       *p_pfn = pfn;
> +
> +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> +           !allow_unsafe_mappings)
> +               r = -EFAULT;
> +       else
> +               *p_pfn = pfn;
>
>         return r;
>  }
>

As I pointed out above, this suggestion is broken because a FOLL_GET
&& !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
non-refcounted page will result in the refcount eventually being
corrupted.

What do you think of this implementation? If it makes sense, I can
send out an updated patch series.

/*
 * If FOLL_GET is set, then the caller wants us to take a reference to
 * keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
 * guarantees that guarded_by_mmu_notifier is set, so there aren't any
 * use-after-free concerns.
 */
page = kvm_pfn_to_refcounted_page(pfn);
if (page) {
        if (get_page_unless_zero(page)) {
                WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
        } else if (foll->flags & FOLL_GET) {
                /*
                 * Certain IO or PFNMAP mappings can be backed with
                 * valid struct pages but be allocated without
                 * refcounting e.g., tail pages of non-compound higher
                 * order allocations. The caller asked for a ref, but
                 * we can't take one, since releasing such a ref would
                 * free the page.
                 */
                r = -EFAULT;
        }
} else if (foll->flags & FOLL_GET) {
        /*
         * When there's no struct page to refcount and no MMU notifier,
         * then KVM can't be guarantee to avoid use-after-free. However,
         * there are valid reasons to set up such mappings. If userspace
         * is trusted and willing to forego kernel safety guarantees,
         * allow this check to be bypassed.
         */
        if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)
                r = -EFAULT;
}

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-09-05  8:26             ` David Stevens
  (?)
@ 2023-09-06  0:45               ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-09-06  0:45 UTC (permalink / raw)
  To: David Stevens
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Sep 05, 2023, David Stevens wrote:
> On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jul 11, 2023, Zhi Wang wrote:
> > > On Thu, 6 Jul 2023 15:49:39 +0900
> > > David Stevens <stevensd@chromium.org> wrote:
> > >
> > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > > >
> > > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > > David Stevens <stevensd@chromium.org> wrote:
> > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > > a temporary solution, more like we are abusing the toleration.
> > > >
> > > > I'm not sure I understand what you're getting at. This series never
> > > > calls gup without FOLL_GET.
> > > >
> > > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > > notifier, it makes sense to support returning a pfn without taking a
> > > > reference. And we indeed need to do that for certain types of memory.
> > > >
> > >
> > > I am not having prob with taking a pfn without taking a ref. I am
> > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > > a pfn without a ref is a good idea or not, while there is another flag
> > > actually showing it.
> > >
> > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > > requirements of GUP in the code path that going to call GUP is reasonable.
> > >
> > > But using FOLL_XXX with purposes that are not related to GUP call really
> > > feels off.
> >
> > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> > that handles non-refcounted pages, i.e. this
> >
> >         if (get_page_unless_zero(page)) {
> >                 foll->is_refcounted_page = true;
> >                 if (!(foll->flags & FOLL_GET))
> >                         put_page(page);
> >         } else if (foll->flags & FOLL_GET) {
> >                 r = -EFAULT;
> >         }
> >
> > should be
> >
> >         if (get_page_unless_zero(page)) {
> >                 foll->is_refcounted_page = true;
> >                 if (!(foll->flags & FOLL_GET))
> >                         put_page(page);
> >         else if (!foll->guarded_by_mmu_notifier)
> >                 r = -EFAULT;
> >
> > because it's not the desire to grab a reference that makes getting non-refcounted
> > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
> >
> > Though that highlights that checking guarded_by_mmu_notifier should be done for
> > *all* non-refcounted pfns, not just non-refcounted struct page memory.
> 
> I think things are getting confused here because there are multiple
> things which "safe" refers to. There are three different definitions
> that I think are relevant here:
> 
> 1) "safe" in the sense that KVM doesn't corrupt page reference counts
> 2) "safe" in the sense that KVM doesn't access pfns after they have been freed
> 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations
>
> For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
> then they expect to be able to pass the returned pfn to
> kvm_release_pfn. This means that when FOLL_GET is set, if
> kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
> must take a reference count to avoid eventually corrupting the page
> ref count. I guess replacing the FOLL_GET check with
> !guarded_by_mmu_notifier is logically equivalent because
> __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
> and FOLL_GET is set. But since we're concerned about a property of the
> refcount, I think that checking FOLL_GET is clearer.
> 
> For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> is set, then we're all good here. If guarded_by_mmu_notifier is not
> set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> set. For struct page memory, we're safe because KVM will hold a
> reference as long as it's still using the page. For non struct page
> memory, we're not safe - this is where the breaking change of
> allow_unsafe_mappings would go. Note that for non-refcounted struct
> page, we can't use the allow_unsafe_mappings escape hatch. Since
> FOLL_GET was requested, if we returned such a page, then the caller
> would eventually corrupt the page refcount via kvm_release_pfn.

Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
didn't include that in the snippet below because I didn't want to write the entire
patch.  The whole point of adding is_refcounted_page is so that callers can
identify exactly what type of page was at the end of the trail that was followed.

> Property 3 would be nice, but we've already concluded that guarding
> all translations with mmu notifiers is infeasible. So maintaining
> property 2 is the best we can hope for.

No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
about guest accesses being ordered with respect to VMA/memslot updates, but I
don't think that's the case.

> > As for the other usage of FOLL_GET in this series (using it to conditionally do
> > put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> > without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
> >
> > I do think it's worth providing a helper to consolidate and document that hacky
> > code, e.g. add a kvm_follow_refcounted_pfn() helper.
> >
> > All in all, I think the below (completely untested) is what we want?
> >
> > David (and others), I am planning on doing a full review of this series "soon",
> > but it will likely be a few weeks until that happens.  I jumped in on this
> > specific thread because this caught my eye and I really don't want to throw out
> > *all* of the FOLL_GET usage.
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 5b5afd70f239..90d424990e0a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >         return rc == -EHWPOISON;
> >  }
> >
> > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > +                                          struct page *page)
> > +{
> > +       kvm_pfn_t pfn = page_to_pfn(page);
> > +
> > +       foll->is_refcounted_page = true;
> > +
> > +       /*
> > +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> > +        * doesn't want to grab a reference, but gup() doesn't support getting
> > +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> > +        * changes, drop this and simply don't pass FOLL_GET to gup().
> > +        */
> > +       if (!(foll->flags & FOLL_GET))
> > +               put_page(page);
> > +
> > +       return pfn;
> > +}
> > +
> >  /*
> >   * The fast path to get the writable pfn which will be stored in @pfn,
> >   * true indicates success, otherwise false is returned.  It's also the
> > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                 return false;
> >
> >         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > -               *pfn = page_to_pfn(page[0]);
> >                 foll->writable = foll->allow_write_mapping;
> > -               foll->is_refcounted_page = true;
> > -               if (!(foll->flags & FOLL_GET))
> > -                       put_page(page[0]);
> > +
> > +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> >                 return true;
> >         }
> >
> > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                 return npages;
> >
> >         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > -       foll->is_refcounted_page = true;
> >
> >         /* map read fault as writable if possible */
> >         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                         page = wpage;
> >                 }
> >         }
> > -       *pfn = page_to_pfn(page);
> > -       if (!(foll->flags & FOLL_GET))
> > -               put_page(page);
> > +
> > +       *pfn = kvm_follow_refcounted_pfn(foll, page);
> >         return npages;
> >  }
> >
> > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> >         if (!page)
> >                 goto out;
> >
> > -       if (get_page_unless_zero(page)) {
> > -               foll->is_refcounted_page = true;
> > -               if (!(foll->flags & FOLL_GET))
> > -                       put_page(page);
> > -       } else if (foll->flags & FOLL_GET) {
> > -               r = -EFAULT;
> > -       }
> > -
> > +       if (get_page_unless_zero(page))
> > +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> >  out:
> >         pte_unmap_unlock(ptep, ptl);
> > -       *p_pfn = pfn;
> > +
> > +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> > +           !allow_unsafe_mappings)
> > +               r = -EFAULT;
> > +       else
> > +               *p_pfn = pfn;
> >
> >         return r;
> >  }
> >
> 
> As I pointed out above, this suggestion is broken because a FOLL_GET
> && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
> non-refcounted page will result in the refcount eventually being
> corrupted.

I don't think so, unless I'm misunderstanding the concern.  It just wasn't a
complete patch, and wasn't intended to be.

> What do you think of this implementation? If it makes sense, I can
> send out an updated patch series.
>
> /*
>  * If FOLL_GET is set, then the caller wants us to take a reference to
>  * keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
>  * guarantees that guarded_by_mmu_notifier is set, so there aren't any
>  * use-after-free concerns.
>  */
> page = kvm_pfn_to_refcounted_page(pfn);
> if (page) {
>         if (get_page_unless_zero(page)) {
>                 WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
>         } else if (foll->flags & FOLL_GET) {
>                 /*
>                  * Certain IO or PFNMAP mappings can be backed with
>                  * valid struct pages but be allocated without
>                  * refcounting e.g., tail pages of non-compound higher
>                  * order allocations. The caller asked for a ref, but
>                  * we can't take one, since releasing such a ref would
>                  * free the page.
>                  */
>                 r = -EFAULT;
>         }
> } else if (foll->flags & FOLL_GET) {
>         /*
>          * When there's no struct page to refcount and no MMU notifier,
>          * then KVM can't be guarantee to avoid use-after-free. However,
>          * there are valid reasons to set up such mappings. If userspace
>          * is trusted and willing to forego kernel safety guarantees,
>          * allow this check to be bypassed.
>          */
>         if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)

I assume you mean:

	if (!foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)

>                 r = -EFAULT;
> }

Please no.  I don't want to overload FOLL_GET or have dependencies between
FOLL_GET and guarded_by_mmu_notifier.  The guest page fault path should be able
to omit FOLL_GET, and it obviously should set guarded_by_mmu_notifier.  And
kvm_vcpu_map() should be able to set FOLL_GET, omit guarded_by_mmu_notifier, _and_
play nice with "unsafe", non-refcounted memory when allow_unsafe_mappings is true.

The above fits your use case because nothing in KVM needs to do kvm_vcpu_map()
on the GPU's buffer, but as a general solution the semantics are very odd.  E.g.
in long form, they are:

  Get get a reference if a there's a refcounted page, fail with -EFAULT if there's
  a struct page but it's not refcounted, and fail with -EFAULT if there's not struct
  page unless the caller is protected by mmu_notifiers or unsafe mappings are allowed.

That's rather bonkers.  What I instead want is FOLL_GET to be:

  Get a reference if there's a refcounted page.

And then allow_unsafe_mappings is simply:

  Allow mapping non-refcounted memory into the guest even if the mapping isn't
  guarded by mmu_notifier events.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-06  0:45               ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-09-06  0:45 UTC (permalink / raw)
  To: David Stevens
  Cc: Zhi Wang, kvm, Marc Zyngier, linux-kernel, Peter Xu, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Tue, Sep 05, 2023, David Stevens wrote:
> On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jul 11, 2023, Zhi Wang wrote:
> > > On Thu, 6 Jul 2023 15:49:39 +0900
> > > David Stevens <stevensd@chromium.org> wrote:
> > >
> > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > > >
> > > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > > David Stevens <stevensd@chromium.org> wrote:
> > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > > a temporary solution, more like we are abusing the toleration.
> > > >
> > > > I'm not sure I understand what you're getting at. This series never
> > > > calls gup without FOLL_GET.
> > > >
> > > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > > notifier, it makes sense to support returning a pfn without taking a
> > > > reference. And we indeed need to do that for certain types of memory.
> > > >
> > >
> > > I am not having prob with taking a pfn without taking a ref. I am
> > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > > a pfn without a ref is a good idea or not, while there is another flag
> > > actually showing it.
> > >
> > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > > requirements of GUP in the code path that going to call GUP is reasonable.
> > >
> > > But using FOLL_XXX with purposes that are not related to GUP call really
> > > feels off.
> >
> > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> > that handles non-refcounted pages, i.e. this
> >
> >         if (get_page_unless_zero(page)) {
> >                 foll->is_refcounted_page = true;
> >                 if (!(foll->flags & FOLL_GET))
> >                         put_page(page);
> >         } else if (foll->flags & FOLL_GET) {
> >                 r = -EFAULT;
> >         }
> >
> > should be
> >
> >         if (get_page_unless_zero(page)) {
> >                 foll->is_refcounted_page = true;
> >                 if (!(foll->flags & FOLL_GET))
> >                         put_page(page);
> >         else if (!foll->guarded_by_mmu_notifier)
> >                 r = -EFAULT;
> >
> > because it's not the desire to grab a reference that makes getting non-refcounted
> > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
> >
> > Though that highlights that checking guarded_by_mmu_notifier should be done for
> > *all* non-refcounted pfns, not just non-refcounted struct page memory.
> 
> I think things are getting confused here because there are multiple
> things which "safe" refers to. There are three different definitions
> that I think are relevant here:
> 
> 1) "safe" in the sense that KVM doesn't corrupt page reference counts
> 2) "safe" in the sense that KVM doesn't access pfns after they have been freed
> 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations
>
> For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
> then they expect to be able to pass the returned pfn to
> kvm_release_pfn. This means that when FOLL_GET is set, if
> kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
> must take a reference count to avoid eventually corrupting the page
> ref count. I guess replacing the FOLL_GET check with
> !guarded_by_mmu_notifier is logically equivalent because
> __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
> and FOLL_GET is set. But since we're concerned about a property of the
> refcount, I think that checking FOLL_GET is clearer.
> 
> For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> is set, then we're all good here. If guarded_by_mmu_notifier is not
> set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> set. For struct page memory, we're safe because KVM will hold a
> reference as long as it's still using the page. For non struct page
> memory, we're not safe - this is where the breaking change of
> allow_unsafe_mappings would go. Note that for non-refcounted struct
> page, we can't use the allow_unsafe_mappings escape hatch. Since
> FOLL_GET was requested, if we returned such a page, then the caller
> would eventually corrupt the page refcount via kvm_release_pfn.

Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
didn't include that in the snippet below because I didn't want to write the entire
patch.  The whole point of adding is_refcounted_page is so that callers can
identify exactly what type of page was at the end of the trail that was followed.

> Property 3 would be nice, but we've already concluded that guarding
> all translations with mmu notifiers is infeasible. So maintaining
> property 2 is the best we can hope for.

No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
about guest accesses being ordered with respect to VMA/memslot updates, but I
don't think that's the case.

> > As for the other usage of FOLL_GET in this series (using it to conditionally do
> > put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> > without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
> >
> > I do think it's worth providing a helper to consolidate and document that hacky
> > code, e.g. add a kvm_follow_refcounted_pfn() helper.
> >
> > All in all, I think the below (completely untested) is what we want?
> >
> > David (and others), I am planning on doing a full review of this series "soon",
> > but it will likely be a few weeks until that happens.  I jumped in on this
> > specific thread because this caught my eye and I really don't want to throw out
> > *all* of the FOLL_GET usage.
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 5b5afd70f239..90d424990e0a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >         return rc == -EHWPOISON;
> >  }
> >
> > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > +                                          struct page *page)
> > +{
> > +       kvm_pfn_t pfn = page_to_pfn(page);
> > +
> > +       foll->is_refcounted_page = true;
> > +
> > +       /*
> > +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> > +        * doesn't want to grab a reference, but gup() doesn't support getting
> > +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> > +        * changes, drop this and simply don't pass FOLL_GET to gup().
> > +        */
> > +       if (!(foll->flags & FOLL_GET))
> > +               put_page(page);
> > +
> > +       return pfn;
> > +}
> > +
> >  /*
> >   * The fast path to get the writable pfn which will be stored in @pfn,
> >   * true indicates success, otherwise false is returned.  It's also the
> > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                 return false;
> >
> >         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > -               *pfn = page_to_pfn(page[0]);
> >                 foll->writable = foll->allow_write_mapping;
> > -               foll->is_refcounted_page = true;
> > -               if (!(foll->flags & FOLL_GET))
> > -                       put_page(page[0]);
> > +
> > +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> >                 return true;
> >         }
> >
> > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                 return npages;
> >
> >         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > -       foll->is_refcounted_page = true;
> >
> >         /* map read fault as writable if possible */
> >         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                         page = wpage;
> >                 }
> >         }
> > -       *pfn = page_to_pfn(page);
> > -       if (!(foll->flags & FOLL_GET))
> > -               put_page(page);
> > +
> > +       *pfn = kvm_follow_refcounted_pfn(foll, page);
> >         return npages;
> >  }
> >
> > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> >         if (!page)
> >                 goto out;
> >
> > -       if (get_page_unless_zero(page)) {
> > -               foll->is_refcounted_page = true;
> > -               if (!(foll->flags & FOLL_GET))
> > -                       put_page(page);
> > -       } else if (foll->flags & FOLL_GET) {
> > -               r = -EFAULT;
> > -       }
> > -
> > +       if (get_page_unless_zero(page))
> > +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> >  out:
> >         pte_unmap_unlock(ptep, ptl);
> > -       *p_pfn = pfn;
> > +
> > +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> > +           !allow_unsafe_mappings)
> > +               r = -EFAULT;
> > +       else
> > +               *p_pfn = pfn;
> >
> >         return r;
> >  }
> >
> 
> As I pointed out above, this suggestion is broken because a FOLL_GET
> && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
> non-refcounted page will result in the refcount eventually being
> corrupted.

I don't think so, unless I'm misunderstanding the concern.  It just wasn't a
complete patch, and wasn't intended to be.

> What do you think of this implementation? If it makes sense, I can
> send out an updated patch series.
>
> /*
>  * If FOLL_GET is set, then the caller wants us to take a reference to
>  * keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
>  * guarantees that guarded_by_mmu_notifier is set, so there aren't any
>  * use-after-free concerns.
>  */
> page = kvm_pfn_to_refcounted_page(pfn);
> if (page) {
>         if (get_page_unless_zero(page)) {
>                 WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
>         } else if (foll->flags & FOLL_GET) {
>                 /*
>                  * Certain IO or PFNMAP mappings can be backed with
>                  * valid struct pages but be allocated without
>                  * refcounting e.g., tail pages of non-compound higher
>                  * order allocations. The caller asked for a ref, but
>                  * we can't take one, since releasing such a ref would
>                  * free the page.
>                  */
>                 r = -EFAULT;
>         }
> } else if (foll->flags & FOLL_GET) {
>         /*
>          * When there's no struct page to refcount and no MMU notifier,
>          * then KVM can't be guarantee to avoid use-after-free. However,
>          * there are valid reasons to set up such mappings. If userspace
>          * is trusted and willing to forego kernel safety guarantees,
>          * allow this check to be bypassed.
>          */
>         if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)

I assume you mean:

	if (!foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)

>                 r = -EFAULT;
> }

Please no.  I don't want to overload FOLL_GET or have dependencies between
FOLL_GET and guarded_by_mmu_notifier.  The guest page fault path should be able
to omit FOLL_GET, and it obviously should set guarded_by_mmu_notifier.  And
kvm_vcpu_map() should be able to set FOLL_GET, omit guarded_by_mmu_notifier, _and_
play nice with "unsafe", non-refcounted memory when allow_unsafe_mappings is true.

The above fits your use case because nothing in KVM needs to do kvm_vcpu_map()
on the GPU's buffer, but as a general solution the semantics are very odd.  E.g.
in long form, they are:

  Get get a reference if a there's a refcounted page, fail with -EFAULT if there's
  a struct page but it's not refcounted, and fail with -EFAULT if there's not struct
  page unless the caller is protected by mmu_notifiers or unsafe mappings are allowed.

That's rather bonkers.  What I instead want is FOLL_GET to be:

  Get a reference if there's a refcounted page.

And then allow_unsafe_mappings is simply:

  Allow mapping non-refcounted memory into the guest even if the mapping isn't
  guarded by mmu_notifier events.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-06  0:45               ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-09-06  0:45 UTC (permalink / raw)
  To: David Stevens
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Tue, Sep 05, 2023, David Stevens wrote:
> On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jul 11, 2023, Zhi Wang wrote:
> > > On Thu, 6 Jul 2023 15:49:39 +0900
> > > David Stevens <stevensd@chromium.org> wrote:
> > >
> > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > > >
> > > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > > David Stevens <stevensd@chromium.org> wrote:
> > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > > a temporary solution, more like we are abusing the toleration.
> > > >
> > > > I'm not sure I understand what you're getting at. This series never
> > > > calls gup without FOLL_GET.
> > > >
> > > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > > notifier, it makes sense to support returning a pfn without taking a
> > > > reference. And we indeed need to do that for certain types of memory.
> > > >
> > >
> > > I am not having prob with taking a pfn without taking a ref. I am
> > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > > a pfn without a ref is a good idea or not, while there is another flag
> > > actually showing it.
> > >
> > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > > requirements of GUP in the code path that going to call GUP is reasonable.
> > >
> > > But using FOLL_XXX with purposes that are not related to GUP call really
> > > feels off.
> >
> > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> > that handles non-refcounted pages, i.e. this
> >
> >         if (get_page_unless_zero(page)) {
> >                 foll->is_refcounted_page = true;
> >                 if (!(foll->flags & FOLL_GET))
> >                         put_page(page);
> >         } else if (foll->flags & FOLL_GET) {
> >                 r = -EFAULT;
> >         }
> >
> > should be
> >
> >         if (get_page_unless_zero(page)) {
> >                 foll->is_refcounted_page = true;
> >                 if (!(foll->flags & FOLL_GET))
> >                         put_page(page);
> >         else if (!foll->guarded_by_mmu_notifier)
> >                 r = -EFAULT;
> >
> > because it's not the desire to grab a reference that makes getting non-refcounted
> > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
> >
> > Though that highlights that checking guarded_by_mmu_notifier should be done for
> > *all* non-refcounted pfns, not just non-refcounted struct page memory.
> 
> I think things are getting confused here because there are multiple
> things which "safe" refers to. There are three different definitions
> that I think are relevant here:
> 
> 1) "safe" in the sense that KVM doesn't corrupt page reference counts
> 2) "safe" in the sense that KVM doesn't access pfns after they have been freed
> 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations
>
> For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
> then they expect to be able to pass the returned pfn to
> kvm_release_pfn. This means that when FOLL_GET is set, if
> kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
> must take a reference count to avoid eventually corrupting the page
> ref count. I guess replacing the FOLL_GET check with
> !guarded_by_mmu_notifier is logically equivalent because
> __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
> and FOLL_GET is set. But since we're concerned about a property of the
> refcount, I think that checking FOLL_GET is clearer.
> 
> For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> is set, then we're all good here. If guarded_by_mmu_notifier is not
> set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> set. For struct page memory, we're safe because KVM will hold a
> reference as long as it's still using the page. For non struct page
> memory, we're not safe - this is where the breaking change of
> allow_unsafe_mappings would go. Note that for non-refcounted struct
> page, we can't use the allow_unsafe_mappings escape hatch. Since
> FOLL_GET was requested, if we returned such a page, then the caller
> would eventually corrupt the page refcount via kvm_release_pfn.

Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
didn't include that in the snippet below because I didn't want to write the entire
patch.  The whole point of adding is_refcounted_page is so that callers can
identify exactly what type of page was at the end of the trail that was followed.

> Property 3 would be nice, but we've already concluded that guarding
> all translations with mmu notifiers is infeasible. So maintaining
> property 2 is the best we can hope for.

No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
about guest accesses being ordered with respect to VMA/memslot updates, but I
don't think that's the case.

> > As for the other usage of FOLL_GET in this series (using it to conditionally do
> > put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> > without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
> >
> > I do think it's worth providing a helper to consolidate and document that hacky
> > code, e.g. add a kvm_follow_refcounted_pfn() helper.
> >
> > All in all, I think the below (completely untested) is what we want?
> >
> > David (and others), I am planning on doing a full review of this series "soon",
> > but it will likely be a few weeks until that happens.  I jumped in on this
> > specific thread because this caught my eye and I really don't want to throw out
> > *all* of the FOLL_GET usage.
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 5b5afd70f239..90d424990e0a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >         return rc == -EHWPOISON;
> >  }
> >
> > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > +                                          struct page *page)
> > +{
> > +       kvm_pfn_t pfn = page_to_pfn(page);
> > +
> > +       foll->is_refcounted_page = true;
> > +
> > +       /*
> > +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> > +        * doesn't want to grab a reference, but gup() doesn't support getting
> > +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> > +        * changes, drop this and simply don't pass FOLL_GET to gup().
> > +        */
> > +       if (!(foll->flags & FOLL_GET))
> > +               put_page(page);
> > +
> > +       return pfn;
> > +}
> > +
> >  /*
> >   * The fast path to get the writable pfn which will be stored in @pfn,
> >   * true indicates success, otherwise false is returned.  It's also the
> > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                 return false;
> >
> >         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > -               *pfn = page_to_pfn(page[0]);
> >                 foll->writable = foll->allow_write_mapping;
> > -               foll->is_refcounted_page = true;
> > -               if (!(foll->flags & FOLL_GET))
> > -                       put_page(page[0]);
> > +
> > +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> >                 return true;
> >         }
> >
> > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                 return npages;
> >
> >         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > -       foll->is_refcounted_page = true;
> >
> >         /* map read fault as writable if possible */
> >         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >                         page = wpage;
> >                 }
> >         }
> > -       *pfn = page_to_pfn(page);
> > -       if (!(foll->flags & FOLL_GET))
> > -               put_page(page);
> > +
> > +       *pfn = kvm_follow_refcounted_pfn(foll, page);
> >         return npages;
> >  }
> >
> > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> >         if (!page)
> >                 goto out;
> >
> > -       if (get_page_unless_zero(page)) {
> > -               foll->is_refcounted_page = true;
> > -               if (!(foll->flags & FOLL_GET))
> > -                       put_page(page);
> > -       } else if (foll->flags & FOLL_GET) {
> > -               r = -EFAULT;
> > -       }
> > -
> > +       if (get_page_unless_zero(page))
> > +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> >  out:
> >         pte_unmap_unlock(ptep, ptl);
> > -       *p_pfn = pfn;
> > +
> > +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> > +           !allow_unsafe_mappings)
> > +               r = -EFAULT;
> > +       else
> > +               *p_pfn = pfn;
> >
> >         return r;
> >  }
> >
> 
> As I pointed out above, this suggestion is broken because a FOLL_GET
> && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
> non-refcounted page will result in the refcount eventually being
> corrupted.

I don't think so, unless I'm misunderstanding the concern.  It just wasn't a
complete patch, and wasn't intended to be.

> What do you think of this implementation? If it makes sense, I can
> send out an updated patch series.
>
> /*
>  * If FOLL_GET is set, then the caller wants us to take a reference to
>  * keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
>  * guarantees that guarded_by_mmu_notifier is set, so there aren't any
>  * use-after-free concerns.
>  */
> page = kvm_pfn_to_refcounted_page(pfn);
> if (page) {
>         if (get_page_unless_zero(page)) {
>                 WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
>         } else if (foll->flags & FOLL_GET) {
>                 /*
>                  * Certain IO or PFNMAP mappings can be backed with
>                  * valid struct pages but be allocated without
>                  * refcounting e.g., tail pages of non-compound higher
>                  * order allocations. The caller asked for a ref, but
>                  * we can't take one, since releasing such a ref would
>                  * free the page.
>                  */
>                 r = -EFAULT;
>         }
> } else if (foll->flags & FOLL_GET) {
>         /*
>          * When there's no struct page to refcount and no MMU notifier,
>          * then KVM can't be guarantee to avoid use-after-free. However,
>          * there are valid reasons to set up such mappings. If userspace
>          * is trusted and willing to forego kernel safety guarantees,
>          * allow this check to be bypassed.
>          */
>         if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)

I assume you mean:

	if (!foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)

>                 r = -EFAULT;
> }

Please no.  I don't want to overload FOLL_GET or have dependencies between
FOLL_GET and guarded_by_mmu_notifier.  The guest page fault path should be able
to omit FOLL_GET, and it obviously should set guarded_by_mmu_notifier.  And
kvm_vcpu_map() should be able to set FOLL_GET, omit guarded_by_mmu_notifier, _and_
play nice with "unsafe", non-refcounted memory when allow_unsafe_mappings is true.

The above fits your use case because nothing in KVM needs to do kvm_vcpu_map()
on the GPU's buffer, but as a general solution the semantics are very odd.  E.g.
in long form, they are:

  Get get a reference if a there's a refcounted page, fail with -EFAULT if there's
  a struct page but it's not refcounted, and fail with -EFAULT if there's not struct
  page unless the caller is protected by mmu_notifiers or unsafe mappings are allowed.

That's rather bonkers.  What I instead want is FOLL_GET to be:

  Get a reference if there's a refcounted page.

And then allow_unsafe_mappings is simply:

  Allow mapping non-refcounted memory into the guest even if the mapping isn't
  guarded by mmu_notifier events.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-09-06  0:45               ` Sean Christopherson
  (?)
@ 2023-09-06  3:24                 ` David Stevens
  -1 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-09-06  3:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Sep 05, 2023, David Stevens wrote:
> > On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jul 11, 2023, Zhi Wang wrote:
> > > > On Thu, 6 Jul 2023 15:49:39 +0900
> > > > David Stevens <stevensd@chromium.org> wrote:
> > > >
> > > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > > > >
> > > > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > > > David Stevens <stevensd@chromium.org> wrote:
> > > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > > > a temporary solution, more like we are abusing the toleration.
> > > > >
> > > > > I'm not sure I understand what you're getting at. This series never
> > > > > calls gup without FOLL_GET.
> > > > >
> > > > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > > > notifier, it makes sense to support returning a pfn without taking a
> > > > > reference. And we indeed need to do that for certain types of memory.
> > > > >
> > > >
> > > > I am not having prob with taking a pfn without taking a ref. I am
> > > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > > > a pfn without a ref is a good idea or not, while there is another flag
> > > > actually showing it.
> > > >
> > > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > > > requirements of GUP in the code path that going to call GUP is reasonable.
> > > >
> > > > But using FOLL_XXX with purposes that are not related to GUP call really
> > > > feels off.
> > >
> > > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> > > that handles non-refcounted pages, i.e. this
> > >
> > >         if (get_page_unless_zero(page)) {
> > >                 foll->is_refcounted_page = true;
> > >                 if (!(foll->flags & FOLL_GET))
> > >                         put_page(page);
> > >         } else if (foll->flags & FOLL_GET) {
> > >                 r = -EFAULT;
> > >         }
> > >
> > > should be
> > >
> > >         if (get_page_unless_zero(page)) {
> > >                 foll->is_refcounted_page = true;
> > >                 if (!(foll->flags & FOLL_GET))
> > >                         put_page(page);
> > >         else if (!foll->guarded_by_mmu_notifier)
> > >                 r = -EFAULT;
> > >
> > > because it's not the desire to grab a reference that makes getting non-refcounted
> > > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
> > >
> > > Though that highlights that checking guarded_by_mmu_notifier should be done for
> > > *all* non-refcounted pfns, not just non-refcounted struct page memory.
> >
> > I think things are getting confused here because there are multiple
> > things which "safe" refers to. There are three different definitions
> > that I think are relevant here:
> >
> > 1) "safe" in the sense that KVM doesn't corrupt page reference counts
> > 2) "safe" in the sense that KVM doesn't access pfns after they have been freed
> > 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations
> >
> > For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
> > then they expect to be able to pass the returned pfn to
> > kvm_release_pfn. This means that when FOLL_GET is set, if
> > kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
> > must take a reference count to avoid eventually corrupting the page
> > ref count. I guess replacing the FOLL_GET check with
> > !guarded_by_mmu_notifier is logically equivalent because
> > __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
> > and FOLL_GET is set. But since we're concerned about a property of the
> > refcount, I think that checking FOLL_GET is clearer.
> >
> > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> > is set, then we're all good here. If guarded_by_mmu_notifier is not
> > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> > set. For struct page memory, we're safe because KVM will hold a
> > reference as long as it's still using the page. For non struct page
> > memory, we're not safe - this is where the breaking change of
> > allow_unsafe_mappings would go. Note that for non-refcounted struct
> > page, we can't use the allow_unsafe_mappings escape hatch. Since
> > FOLL_GET was requested, if we returned such a page, then the caller
> > would eventually corrupt the page refcount via kvm_release_pfn.
>
> Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
> didn't include that in the snippet below because I didn't want to write the entire
> patch.  The whole point of adding is_refcounted_page is so that callers can
> identify exactly what type of page was at the end of the trail that was followed.

Are you asking me to completely migrate every caller of any gfn_to_pfn
variant to __kvm_follow_pfn, so that they can respect
is_refcounted_page? That's the only way to make it safe for
allow_unsafe_mappings to apply to non-refcounted pages. That is
decidedly not simple. Or is kvm_vcpu_map the specific call site you
care about? At best, I can try to migrate x86, and then just add some
sort of compatibility shim for other architectures that rejects
non-refcounted pages.

> > Property 3 would be nice, but we've already concluded that guarding
> > all translations with mmu notifiers is infeasible. So maintaining
> > property 2 is the best we can hope for.
>
> No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
> about guest accesses being ordered with respect to VMA/memslot updates, but I
> don't think that's the case.

I'm talking about the fact that kvm_vcpu_map is busted with respect to
updates to VMA updates. It won't corrupt host memory because the
mapping keeps a reference to the page, but it will continue to use
stale translations. From [1], it sounds like you've granted that
fixing that is not feasible, so I just wanted to make sure that this
isn't the "unsafe" referred to by allow_unsafe_mappings.

[1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

> > > As for the other usage of FOLL_GET in this series (using it to conditionally do
> > > put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> > > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> > > without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> > > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> > > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
> > >
> > > I do think it's worth providing a helper to consolidate and document that hacky
> > > code, e.g. add a kvm_follow_refcounted_pfn() helper.
> > >
> > > All in all, I think the below (completely untested) is what we want?
> > >
> > > David (and others), I am planning on doing a full review of this series "soon",
> > > but it will likely be a few weeks until that happens.  I jumped in on this
> > > specific thread because this caught my eye and I really don't want to throw out
> > > *all* of the FOLL_GET usage.
> > >
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 5b5afd70f239..90d424990e0a 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > >         return rc == -EHWPOISON;
> > >  }
> > >
> > > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > > +                                          struct page *page)
> > > +{
> > > +       kvm_pfn_t pfn = page_to_pfn(page);
> > > +
> > > +       foll->is_refcounted_page = true;
> > > +
> > > +       /*
> > > +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> > > +        * doesn't want to grab a reference, but gup() doesn't support getting
> > > +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> > > +        * changes, drop this and simply don't pass FOLL_GET to gup().
> > > +        */
> > > +       if (!(foll->flags & FOLL_GET))
> > > +               put_page(page);
> > > +
> > > +       return pfn;
> > > +}
> > > +
> > >  /*
> > >   * The fast path to get the writable pfn which will be stored in @pfn,
> > >   * true indicates success, otherwise false is returned.  It's also the
> > > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                 return false;
> > >
> > >         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > > -               *pfn = page_to_pfn(page[0]);
> > >                 foll->writable = foll->allow_write_mapping;
> > > -               foll->is_refcounted_page = true;
> > > -               if (!(foll->flags & FOLL_GET))
> > > -                       put_page(page[0]);
> > > +
> > > +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> > >                 return true;
> > >         }
> > >
> > > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                 return npages;
> > >
> > >         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > -       foll->is_refcounted_page = true;
> > >
> > >         /* map read fault as writable if possible */
> > >         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                         page = wpage;
> > >                 }
> > >         }
> > > -       *pfn = page_to_pfn(page);
> > > -       if (!(foll->flags & FOLL_GET))
> > > -               put_page(page);
> > > +
> > > +       *pfn = kvm_follow_refcounted_pfn(foll, page);
> > >         return npages;
> > >  }
> > >
> > > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> > >         if (!page)
> > >                 goto out;
> > >
> > > -       if (get_page_unless_zero(page)) {
> > > -               foll->is_refcounted_page = true;
> > > -               if (!(foll->flags & FOLL_GET))
> > > -                       put_page(page);
> > > -       } else if (foll->flags & FOLL_GET) {
> > > -               r = -EFAULT;
> > > -       }
> > > -
> > > +       if (get_page_unless_zero(page))
> > > +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> > >  out:
> > >         pte_unmap_unlock(ptep, ptl);
> > > -       *p_pfn = pfn;
> > > +
> > > +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> > > +           !allow_unsafe_mappings)
> > > +               r = -EFAULT;
> > > +       else
> > > +               *p_pfn = pfn;
> > >
> > >         return r;
> > >  }
> > >
> >
> > As I pointed out above, this suggestion is broken because a FOLL_GET
> > && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
> > non-refcounted page will result in the refcount eventually being
> > corrupted.
>
> I don't think so, unless I'm misunderstanding the concern.  It just wasn't a
> complete patch, and wasn't intended to be.

I guess I misunderstood the scale of the changes you're suggesting.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-06  3:24                 ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-09-06  3:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Sep 05, 2023, David Stevens wrote:
> > On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jul 11, 2023, Zhi Wang wrote:
> > > > On Thu, 6 Jul 2023 15:49:39 +0900
> > > > David Stevens <stevensd@chromium.org> wrote:
> > > >
> > > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > > > >
> > > > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > > > David Stevens <stevensd@chromium.org> wrote:
> > > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > > > a temporary solution, more like we are abusing the toleration.
> > > > >
> > > > > I'm not sure I understand what you're getting at. This series never
> > > > > calls gup without FOLL_GET.
> > > > >
> > > > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > > > notifier, it makes sense to support returning a pfn without taking a
> > > > > reference. And we indeed need to do that for certain types of memory.
> > > > >
> > > >
> > > > I am not having prob with taking a pfn without taking a ref. I am
> > > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > > > a pfn without a ref is a good idea or not, while there is another flag
> > > > actually showing it.
> > > >
> > > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > > > requirements of GUP in the code path that going to call GUP is reasonable.
> > > >
> > > > But using FOLL_XXX with purposes that are not related to GUP call really
> > > > feels off.
> > >
> > > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> > > that handles non-refcounted pages, i.e. this
> > >
> > >         if (get_page_unless_zero(page)) {
> > >                 foll->is_refcounted_page = true;
> > >                 if (!(foll->flags & FOLL_GET))
> > >                         put_page(page);
> > >         } else if (foll->flags & FOLL_GET) {
> > >                 r = -EFAULT;
> > >         }
> > >
> > > should be
> > >
> > >         if (get_page_unless_zero(page)) {
> > >                 foll->is_refcounted_page = true;
> > >                 if (!(foll->flags & FOLL_GET))
> > >                         put_page(page);
> > >         else if (!foll->guarded_by_mmu_notifier)
> > >                 r = -EFAULT;
> > >
> > > because it's not the desire to grab a reference that makes getting non-refcounted
> > > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
> > >
> > > Though that highlights that checking guarded_by_mmu_notifier should be done for
> > > *all* non-refcounted pfns, not just non-refcounted struct page memory.
> >
> > I think things are getting confused here because there are multiple
> > things which "safe" refers to. There are three different definitions
> > that I think are relevant here:
> >
> > 1) "safe" in the sense that KVM doesn't corrupt page reference counts
> > 2) "safe" in the sense that KVM doesn't access pfns after they have been freed
> > 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations
> >
> > For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
> > then they expect to be able to pass the returned pfn to
> > kvm_release_pfn. This means that when FOLL_GET is set, if
> > kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
> > must take a reference count to avoid eventually corrupting the page
> > ref count. I guess replacing the FOLL_GET check with
> > !guarded_by_mmu_notifier is logically equivalent because
> > __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
> > and FOLL_GET is set. But since we're concerned about a property of the
> > refcount, I think that checking FOLL_GET is clearer.
> >
> > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> > is set, then we're all good here. If guarded_by_mmu_notifier is not
> > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> > set. For struct page memory, we're safe because KVM will hold a
> > reference as long as it's still using the page. For non struct page
> > memory, we're not safe - this is where the breaking change of
> > allow_unsafe_mappings would go. Note that for non-refcounted struct
> > page, we can't use the allow_unsafe_mappings escape hatch. Since
> > FOLL_GET was requested, if we returned such a page, then the caller
> > would eventually corrupt the page refcount via kvm_release_pfn.
>
> Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
> didn't include that in the snippet below because I didn't want to write the entire
> patch.  The whole point of adding is_refcounted_page is so that callers can
> identify exactly what type of page was at the end of the trail that was followed.

Are you asking me to completely migrate every caller of any gfn_to_pfn
variant to __kvm_follow_pfn, so that they can respect
is_refcounted_page? That's the only way to make it safe for
allow_unsafe_mappings to apply to non-refcounted pages. That is
decidedly not simple. Or is kvm_vcpu_map the specific call site you
care about? At best, I can try to migrate x86, and then just add some
sort of compatibility shim for other architectures that rejects
non-refcounted pages.

> > Property 3 would be nice, but we've already concluded that guarding
> > all translations with mmu notifiers is infeasible. So maintaining
> > property 2 is the best we can hope for.
>
> No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
> about guest accesses being ordered with respect to VMA/memslot updates, but I
> don't think that's the case.

I'm talking about the fact that kvm_vcpu_map is busted with respect to
updates to VMA updates. It won't corrupt host memory because the
mapping keeps a reference to the page, but it will continue to use
stale translations. From [1], it sounds like you've granted that
fixing that is not feasible, so I just wanted to make sure that this
isn't the "unsafe" referred to by allow_unsafe_mappings.

[1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

> > > As for the other usage of FOLL_GET in this series (using it to conditionally do
> > > put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> > > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> > > without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> > > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> > > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
> > >
> > > I do think it's worth providing a helper to consolidate and document that hacky
> > > code, e.g. add a kvm_follow_refcounted_pfn() helper.
> > >
> > > All in all, I think the below (completely untested) is what we want?
> > >
> > > David (and others), I am planning on doing a full review of this series "soon",
> > > but it will likely be a few weeks until that happens.  I jumped in on this
> > > specific thread because this caught my eye and I really don't want to throw out
> > > *all* of the FOLL_GET usage.
> > >
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 5b5afd70f239..90d424990e0a 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > >         return rc == -EHWPOISON;
> > >  }
> > >
> > > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > > +                                          struct page *page)
> > > +{
> > > +       kvm_pfn_t pfn = page_to_pfn(page);
> > > +
> > > +       foll->is_refcounted_page = true;
> > > +
> > > +       /*
> > > +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> > > +        * doesn't want to grab a reference, but gup() doesn't support getting
> > > +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> > > +        * changes, drop this and simply don't pass FOLL_GET to gup().
> > > +        */
> > > +       if (!(foll->flags & FOLL_GET))
> > > +               put_page(page);
> > > +
> > > +       return pfn;
> > > +}
> > > +
> > >  /*
> > >   * The fast path to get the writable pfn which will be stored in @pfn,
> > >   * true indicates success, otherwise false is returned.  It's also the
> > > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                 return false;
> > >
> > >         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > > -               *pfn = page_to_pfn(page[0]);
> > >                 foll->writable = foll->allow_write_mapping;
> > > -               foll->is_refcounted_page = true;
> > > -               if (!(foll->flags & FOLL_GET))
> > > -                       put_page(page[0]);
> > > +
> > > +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> > >                 return true;
> > >         }
> > >
> > > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                 return npages;
> > >
> > >         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > -       foll->is_refcounted_page = true;
> > >
> > >         /* map read fault as writable if possible */
> > >         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                         page = wpage;
> > >                 }
> > >         }
> > > -       *pfn = page_to_pfn(page);
> > > -       if (!(foll->flags & FOLL_GET))
> > > -               put_page(page);
> > > +
> > > +       *pfn = kvm_follow_refcounted_pfn(foll, page);
> > >         return npages;
> > >  }
> > >
> > > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> > >         if (!page)
> > >                 goto out;
> > >
> > > -       if (get_page_unless_zero(page)) {
> > > -               foll->is_refcounted_page = true;
> > > -               if (!(foll->flags & FOLL_GET))
> > > -                       put_page(page);
> > > -       } else if (foll->flags & FOLL_GET) {
> > > -               r = -EFAULT;
> > > -       }
> > > -
> > > +       if (get_page_unless_zero(page))
> > > +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> > >  out:
> > >         pte_unmap_unlock(ptep, ptl);
> > > -       *p_pfn = pfn;
> > > +
> > > +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> > > +           !allow_unsafe_mappings)
> > > +               r = -EFAULT;
> > > +       else
> > > +               *p_pfn = pfn;
> > >
> > >         return r;
> > >  }
> > >
> >
> > As I pointed out above, this suggestion is broken because a FOLL_GET
> > && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
> > non-refcounted page will result in the refcount eventually being
> > corrupted.
>
> I don't think so, unless I'm misunderstanding the concern.  It just wasn't a
> complete patch, and wasn't intended to be.

I guess I misunderstood the scale of the changes you're suggesting.

-David

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-06  3:24                 ` David Stevens
  0 siblings, 0 replies; 165+ messages in thread
From: David Stevens @ 2023-09-06  3:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zhi Wang, kvm, Marc Zyngier, linux-kernel, Peter Xu, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Sep 05, 2023, David Stevens wrote:
> > On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jul 11, 2023, Zhi Wang wrote:
> > > > On Thu, 6 Jul 2023 15:49:39 +0900
> > > > David Stevens <stevensd@chromium.org> wrote:
> > > >
> > > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> > > > > >
> > > > > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > > > > David Stevens <stevensd@chromium.org> wrote:
> > > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > > > a temporary solution, more like we are abusing the toleration.
> > > > >
> > > > > I'm not sure I understand what you're getting at. This series never
> > > > > calls gup without FOLL_GET.
> > > > >
> > > > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > > > notifier, it makes sense to support returning a pfn without taking a
> > > > > reference. And we indeed need to do that for certain types of memory.
> > > > >
> > > >
> > > > I am not having prob with taking a pfn without taking a ref. I am
> > > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > > > a pfn without a ref is a good idea or not, while there is another flag
> > > > actually showing it.
> > > >
> > > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > > > requirements of GUP in the code path that going to call GUP is reasonable.
> > > >
> > > > But using FOLL_XXX with purposes that are not related to GUP call really
> > > > feels off.
> > >
> > > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> > > that handles non-refcounted pages, i.e. this
> > >
> > >         if (get_page_unless_zero(page)) {
> > >                 foll->is_refcounted_page = true;
> > >                 if (!(foll->flags & FOLL_GET))
> > >                         put_page(page);
> > >         } else if (foll->flags & FOLL_GET) {
> > >                 r = -EFAULT;
> > >         }
> > >
> > > should be
> > >
> > >         if (get_page_unless_zero(page)) {
> > >                 foll->is_refcounted_page = true;
> > >                 if (!(foll->flags & FOLL_GET))
> > >                         put_page(page);
> > >         else if (!foll->guarded_by_mmu_notifier)
> > >                 r = -EFAULT;
> > >
> > > because it's not the desire to grab a reference that makes getting non-refcounted
> > > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
> > >
> > > Though that highlights that checking guarded_by_mmu_notifier should be done for
> > > *all* non-refcounted pfns, not just non-refcounted struct page memory.
> >
> > I think things are getting confused here because there are multiple
> > things which "safe" refers to. There are three different definitions
> > that I think are relevant here:
> >
> > 1) "safe" in the sense that KVM doesn't corrupt page reference counts
> > 2) "safe" in the sense that KVM doesn't access pfns after they have been freed
> > 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations
> >
> > For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
> > then they expect to be able to pass the returned pfn to
> > kvm_release_pfn. This means that when FOLL_GET is set, if
> > kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
> > must take a reference count to avoid eventually corrupting the page
> > ref count. I guess replacing the FOLL_GET check with
> > !guarded_by_mmu_notifier is logically equivalent because
> > __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
> > and FOLL_GET is set. But since we're concerned about a property of the
> > refcount, I think that checking FOLL_GET is clearer.
> >
> > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> > is set, then we're all good here. If guarded_by_mmu_notifier is not
> > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> > set. For struct page memory, we're safe because KVM will hold a
> > reference as long as it's still using the page. For non struct page
> > memory, we're not safe - this is where the breaking change of
> > allow_unsafe_mappings would go. Note that for non-refcounted struct
> > page, we can't use the allow_unsafe_mappings escape hatch. Since
> > FOLL_GET was requested, if we returned such a page, then the caller
> > would eventually corrupt the page refcount via kvm_release_pfn.
>
> Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
> didn't include that in the snippet below because I didn't want to write the entire
> patch.  The whole point of adding is_refcounted_page is so that callers can
> identify exactly what type of page was at the end of the trail that was followed.

Are you asking me to completely migrate every caller of any gfn_to_pfn
variant to __kvm_follow_pfn, so that they can respect
is_refcounted_page? That's the only way to make it safe for
allow_unsafe_mappings to apply to non-refcounted pages. That is
decidedly not simple. Or is kvm_vcpu_map the specific call site you
care about? At best, I can try to migrate x86, and then just add some
sort of compatibility shim for other architectures that rejects
non-refcounted pages.

> > Property 3 would be nice, but we've already concluded that guarding
> > all translations with mmu notifiers is infeasible. So maintaining
> > property 2 is the best we can hope for.
>
> No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
> about guest accesses being ordered with respect to VMA/memslot updates, but I
> don't think that's the case.

I'm talking about the fact that kvm_vcpu_map is busted with respect to
updates to VMA updates. It won't corrupt host memory because the
mapping keeps a reference to the page, but it will continue to use
stale translations. From [1], it sounds like you've granted that
fixing that is not feasible, so I just wanted to make sure that this
isn't the "unsafe" referred to by allow_unsafe_mappings.

[1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

> > > As for the other usage of FOLL_GET in this series (using it to conditionally do
> > > put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
> > > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> > > without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
> > > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> > > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
> > >
> > > I do think it's worth providing a helper to consolidate and document that hacky
> > > code, e.g. add a kvm_follow_refcounted_pfn() helper.
> > >
> > > All in all, I think the below (completely untested) is what we want?
> > >
> > > David (and others), I am planning on doing a full review of this series "soon",
> > > but it will likely be a few weeks until that happens.  I jumped in on this
> > > specific thread because this caught my eye and I really don't want to throw out
> > > *all* of the FOLL_GET usage.
> > >
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 5b5afd70f239..90d424990e0a 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > >         return rc == -EHWPOISON;
> > >  }
> > >
> > > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > > +                                          struct page *page)
> > > +{
> > > +       kvm_pfn_t pfn = page_to_pfn(page);
> > > +
> > > +       foll->is_refcounted_page = true;
> > > +
> > > +       /*
> > > +        * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> > > +        * doesn't want to grab a reference, but gup() doesn't support getting
> > > +        * just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
> > > +        * changes, drop this and simply don't pass FOLL_GET to gup().
> > > +        */
> > > +       if (!(foll->flags & FOLL_GET))
> > > +               put_page(page);
> > > +
> > > +       return pfn;
> > > +}
> > > +
> > >  /*
> > >   * The fast path to get the writable pfn which will be stored in @pfn,
> > >   * true indicates success, otherwise false is returned.  It's also the
> > > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                 return false;
> > >
> > >         if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > > -               *pfn = page_to_pfn(page[0]);
> > >                 foll->writable = foll->allow_write_mapping;
> > > -               foll->is_refcounted_page = true;
> > > -               if (!(foll->flags & FOLL_GET))
> > > -                       put_page(page[0]);
> > > +
> > > +               *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> > >                 return true;
> > >         }
> > >
> > > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                 return npages;
> > >
> > >         foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > -       foll->is_refcounted_page = true;
> > >
> > >         /* map read fault as writable if possible */
> > >         if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >                         page = wpage;
> > >                 }
> > >         }
> > > -       *pfn = page_to_pfn(page);
> > > -       if (!(foll->flags & FOLL_GET))
> > > -               put_page(page);
> > > +
> > > +       *pfn = kvm_follow_refcounted_pfn(foll, page);
> > >         return npages;
> > >  }
> > >
> > > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> > >         if (!page)
> > >                 goto out;
> > >
> > > -       if (get_page_unless_zero(page)) {
> > > -               foll->is_refcounted_page = true;
> > > -               if (!(foll->flags & FOLL_GET))
> > > -                       put_page(page);
> > > -       } else if (foll->flags & FOLL_GET) {
> > > -               r = -EFAULT;
> > > -       }
> > > -
> > > +       if (get_page_unless_zero(page))
> > > +               WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> > >  out:
> > >         pte_unmap_unlock(ptep, ptl);
> > > -       *p_pfn = pfn;
> > > +
> > > +       if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> > > +           !allow_unsafe_mappings)
> > > +               r = -EFAULT;
> > > +       else
> > > +               *p_pfn = pfn;
> > >
> > >         return r;
> > >  }
> > >
> >
> > As I pointed out above, this suggestion is broken because a FOLL_GET
> > && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
> > non-refcounted page will result in the refcount eventually being
> > corrupted.
>
> I don't think so, unless I'm misunderstanding the concern.  It just wasn't a
> complete patch, and wasn't intended to be.

I guess I misunderstood the scale of the changes you're suggesting.

-David

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
  2023-09-06  3:24                 ` David Stevens
  (?)
@ 2023-09-06 22:03                   ` Sean Christopherson
  -1 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-09-06 22:03 UTC (permalink / raw)
  To: David Stevens
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Sep 06, 2023, David Stevens wrote:
> On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Sep 05, 2023, David Stevens wrote:
> > > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> > > is set, then we're all good here. If guarded_by_mmu_notifier is not
> > > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> > > set. For struct page memory, we're safe because KVM will hold a
> > > reference as long as it's still using the page. For non struct page
> > > memory, we're not safe - this is where the breaking change of
> > > allow_unsafe_mappings would go. Note that for non-refcounted struct
> > > page, we can't use the allow_unsafe_mappings escape hatch. Since
> > > FOLL_GET was requested, if we returned such a page, then the caller
> > > would eventually corrupt the page refcount via kvm_release_pfn.
> >
> > Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
> > didn't include that in the snippet below because I didn't want to write the entire
> > patch.  The whole point of adding is_refcounted_page is so that callers can
> > identify exactly what type of page was at the end of the trail that was followed.
> 
> Are you asking me to completely migrate every caller of any gfn_to_pfn
> variant to __kvm_follow_pfn, so that they can respect
> is_refcounted_page? That's the only way to make it safe for
> allow_unsafe_mappings to apply to non-refcounted pages. That is
> decidedly not simple. Or is kvm_vcpu_map the specific call site you
> care about? At best, I can try to migrate x86, and then just add some
> sort of compatibility shim for other architectures that rejects
> non-refcounted pages.

Ah, I see your conundrum.  No, I don't think it's reasonable to require you to
convert all users in every architecture.  I'll still ask, just in case you're
feeling generous, but it's not a requirement :-)

The easiest way forward I can think of is to add yet another flag to kvm_follow_pfn,
e.g. allow_non_refcounted_struct_page, to communicate whether or not the caller
has been enlightened to play nice with non-refcounted struct page memory.  We'll
need that flag no matter what, otherwise we'd have to convert all users in a single
patch (LOL).  Then your series can simply stop at a reasonable point, e.g. convert
all x86 usage (including kvm_vcpu_map(), and leave converting everything else to
future work.

E.g. I think this would be the outro of hva_to_pfn_remapped():

        if (!page)
                goto out;

        if (get_page_unless_zero(page))
                WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
 out:
        pte_unmap_unlock(ptep, ptl);

	/*
	 * TODO: Drop allow_non_refcounted_struct_page once all callers have
	 * been taught to play nice with non-refcounted tail pages.
	 */
	if (page && !foll->is_refcounted_page &&
	    !foll->allow_non_refcounted_struct_page)
		r = -EFAULT
        else if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
        	 !allow_unsafe_mappings)
        	r = -EFAULT;
        else
               *p_pfn = pfn;

        return r;

> > > Property 3 would be nice, but we've already concluded that guarding
> > > all translations with mmu notifiers is infeasible. So maintaining
> > > property 2 is the best we can hope for.
> >
> > No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
> > about guest accesses being ordered with respect to VMA/memslot updates, but I
> > don't think that's the case.
> 
> I'm talking about the fact that kvm_vcpu_map is busted with respect to
> updates to VMA updates. It won't corrupt host memory because the
> mapping keeps a reference to the page, but it will continue to use
> stale translations.

True.  But barring some crazy paravirt use case, userspace modifying a mapping
that is in active use is inherently broken, the guest will have no idea that memory
just got yanked away.

Hmm, though I suppose userspace could theoretically mprotect() a mapping to be
read-only, which would "work" for mmu_notifier paths but not kvm_vcpu_map().  But
KVM doesn't provide enough information on -EFAULT for userspace to do anything in
response to a write to read-only memory, so in practice that's likely inherently
broken too.

> From [1], it sounds like you've granted that fixing that is not feasible, so
> I just wanted to make sure that this isn't the "unsafe" referred to by
> allow_unsafe_mappings.

Right, this is not the "unsafe" I'm referring to.

> [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-06 22:03                   ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-09-06 22:03 UTC (permalink / raw)
  To: David Stevens
  Cc: Zhi Wang, Marc Zyngier, Michael Ellerman, Peter Xu,
	linux-arm-kernel, kvmarm, linux-kernel, linuxppc-dev, kvm

On Wed, Sep 06, 2023, David Stevens wrote:
> On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Sep 05, 2023, David Stevens wrote:
> > > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> > > is set, then we're all good here. If guarded_by_mmu_notifier is not
> > > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> > > set. For struct page memory, we're safe because KVM will hold a
> > > reference as long as it's still using the page. For non struct page
> > > memory, we're not safe - this is where the breaking change of
> > > allow_unsafe_mappings would go. Note that for non-refcounted struct
> > > page, we can't use the allow_unsafe_mappings escape hatch. Since
> > > FOLL_GET was requested, if we returned such a page, then the caller
> > > would eventually corrupt the page refcount via kvm_release_pfn.
> >
> > Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
> > didn't include that in the snippet below because I didn't want to write the entire
> > patch.  The whole point of adding is_refcounted_page is so that callers can
> > identify exactly what type of page was at the end of the trail that was followed.
> 
> Are you asking me to completely migrate every caller of any gfn_to_pfn
> variant to __kvm_follow_pfn, so that they can respect
> is_refcounted_page? That's the only way to make it safe for
> allow_unsafe_mappings to apply to non-refcounted pages. That is
> decidedly not simple. Or is kvm_vcpu_map the specific call site you
> care about? At best, I can try to migrate x86, and then just add some
> sort of compatibility shim for other architectures that rejects
> non-refcounted pages.

Ah, I see your conundrum.  No, I don't think it's reasonable to require you to
convert all users in every architecture.  I'll still ask, just in case you're
feeling generous, but it's not a requirement :-)

The easiest way forward I can think of is to add yet another flag to kvm_follow_pfn,
e.g. allow_non_refcounted_struct_page, to communicate whether or not the caller
has been enlightened to play nice with non-refcounted struct page memory.  We'll
need that flag no matter what, otherwise we'd have to convert all users in a single
patch (LOL).  Then your series can simply stop at a reasonable point, e.g. convert
all x86 usage (including kvm_vcpu_map(), and leave converting everything else to
future work.

E.g. I think this would be the outro of hva_to_pfn_remapped():

        if (!page)
                goto out;

        if (get_page_unless_zero(page))
                WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
 out:
        pte_unmap_unlock(ptep, ptl);

	/*
	 * TODO: Drop allow_non_refcounted_struct_page once all callers have
	 * been taught to play nice with non-refcounted tail pages.
	 */
	if (page && !foll->is_refcounted_page &&
	    !foll->allow_non_refcounted_struct_page)
		r = -EFAULT
        else if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
        	 !allow_unsafe_mappings)
        	r = -EFAULT;
        else
               *p_pfn = pfn;

        return r;

> > > Property 3 would be nice, but we've already concluded that guarding
> > > all translations with mmu notifiers is infeasible. So maintaining
> > > property 2 is the best we can hope for.
> >
> > No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
> > about guest accesses being ordered with respect to VMA/memslot updates, but I
> > don't think that's the case.
> 
> I'm talking about the fact that kvm_vcpu_map is busted with respect to
> updates to VMA updates. It won't corrupt host memory because the
> mapping keeps a reference to the page, but it will continue to use
> stale translations.

True.  But barring some crazy paravirt use case, userspace modifying a mapping
that is in active use is inherently broken, the guest will have no idea that memory
just got yanked away.

Hmm, though I suppose userspace could theoretically mprotect() a mapping to be
read-only, which would "work" for mmu_notifier paths but not kvm_vcpu_map().  But
KVM doesn't provide enough information on -EFAULT for userspace to do anything in
response to a write to read-only memory, so in practice that's likely inherently
broken too.

> From [1], it sounds like you've granted that fixing that is not feasible, so
> I just wanted to make sure that this isn't the "unsafe" referred to by
> allow_unsafe_mappings.

Right, this is not the "unsafe" I'm referring to.

> [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
@ 2023-09-06 22:03                   ` Sean Christopherson
  0 siblings, 0 replies; 165+ messages in thread
From: Sean Christopherson @ 2023-09-06 22:03 UTC (permalink / raw)
  To: David Stevens
  Cc: Zhi Wang, kvm, Marc Zyngier, linux-kernel, Peter Xu, kvmarm,
	linuxppc-dev, linux-arm-kernel

On Wed, Sep 06, 2023, David Stevens wrote:
> On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Sep 05, 2023, David Stevens wrote:
> > > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> > > is set, then we're all good here. If guarded_by_mmu_notifier is not
> > > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> > > set. For struct page memory, we're safe because KVM will hold a
> > > reference as long as it's still using the page. For non struct page
> > > memory, we're not safe - this is where the breaking change of
> > > allow_unsafe_mappings would go. Note that for non-refcounted struct
> > > page, we can't use the allow_unsafe_mappings escape hatch. Since
> > > FOLL_GET was requested, if we returned such a page, then the caller
> > > would eventually corrupt the page refcount via kvm_release_pfn.
> >
> > Yes we can.  The caller simply needs to be made aware of is_refcounted_page.   I
> > didn't include that in the snippet below because I didn't want to write the entire
> > patch.  The whole point of adding is_refcounted_page is so that callers can
> > identify exactly what type of page was at the end of the trail that was followed.
> 
> Are you asking me to completely migrate every caller of any gfn_to_pfn
> variant to __kvm_follow_pfn, so that they can respect
> is_refcounted_page? That's the only way to make it safe for
> allow_unsafe_mappings to apply to non-refcounted pages. That is
> decidedly not simple. Or is kvm_vcpu_map the specific call site you
> care about? At best, I can try to migrate x86, and then just add some
> sort of compatibility shim for other architectures that rejects
> non-refcounted pages.

Ah, I see your conundrum.  No, I don't think it's reasonable to require you to
convert all users in every architecture.  I'll still ask, just in case you're
feeling generous, but it's not a requirement :-)

The easiest way forward I can think of is to add yet another flag to kvm_follow_pfn,
e.g. allow_non_refcounted_struct_page, to communicate whether or not the caller
has been enlightened to play nice with non-refcounted struct page memory.  We'll
need that flag no matter what, otherwise we'd have to convert all users in a single
patch (LOL).  Then your series can simply stop at a reasonable point, e.g. convert
all x86 usage (including kvm_vcpu_map(), and leave converting everything else to
future work.

E.g. I think this would be the outro of hva_to_pfn_remapped():

        if (!page)
                goto out;

        if (get_page_unless_zero(page))
                WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
 out:
        pte_unmap_unlock(ptep, ptl);

	/*
	 * TODO: Drop allow_non_refcounted_struct_page once all callers have
	 * been taught to play nice with non-refcounted tail pages.
	 */
	if (page && !foll->is_refcounted_page &&
	    !foll->allow_non_refcounted_struct_page)
		r = -EFAULT
        else if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
        	 !allow_unsafe_mappings)
        	r = -EFAULT;
        else
               *p_pfn = pfn;

        return r;

> > > Property 3 would be nice, but we've already concluded that guarding
> > > all translations with mmu notifiers is infeasible. So maintaining
> > > property 2 is the best we can hope for.
> >
> > No, #3 is just a variant of #2.  Unless you're talking about not making guarantees
> > about guest accesses being ordered with respect to VMA/memslot updates, but I
> > don't think that's the case.
> 
> I'm talking about the fact that kvm_vcpu_map is busted with respect to
> updates to VMA updates. It won't corrupt host memory because the
> mapping keeps a reference to the page, but it will continue to use
> stale translations.

True.  But barring some crazy paravirt use case, userspace modifying a mapping
that is in active use is inherently broken, the guest will have no idea that memory
just got yanked away.

Hmm, though I suppose userspace could theoretically mprotect() a mapping to be
read-only, which would "work" for mmu_notifier paths but not kvm_vcpu_map().  But
KVM doesn't provide enough information on -EFAULT for userspace to do anything in
response to a write to read-only memory, so in practice that's likely inherently
broken too.

> From [1], it sounds like you've granted that fixing that is not feasible, so
> I just wanted to make sure that this isn't the "unsafe" referred to by
> allow_unsafe_mappings.

Right, this is not the "unsafe" I'm referring to.

> [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/

^ permalink raw reply	[flat|nested] 165+ messages in thread

end of thread, other threads:[~2023-09-06 22:04 UTC | newest]

Thread overview: 165+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-04  7:50 [PATCH v7 0/8] KVM: allow mapping non-refcounted pages David Stevens
2023-07-04  7:50 ` David Stevens
2023-07-04  7:50 ` David Stevens
2023-07-04  7:50 ` [PATCH v7 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50 ` [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-05  3:10   ` Yu Zhang
2023-07-05  3:10     ` Yu Zhang
2023-07-05  3:10     ` Yu Zhang
2023-07-05  9:22     ` David Stevens
2023-07-05  9:22       ` David Stevens
2023-07-05  9:22       ` David Stevens
2023-07-05 10:53       ` Yu Zhang
2023-07-05 10:53         ` Yu Zhang
2023-07-05 10:53         ` Yu Zhang
2023-07-06  5:29         ` David Stevens
2023-07-06  5:29           ` David Stevens
2023-07-06  5:29           ` David Stevens
2023-07-06 14:52           ` Yu Zhang
2023-07-06 14:52             ` Yu Zhang
2023-07-06 14:52             ` Yu Zhang
2023-08-04 22:03             ` Sean Christopherson
2023-08-04 22:03               ` Sean Christopherson
2023-08-04 22:03               ` Sean Christopherson
2023-07-05  8:47   ` Zhi Wang
2023-07-05  8:47     ` Zhi Wang
2023-07-05  8:47     ` Zhi Wang
2023-07-05  9:08     ` David Stevens
2023-07-05  9:08       ` David Stevens
2023-07-05  9:08       ` David Stevens
2023-07-11 17:37       ` Zhi Wang
2023-07-11 17:37         ` Zhi Wang
2023-07-11 17:37         ` Zhi Wang
2023-07-06  1:34   ` Isaku Yamahata
2023-07-06  1:34     ` Isaku Yamahata
2023-07-06  1:34     ` Isaku Yamahata
2023-07-06  5:52     ` David Stevens
2023-07-06  5:52       ` David Stevens
2023-07-06  5:52       ` David Stevens
2023-08-04 22:13   ` Sean Christopherson
2023-08-04 22:13     ` Sean Christopherson
2023-08-04 22:13     ` Sean Christopherson
2023-07-04  7:50 ` [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-05  7:23   ` Yu Zhang
2023-07-05  7:23     ` Yu Zhang
2023-07-05  7:23     ` Yu Zhang
2023-07-05 11:56   ` Yu Zhang
2023-07-05 11:56     ` Yu Zhang
2023-07-05 11:56     ` Yu Zhang
2023-07-06  6:09     ` David Stevens
2023-07-06  6:09       ` David Stevens
2023-07-06  6:09       ` David Stevens
2023-07-05 13:19   ` Zhi Wang
2023-07-05 13:19     ` Zhi Wang
2023-07-05 13:19     ` Zhi Wang
2023-07-06  6:49     ` David Stevens
2023-07-06  6:49       ` David Stevens
2023-07-06  6:49       ` David Stevens
2023-07-11 17:33       ` Zhi Wang
2023-07-11 17:33         ` Zhi Wang
2023-07-11 17:33         ` Zhi Wang
2023-07-11 21:59         ` Sean Christopherson
2023-07-11 21:59           ` Sean Christopherson
2023-07-11 21:59           ` Sean Christopherson
2023-09-05  8:26           ` David Stevens
2023-09-05  8:26             ` David Stevens
2023-09-05  8:26             ` David Stevens
2023-09-06  0:45             ` Sean Christopherson
2023-09-06  0:45               ` Sean Christopherson
2023-09-06  0:45               ` Sean Christopherson
2023-09-06  3:24               ` David Stevens
2023-09-06  3:24                 ` David Stevens
2023-09-06  3:24                 ` David Stevens
2023-09-06 22:03                 ` Sean Christopherson
2023-09-06 22:03                   ` Sean Christopherson
2023-09-06 22:03                   ` Sean Christopherson
2023-07-04  7:50 ` [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-05  8:07   ` Yu Zhang
2023-07-05  8:07     ` Yu Zhang
2023-07-05  8:07     ` Yu Zhang
2023-08-04 22:30     ` Sean Christopherson
2023-08-04 22:30       ` Sean Christopherson
2023-08-04 22:30       ` Sean Christopherson
2023-07-06  1:54   ` Isaku Yamahata
2023-07-06  1:54     ` Isaku Yamahata
2023-07-06  1:54     ` Isaku Yamahata
2023-08-24  8:03     ` David Stevens
2023-08-24  8:03       ` David Stevens
2023-08-24  8:03       ` David Stevens
2023-07-04  7:50 ` [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET " David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-05 10:18   ` Yu Zhang
2023-07-05 10:18     ` Yu Zhang
2023-07-05 10:18     ` Yu Zhang
2023-07-05 14:17     ` Yu Zhang
2023-07-05 14:17       ` Yu Zhang
2023-07-05 14:17       ` Yu Zhang
2023-07-06  4:52     ` David Stevens
2023-07-06  4:52       ` David Stevens
2023-07-06  4:52       ` David Stevens
2023-07-06  7:19       ` Yu Zhang
2023-07-06  7:19         ` Yu Zhang
2023-07-06  7:19         ` Yu Zhang
2023-07-06 15:58       ` Isaku Yamahata
2023-07-06 15:58         ` Isaku Yamahata
2023-07-06 15:58         ` Isaku Yamahata
2023-07-07  1:35         ` David Stevens
2023-07-07  1:35           ` David Stevens
2023-07-07  1:35           ` David Stevens
2023-07-10 16:34           ` Isaku Yamahata
2023-07-10 16:34             ` Isaku Yamahata
2023-07-10 16:34             ` Isaku Yamahata
2023-07-11  2:59             ` David Stevens
2023-07-11  2:59               ` David Stevens
2023-07-11  2:59               ` David Stevens
2023-08-04 22:45       ` Sean Christopherson
2023-08-04 22:45         ` Sean Christopherson
2023-08-04 22:45         ` Sean Christopherson
2023-07-05 10:25   ` Yu Zhang
2023-07-05 10:25     ` Yu Zhang
2023-07-05 10:25     ` Yu Zhang
2023-08-24  8:03     ` David Stevens
2023-08-24  8:03       ` David Stevens
2023-08-24  8:03       ` David Stevens
2023-08-24 15:15       ` Sean Christopherson
2023-08-24 15:15         ` Sean Christopherson
2023-08-24 15:15         ` Sean Christopherson
2023-08-25  1:38         ` David Stevens
2023-08-25  1:38           ` David Stevens
2023-08-25  1:38           ` David Stevens
2023-08-31 21:18           ` Sean Christopherson
2023-08-31 21:18             ` Sean Christopherson
2023-08-31 21:18             ` Sean Christopherson
2023-07-06  2:10   ` Isaku Yamahata
2023-07-06  2:10     ` Isaku Yamahata
2023-07-06  2:10     ` Isaku Yamahata
2023-07-06  5:18     ` David Stevens
2023-07-06  5:18       ` David Stevens
2023-07-06  5:18       ` David Stevens
2023-07-19  6:09   ` Yan Zhao
2023-07-19  6:09     ` Yan Zhao
2023-07-19  6:09     ` Yan Zhao
2023-07-19  7:16     ` David Stevens
2023-07-19  7:16       ` David Stevens
2023-07-19  7:16       ` David Stevens
2023-07-04  7:50 ` [PATCH v7 6/8] KVM: arm64: Migrate " David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50 ` [PATCH v7 7/8] KVM: PPC: " David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50 ` [PATCH v7 8/8] KVM: remove __gfn_to_pfn_memslot David Stevens
2023-07-04  7:50   ` David Stevens
2023-07-04  7:50   ` David Stevens
2023-08-04 22:47 ` [PATCH v7 0/8] KVM: allow mapping non-refcounted pages Sean Christopherson
2023-08-04 22:47   ` Sean Christopherson
2023-08-04 22:47   ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.