All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages
@ 2019-11-11 22:12 Sean Christopherson
  2019-11-11 22:12 ` [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sean Christopherson @ 2019-11-11 22:12 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Adam Borowski,
	David Hildenbrand, Dan Williams

This mini-series fixes a suspected, but technically unconfirmed, bug in
KVM related to ZONE_DEVICE pages.  The suspected issue is that KVM treats
ZONE_DEVICE pages as reserved PFNs, and so doesn't put references to such
pages when dropping references via KVM's generic kvm_release_pfn_clean().

David Hildenbrand uncovered the bug during a discussion about removing
PG_reserved from ZONE_DEVICE pages, after Dan Williams pointed out[1] that
there was a bug report from Adam Borowski[2] that was likely related to
KVM's interaction with PageReserved().

Patch 1/3 contains the actual fix, patches 2/3 and 3/3 are minor cleanup
that is mostly unrelated, but dependent and prompted by the fix itself.

v2:
  - Remove the kvm_is_zone_device_pfn(pfn) check from kvm_get_pfn().  It's
    not entirely clear whether or not the hva_to_pfn_remapped() case is
    actually broken, e.g. KVM's page fault handler is likely ok, whereas
    not calling get_page() willl definitely cause breakage as KVM would
    later call put_page() on the pfn/page. [Paolo]

  - WARN if kvm_is_zone_device_pfn() is called without the underlying
    page being pinned.  This won't necessarily catch all bugs, e.g. if
    the above hva_to_pfn_remapped case is indeed broken, but will
    prevent completely bogus usage. [Dan]

  - Remove the is_error_pfn() check from transparent_hugepage_adjust()
    instead of carrying it forward into the new kvm_is_hugepage_allowed()
    helper. [Paolo]

[1] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
[2] https://lkml.kernel.org/r/01adb4cb-6092-638c-0bab-e61322be7cf5@redhat.com

Sean Christopherson (3):
  KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  KVM: x86/mmu: Remove superfluous is_error_pfn() check from THP adjust
  KVM: x86/mmu: Add helper to consolidate huge page promotion

 arch/x86/kvm/mmu.c       | 15 +++++++++------
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 26 +++++++++++++++++++++++---
 3 files changed, 33 insertions(+), 9 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-11 22:12 [PATCH v2 0/3] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages Sean Christopherson
@ 2019-11-11 22:12 ` Sean Christopherson
  2019-11-11 22:20   ` Paolo Bonzini
                     ` (2 more replies)
  2019-11-11 22:12 ` [PATCH v2 2/3] KVM: x86/mmu: Remove superfluous is_error_pfn() check from THP adjust Sean Christopherson
  2019-11-11 22:12 ` [PATCH v2 3/3] KVM: x86/mmu: Add helper to consolidate huge page promotion Sean Christopherson
  2 siblings, 3 replies; 8+ messages in thread
From: Sean Christopherson @ 2019-11-11 22:12 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Adam Borowski,
	David Hildenbrand, Dan Williams

Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
pages, e.g. put pages grabbed via gup().  But for flows such as setting
A/D bits or shifting refcounts for transparent huge pages, KVM needs to
to avoid processing ZONE_DEVICE pages as the flows in question lack the
underlying machinery for proper handling of ZONE_DEVICE pages.

This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
when running a KVM guest backed with /dev/dax memory, as KVM straight up
doesn't put any references to ZONE_DEVICE pages acquired by gup().

Note, Dan Williams proposed an alternative solution of doing put_page()
on ZONE_DEVICE pages immediately after gup() in order to simplify the
auditing needed to ensure is_zone_device_page() is called if and only if
the backing device is pinned (via gup()).  But that approach would break
kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
unmap() when accessing guest memory, unlike KVM's secondary MMU, which
coordinates with mmu_notifier invalidations to avoid creating stale
page references, i.e. doesn't rely on pages being pinned.

[*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl

Reported-by: Adam Borowski <kilobyte@angband.pl>
Debugged-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.c       |  8 ++++----
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 26 +++++++++++++++++++++++---
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..bf82b1f2e834 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 * here.
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    level == PT_PAGE_TABLE_LEVEL &&
+	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
 	    PageTransCompoundMap(pfn_to_page(pfn)) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
 		unsigned long mask;
@@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * the guest, and the guest page table is using 4K page size
 		 * mapping if the indirect sp has level = 1.
 		 */
-		if (sp->role.direct &&
-			!kvm_is_reserved_pfn(pfn) &&
-			PageTransCompoundMap(pfn_to_page(pfn))) {
+		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
+		    !kvm_is_zone_device_pfn(pfn) &&
+		    PageTransCompoundMap(pfn_to_page(pfn))) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a817e446c9aa..4ad1cd7d2d4d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
+bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
 	struct hlist_node link;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b8534c6b8cf6..bc9d10a0a334 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -149,10 +149,30 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	return 0;
 }
 
+bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
+{
+	/*
+	 * The metadata used by is_zone_device_page() to determine whether or
+	 * not a page is ZONE_DEVICE is guaranteed to be valid if and only if
+	 * the device has been pinned, e.g. by get_user_pages().  WARN if the
+	 * page_count() is zero to help detect bad usage of this helper.
+	 */
+	if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
+		return false;
+
+	return is_zone_device_page(pfn_to_page(pfn));
+}
+
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
+	/*
+	 * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
+	 * perspective they are "normal" pages, albeit with slightly different
+	 * usage rules.
+	 */
 	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+		return PageReserved(pfn_to_page(pfn)) &&
+		       !kvm_is_zone_device_pfn(pfn);
 
 	return true;
 }
@@ -1865,7 +1885,7 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
 
 void kvm_set_pfn_dirty(kvm_pfn_t pfn)
 {
-	if (!kvm_is_reserved_pfn(pfn)) {
+	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
 
 		SetPageDirty(page);
@@ -1875,7 +1895,7 @@ EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
 void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 {
-	if (!kvm_is_reserved_pfn(pfn))
+	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
 		mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] KVM: x86/mmu: Remove superfluous is_error_pfn() check from THP adjust
  2019-11-11 22:12 [PATCH v2 0/3] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages Sean Christopherson
  2019-11-11 22:12 ` [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
@ 2019-11-11 22:12 ` Sean Christopherson
  2019-11-11 22:12 ` [PATCH v2 3/3] KVM: x86/mmu: Add helper to consolidate huge page promotion Sean Christopherson
  2 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2019-11-11 22:12 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Adam Borowski,
	David Hildenbrand, Dan Williams

Replace the is_error_noslot_pfn() check in transparent_hugepage_adjust()
with an is_noslot_pfn() check.  thp_adjust() cannot be reached with an
error pfn as it is always called after handle_abnormal_pfn(), which
aborts the page fault handler if an error pfn is encountered.  Don't
bother future proofing thp_adjust() with a WARN on is_error_pfn(), as
calling thp_adjust() before handle_abnormal_pfn() is impossible for all
intents and purposes, e.g. thp_adjust() relies on being called after
mmu_notifier_retry() and while holding mmu_lock, thus moving it would
essentially require a complete rewrite of KVM's page fault handlers.

No functional change intended.

Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bf82b1f2e834..c35c6fb2635a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3305,7 +3305,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
 	 * here.
 	 */
-	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
+	if (!is_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
 	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
 	    PageTransCompoundMap(pfn_to_page(pfn)) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] KVM: x86/mmu: Add helper to consolidate huge page promotion
  2019-11-11 22:12 [PATCH v2 0/3] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages Sean Christopherson
  2019-11-11 22:12 ` [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
  2019-11-11 22:12 ` [PATCH v2 2/3] KVM: x86/mmu: Remove superfluous is_error_pfn() check from THP adjust Sean Christopherson
@ 2019-11-11 22:12 ` Sean Christopherson
  2 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2019-11-11 22:12 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Adam Borowski,
	David Hildenbrand, Dan Williams

Add a helper, kvm_is_hugepage_allowed(), to consolidate identical code
across transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte().

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c35c6fb2635a..95163f3abb24 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3292,6 +3292,12 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 	return -EFAULT;
 }
 
+static bool kvm_is_hugepage_allowed(kvm_pfn_t pfn)
+{
+	return !kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn) &&
+	       PageTransCompoundMap(pfn_to_page(pfn));
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 					gfn_t gfn, kvm_pfn_t *pfnp,
 					int *levelp)
@@ -3305,9 +3311,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
 	 * here.
 	 */
-	if (!is_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompoundMap(pfn_to_page(pfn)) &&
+	if (!is_noslot_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
+	    kvm_is_hugepage_allowed(pfn) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
 		unsigned long mask;
 		/*
@@ -5914,9 +5919,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * the guest, and the guest page table is using 4K page size
 		 * mapping if the indirect sp has level = 1.
 		 */
-		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
-		    !kvm_is_zone_device_pfn(pfn) &&
-		    PageTransCompoundMap(pfn_to_page(pfn))) {
+		if (sp->role.direct && kvm_is_hugepage_allowed(pfn)) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-11 22:12 ` [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
@ 2019-11-11 22:20   ` Paolo Bonzini
  2019-11-11 22:39   ` Dan Williams
  2019-11-12  7:06   ` David Hildenbrand
  2 siblings, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2019-11-11 22:20 UTC (permalink / raw)
  To: Sean Christopherson, Radim Krčmář
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Adam Borowski, David Hildenbrand, Dan Williams

On 11/11/19 23:12, Sean Christopherson wrote:
> Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
> instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
> like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
> pages, e.g. put pages grabbed via gup().  But for flows such as setting
> A/D bits or shifting refcounts for transparent huge pages, KVM needs to
> to avoid processing ZONE_DEVICE pages as the flows in question lack the
> underlying machinery for proper handling of ZONE_DEVICE pages.
> 
> This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
> when running a KVM guest backed with /dev/dax memory, as KVM straight up
> doesn't put any references to ZONE_DEVICE pages acquired by gup().
> 
> Note, Dan Williams proposed an alternative solution of doing put_page()
> on ZONE_DEVICE pages immediately after gup() in order to simplify the
> auditing needed to ensure is_zone_device_page() is called if and only if
> the backing device is pinned (via gup()).  But that approach would break
> kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
> unmap() when accessing guest memory, unlike KVM's secondary MMU, which
> coordinates with mmu_notifier invalidations to avoid creating stale
> page references, i.e. doesn't rely on pages being pinned.
> 
> [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
> 
> Reported-by: Adam Borowski <kilobyte@angband.pl>
> Debugged-by: David Hildenbrand <david@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/mmu.c       |  8 ++++----
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/kvm_main.c      | 26 +++++++++++++++++++++++---
>  3 files changed, 28 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24c23c66b226..bf82b1f2e834 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>  	 * here.
>  	 */
>  	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    level == PT_PAGE_TABLE_LEVEL &&
> +	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>  	    PageTransCompoundMap(pfn_to_page(pfn)) &&
>  	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>  		unsigned long mask;
> @@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 * the guest, and the guest page table is using 4K page size
>  		 * mapping if the indirect sp has level = 1.
>  		 */
> -		if (sp->role.direct &&
> -			!kvm_is_reserved_pfn(pfn) &&
> -			PageTransCompoundMap(pfn_to_page(pfn))) {
> +		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> +		    !kvm_is_zone_device_pfn(pfn) &&
> +		    PageTransCompoundMap(pfn_to_page(pfn))) {
>  			pte_list_remove(rmap_head, sptep);
>  
>  			if (kvm_available_flush_tlb_with_range())
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a817e446c9aa..4ad1cd7d2d4d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
>  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
>  
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
> +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
>  
>  struct kvm_irq_ack_notifier {
>  	struct hlist_node link;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b8534c6b8cf6..bc9d10a0a334 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -149,10 +149,30 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>  	return 0;
>  }
>  
> +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
> +{
> +	/*
> +	 * The metadata used by is_zone_device_page() to determine whether or
> +	 * not a page is ZONE_DEVICE is guaranteed to be valid if and only if
> +	 * the device has been pinned, e.g. by get_user_pages().  WARN if the
> +	 * page_count() is zero to help detect bad usage of this helper.
> +	 */
> +	if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
> +		return false;
> +
> +	return is_zone_device_page(pfn_to_page(pfn));
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
> +	/*
> +	 * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
> +	 * perspective they are "normal" pages, albeit with slightly different
> +	 * usage rules.
> +	 */
>  	if (pfn_valid(pfn))
> -		return PageReserved(pfn_to_page(pfn));
> +		return PageReserved(pfn_to_page(pfn)) &&
> +		       !kvm_is_zone_device_pfn(pfn);
>  
>  	return true;
>  }
> @@ -1865,7 +1885,7 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
>  
>  void kvm_set_pfn_dirty(kvm_pfn_t pfn)
>  {
> -	if (!kvm_is_reserved_pfn(pfn)) {
> +	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
>  		struct page *page = pfn_to_page(pfn);
>  
>  		SetPageDirty(page);
> @@ -1875,7 +1895,7 @@ EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
>  
>  void kvm_set_pfn_accessed(kvm_pfn_t pfn)
>  {
> -	if (!kvm_is_reserved_pfn(pfn))
> +	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
>  		mark_page_accessed(pfn_to_page(pfn));
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
> 

Queued, thanks -- the other two will wait for after the merge window to
avoid pointless conflicts.

Paolo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-11 22:12 ` [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
  2019-11-11 22:20   ` Paolo Bonzini
@ 2019-11-11 22:39   ` Dan Williams
  2019-11-11 22:43     ` Paolo Bonzini
  2019-11-12  7:06   ` David Hildenbrand
  2 siblings, 1 reply; 8+ messages in thread
From: Dan Williams @ 2019-11-11 22:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Mon, Nov 11, 2019 at 2:12 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
> instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
> like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
> pages, e.g. put pages grabbed via gup().  But for flows such as setting
> A/D bits or shifting refcounts for transparent huge pages, KVM needs to
> to avoid processing ZONE_DEVICE pages as the flows in question lack the
> underlying machinery for proper handling of ZONE_DEVICE pages.
>
> This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
> when running a KVM guest backed with /dev/dax memory, as KVM straight up
> doesn't put any references to ZONE_DEVICE pages acquired by gup().
>
> Note, Dan Williams proposed an alternative solution of doing put_page()
> on ZONE_DEVICE pages immediately after gup() in order to simplify the
> auditing needed to ensure is_zone_device_page() is called if and only if
> the backing device is pinned (via gup()).  But that approach would break
> kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
> unmap() when accessing guest memory, unlike KVM's secondary MMU, which
> coordinates with mmu_notifier invalidations to avoid creating stale
> page references, i.e. doesn't rely on pages being pinned.
>
> [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
>
> Reported-by: Adam Borowski <kilobyte@angband.pl>
> Debugged-by: David Hildenbrand <david@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>

Acked-by: Dan Williams <dan.j.williams@intel.com>

> Cc: stable@vger.kernel.org

Perhaps add:

Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")

...since that was the first kernel that broke KVM's assumption about
which pfn types needed to have the reference count managed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-11 22:39   ` Dan Williams
@ 2019-11-11 22:43     ` Paolo Bonzini
  0 siblings, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2019-11-11 22:43 UTC (permalink / raw)
  To: Dan Williams, Sean Christopherson
  Cc: Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On 11/11/19 23:39, Dan Williams wrote:
>> [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
>>
>> Reported-by: Adam Borowski <kilobyte@angband.pl>
>> Debugged-by: David Hildenbrand <david@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> 
>> Cc: stable@vger.kernel.org
> Perhaps add:
> 
> Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
> 
> ...since that was the first kernel that broke KVM's assumption about
> which pfn types needed to have the reference count managed.
> 

Done, thanks!

Paolo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-11 22:12 ` [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
  2019-11-11 22:20   ` Paolo Bonzini
  2019-11-11 22:39   ` Dan Williams
@ 2019-11-12  7:06   ` David Hildenbrand
  2 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2019-11-12  7:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Adam Borowski, David Hildenbrand, Dan Williams



> Am 11.11.2019 um 23:12 schrieb Sean Christopherson <sean.j.christopherson@intel.com>:
> 
> Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
> instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
> like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
> pages, e.g. put pages grabbed via gup().  But for flows such as setting
> A/D bits or shifting refcounts for transparent huge pages, KVM needs to
> to avoid processing ZONE_DEVICE pages as the flows in question lack the
> underlying machinery for proper handling of ZONE_DEVICE pages.
> 
> This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
> when running a KVM guest backed with /dev/dax memory, as KVM straight up
> doesn't put any references to ZONE_DEVICE pages acquired by gup().
> 
> Note, Dan Williams proposed an alternative solution of doing put_page()
> on ZONE_DEVICE pages immediately after gup() in order to simplify the
> auditing needed to ensure is_zone_device_page() is called if and only if
> the backing device is pinned (via gup()).  But that approach would break
> kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
> unmap() when accessing guest memory, unlike KVM's secondary MMU, which
> coordinates with mmu_notifier invalidations to avoid creating stale
> page references, i.e. doesn't rely on pages being pinned.
> 
> [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
> 
> Reported-by: Adam Borowski <kilobyte@angband.pl>
> Debugged-by: David Hildenbrand <david@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> arch/x86/kvm/mmu.c       |  8 ++++----
> include/linux/kvm_host.h |  1 +
> virt/kvm/kvm_main.c      | 26 +++++++++++++++++++++++---
> 3 files changed, 28 insertions(+), 7 deletions(-)
> 

Thanks for taking care of this! Other KVM related code (PPC, vfio) also has a reserved check (see my series), I didn‘t have a look yet at the details, how reserved pages are treated. Will do so in the next weeks, after hollidays.

Acked-by: David Hildenbrand <david@redhat.com>

> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24c23c66b226..bf82b1f2e834 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>     * here.
>     */
>    if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -        level == PT_PAGE_TABLE_LEVEL &&
> +        !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>        PageTransCompoundMap(pfn_to_page(pfn)) &&
>        !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>        unsigned long mask;
> @@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>         * the guest, and the guest page table is using 4K page size
>         * mapping if the indirect sp has level = 1.
>         */
> -        if (sp->role.direct &&
> -            !kvm_is_reserved_pfn(pfn) &&
> -            PageTransCompoundMap(pfn_to_page(pfn))) {
> +        if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> +            !kvm_is_zone_device_pfn(pfn) &&
> +            PageTransCompoundMap(pfn_to_page(pfn))) {
>            pte_list_remove(rmap_head, sptep);
> 
>            if (kvm_available_flush_tlb_with_range())
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a817e446c9aa..4ad1cd7d2d4d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
> 
> bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
> +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
> 
> struct kvm_irq_ack_notifier {
>    struct hlist_node link;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b8534c6b8cf6..bc9d10a0a334 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -149,10 +149,30 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>    return 0;
> }
> 
> +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
> +{
> +    /*
> +     * The metadata used by is_zone_device_page() to determine whether or
> +     * not a page is ZONE_DEVICE is guaranteed to be valid if and only if
> +     * the device has been pinned, e.g. by get_user_pages().  WARN if the
> +     * page_count() is zero to help detect bad usage of this helper.
> +     */
> +    if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
> +        return false;
> +
> +    return is_zone_device_page(pfn_to_page(pfn));
> +}
> +
> bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> {
> +    /*
> +     * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
> +     * perspective they are "normal" pages, albeit with slightly different
> +     * usage rules.
> +     */
>    if (pfn_valid(pfn))
> -        return PageReserved(pfn_to_page(pfn));
> +        return PageReserved(pfn_to_page(pfn)) &&
> +               !kvm_is_zone_device_pfn(pfn);
> 
>    return true;
> }
> @@ -1865,7 +1885,7 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
> 
> void kvm_set_pfn_dirty(kvm_pfn_t pfn)
> {
> -    if (!kvm_is_reserved_pfn(pfn)) {
> +    if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
>        struct page *page = pfn_to_page(pfn);
> 
>        SetPageDirty(page);
> @@ -1875,7 +1895,7 @@ EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
> 
> void kvm_set_pfn_accessed(kvm_pfn_t pfn)
> {
> -    if (!kvm_is_reserved_pfn(pfn))
> +    if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
>        mark_page_accessed(pfn_to_page(pfn));
> }
> EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
> -- 
> 2.24.0
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-12  7:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-11 22:12 [PATCH v2 0/3] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages Sean Christopherson
2019-11-11 22:12 ` [PATCH v2 1/3] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
2019-11-11 22:20   ` Paolo Bonzini
2019-11-11 22:39   ` Dan Williams
2019-11-11 22:43     ` Paolo Bonzini
2019-11-12  7:06   ` David Hildenbrand
2019-11-11 22:12 ` [PATCH v2 2/3] KVM: x86/mmu: Remove superfluous is_error_pfn() check from THP adjust Sean Christopherson
2019-11-11 22:12 ` [PATCH v2 3/3] KVM: x86/mmu: Add helper to consolidate huge page promotion Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.