LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/2] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages
@ 2019-11-06 17:07 Sean Christopherson
  2019-11-06 17:07 ` [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
  2019-11-06 17:07 ` [PATCH 2/2] KVM: x86/mmu: Add helper to consolidate huge page promotion Sean Christopherson
  0 siblings, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2019-11-06 17:07 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Adam Borowski,
	David Hildenbrand, Dan Williams

This mini-series fixes a suspected, but technically unconfirmed, bug in
KVM related to ZONE_DEVICE pages.  The suspected issue is that KVM treats
ZONE_DEVICE pages as reserved PFNs, and so doesn't put references to such
pages when dropping references via KVM's generic kvm_release_pfn_clean().

David Hildenbrand uncovered the bug during a discussion about removing
PG_reserved from ZONE_DEVICE pages, after Dan Williams pointed out[1] that
there was a bug report from Adam Borowski[2] that was likely related to
KVM's interaction with PageReserved().

Patch 1/2 contains the actual fix, patch 2/2 is a minor cleanup that is
mostly unrelated, but dependent and prompted by the fix in patch 1/2.

The fix itself is a bit more aggressive than what was proposed by David
and Dan, but I'm fairly confident it's the right direction for the long
term, and it also plays nice with the original PG_reserved removal series
that exposed the bug.

To be 100% clear, I haven't actually confirmed this fixes the bug reported
by Adam.

[1] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
[2] https://lkml.kernel.org/r/01adb4cb-6092-638c-0bab-e61322be7cf5@redhat.com

Sean Christopherson (2):
  KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  KVM: x86/mmu: Add helper to consolidate huge page promotion

 arch/x86/kvm/mmu.c       | 15 +++++++++------
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 19 +++++++++++++++----
 3 files changed, 25 insertions(+), 10 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 17:07 [PATCH 0/2] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages Sean Christopherson
@ 2019-11-06 17:07 ` Sean Christopherson
  2019-11-06 17:14   ` Paolo Bonzini
  2019-11-06 18:04   ` Dan Williams
  2019-11-06 17:07 ` [PATCH 2/2] KVM: x86/mmu: Add helper to consolidate huge page promotion Sean Christopherson
  1 sibling, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2019-11-06 17:07 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Adam Borowski,
	David Hildenbrand, Dan Williams

Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
pages, e.g. put pages grabbed via gup().  But KVM needs special handling
in other flows where ZONE_DEVICE pages lack the underlying machinery,
e.g. when setting accessed/dirty bits and shifting refcounts for
transparent huge pages.

This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
when running a KVM guest backed with /dev/dax memory, as KVM straight up
doesn't put any references to ZONE_DEVICE pages acquired by gup().

[*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl

Reported-by: Adam Borowski <kilobyte@angband.pl>
Debugged-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.c       |  8 ++++----
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..bf82b1f2e834 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 * here.
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    level == PT_PAGE_TABLE_LEVEL &&
+	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
 	    PageTransCompoundMap(pfn_to_page(pfn)) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
 		unsigned long mask;
@@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * the guest, and the guest page table is using 4K page size
 		 * mapping if the indirect sp has level = 1.
 		 */
-		if (sp->role.direct &&
-			!kvm_is_reserved_pfn(pfn) &&
-			PageTransCompoundMap(pfn_to_page(pfn))) {
+		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
+		    !kvm_is_zone_device_pfn(pfn) &&
+		    PageTransCompoundMap(pfn_to_page(pfn))) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a817e446c9aa..4ad1cd7d2d4d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
+bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
 	struct hlist_node link;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b8534c6b8cf6..0a781b1fb8f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,12 +151,23 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
+	/*
+	 * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
+	 * perspective they are "normal" pages, albeit with slightly different
+	 * usage rules.
+	 */
 	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+		return PageReserved(pfn_to_page(pfn)) &&
+		       !is_zone_device_page(pfn_to_page(pfn));
 
 	return true;
 }
 
+bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
+{
+	return pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn));
+}
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
@@ -1865,7 +1876,7 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
 
 void kvm_set_pfn_dirty(kvm_pfn_t pfn)
 {
-	if (!kvm_is_reserved_pfn(pfn)) {
+	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
 
 		SetPageDirty(page);
@@ -1875,14 +1886,14 @@ EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
 void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 {
-	if (!kvm_is_reserved_pfn(pfn))
+	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
 		mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
 void kvm_get_pfn(kvm_pfn_t pfn)
 {
-	if (!kvm_is_reserved_pfn(pfn))
+	if (!kvm_is_reserved_pfn(pfn) && !WARN_ON(kvm_is_zone_device_pfn(pfn)))
 		get_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_get_pfn);
-- 
2.24.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/2] KVM: x86/mmu: Add helper to consolidate huge page promotion
  2019-11-06 17:07 [PATCH 0/2] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages Sean Christopherson
  2019-11-06 17:07 ` [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
@ 2019-11-06 17:07 ` Sean Christopherson
  2019-11-06 17:22   ` Paolo Bonzini
  1 sibling, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2019-11-06 17:07 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Adam Borowski,
	David Hildenbrand, Dan Williams

Add a helper, kvm_is_hugepage_allowed(), to consolidate nearly identical
code between transparent_hugepage_adjust() and
kvm_mmu_zap_collapsible_spte().

Note, this effectively adds a superfluous is_error_noslot_pfn() check in
kvm_mmu_zap_collapsible_spte(), but the overhead added is essentially a
single uop on modern hardware and taking the hit allows all the PFN
related checks to be encapsulated in kvm_is_hugepage_allowed().

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bf82b1f2e834..8565653fce22 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3292,6 +3292,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 	return -EFAULT;
 }
 
+static bool kvm_is_hugepage_allowed(kvm_pfn_t pfn)
+{
+	return !is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
+	       !kvm_is_zone_device_pfn(pfn) &&
+	       PageTransCompoundMap(pfn_to_page(pfn));
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 					gfn_t gfn, kvm_pfn_t *pfnp,
 					int *levelp)
@@ -3305,9 +3312,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
 	 * here.
 	 */
-	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompoundMap(pfn_to_page(pfn)) &&
+	if (level == PT_PAGE_TABLE_LEVEL && kvm_is_hugepage_allowed(pfn) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
 		unsigned long mask;
 		/*
@@ -5914,9 +5919,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * the guest, and the guest page table is using 4K page size
 		 * mapping if the indirect sp has level = 1.
 		 */
-		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
-		    !kvm_is_zone_device_pfn(pfn) &&
-		    PageTransCompoundMap(pfn_to_page(pfn))) {
+		if (sp->role.direct && kvm_is_hugepage_allowed(pfn)) {
 			pte_list_remove(rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
-- 
2.24.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 17:07 ` [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
@ 2019-11-06 17:14   ` Paolo Bonzini
  2019-11-06 17:46     ` Sean Christopherson
  2019-11-06 18:04   ` Dan Williams
  1 sibling, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2019-11-06 17:14 UTC (permalink / raw)
  To: Sean Christopherson, Radim Krčmář
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Adam Borowski, David Hildenbrand, Dan Williams

On 06/11/19 18:07, Sean Christopherson wrote:
>  void kvm_get_pfn(kvm_pfn_t pfn)
>  {
> -	if (!kvm_is_reserved_pfn(pfn))
> +	if (!kvm_is_reserved_pfn(pfn) && !WARN_ON(kvm_is_zone_device_pfn(pfn)))
>  		get_page(pfn_to_page(pfn));
>  }
>  EXPORT_SYMBOL_GPL(kvm_get_pfn);

Can you call remap_pfn_range with a source address that is ZONE_DEVICE?
 If so, you would get a WARN from the kvm_get_pfn call in
hva_to_pfn_remapped.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Add helper to consolidate huge page promotion
  2019-11-06 17:07 ` [PATCH 2/2] KVM: x86/mmu: Add helper to consolidate huge page promotion Sean Christopherson
@ 2019-11-06 17:22   ` Paolo Bonzini
  0 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2019-11-06 17:22 UTC (permalink / raw)
  To: Sean Christopherson, Radim Krčmář
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Adam Borowski, David Hildenbrand, Dan Williams

On 06/11/19 18:07, Sean Christopherson wrote:
>  	 */
> -	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> -	    PageTransCompoundMap(pfn_to_page(pfn)) &&
> +	if (level == PT_PAGE_TABLE_LEVEL && kvm_is_hugepage_allowed(pfn) &&
>  	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>  		unsigned long mask;
>  		/*
> @@ -5914,9 +5919,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 * the guest, and the guest page table is using 4K page size
>  		 * mapping if the indirect sp has level = 1.
>  		 */
> -		if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> -		    !kvm_is_zone_device_pfn(pfn) &&
> -		    PageTransCompoundMap(pfn_to_page(pfn))) {
> +		if (sp->role.direct && kvm_is_hugepage_allowed(pfn)) {
>  			pte_list_remove(rmap_head, sptep);

I don't think is_error_noslot_pfn(pfn) makes sense in
kvm_mmu_zap_collapsible_spte, so I'd rather keep it in
transparent_hugepage_adjust.  Actually, it must be is_noslot_pfn only at
this point---error pfns have been sieved earlier in handle_abnormal_pfn,
so perhaps

	if (WARN_ON_ONCE(is_error_pfn(pfn)) || is_noslot_pfn(pfn))
		return;

	if (level == PT_PAGE_TABLE_LEVEL &&
	    kvm_is_hugepage_allowed(pfn) &&
	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL))

would be the best option.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 17:14   ` Paolo Bonzini
@ 2019-11-06 17:46     ` Sean Christopherson
  0 siblings, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2019-11-06 17:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Adam Borowski, David Hildenbrand, Dan Williams

On Wed, Nov 06, 2019 at 06:14:40PM +0100, Paolo Bonzini wrote:
> On 06/11/19 18:07, Sean Christopherson wrote:
> >  void kvm_get_pfn(kvm_pfn_t pfn)
> >  {
> > -	if (!kvm_is_reserved_pfn(pfn))
> > +	if (!kvm_is_reserved_pfn(pfn) && !WARN_ON(kvm_is_zone_device_pfn(pfn)))
> >  		get_page(pfn_to_page(pfn));
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_get_pfn);
> 
> Can you call remap_pfn_range with a source address that is ZONE_DEVICE?
>  If so, you would get a WARN from the kvm_get_pfn call in
> hva_to_pfn_remapped.

I don't know, at a quick glance I'm guessing it's possible via /dev/mem?
But, get_page() isn't sufficient to properly grab a ZONE_DEVICE page, so
the WARN is a good thing and intentional.

So assuming the answer is "yes", perhaps hva_to_pfn_remapped() should
adds its own check on kvm_is_zone_device_pfn() and return -EINVAL or
something?  That'd likely end up kill the guest, but wouldn't break the
kernel.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 17:07 ` [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
  2019-11-06 17:14   ` Paolo Bonzini
@ 2019-11-06 18:04   ` Dan Williams
  2019-11-06 20:26     ` Sean Christopherson
  2019-11-06 21:09     ` Paolo Bonzini
  1 sibling, 2 replies; 25+ messages in thread
From: Dan Williams @ 2019-11-06 18:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 6, 2019 at 9:07 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
> instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
> like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
> pages, e.g. put pages grabbed via gup().  But KVM needs special handling
> in other flows where ZONE_DEVICE pages lack the underlying machinery,
> e.g. when setting accessed/dirty bits and shifting refcounts for
> transparent huge pages.
>
> This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
> when running a KVM guest backed with /dev/dax memory, as KVM straight up
> doesn't put any references to ZONE_DEVICE pages acquired by gup().
>
> [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
>
> Reported-by: Adam Borowski <kilobyte@angband.pl>
> Debugged-by: David Hildenbrand <david@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/mmu.c       |  8 ++++----
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/kvm_main.c      | 19 +++++++++++++++----
>  3 files changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24c23c66b226..bf82b1f2e834 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>          * here.
>          */
>         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> -           level == PT_PAGE_TABLE_LEVEL &&
> +           !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>             PageTransCompoundMap(pfn_to_page(pfn)) &&
>             !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>                 unsigned long mask;
> @@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>                  * the guest, and the guest page table is using 4K page size
>                  * mapping if the indirect sp has level = 1.
>                  */
> -               if (sp->role.direct &&
> -                       !kvm_is_reserved_pfn(pfn) &&
> -                       PageTransCompoundMap(pfn_to_page(pfn))) {
> +               if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> +                   !kvm_is_zone_device_pfn(pfn) &&
> +                   PageTransCompoundMap(pfn_to_page(pfn))) {
>                         pte_list_remove(rmap_head, sptep);
>
>                         if (kvm_available_flush_tlb_with_range())
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a817e446c9aa..4ad1cd7d2d4d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
>  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
>
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
> +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
>
>  struct kvm_irq_ack_notifier {
>         struct hlist_node link;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b8534c6b8cf6..0a781b1fb8f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -151,12 +151,23 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
> +       /*
> +        * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
> +        * perspective they are "normal" pages, albeit with slightly different
> +        * usage rules.
> +        */
>         if (pfn_valid(pfn))
> -               return PageReserved(pfn_to_page(pfn));
> +               return PageReserved(pfn_to_page(pfn)) &&
> +                      !is_zone_device_page(pfn_to_page(pfn));

This is racy unless you can be certain that the pfn and resulting page
has already been pinned by get_user_pages(). This is why I told David
to steer clear of adding more is_zone_device_page() usage, it's
difficult to audit. Without an existing pin the metadata to determine
whether a page is ZONE_DEVICE or not could be in the process of being
torn down. Ideally KVM would pass around a struct { struct page *page,
unsigned long pfn } tuple so it would not have to do this "recall
context" dance on every pfn operation.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 18:04   ` Dan Williams
@ 2019-11-06 20:26     ` Sean Christopherson
  2019-11-06 20:34       ` Dan Williams
  2019-11-06 21:09     ` Paolo Bonzini
  1 sibling, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2019-11-06 20:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 06, 2019 at 10:04:22AM -0800, Dan Williams wrote:
> On Wed, Nov 6, 2019 at 9:07 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
> > instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
> > like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
> > pages, e.g. put pages grabbed via gup().  But KVM needs special handling
> > in other flows where ZONE_DEVICE pages lack the underlying machinery,
> > e.g. when setting accessed/dirty bits and shifting refcounts for
> > transparent huge pages.
> >
> > This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
> > when running a KVM guest backed with /dev/dax memory, as KVM straight up
> > doesn't put any references to ZONE_DEVICE pages acquired by gup().
> >
> > [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
> >
> > Reported-by: Adam Borowski <kilobyte@angband.pl>
> > Debugged-by: David Hildenbrand <david@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >  arch/x86/kvm/mmu.c       |  8 ++++----
> >  include/linux/kvm_host.h |  1 +
> >  virt/kvm/kvm_main.c      | 19 +++++++++++++++----
> >  3 files changed, 20 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 24c23c66b226..bf82b1f2e834 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >          * here.
> >          */
> >         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > -           level == PT_PAGE_TABLE_LEVEL &&
> > +           !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> >             PageTransCompoundMap(pfn_to_page(pfn)) &&
> >             !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> >                 unsigned long mask;
> > @@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >                  * the guest, and the guest page table is using 4K page size
> >                  * mapping if the indirect sp has level = 1.
> >                  */
> > -               if (sp->role.direct &&
> > -                       !kvm_is_reserved_pfn(pfn) &&
> > -                       PageTransCompoundMap(pfn_to_page(pfn))) {
> > +               if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> > +                   !kvm_is_zone_device_pfn(pfn) &&
> > +                   PageTransCompoundMap(pfn_to_page(pfn))) {
> >                         pte_list_remove(rmap_head, sptep);
> >
> >                         if (kvm_available_flush_tlb_with_range())
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a817e446c9aa..4ad1cd7d2d4d 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> >  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
> >
> >  bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
> > +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
> >
> >  struct kvm_irq_ack_notifier {
> >         struct hlist_node link;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b8534c6b8cf6..0a781b1fb8f0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -151,12 +151,23 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >
> >  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> >  {
> > +       /*
> > +        * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
> > +        * perspective they are "normal" pages, albeit with slightly different
> > +        * usage rules.
> > +        */
> >         if (pfn_valid(pfn))
> > -               return PageReserved(pfn_to_page(pfn));
> > +               return PageReserved(pfn_to_page(pfn)) &&
> > +                      !is_zone_device_page(pfn_to_page(pfn));
> 
> This is racy unless you can be certain that the pfn and resulting page
> has already been pinned by get_user_pages(). This is why I told David
> to steer clear of adding more is_zone_device_page() usage, it's
> difficult to audit. Without an existing pin the metadata to determine
> whether a page is ZONE_DEVICE or not could be in the process of being
> torn down.

Ouch, that's brutal.  Makes perfect sense why you'd want to avoid
is_zone_device_page().

> Ideally KVM would pass around a struct { struct page *page,
> unsigned long pfn } tuple so it would not have to do this "recall
> context" dance on every pfn operation.

Implementing something of that nature is doable, but it's going to be a
significant overhaul as it will require updating every architecture.  So
yeah, let's go with David's suggestion of fixing the immediate bug with
your quick and dirty change and punting a more complete rework to future
cleanup.  There are still multiple KVM flows that don't correctly handle
ZONE_DEVICE, especially outside of x86, but I don't think there's a quick
fix for those issues.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 20:26     ` Sean Christopherson
@ 2019-11-06 20:34       ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2019-11-06 20:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 6, 2019 at 12:26 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Wed, Nov 06, 2019 at 10:04:22AM -0800, Dan Williams wrote:
> > On Wed, Nov 6, 2019 at 9:07 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
> > > instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
> > > like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
> > > pages, e.g. put pages grabbed via gup().  But KVM needs special handling
> > > in other flows where ZONE_DEVICE pages lack the underlying machinery,
> > > e.g. when setting accessed/dirty bits and shifting refcounts for
> > > transparent huge pages.
> > >
> > > This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
> > > when running a KVM guest backed with /dev/dax memory, as KVM straight up
> > > doesn't put any references to ZONE_DEVICE pages acquired by gup().
> > >
> > > [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
> > >
> > > Reported-by: Adam Borowski <kilobyte@angband.pl>
> > > Debugged-by: David Hildenbrand <david@redhat.com>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > ---
> > >  arch/x86/kvm/mmu.c       |  8 ++++----
> > >  include/linux/kvm_host.h |  1 +
> > >  virt/kvm/kvm_main.c      | 19 +++++++++++++++----
> > >  3 files changed, 20 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index 24c23c66b226..bf82b1f2e834 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > >          * here.
> > >          */
> > >         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > > -           level == PT_PAGE_TABLE_LEVEL &&
> > > +           !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> > >             PageTransCompoundMap(pfn_to_page(pfn)) &&
> > >             !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> > >                 unsigned long mask;
> > > @@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > >                  * the guest, and the guest page table is using 4K page size
> > >                  * mapping if the indirect sp has level = 1.
> > >                  */
> > > -               if (sp->role.direct &&
> > > -                       !kvm_is_reserved_pfn(pfn) &&
> > > -                       PageTransCompoundMap(pfn_to_page(pfn))) {
> > > +               if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> > > +                   !kvm_is_zone_device_pfn(pfn) &&
> > > +                   PageTransCompoundMap(pfn_to_page(pfn))) {
> > >                         pte_list_remove(rmap_head, sptep);
> > >
> > >                         if (kvm_available_flush_tlb_with_range())
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a817e446c9aa..4ad1cd7d2d4d 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> > >  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
> > >
> > >  bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
> > > +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
> > >
> > >  struct kvm_irq_ack_notifier {
> > >         struct hlist_node link;
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index b8534c6b8cf6..0a781b1fb8f0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -151,12 +151,23 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > >
> > >  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> > >  {
> > > +       /*
> > > +        * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
> > > +        * perspective they are "normal" pages, albeit with slightly different
> > > +        * usage rules.
> > > +        */
> > >         if (pfn_valid(pfn))
> > > -               return PageReserved(pfn_to_page(pfn));
> > > +               return PageReserved(pfn_to_page(pfn)) &&
> > > +                      !is_zone_device_page(pfn_to_page(pfn));
> >
> > This is racy unless you can be certain that the pfn and resulting page
> > has already been pinned by get_user_pages(). This is why I told David
> > to steer clear of adding more is_zone_device_page() usage, it's
> > difficult to audit. Without an existing pin the metadata to determine
> > whether a page is ZONE_DEVICE or not could be in the process of being
> > torn down.
>
> Ouch, that's brutal.  Makes perfect sense why you'd want to avoid
> is_zone_device_page().
>
> > Ideally KVM would pass around a struct { struct page *page,
> > unsigned long pfn } tuple so it would not have to do this "recall
> > context" dance on every pfn operation.
>
> Implementing something of that nature is doable, but it's going to be a
> significant overhaul as it will require updating every architecture.  So
> yeah, let's go with David's suggestion of fixing the immediate bug with
> your quick and dirty change and punting a more complete rework to future
> cleanup.  There are still multiple KVM flows that don't correctly handle
> ZONE_DEVICE, especially outside of x86, but I don't think there's a quick
> fix for those issues.

Sounds like a plan, I'll rework that fix with the change you suggested
and submit a formal one.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 18:04   ` Dan Williams
  2019-11-06 20:26     ` Sean Christopherson
@ 2019-11-06 21:09     ` Paolo Bonzini
  2019-11-06 21:30       ` Sean Christopherson
  2019-11-06 23:20       ` Dan Williams
  1 sibling, 2 replies; 25+ messages in thread
From: Paolo Bonzini @ 2019-11-06 21:09 UTC (permalink / raw)
  To: Dan Williams, Sean Christopherson
  Cc: Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On 06/11/19 19:04, Dan Williams wrote:
> On Wed, Nov 6, 2019 at 9:07 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
>>
>> Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
>> instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
>> like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
>> pages, e.g. put pages grabbed via gup().  But KVM needs special handling
>> in other flows where ZONE_DEVICE pages lack the underlying machinery,
>> e.g. when setting accessed/dirty bits and shifting refcounts for
>> transparent huge pages.
>>
>> This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
>> when running a KVM guest backed with /dev/dax memory, as KVM straight up
>> doesn't put any references to ZONE_DEVICE pages acquired by gup().
>>
>> [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
>>
>> Reported-by: Adam Borowski <kilobyte@angband.pl>
>> Debugged-by: David Hildenbrand <david@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> ---
>>  arch/x86/kvm/mmu.c       |  8 ++++----
>>  include/linux/kvm_host.h |  1 +
>>  virt/kvm/kvm_main.c      | 19 +++++++++++++++----
>>  3 files changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 24c23c66b226..bf82b1f2e834 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>          * here.
>>          */
>>         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>> -           level == PT_PAGE_TABLE_LEVEL &&
>> +           !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>>             PageTransCompoundMap(pfn_to_page(pfn)) &&
>>             !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>>                 unsigned long mask;
>> @@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>                  * the guest, and the guest page table is using 4K page size
>>                  * mapping if the indirect sp has level = 1.
>>                  */
>> -               if (sp->role.direct &&
>> -                       !kvm_is_reserved_pfn(pfn) &&
>> -                       PageTransCompoundMap(pfn_to_page(pfn))) {
>> +               if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
>> +                   !kvm_is_zone_device_pfn(pfn) &&
>> +                   PageTransCompoundMap(pfn_to_page(pfn))) {
>>                         pte_list_remove(rmap_head, sptep);
>>
>>                         if (kvm_available_flush_tlb_with_range())
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index a817e446c9aa..4ad1cd7d2d4d 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
>>  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
>>
>>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
>> +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
>>
>>  struct kvm_irq_ack_notifier {
>>         struct hlist_node link;
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index b8534c6b8cf6..0a781b1fb8f0 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -151,12 +151,23 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>
>>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>  {
>> +       /*
>> +        * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
>> +        * perspective they are "normal" pages, albeit with slightly different
>> +        * usage rules.
>> +        */
>>         if (pfn_valid(pfn))
>> -               return PageReserved(pfn_to_page(pfn));
>> +               return PageReserved(pfn_to_page(pfn)) &&
>> +                      !is_zone_device_page(pfn_to_page(pfn));
> 
> This is racy unless you can be certain that the pfn and resulting page
> has already been pinned by get_user_pages().

What is the race exactly?

In general KVM does not use pfn's until after having gotten them from
get_user_pages (or follow_pfn for VM_IO | VM_PFNMAP vmas, for which
get_user_pages fails, but this is not an issue here).  It then creates
the page tables and releases the reference to the struct page.

Anything else happens _after_ the reference has been released, but still
from an mmu notifier; this is why KVM uses pfn_to_page quite pervasively.

If this is enough to avoid races, then I prefer Sean's patch.  If it is
racy, we need to fix kvm_set_pfn_accessed and kvm_set_pfn_dirty first,
and second at transparent_hugepage_adjust and kvm_mmu_zap_collapsible_spte:

- if accessed/dirty state need not be tracked properly for ZONE_DEVICE,
then I suppose David's patch is okay (though I'd like to have a big
comment explaining all the things that went on in these emails).  If
they need to work, however, Sean's patch 1 is the right thing to do.

- if we need Sean's patch 1, but it is racy to use is_zone_device_page,
we could first introduce a helper similar to kvm_is_hugepage_allowed()
from his patch 2, but using pfn_to_online_page() to filter out
ZONE_DEVICE pages.  This would cover both transparent_hugepage_adjust
and kvm_mmu_zap_collapsible_spte.

> This is why I told David
> to steer clear of adding more is_zone_device_page() usage, it's
> difficult to audit. Without an existing pin the metadata to determine
> whether a page is ZONE_DEVICE or not could be in the process of being
> torn down. Ideally KVM would pass around a struct { struct page *page,
> unsigned long pfn } tuple so it would not have to do this "recall
> context" dance on every pfn operation.

Unfortunately once KVM has created its own page tables, the struct page*
reference is lost, as the PFN is the only thing that is stored in there.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 21:09     ` Paolo Bonzini
@ 2019-11-06 21:30       ` Sean Christopherson
  2019-11-06 23:20       ` Dan Williams
  1 sibling, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2019-11-06 21:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dan Williams, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 06, 2019 at 10:09:29PM +0100, Paolo Bonzini wrote:
> On 06/11/19 19:04, Dan Williams wrote:
> > On Wed, Nov 6, 2019 at 9:07 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > This is racy unless you can be certain that the pfn and resulting page
> > has already been pinned by get_user_pages().
> 
> What is the race exactly?

The check in kvm_get_pfn() is guaranteed to be racy, but that's already
broken with respect to ZONE_DEVICE.
 
> In general KVM does not use pfn's until after having gotten them from
> get_user_pages (or follow_pfn for VM_IO | VM_PFNMAP vmas, for which
> get_user_pages fails, but this is not an issue here).  It then creates
> the page tables and releases the reference to the struct page.
> 
> Anything else happens _after_ the reference has been released, but still
> from an mmu notifier; this is why KVM uses pfn_to_page quite pervasively.
> 
> If this is enough to avoid races, then I prefer Sean's patch.  If it is
> racy, we need to fix kvm_set_pfn_accessed and kvm_set_pfn_dirty first,
> and second at transparent_hugepage_adjust and kvm_mmu_zap_collapsible_spte:

transparent_hugepage_adjust() would be ok, it's called before the original
reference is put back.

> - if accessed/dirty state need not be tracked properly for ZONE_DEVICE,
> then I suppose David's patch is okay (though I'd like to have a big
> comment explaining all the things that went on in these emails).  If
> they need to work, however, Sean's patch 1 is the right thing to do.
> 
> - if we need Sean's patch 1, but it is racy to use is_zone_device_page,
> we could first introduce a helper similar to kvm_is_hugepage_allowed()
> from his patch 2, but using pfn_to_online_page() to filter out
> ZONE_DEVICE pages.  This would cover both transparent_hugepage_adjust
> and kvm_mmu_zap_collapsible_spte.

After more thought, I agree with Paolo that adding kvm_is_zone_device_pfn()
is preferably if blocking with mmu_notifier is sufficient to avoid races.
The only caller that isn't protected by mmu_notifier (or holding a gup()-
obtained referece) is kvm_get_pfn(), but again, that's completely borked
anyways.   Adding a WARN_ON(page_count()) in kvm_is_zone_device_pfn() would
help with the auditing issue.

The scope of the changes required to completely avoid is_zone_device_page()
is massive, and probably would result in additional trickery :-(

> > This is why I told David
> > to steer clear of adding more is_zone_device_page() usage, it's
> > difficult to audit. Without an existing pin the metadata to determine
> > whether a page is ZONE_DEVICE or not could be in the process of being
> > torn down. Ideally KVM would pass around a struct { struct page *page,
> > unsigned long pfn } tuple so it would not have to do this "recall
> > context" dance on every pfn operation.
> 
> Unfortunately once KVM has created its own page tables, the struct page*
> reference is lost, as the PFN is the only thing that is stored in there.

Ya, the horrible idea I had in mind was to steal a bit from KVM_PFN_ERR_MASK
to track whether or not the PFN is associated with a struct page.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 21:09     ` Paolo Bonzini
  2019-11-06 21:30       ` Sean Christopherson
@ 2019-11-06 23:20       ` Dan Williams
  2019-11-06 23:39         ` Sean Christopherson
  1 sibling, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-11-06 23:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 6, 2019 at 1:10 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 06/11/19 19:04, Dan Williams wrote:
> > On Wed, Nov 6, 2019 at 9:07 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> >>
> >> Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
> >> instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
> >> like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
> >> pages, e.g. put pages grabbed via gup().  But KVM needs special handling
> >> in other flows where ZONE_DEVICE pages lack the underlying machinery,
> >> e.g. when setting accessed/dirty bits and shifting refcounts for
> >> transparent huge pages.
> >>
> >> This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
> >> when running a KVM guest backed with /dev/dax memory, as KVM straight up
> >> doesn't put any references to ZONE_DEVICE pages acquired by gup().
> >>
> >> [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
> >>
> >> Reported-by: Adam Borowski <kilobyte@angband.pl>
> >> Debugged-by: David Hildenbrand <david@redhat.com>
> >> Cc: Dan Williams <dan.j.williams@intel.com>
> >> Cc: stable@vger.kernel.org
> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> >> ---
> >>  arch/x86/kvm/mmu.c       |  8 ++++----
> >>  include/linux/kvm_host.h |  1 +
> >>  virt/kvm/kvm_main.c      | 19 +++++++++++++++----
> >>  3 files changed, 20 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 24c23c66b226..bf82b1f2e834 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -3306,7 +3306,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >>          * here.
> >>          */
> >>         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> >> -           level == PT_PAGE_TABLE_LEVEL &&
> >> +           !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> >>             PageTransCompoundMap(pfn_to_page(pfn)) &&
> >>             !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> >>                 unsigned long mask;
> >> @@ -5914,9 +5914,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >>                  * the guest, and the guest page table is using 4K page size
> >>                  * mapping if the indirect sp has level = 1.
> >>                  */
> >> -               if (sp->role.direct &&
> >> -                       !kvm_is_reserved_pfn(pfn) &&
> >> -                       PageTransCompoundMap(pfn_to_page(pfn))) {
> >> +               if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> >> +                   !kvm_is_zone_device_pfn(pfn) &&
> >> +                   PageTransCompoundMap(pfn_to_page(pfn))) {
> >>                         pte_list_remove(rmap_head, sptep);
> >>
> >>                         if (kvm_available_flush_tlb_with_range())
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index a817e446c9aa..4ad1cd7d2d4d 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -966,6 +966,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> >>  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
> >>
> >>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
> >> +bool kvm_is_zone_device_pfn(kvm_pfn_t pfn);
> >>
> >>  struct kvm_irq_ack_notifier {
> >>         struct hlist_node link;
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index b8534c6b8cf6..0a781b1fb8f0 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -151,12 +151,23 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >>
> >>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> >>  {
> >> +       /*
> >> +        * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting
> >> +        * perspective they are "normal" pages, albeit with slightly different
> >> +        * usage rules.
> >> +        */
> >>         if (pfn_valid(pfn))
> >> -               return PageReserved(pfn_to_page(pfn));
> >> +               return PageReserved(pfn_to_page(pfn)) &&
> >> +                      !is_zone_device_page(pfn_to_page(pfn));
> >
> > This is racy unless you can be certain that the pfn and resulting page
> > has already been pinned by get_user_pages().
>
> What is the race exactly?

The race is that nothing protects the page from being destroyed while
is_zone_device_page() is running. All other pages obtained from the
page allocator have an existing reference from the core-mm otherwise
KVM would be touch free memory. In the case of ZONE_DEVICE the page
that KVM is looking up is only alive while the underlying device is
pinned. GUP will take the device reference and check that the device
is not already dying where a pfn_valid() +  get_page() sequence will
not.

>
> In general KVM does not use pfn's until after having gotten them from
> get_user_pages (or follow_pfn for VM_IO | VM_PFNMAP vmas, for which
> get_user_pages fails, but this is not an issue here).  It then creates
> the page tables and releases the reference to the struct page.

In this case I think it would be better to treat ZONE_DEVICE like the
VM_IO and VM_PFNMAP case, and never take a reference in the first
instance. This keeps the existing correct usages of
kvm_is_reserved_pfn() in tact, and avoids the broken case of a missing
put_page().

> Anything else happens _after_ the reference has been released, but still
> from an mmu notifier; this is why KVM uses pfn_to_page quite pervasively.
>
> If this is enough to avoid races, then I prefer Sean's patch.  If it is
> racy, we need to fix kvm_set_pfn_accessed and kvm_set_pfn_dirty first,
> and second at transparent_hugepage_adjust and kvm_mmu_zap_collapsible_spte:

Those are already ok due to kvm_is_reserved_pfn() == true being the
right answer for ZONE_DEVICE.

> - if accessed/dirty state need not be tracked properly for ZONE_DEVICE,
> then I suppose David's patch is okay (though I'd like to have a big
> comment explaining all the things that went on in these emails).  If
> they need to work, however, Sean's patch 1 is the right thing to do.

They can't work, there's no core-mm managing ZONE_DEVICE pages, so
PageDirty() and THP housekeeping have no meaning.

> - if we need Sean's patch 1, but it is racy to use is_zone_device_page,
> we could first introduce a helper similar to kvm_is_hugepage_allowed()
> from his patch 2, but using pfn_to_online_page() to filter out
> ZONE_DEVICE pages.  This would cover both transparent_hugepage_adjust
> and kvm_mmu_zap_collapsible_spte.

After some more thought I'd feel more comfortable just collapsing the
ZONE_DEVICE case into the VM_IO/VM_PFNMAP case. I.e. with something
like this (untested) that just drops the reference immediately and let
kvm_is_reserved_pfn() do the right thing going forward.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d6f0696d98ef..d21689e2b4eb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1464,6 +1464,14 @@ static bool hva_to_pfn_fast(unsigned long addr,
bool write_fault,
        npages = __get_user_pages_fast(addr, 1, 1, page);
        if (npages == 1) {
                *pfn = page_to_pfn(page[0]);
+               /*
+                * ZONE_DEVICE pages are effectively VM_IO/VM_PFNMAP as
+                * far as KVM is concerned kvm_is_reserved_pfn() will
+                * prevent further unnecessary page management on this
+                * page.
+                */
+               if (is_zone_device_page(page[0]))
+                       put_page(page[0]);

                if (writable)
                        *writable = true;
@@ -1509,6 +1517,11 @@ static int hva_to_pfn_slow(unsigned long addr,
bool *async, bool write_fault,
                }
        }
        *pfn = page_to_pfn(page);
+
+       /* See comment in hva_to_pfn_fast. */
+       if (is_zone_device_page(page[0]))
+               put_page(page[0]);
+
        return npages;
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 23:20       ` Dan Williams
@ 2019-11-06 23:39         ` Sean Christopherson
  2019-11-07  0:01           ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2019-11-06 23:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 06, 2019 at 03:20:11PM -0800, Dan Williams wrote:
> After some more thought I'd feel more comfortable just collapsing the
> ZONE_DEVICE case into the VM_IO/VM_PFNMAP case. I.e. with something
> like this (untested) that just drops the reference immediately and let
> kvm_is_reserved_pfn() do the right thing going forward.

This will break the page fault flow, as it will allow the page to be
whacked before KVM can ensure it will get proper notification from the
mmu_notifier.  E.g. KVM would install the PFN in its secondary MMU after
getting the invalidate notification for the PFN.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d6f0696d98ef..d21689e2b4eb 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1464,6 +1464,14 @@ static bool hva_to_pfn_fast(unsigned long addr,
> bool write_fault,
>         npages = __get_user_pages_fast(addr, 1, 1, page);
>         if (npages == 1) {
>                 *pfn = page_to_pfn(page[0]);
> +               /*
> +                * ZONE_DEVICE pages are effectively VM_IO/VM_PFNMAP as
> +                * far as KVM is concerned kvm_is_reserved_pfn() will
> +                * prevent further unnecessary page management on this
> +                * page.
> +                */
> +               if (is_zone_device_page(page[0]))
> +                       put_page(page[0]);
> 
>                 if (writable)
>                         *writable = true;
> @@ -1509,6 +1517,11 @@ static int hva_to_pfn_slow(unsigned long addr,
> bool *async, bool write_fault,
>                 }
>         }
>         *pfn = page_to_pfn(page);
> +
> +       /* See comment in hva_to_pfn_fast. */
> +       if (is_zone_device_page(page[0]))
> +               put_page(page[0]);
> +
>         return npages;
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-06 23:39         ` Sean Christopherson
@ 2019-11-07  0:01           ` Dan Williams
  2019-11-07  5:48             ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-11-07  0:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 6, 2019 at 3:39 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Wed, Nov 06, 2019 at 03:20:11PM -0800, Dan Williams wrote:
> > After some more thought I'd feel more comfortable just collapsing the
> > ZONE_DEVICE case into the VM_IO/VM_PFNMAP case. I.e. with something
> > like this (untested) that just drops the reference immediately and let
> > kvm_is_reserved_pfn() do the right thing going forward.
>
> This will break the page fault flow, as it will allow the page to be
> whacked before KVM can ensure it will get proper notification from the
> mmu_notifier.  E.g. KVM would install the PFN in its secondary MMU after
> getting the invalidate notification for the PFN.

How do mmu notifiers get held off by page references and does that
machinery work with ZONE_DEVICE? Why is this not a concern for the
VM_IO and VM_PFNMAP case?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-07  0:01           ` Dan Williams
@ 2019-11-07  5:48             ` Dan Williams
  2019-11-07 11:12               ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-11-07  5:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Wed, Nov 6, 2019 at 4:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Nov 6, 2019 at 3:39 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Wed, Nov 06, 2019 at 03:20:11PM -0800, Dan Williams wrote:
> > > After some more thought I'd feel more comfortable just collapsing the
> > > ZONE_DEVICE case into the VM_IO/VM_PFNMAP case. I.e. with something
> > > like this (untested) that just drops the reference immediately and let
> > > kvm_is_reserved_pfn() do the right thing going forward.
> >
> > This will break the page fault flow, as it will allow the page to be
> > whacked before KVM can ensure it will get proper notification from the
> > mmu_notifier.  E.g. KVM would install the PFN in its secondary MMU after
> > getting the invalidate notification for the PFN.
>
> How do mmu notifiers get held off by page references and does that
> machinery work with ZONE_DEVICE? Why is this not a concern for the
> VM_IO and VM_PFNMAP case?

Put another way, I see no protection against truncate/invalidate
afforded by a page pin. If you need guarantees that the page remains
valid in the VMA until KVM can install a mmu notifier that needs to
happen under the mmap_sem as far as I can see. Otherwise gup just
weakly asserts "this pinned page was valid in this vma at one point in
time".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-07  5:48             ` Dan Williams
@ 2019-11-07 11:12               ` Paolo Bonzini
  2019-11-07 15:36                 ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2019-11-07 11:12 UTC (permalink / raw)
  To: Dan Williams, Sean Christopherson
  Cc: Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On 07/11/19 06:48, Dan Williams wrote:
>> How do mmu notifiers get held off by page references and does that
>> machinery work with ZONE_DEVICE? Why is this not a concern for the
>> VM_IO and VM_PFNMAP case?
> Put another way, I see no protection against truncate/invalidate
> afforded by a page pin. If you need guarantees that the page remains
> valid in the VMA until KVM can install a mmu notifier that needs to
> happen under the mmap_sem as far as I can see. Otherwise gup just
> weakly asserts "this pinned page was valid in this vma at one point in
> time".

The MMU notifier is installed before gup, so any invalidation will be
preceded by a call to the MMU notifier.  In turn,
invalidate_range_start/end is called with mmap_sem held so there should
be no race.

However, as Sean mentioned, early put_page of ZONE_DEVICE pages would be
racy, because we need to keep the reference between the gup and the last
time we use the corresponding struct page.

Based on this, I think Sean's patches should work fine, and I prefer
them over David's approach.  Either way, adding some documentation is in
order.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-07 11:12               ` Paolo Bonzini
@ 2019-11-07 15:36                 ` Dan Williams
  2019-11-07 15:58                   ` Sean Christopherson
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-11-07 15:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Thu, Nov 7, 2019 at 3:12 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 07/11/19 06:48, Dan Williams wrote:
> >> How do mmu notifiers get held off by page references and does that
> >> machinery work with ZONE_DEVICE? Why is this not a concern for the
> >> VM_IO and VM_PFNMAP case?
> > Put another way, I see no protection against truncate/invalidate
> > afforded by a page pin. If you need guarantees that the page remains
> > valid in the VMA until KVM can install a mmu notifier that needs to
> > happen under the mmap_sem as far as I can see. Otherwise gup just
> > weakly asserts "this pinned page was valid in this vma at one point in
> > time".
>
> The MMU notifier is installed before gup, so any invalidation will be
> preceded by a call to the MMU notifier.  In turn,
> invalidate_range_start/end is called with mmap_sem held so there should
> be no race.
>
> However, as Sean mentioned, early put_page of ZONE_DEVICE pages would be
> racy, because we need to keep the reference between the gup and the last
> time we use the corresponding struct page.

If KVM is establishing the mmu_notifier before gup then there is
nothing left to do with that ZONE_DEVICE page, so I'm struggling to
see what further qualification of kvm_is_reserved_pfn() buys the
implementation.

However, if you're attracted to the explicitness of Sean's approach
can I at least ask for comments asserting that KVM knows it already
holds a reference on that page so the is_zone_device_page() usage is
safe?

David and I are otherwise trying to reduce is_zone_device_page() to
easy to audit "obviously safe" cases and converting the others with
additional synchronization.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-07 15:36                 ` Dan Williams
@ 2019-11-07 15:58                   ` Sean Christopherson
  2019-11-09  1:43                     ` Sean Christopherson
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2019-11-07 15:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Thu, Nov 07, 2019 at 07:36:45AM -0800, Dan Williams wrote:
> On Thu, Nov 7, 2019 at 3:12 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 07/11/19 06:48, Dan Williams wrote:
> > >> How do mmu notifiers get held off by page references and does that
> > >> machinery work with ZONE_DEVICE? Why is this not a concern for the
> > >> VM_IO and VM_PFNMAP case?
> > > Put another way, I see no protection against truncate/invalidate
> > > afforded by a page pin. If you need guarantees that the page remains
> > > valid in the VMA until KVM can install a mmu notifier that needs to
> > > happen under the mmap_sem as far as I can see. Otherwise gup just
> > > weakly asserts "this pinned page was valid in this vma at one point in
> > > time".
> >
> > The MMU notifier is installed before gup, so any invalidation will be
> > preceded by a call to the MMU notifier.  In turn,
> > invalidate_range_start/end is called with mmap_sem held so there should
> > be no race.
> >
> > However, as Sean mentioned, early put_page of ZONE_DEVICE pages would be
> > racy, because we need to keep the reference between the gup and the last
> > time we use the corresponding struct page.
> 
> If KVM is establishing the mmu_notifier before gup then there is
> nothing left to do with that ZONE_DEVICE page, so I'm struggling to
> see what further qualification of kvm_is_reserved_pfn() buys the
> implementation.

Insertion into KVM's secondary MMU is mutually exclusive with an invalidate
from the mmu_notifier.  KVM holds a reference to the to-be-inserted page
until the page has been inserted, which ensures that the page is pinned and
thus won't be invalidated until after the page is inserted.  This prevents
an invalidate from racing with insertion.  Dropping the reference
immediately after gup() would allow the invalidate to run prior to the page
being inserted, and so KVM would map the stale PFN into the guest's page
tables after it was invalidated in the host.

The other benefit of treating ZONE_DEVICE pages as not-reserved is that it
allows KVM to do proper sanity checks, e.g. when removing pages from its
secondary MMU, KVM asserts that page_count() for present pages is non-zero
to help detect bugs where KVM didn't properly zap the page from its MMU.
Enabling the assertion for ZONE_DEVICE pages would be very nice to have as
it's the most explicit indicator that we done messed up, e.g. without the
associated WARN, errors manifest as random corruption or weird behavior in
the guest.

> However, if you're attracted to the explicitness of Sean's approach
> can I at least ask for comments asserting that KVM knows it already
> holds a reference on that page so the is_zone_device_page() usage is
> safe?

Ya, I'll update the patch to add comments and a WARN.

> David and I are otherwise trying to reduce is_zone_device_page() to
> easy to audit "obviously safe" cases and converting the others with
> additional synchronization.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-07 15:58                   ` Sean Christopherson
@ 2019-11-09  1:43                     ` Sean Christopherson
  2019-11-09  2:00                       ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2019-11-09  1:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Thu, Nov 07, 2019 at 07:58:46AM -0800, Sean Christopherson wrote:
> On Thu, Nov 07, 2019 at 07:36:45AM -0800, Dan Williams wrote:
> > On Thu, Nov 7, 2019 at 3:12 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > > On 07/11/19 06:48, Dan Williams wrote:
> > > >> How do mmu notifiers get held off by page references and does that
> > > >> machinery work with ZONE_DEVICE? Why is this not a concern for the
> > > >> VM_IO and VM_PFNMAP case?
> > > > Put another way, I see no protection against truncate/invalidate
> > > > afforded by a page pin. If you need guarantees that the page remains
> > > > valid in the VMA until KVM can install a mmu notifier that needs to
> > > > happen under the mmap_sem as far as I can see. Otherwise gup just
> > > > weakly asserts "this pinned page was valid in this vma at one point in
> > > > time".
> > >
> > > The MMU notifier is installed before gup, so any invalidation will be
> > > preceded by a call to the MMU notifier.  In turn,
> > > invalidate_range_start/end is called with mmap_sem held so there should
> > > be no race.
> > >
> > > However, as Sean mentioned, early put_page of ZONE_DEVICE pages would be
> > > racy, because we need to keep the reference between the gup and the last
> > > time we use the corresponding struct page.
> > 
> > If KVM is establishing the mmu_notifier before gup then there is
> > nothing left to do with that ZONE_DEVICE page, so I'm struggling to
> > see what further qualification of kvm_is_reserved_pfn() buys the
> > implementation.
> 
> Insertion into KVM's secondary MMU is mutually exclusive with an invalidate
> from the mmu_notifier.  KVM holds a reference to the to-be-inserted page
> until the page has been inserted, which ensures that the page is pinned and
> thus won't be invalidated until after the page is inserted.  This prevents
> an invalidate from racing with insertion.  Dropping the reference
> immediately after gup() would allow the invalidate to run prior to the page
> being inserted, and so KVM would map the stale PFN into the guest's page
> tables after it was invalidated in the host.

My previous analysis is wrong, although I did sort of come to the right
conclusion.

The part that's wrong is that KVM does not rely on pinning a page/pfn when
installing the pfn into its secondary MMU (guest page tables).  Instead,
KVM keeps track of mmu_notifier invalidate requests and cancels insertion
if an invalidate occured at any point between the start of hva_to_pfn(),
i.e. the get_user_pages() call, and acquiring KVM's mmu lock (which must
also be grabbed by mmu_notifier invalidate).  So for any pfn, regardless
of whether it's backed by a struct page, KVM inserts a pfn if and only if
it is guaranteed to get an mmu_notifier invalidate for the pfn (and isn't
already invalidated).

In the page fault flow, KVM doesn't care whether or not the pfn remains
valid in the associated vma.  In other words, Dan's idea of immediately
doing put_page() on ZONE_DEVICE pages would work for *page faults*...

...but not for all the other flows where KVM uses gfn_to_pfn(), and thus
get_user_pages().  When accessing entire pages of guest memory, e.g. for
nested virtualization, KVM gets the page associated with a gfn, maps it
with kmap() to get a kernel address and keeps the mapping/page until it's
done reading/writing the page.  Immediately putting ZONE_DEVICE pages
would result in use-after-free scenarios for these flows.

For non-page pfns, KVM explicitly memremap()'s the pfn and again keeps the
mapping until it's done accessing guest memory.

The above is nicely encapsulated in kvm_vcpu_map(), added by KarimAllah in
commit e45adf665a53 ("KVM: Introduce a new guest mapping API").

For the hva_to_pfn_remapped() -> kvm_get_pfn() case, I think the page
fault flow is in good shape by way of the mmu_notifier sequencing.  The
kvm_get_pfn() call might temporarily keep a page alive, but it shouldn't
break anything (I think).  Not sure about kvm_vcpu_map()...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-09  1:43                     ` Sean Christopherson
@ 2019-11-09  2:00                       ` Dan Williams
  2019-11-11 18:27                         ` Sean Christopherson
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-11-09  2:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Fri, Nov 8, 2019 at 5:43 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Nov 07, 2019 at 07:58:46AM -0800, Sean Christopherson wrote:
> > On Thu, Nov 07, 2019 at 07:36:45AM -0800, Dan Williams wrote:
> > > On Thu, Nov 7, 2019 at 3:12 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > >
> > > > On 07/11/19 06:48, Dan Williams wrote:
> > > > >> How do mmu notifiers get held off by page references and does that
> > > > >> machinery work with ZONE_DEVICE? Why is this not a concern for the
> > > > >> VM_IO and VM_PFNMAP case?
> > > > > Put another way, I see no protection against truncate/invalidate
> > > > > afforded by a page pin. If you need guarantees that the page remains
> > > > > valid in the VMA until KVM can install a mmu notifier that needs to
> > > > > happen under the mmap_sem as far as I can see. Otherwise gup just
> > > > > weakly asserts "this pinned page was valid in this vma at one point in
> > > > > time".
> > > >
> > > > The MMU notifier is installed before gup, so any invalidation will be
> > > > preceded by a call to the MMU notifier.  In turn,
> > > > invalidate_range_start/end is called with mmap_sem held so there should
> > > > be no race.
> > > >
> > > > However, as Sean mentioned, early put_page of ZONE_DEVICE pages would be
> > > > racy, because we need to keep the reference between the gup and the last
> > > > time we use the corresponding struct page.
> > >
> > > If KVM is establishing the mmu_notifier before gup then there is
> > > nothing left to do with that ZONE_DEVICE page, so I'm struggling to
> > > see what further qualification of kvm_is_reserved_pfn() buys the
> > > implementation.
> >
> > Insertion into KVM's secondary MMU is mutually exclusive with an invalidate
> > from the mmu_notifier.  KVM holds a reference to the to-be-inserted page
> > until the page has been inserted, which ensures that the page is pinned and
> > thus won't be invalidated until after the page is inserted.  This prevents
> > an invalidate from racing with insertion.  Dropping the reference
> > immediately after gup() would allow the invalidate to run prior to the page
> > being inserted, and so KVM would map the stale PFN into the guest's page
> > tables after it was invalidated in the host.
>
> My previous analysis is wrong, although I did sort of come to the right
> conclusion.
>
> The part that's wrong is that KVM does not rely on pinning a page/pfn when
> installing the pfn into its secondary MMU (guest page tables).  Instead,
> KVM keeps track of mmu_notifier invalidate requests and cancels insertion
> if an invalidate occured at any point between the start of hva_to_pfn(),
> i.e. the get_user_pages() call, and acquiring KVM's mmu lock (which must
> also be grabbed by mmu_notifier invalidate).  So for any pfn, regardless
> of whether it's backed by a struct page, KVM inserts a pfn if and only if
> it is guaranteed to get an mmu_notifier invalidate for the pfn (and isn't
> already invalidated).
>
> In the page fault flow, KVM doesn't care whether or not the pfn remains
> valid in the associated vma.  In other words, Dan's idea of immediately
> doing put_page() on ZONE_DEVICE pages would work for *page faults*...
>
> ...but not for all the other flows where KVM uses gfn_to_pfn(), and thus
> get_user_pages().  When accessing entire pages of guest memory, e.g. for
> nested virtualization, KVM gets the page associated with a gfn, maps it
> with kmap() to get a kernel address and keeps the mapping/page until it's
> done reading/writing the page.  Immediately putting ZONE_DEVICE pages
> would result in use-after-free scenarios for these flows.

Thanks for this clarification. I do want to put out though that
ZONE_DEVICE pages go idle, they don't get freed. As long as KVM drops
its usage on invalidate it's perfectly fine for KVM to operate on idle
ZONE_DEVICE pages. The common case is that ZONE_DEVICE pages are
accessed and mapped while idle. Only direct-I/O temporarily marks them
busy to synchronize with invalidate. KVM obviates that need by
coordinating with mmu-notifiers instead.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-09  2:00                       ` Dan Williams
@ 2019-11-11 18:27                         ` Sean Christopherson
  2019-11-11 19:15                           ` Paolo Bonzini
  2019-11-12  0:51                           ` Dan Williams
  0 siblings, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2019-11-11 18:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Fri, Nov 08, 2019 at 06:00:46PM -0800, Dan Williams wrote:
> On Fri, Nov 8, 2019 at 5:43 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > On Thu, Nov 07, 2019 at 07:58:46AM -0800, Sean Christopherson wrote:
> > > Insertion into KVM's secondary MMU is mutually exclusive with an invalidate
> > > from the mmu_notifier.  KVM holds a reference to the to-be-inserted page
> > > until the page has been inserted, which ensures that the page is pinned and
> > > thus won't be invalidated until after the page is inserted.  This prevents
> > > an invalidate from racing with insertion.  Dropping the reference
> > > immediately after gup() would allow the invalidate to run prior to the page
> > > being inserted, and so KVM would map the stale PFN into the guest's page
> > > tables after it was invalidated in the host.
> >
> > My previous analysis is wrong, although I did sort of come to the right
> > conclusion.
> >
> > The part that's wrong is that KVM does not rely on pinning a page/pfn when
> > installing the pfn into its secondary MMU (guest page tables).  Instead,
> > KVM keeps track of mmu_notifier invalidate requests and cancels insertion
> > if an invalidate occured at any point between the start of hva_to_pfn(),
> > i.e. the get_user_pages() call, and acquiring KVM's mmu lock (which must
> > also be grabbed by mmu_notifier invalidate).  So for any pfn, regardless
> > of whether it's backed by a struct page, KVM inserts a pfn if and only if
> > it is guaranteed to get an mmu_notifier invalidate for the pfn (and isn't
> > already invalidated).
> >
> > In the page fault flow, KVM doesn't care whether or not the pfn remains
> > valid in the associated vma.  In other words, Dan's idea of immediately
> > doing put_page() on ZONE_DEVICE pages would work for *page faults*...
> >
> > ...but not for all the other flows where KVM uses gfn_to_pfn(), and thus
> > get_user_pages().  When accessing entire pages of guest memory, e.g. for
> > nested virtualization, KVM gets the page associated with a gfn, maps it
> > with kmap() to get a kernel address and keeps the mapping/page until it's
> > done reading/writing the page.  Immediately putting ZONE_DEVICE pages
> > would result in use-after-free scenarios for these flows.
> 
> Thanks for this clarification. I do want to put out though that
> ZONE_DEVICE pages go idle, they don't get freed. As long as KVM drops
> its usage on invalidate it's perfectly fine for KVM to operate on idle
> ZONE_DEVICE pages. The common case is that ZONE_DEVICE pages are
> accessed and mapped while idle. Only direct-I/O temporarily marks them
> busy to synchronize with invalidate. KVM obviates that need by
> coordinating with mmu-notifiers instead.

Only the KVM MMU, e.g. page fault handler, coordinates via mmu_notifier,
the kvm_vcpu_map() case would continue using pages across an invalidate.

Or did I misunderstand?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-11 18:27                         ` Sean Christopherson
@ 2019-11-11 19:15                           ` Paolo Bonzini
  2019-11-12  0:51                           ` Dan Williams
  1 sibling, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2019-11-11 19:15 UTC (permalink / raw)
  To: Sean Christopherson, Dan Williams
  Cc: Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On 11/11/19 19:27, Sean Christopherson wrote:
>> Thanks for this clarification. I do want to put out though that
>> ZONE_DEVICE pages go idle, they don't get freed. As long as KVM drops
>> its usage on invalidate it's perfectly fine for KVM to operate on idle
>> ZONE_DEVICE pages. The common case is that ZONE_DEVICE pages are
>> accessed and mapped while idle. Only direct-I/O temporarily marks them
>> busy to synchronize with invalidate. KVM obviates that need by
>> coordinating with mmu-notifiers instead.
> Only the KVM MMU, e.g. page fault handler, coordinates via mmu_notifier,
> the kvm_vcpu_map() case would continue using pages across an invalidate.

Yes, and it gets/puts the page correctly.

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-11 18:27                         ` Sean Christopherson
  2019-11-11 19:15                           ` Paolo Bonzini
@ 2019-11-12  0:51                           ` Dan Williams
  2019-11-12 10:19                             ` Paolo Bonzini
  1 sibling, 1 reply; 25+ messages in thread
From: Dan Williams @ 2019-11-12  0:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Mon, Nov 11, 2019 at 10:27 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Nov 08, 2019 at 06:00:46PM -0800, Dan Williams wrote:
> > On Fri, Nov 8, 2019 at 5:43 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > On Thu, Nov 07, 2019 at 07:58:46AM -0800, Sean Christopherson wrote:
> > > > Insertion into KVM's secondary MMU is mutually exclusive with an invalidate
> > > > from the mmu_notifier.  KVM holds a reference to the to-be-inserted page
> > > > until the page has been inserted, which ensures that the page is pinned and
> > > > thus won't be invalidated until after the page is inserted.  This prevents
> > > > an invalidate from racing with insertion.  Dropping the reference
> > > > immediately after gup() would allow the invalidate to run prior to the page
> > > > being inserted, and so KVM would map the stale PFN into the guest's page
> > > > tables after it was invalidated in the host.
> > >
> > > My previous analysis is wrong, although I did sort of come to the right
> > > conclusion.
> > >
> > > The part that's wrong is that KVM does not rely on pinning a page/pfn when
> > > installing the pfn into its secondary MMU (guest page tables).  Instead,
> > > KVM keeps track of mmu_notifier invalidate requests and cancels insertion
> > > if an invalidate occured at any point between the start of hva_to_pfn(),
> > > i.e. the get_user_pages() call, and acquiring KVM's mmu lock (which must
> > > also be grabbed by mmu_notifier invalidate).  So for any pfn, regardless
> > > of whether it's backed by a struct page, KVM inserts a pfn if and only if
> > > it is guaranteed to get an mmu_notifier invalidate for the pfn (and isn't
> > > already invalidated).
> > >
> > > In the page fault flow, KVM doesn't care whether or not the pfn remains
> > > valid in the associated vma.  In other words, Dan's idea of immediately
> > > doing put_page() on ZONE_DEVICE pages would work for *page faults*...
> > >
> > > ...but not for all the other flows where KVM uses gfn_to_pfn(), and thus
> > > get_user_pages().  When accessing entire pages of guest memory, e.g. for
> > > nested virtualization, KVM gets the page associated with a gfn, maps it
> > > with kmap() to get a kernel address and keeps the mapping/page until it's
> > > done reading/writing the page.  Immediately putting ZONE_DEVICE pages
> > > would result in use-after-free scenarios for these flows.
> >
> > Thanks for this clarification. I do want to put out though that
> > ZONE_DEVICE pages go idle, they don't get freed. As long as KVM drops
> > its usage on invalidate it's perfectly fine for KVM to operate on idle
> > ZONE_DEVICE pages. The common case is that ZONE_DEVICE pages are
> > accessed and mapped while idle. Only direct-I/O temporarily marks them
> > busy to synchronize with invalidate. KVM obviates that need by
> > coordinating with mmu-notifiers instead.
>
> Only the KVM MMU, e.g. page fault handler, coordinates via mmu_notifier,
> the kvm_vcpu_map() case would continue using pages across an invalidate.
>
> Or did I misunderstand?

Not sure *I* understand KVM's implementation.

[ Note, I already acked the solution since it fixes the problem at
hand (thanks btw!), so feel free to drop this discussion if you don't
have time to hit me with the clue bat ]

An elevated page reference count for file mapped pages causes the
filesystem (for a dax mode file) to wait for that reference count to
drop to 1 before allowing the truncate to proceed. For a page cache
backed file mapping (non-dax) the reference count is not considered in
the truncate path. It does prevent the page from getting freed in the
page cache case, but the association to the file is lost for truncate.

As long as any memory the guest expects to be persistent is backed by
mmu-notifier coordination we're all good, otherwise an elevated
reference count does not coordinate with truncate in a reliable way.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-12  0:51                           ` Dan Williams
@ 2019-11-12 10:19                             ` Paolo Bonzini
  2019-11-12 16:57                               ` Sean Christopherson
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2019-11-12 10:19 UTC (permalink / raw)
  To: Dan Williams, Sean Christopherson
  Cc: Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On 12/11/19 01:51, Dan Williams wrote:
> An elevated page reference count for file mapped pages causes the
> filesystem (for a dax mode file) to wait for that reference count to
> drop to 1 before allowing the truncate to proceed. For a page cache
> backed file mapping (non-dax) the reference count is not considered in
> the truncate path. It does prevent the page from getting freed in the
> page cache case, but the association to the file is lost for truncate.

KVM support for file-backed guest memory is limited.  It is not
completely broken, in fact cases such as hugetlbfs are in use routinely,
but corner cases such as truncate aren't covered well indeed.

Paolo

> As long as any memory the guest expects to be persistent is backed by
> mmu-notifier coordination we're all good, otherwise an elevated
> reference count does not coordinate with truncate in a reliable way.
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
  2019-11-12 10:19                             ` Paolo Bonzini
@ 2019-11-12 16:57                               ` Sean Christopherson
  0 siblings, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2019-11-12 16:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dan Williams, Radim Krčmář,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	KVM list, Linux Kernel Mailing List, Adam Borowski,
	David Hildenbrand

On Tue, Nov 12, 2019 at 11:19:44AM +0100, Paolo Bonzini wrote:
> On 12/11/19 01:51, Dan Williams wrote:
> > An elevated page reference count for file mapped pages causes the
> > filesystem (for a dax mode file) to wait for that reference count to
> > drop to 1 before allowing the truncate to proceed. For a page cache
> > backed file mapping (non-dax) the reference count is not considered in
> > the truncate path. It does prevent the page from getting freed in the
> > page cache case, but the association to the file is lost for truncate.
> 
> KVM support for file-backed guest memory is limited.  It is not
> completely broken, in fact cases such as hugetlbfs are in use routinely,
> but corner cases such as truncate aren't covered well indeed.

KVM's actual MMU should be ok since it coordinates with the mmu_notifier.

kvm_vcpu_map() is where KVM could run afoul of page cache truncation.
This is the other main use of hva_to_pfn*(), where KVM directly accesses
guest memory (which could be file-backed) without coordinating with the
mmu_notifier.  IIUC, an ill-timed page cache truncation could result in a
write from KVM effectively being dropped due to writeback racing with
KVM's write to the page.  If that's true, then I think KVM would need to
to move to the proposed pin_user_pages() to ensure its "DMA" isn't lost.

> > As long as any memory the guest expects to be persistent is backed by
> > mmu-notifier coordination we're all good, otherwise an elevated
> > reference count does not coordinate with truncate in a reliable way.

KVM itself is (mostly) blissfully unaware of any such expectations.  The
userspace VMM, e.g. Qemu, is ultimately responsible for ensuring the guest
sees a valid model, e.g. that persistent memory (as presented to the guest)
is actually persistent (from the guest's perspective).

The big caveat is the truncation issue above.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, back to index

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06 17:07 [PATCH 0/2] KVM: MMU: Fix a refcount bug with ZONE_DEVICE pages Sean Christopherson
2019-11-06 17:07 ` [PATCH 1/2] KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved Sean Christopherson
2019-11-06 17:14   ` Paolo Bonzini
2019-11-06 17:46     ` Sean Christopherson
2019-11-06 18:04   ` Dan Williams
2019-11-06 20:26     ` Sean Christopherson
2019-11-06 20:34       ` Dan Williams
2019-11-06 21:09     ` Paolo Bonzini
2019-11-06 21:30       ` Sean Christopherson
2019-11-06 23:20       ` Dan Williams
2019-11-06 23:39         ` Sean Christopherson
2019-11-07  0:01           ` Dan Williams
2019-11-07  5:48             ` Dan Williams
2019-11-07 11:12               ` Paolo Bonzini
2019-11-07 15:36                 ` Dan Williams
2019-11-07 15:58                   ` Sean Christopherson
2019-11-09  1:43                     ` Sean Christopherson
2019-11-09  2:00                       ` Dan Williams
2019-11-11 18:27                         ` Sean Christopherson
2019-11-11 19:15                           ` Paolo Bonzini
2019-11-12  0:51                           ` Dan Williams
2019-11-12 10:19                             ` Paolo Bonzini
2019-11-12 16:57                               ` Sean Christopherson
2019-11-06 17:07 ` [PATCH 2/2] KVM: x86/mmu: Add helper to consolidate huge page promotion Sean Christopherson
2019-11-06 17:22   ` Paolo Bonzini

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git