nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] kvm: Use huge pages for DAX-backed files
@ 2018-10-29 21:07 Barret Rhoden
  2018-10-29 22:25 ` Dan Williams
       [not found] ` <20181029210716.212159-1-brho-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 16+ messages in thread
From: Barret Rhoden @ 2018-10-29 21:07 UTC (permalink / raw)
  To: Dan Williams, Dave Jiang, Ross Zwisler, Vishal Verma,
	Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, yu.c.zhang-ral2JQCrhuEAvxtiuMwx3w,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, H. Peter Anvin,
	yi.z.zhang-ral2JQCrhuEAvxtiuMwx3w

This change allows KVM to map DAX-backed files made of huge pages with
huge mappings in the EPT/TDP.

DAX pages are not PageTransCompound.  The existing check is trying to
determine if the mapping for the pfn is a huge mapping or not.  For
non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.

For DAX, we can check the page table itself.  Actually, we might always
be able to walk the page table, even for PageTransCompound pages, but
it's probably a little slower.

Note that KVM already faulted in the page (or huge page) in the host's
page table, and we hold the KVM mmu spinlock (grabbed before checking
the mmu seq).  Based on the other comments about not worrying about a
pmd split, we might be able to safely walk the page table without
holding the mm sem.

This patch relies on kvm_is_reserved_pfn() being false for DAX pages,
which I've hacked up for testing this code.  That change should
eventually happen:

https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/

Another issue is that kvm_mmu_zap_collapsible_spte() also uses
PageTransCompoundMap() to detect huge pages, but we don't have a way to
get the HVA easily.  Can we just aggressively zap DAX pages there?

Alternatively, is there a better way to track at the struct page level
whether or not a page is huge-mapped?  Maybe the DAX huge pages mark
themselves as TransCompound or something similar, and we don't need to
special case DAX/ZONE_DEVICE pages.

Signed-off-by: Barret Rhoden <brho-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 arch/x86/kvm/mmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cf5f572f2305..9f3e0f83a2dd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 	return -EFAULT;
 }
 
+static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		return 0;
+
+	p4d = p4d_offset(pgd, addr);
+	if (!p4d_present(*p4d))
+		return 0;
+	if (p4d_huge(*p4d))
+		return P4D_SIZE;
+
+	pud = pud_offset(p4d, addr);
+	if (!pud_present(*pud))
+		return 0;
+	if (pud_huge(*pud))
+		return PUD_SIZE;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		return 0;
+	if (pmd_huge(*pmd))
+		return PMD_SIZE;
+
+	pte = pte_offset_map(pmd, addr);
+	if (!pte_present(*pte))
+		return 0;
+	return PAGE_SIZE;
+}
+
+static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	unsigned long hva, map_sz;
+
+	if (!is_zone_device_page(page))
+		return PageTransCompoundMap(page);
+
+	/*
+	 * DAX pages do not use compound pages.  The page should have already
+	 * been mapped into the host-side page table during try_async_pf(), so
+	 * we can check the page tables directly.
+	 */
+	hva = gfn_to_hva(kvm, gfn);
+	if (kvm_is_error_hva(hva))
+		return false;
+
+	/*
+	 * Our caller grabbed the KVM mmu_lock with a successful
+	 * mmu_notifier_retry, so we're safe to walk the page table.
+	 */
+	map_sz = pgd_mapping_size(current->mm, hva);
+	switch (map_sz) {
+	case PMD_SIZE:
+		return true;
+	case P4D_SIZE:
+	case PUD_SIZE:
+		printk_once(KERN_INFO "KVM THP promo found a very large page");
+		return false;
+	}
+	return false;
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 					gfn_t *gfnp, kvm_pfn_t *pfnp,
 					int *levelp)
@@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
 	    level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompoundMap(pfn_to_page(pfn)) &&
+	    pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
 		unsigned long mask;
 		/*
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-29 21:07 [RFC PATCH] kvm: Use huge pages for DAX-backed files Barret Rhoden
@ 2018-10-29 22:25 ` Dan Williams
       [not found]   ` <CAPcyv4gJUjuSKwy7i2wuKR=Vz-AkDrxnGya5qkg7XTFxuXbtzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-10-31  8:52   ` Paolo Bonzini
       [not found] ` <20181029210716.212159-1-brho-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 16+ messages in thread
From: Dan Williams @ 2018-10-29 22:25 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: X86 ML, Zhang, Yu C, KVM list, rkrcmar, linux-nvdimm,
	Linux Kernel Mailing List, Ingo Molnar, Borislav Petkov, zwisler,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin

On Mon, Oct 29, 2018 at 2:07 PM Barret Rhoden <brho@google.com> wrote:
>
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
>
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>
> For DAX, we can check the page table itself.  Actually, we might always
> be able to walk the page table, even for PageTransCompound pages, but
> it's probably a little slower.
>
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock (grabbed before checking
> the mmu seq).  Based on the other comments about not worrying about a
> pmd split, we might be able to safely walk the page table without
> holding the mm sem.
>
> This patch relies on kvm_is_reserved_pfn() being false for DAX pages,
> which I've hacked up for testing this code.  That change should
> eventually happen:
>
> https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/
>
> Another issue is that kvm_mmu_zap_collapsible_spte() also uses
> PageTransCompoundMap() to detect huge pages, but we don't have a way to
> get the HVA easily.  Can we just aggressively zap DAX pages there?
>
> Alternatively, is there a better way to track at the struct page level
> whether or not a page is huge-mapped?  Maybe the DAX huge pages mark
> themselves as TransCompound or something similar, and we don't need to
> special case DAX/ZONE_DEVICE pages.
>
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
>  arch/x86/kvm/mmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 70 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cf5f572f2305..9f3e0f83a2dd 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>         return -EFAULT;
>  }
>
> +static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr)
> +{
> +       pgd_t *pgd;
> +       p4d_t *p4d;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       pgd = pgd_offset(mm, addr);
> +       if (!pgd_present(*pgd))
> +               return 0;
> +
> +       p4d = p4d_offset(pgd, addr);
> +       if (!p4d_present(*p4d))
> +               return 0;
> +       if (p4d_huge(*p4d))
> +               return P4D_SIZE;
> +
> +       pud = pud_offset(p4d, addr);
> +       if (!pud_present(*pud))
> +               return 0;
> +       if (pud_huge(*pud))
> +               return PUD_SIZE;
> +
> +       pmd = pmd_offset(pud, addr);
> +       if (!pmd_present(*pmd))
> +               return 0;
> +       if (pmd_huge(*pmd))
> +               return PMD_SIZE;
> +
> +       pte = pte_offset_map(pmd, addr);
> +       if (!pte_present(*pte))
> +               return 0;
> +       return PAGE_SIZE;
> +}
> +
> +static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +       struct page *page = pfn_to_page(pfn);
> +       unsigned long hva, map_sz;
> +
> +       if (!is_zone_device_page(page))
> +               return PageTransCompoundMap(page);
> +
> +       /*
> +        * DAX pages do not use compound pages.  The page should have already
> +        * been mapped into the host-side page table during try_async_pf(), so
> +        * we can check the page tables directly.
> +        */
> +       hva = gfn_to_hva(kvm, gfn);
> +       if (kvm_is_error_hva(hva))
> +               return false;
> +
> +       /*
> +        * Our caller grabbed the KVM mmu_lock with a successful
> +        * mmu_notifier_retry, so we're safe to walk the page table.
> +        */
> +       map_sz = pgd_mapping_size(current->mm, hva);
> +       switch (map_sz) {
> +       case PMD_SIZE:
> +               return true;
> +       case P4D_SIZE:
> +       case PUD_SIZE:
> +               printk_once(KERN_INFO "KVM THP promo found a very large page");

Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.

> +               return false;
> +       }
> +       return false;
> +}

The above 2 functions are  similar to what we need to do for
determining the blast radius of a memory error, see
dev_pagemap_mapping_shift() and its usage in add_to_kill().

> +
>  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>                                         gfn_t *gfnp, kvm_pfn_t *pfnp,
>                                         int *levelp)
> @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>          */
>         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>             level == PT_PAGE_TABLE_LEVEL &&
> -           PageTransCompoundMap(pfn_to_page(pfn)) &&
> +           pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&

I'm wondering if we're adding an explicit is_zone_device_page() check
in this path to determine the page mapping size if that can be a
replacement for the kvm_is_reserved_pfn() check. In other words, the
goal of fixing up PageReserved() was to preclude the need for DAX-page
special casing in KVM, but if we already need add some special casing
for page size determination, might as well bypass the
kvm_is_reserved_pfn() dependency as well.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
       [not found]   ` <CAPcyv4gJUjuSKwy7i2wuKR=Vz-AkDrxnGya5qkg7XTFxuXbtzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-30  0:28     ` Barret Rhoden
  2018-10-30  3:10       ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Barret Rhoden @ 2018-10-30  0:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: X86 ML, Zhang, Yu C, KVM list, rkrcmar-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm, Linux Kernel Mailing List, Ingo Molnar,
	Borislav Petkov, zwisler-DgEjT+Ai2ygdnm+yROfE0A, Paolo Bonzini,
	Thomas Gleixner, H. Peter Anvin, Zhang, Yi Z

On 2018-10-29 at 15:25 Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > +       /*
> > +        * Our caller grabbed the KVM mmu_lock with a successful
> > +        * mmu_notifier_retry, so we're safe to walk the page table.
> > +        */
> > +       map_sz = pgd_mapping_size(current->mm, hva);
> > +       switch (map_sz) {
> > +       case PMD_SIZE:
> > +               return true;
> > +       case P4D_SIZE:
> > +       case PUD_SIZE:
> > +               printk_once(KERN_INFO "KVM THP promo found a very large page");  
> 
> Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.

The place where I use that helper seemed to care about PMDs (compared
to huge pages larger than PUDs), I think due to THP.  Though it also
checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point.

I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow
any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for
non-DAX, IIUC.

> 
> > +               return false;
> > +       }
> > +       return false;
> > +}  
> 
> The above 2 functions are  similar to what we need to do for
> determining the blast radius of a memory error, see
> dev_pagemap_mapping_shift() and its usage in add_to_kill().

Great.  I don't know if I have access in the KVM code to the VMA to use
those functions directly, but I can extract the guts of
dev_pagemap_mapping_shift() or something and put it in mm/util.c.

> >  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >                                         gfn_t *gfnp, kvm_pfn_t *pfnp,
> >                                         int *levelp)
> > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> >          */
> >         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> >             level == PT_PAGE_TABLE_LEVEL &&
> > -           PageTransCompoundMap(pfn_to_page(pfn)) &&
> > +           pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&  
> 
> I'm wondering if we're adding an explicit is_zone_device_page() check
> in this path to determine the page mapping size if that can be a
> replacement for the kvm_is_reserved_pfn() check. In other words, the
> goal of fixing up PageReserved() was to preclude the need for DAX-page
> special casing in KVM, but if we already need add some special casing
> for page size determination, might as well bypass the
> kvm_is_reserved_pfn() dependency as well.

kvm_is_reserved_pfn() is used in some other places, like
kvm_set_pfn_dirty()and kvm_set_pfn_accessed().  Maybe the way those
treat DAX pages matters on a case-by-case basis?  

There are other callers of kvm_is_reserved_pfn() such as
kvm_pfn_to_page() and gfn_to_page().  I'm not familiar (yet) with how
struct pages and DAX work together, and whether or not the callers of
those pfn_to_page() functions have expectations about the 'type' of
struct page they get back.

It looks like another time that this popped up was kvm_is_mmio_pfn(),
though that wasn't exactly checking kvm_is_reserved_pfn(), and it
special cased based on the memory type / PAT business.

Thanks,

Barret

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-30  0:28     ` Barret Rhoden
@ 2018-10-30  3:10       ` Dan Williams
       [not found]         ` <CAPcyv4gQztHrJ3--rhU4ZpaZyyqdqE0=gx50CRArHKiXwfYC+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-10-31  3:05         ` Yu Zhang
  0 siblings, 2 replies; 16+ messages in thread
From: Dan Williams @ 2018-10-30  3:10 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: X86 ML, Zhang, Yu C, KVM list, rkrcmar, linux-nvdimm,
	Linux Kernel Mailing List, Ingo Molnar, Borislav Petkov, zwisler,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin

On Mon, Oct 29, 2018 at 5:29 PM Barret Rhoden <brho@google.com> wrote:
>
> On 2018-10-29 at 15:25 Dan Williams <dan.j.williams@intel.com> wrote:
> > > +       /*
> > > +        * Our caller grabbed the KVM mmu_lock with a successful
> > > +        * mmu_notifier_retry, so we're safe to walk the page table.
> > > +        */
> > > +       map_sz = pgd_mapping_size(current->mm, hva);
> > > +       switch (map_sz) {
> > > +       case PMD_SIZE:
> > > +               return true;
> > > +       case P4D_SIZE:
> > > +       case PUD_SIZE:
> > > +               printk_once(KERN_INFO "KVM THP promo found a very large page");
> >
> > Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.
>
> The place where I use that helper seemed to care about PMDs (compared
> to huge pages larger than PUDs), I think due to THP.  Though it also
> checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point.
>
> I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow
> any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for
> non-DAX, IIUC.

Yes, THP stops at PMDs, but DAX and hugetlbfs support PUD level mappings.

> > > +               return false;
> > > +       }
> > > +       return false;
> > > +}
> >
> > The above 2 functions are  similar to what we need to do for
> > determining the blast radius of a memory error, see
> > dev_pagemap_mapping_shift() and its usage in add_to_kill().
>
> Great.  I don't know if I have access in the KVM code to the VMA to use
> those functions directly, but I can extract the guts of
> dev_pagemap_mapping_shift() or something and put it in mm/util.c.

Sounds good.

> > >  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > >                                         gfn_t *gfnp, kvm_pfn_t *pfnp,
> > >                                         int *levelp)
> > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > >          */
> > >         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > >             level == PT_PAGE_TABLE_LEVEL &&
> > > -           PageTransCompoundMap(pfn_to_page(pfn)) &&
> > > +           pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&
> >
> > I'm wondering if we're adding an explicit is_zone_device_page() check
> > in this path to determine the page mapping size if that can be a
> > replacement for the kvm_is_reserved_pfn() check. In other words, the
> > goal of fixing up PageReserved() was to preclude the need for DAX-page
> > special casing in KVM, but if we already need add some special casing
> > for page size determination, might as well bypass the
> > kvm_is_reserved_pfn() dependency as well.
>
> kvm_is_reserved_pfn() is used in some other places, like
> kvm_set_pfn_dirty()and kvm_set_pfn_accessed().  Maybe the way those
> treat DAX pages matters on a case-by-case basis?
>
> There are other callers of kvm_is_reserved_pfn() such as
> kvm_pfn_to_page() and gfn_to_page().  I'm not familiar (yet) with how
> struct pages and DAX work together, and whether or not the callers of
> those pfn_to_page() functions have expectations about the 'type' of
> struct page they get back.
>

The property of DAX pages that requires special coordination is the
fact that the device hosting the pages can be disabled at will. The
get_dev_pagemap() api is the interface to pin a device-pfn so that you
can safely perform a pfn_to_page() operation.

Have the pages that kvm uses in this path already been pinned by vfio?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
       [not found]         ` <CAPcyv4gQztHrJ3--rhU4ZpaZyyqdqE0=gx50CRArHKiXwfYC+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-30 19:45           ` Barret Rhoden
  2018-10-31  8:49             ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Barret Rhoden @ 2018-10-30 19:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: X86 ML, Zhang, Yu C, KVM list, rkrcmar-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm, Linux Kernel Mailing List, Ingo Molnar,
	Borislav Petkov, zwisler-DgEjT+Ai2ygdnm+yROfE0A, Paolo Bonzini,
	Thomas Gleixner, H. Peter Anvin, Zhang, Yi Z

On 2018-10-29 at 20:10 Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > > >  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > > >                                         gfn_t *gfnp, kvm_pfn_t *pfnp,
> > > >                                         int *levelp)
> > > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > > >          */
> > > >         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > > >             level == PT_PAGE_TABLE_LEVEL &&
> > > > -           PageTransCompoundMap(pfn_to_page(pfn)) &&
> > > > +           pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&  
> > >
> > > I'm wondering if we're adding an explicit is_zone_device_page() check
> > > in this path to determine the page mapping size if that can be a
> > > replacement for the kvm_is_reserved_pfn() check. In other words, the
> > > goal of fixing up PageReserved() was to preclude the need for DAX-page
> > > special casing in KVM, but if we already need add some special casing
> > > for page size determination, might as well bypass the
> > > kvm_is_reserved_pfn() dependency as well.  
> >
> > kvm_is_reserved_pfn() is used in some other places, like
> > kvm_set_pfn_dirty()and kvm_set_pfn_accessed().  Maybe the way those
> > treat DAX pages matters on a case-by-case basis?
> >
> > There are other callers of kvm_is_reserved_pfn() such as
> > kvm_pfn_to_page() and gfn_to_page().  I'm not familiar (yet) with how
> > struct pages and DAX work together, and whether or not the callers of
> > those pfn_to_page() functions have expectations about the 'type' of
> > struct page they get back.
> >  
> 
> The property of DAX pages that requires special coordination is the
> fact that the device hosting the pages can be disabled at will. The
> get_dev_pagemap() api is the interface to pin a device-pfn so that you
> can safely perform a pfn_to_page() operation.
> 
> Have the pages that kvm uses in this path already been pinned by vfio?

I'm not aware of any explicit pinning, but it might be happening under
the hood.  These pages are just generic guest RAM, but they are present
in a host-side mapping.  I ran into this when looking at EPT fault
handling.  In the code I changed, a physical page was faulted in to the
task's page table, then while the kvm->mmu_lock is held, KVM makes an
EPT mapping to the same physical page.  That mmu_lock seems to prevent
any concurrent host-side unmappings; though I'm not familiar with the mm
notifier stuff.

One usage of kvm_is_reserved_pfn() in KVM code is like this:

static struct page *kvm_pfn_to_page(kvm_pfn_t pfn)
{  
        if (is_error_noslot_pfn(pfn))
                return KVM_ERR_PTR_BAD_PAGE; 
    
        if (kvm_is_reserved_pfn(pfn)) {                      
                WARN_ON(1);
                return KVM_ERR_PTR_BAD_PAGE;                         
        }

        return pfn_to_page(pfn);                                                  
}

I think there's no guarantee the kvm->mmu_lock is held in the generic
case.  Here's one case where it wasn't (from walking through the code):

handle_exception
-handle_ud
--kvm_emulate_instruction
---x86_emulate_instruction
----x86_emulate_insn
-----writeback
------segmented_cmpxchg
-------emulator_cmpxchg_emulated
--------kvm_vcpu_gfn_to_page
---------kvm_pfn_to_page

There are probably other rules related to gfn_to_page that keep the
page alive, maybe just during interrupt/vmexit context?  Whatever keeps
those pages alive for normal memory might grab that devmap reference
under the hood for DAX mappings.

Thanks,
Barret

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-30  3:10       ` Dan Williams
       [not found]         ` <CAPcyv4gQztHrJ3--rhU4ZpaZyyqdqE0=gx50CRArHKiXwfYC+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-31  3:05         ` Yu Zhang
  1 sibling, 0 replies; 16+ messages in thread
From: Yu Zhang @ 2018-10-31  3:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Barret Rhoden, rkrcmar, Zhang, Yu C  <yu.c.zhang@intel.com>,
	KVM list <kvm@vger.kernel.org>,
	linux-nvdimm, X86 ML, Linux Kernel Mailing List, Ingo Molnar,
	Borislav Petkov, zwisler, Paolo Bonzini, Thomas Gleixner,
	H. Peter Anvin

On Mon, Oct 29, 2018 at 08:10:52PM -0700, Dan Williams wrote:
> On Mon, Oct 29, 2018 at 5:29 PM Barret Rhoden <brho@google.com> wrote:
> >
> > On 2018-10-29 at 15:25 Dan Williams <dan.j.williams@intel.com> wrote:
> > > > +       /*
> > > > +        * Our caller grabbed the KVM mmu_lock with a successful
> > > > +        * mmu_notifier_retry, so we're safe to walk the page table.
> > > > +        */
> > > > +       map_sz = pgd_mapping_size(current->mm, hva);
> > > > +       switch (map_sz) {
> > > > +       case PMD_SIZE:
> > > > +               return true;
> > > > +       case P4D_SIZE:
> > > > +       case PUD_SIZE:
> > > > +               printk_once(KERN_INFO "KVM THP promo found a very large page");
> > >
> > > Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.
> >
> > The place where I use that helper seemed to care about PMDs (compared
> > to huge pages larger than PUDs), I think due to THP.  Though it also
> > checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point.
> >
> > I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow
> > any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for
> > non-DAX, IIUC.
> 
> Yes, THP stops at PMDs, but DAX and hugetlbfs support PUD level mappings.
> 
> > > > +               return false;
> > > > +       }
> > > > +       return false;
> > > > +}
> > >
> > > The above 2 functions are  similar to what we need to do for
> > > determining the blast radius of a memory error, see
> > > dev_pagemap_mapping_shift() and its usage in add_to_kill().
> >
> > Great.  I don't know if I have access in the KVM code to the VMA to use
> > those functions directly, but I can extract the guts of
> > dev_pagemap_mapping_shift() or something and put it in mm/util.c.
> 
> Sounds good.
> 
> > > >  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > > >                                         gfn_t *gfnp, kvm_pfn_t *pfnp,
> > > >                                         int *levelp)
> > > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > > >          */
> > > >         if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > > >             level == PT_PAGE_TABLE_LEVEL &&
> > > > -           PageTransCompoundMap(pfn_to_page(pfn)) &&
> > > > +           pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&
> > >
> > > I'm wondering if we're adding an explicit is_zone_device_page() check
> > > in this path to determine the page mapping size if that can be a
> > > replacement for the kvm_is_reserved_pfn() check. In other words, the
> > > goal of fixing up PageReserved() was to preclude the need for DAX-page
> > > special casing in KVM, but if we already need add some special casing
> > > for page size determination, might as well bypass the
> > > kvm_is_reserved_pfn() dependency as well.
> >
> > kvm_is_reserved_pfn() is used in some other places, like
> > kvm_set_pfn_dirty()and kvm_set_pfn_accessed().  Maybe the way those
> > treat DAX pages matters on a case-by-case basis?
> >
> > There are other callers of kvm_is_reserved_pfn() such as
> > kvm_pfn_to_page() and gfn_to_page().  I'm not familiar (yet) with how
> > struct pages and DAX work together, and whether or not the callers of
> > those pfn_to_page() functions have expectations about the 'type' of
> > struct page they get back.
> >
> 
> The property of DAX pages that requires special coordination is the
> fact that the device hosting the pages can be disabled at will. The
> get_dev_pagemap() api is the interface to pin a device-pfn so that you
> can safely perform a pfn_to_page() operation.
> 
> Have the pages that kvm uses in this path already been pinned by vfio?

My understanding is, it could be - if there's any device assigned to the VM.
Otherwise, they will not.

B.R.
Yu

> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-30 19:45           ` Barret Rhoden
@ 2018-10-31  8:49             ` Paolo Bonzini
  2018-11-02 20:32               ` Barret Rhoden
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2018-10-31  8:49 UTC (permalink / raw)
  To: Barret Rhoden, Dan Williams
  Cc: X86 ML, Zhang, Yu C, KVM list, rkrcmar, linux-nvdimm,
	Linux Kernel Mailing List, Ingo Molnar, Borislav Petkov, zwisler,
	Thomas Gleixner, H. Peter Anvin, Zhang, Yi Z

On 30/10/2018 20:45, Barret Rhoden wrote:
> On 2018-10-29 at 20:10 Dan Williams <dan.j.williams@intel.com> wrote:
>> The property of DAX pages that requires special coordination is the
>> fact that the device hosting the pages can be disabled at will. The
>> get_dev_pagemap() api is the interface to pin a device-pfn so that you
>> can safely perform a pfn_to_page() operation.
>>
>> Have the pages that kvm uses in this path already been pinned by vfio?

No, VFIO is not involved here.

The pages that KVM uses are never pinned.  Soon after we grab them and
we build KVM's page table, we do put_page in mmu_set_spte (via
kvm_release_pfn_clean).  From that point on the MMU notifier will take
care of invalidating SPT before the page disappears from the mm's page
table.

> One usage of kvm_is_reserved_pfn() in KVM code is like this:
> 
> static struct page *kvm_pfn_to_page(kvm_pfn_t pfn)
> {  
>         if (is_error_noslot_pfn(pfn))
>                 return KVM_ERR_PTR_BAD_PAGE; 
>     
>         if (kvm_is_reserved_pfn(pfn)) {                      
>                 WARN_ON(1);
>                 return KVM_ERR_PTR_BAD_PAGE;                         
>         }
> 
>         return pfn_to_page(pfn);                                                  
> }
> 
> I think there's no guarantee the kvm->mmu_lock is held in the generic
> case.

Indeed, it's not.

> There are probably other rules related to gfn_to_page that keep the
> page alive, maybe just during interrupt/vmexit context?  Whatever keeps
> those pages alive for normal memory might grab that devmap reference
> under the hood for DAX mappings.

Nothing keeps the page alive except for the MMU notifier (which of
course cannot run in atomic context, since its callers take the mmap_sem).

Paolo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-29 22:25 ` Dan Williams
       [not found]   ` <CAPcyv4gJUjuSKwy7i2wuKR=Vz-AkDrxnGya5qkg7XTFxuXbtzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-31  8:52   ` Paolo Bonzini
  2018-10-31 21:16     ` Dan Williams
  1 sibling, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2018-10-31  8:52 UTC (permalink / raw)
  To: Dan Williams, Barret Rhoden
  Cc: X86 ML, Zhang, Yu C, KVM list, rkrcmar, linux-nvdimm,
	Linux Kernel Mailing List, Ingo Molnar, Borislav Petkov, zwisler,
	Thomas Gleixner, H. Peter Anvin, Zhang, Yi Z

On 29/10/2018 23:25, Dan Williams wrote:
> I'm wondering if we're adding an explicit is_zone_device_page() check
> in this path to determine the page mapping size if that can be a
> replacement for the kvm_is_reserved_pfn() check. In other words, the
> goal of fixing up PageReserved() was to preclude the need for DAX-page
> special casing in KVM, but if we already need add some special casing
> for page size determination, might as well bypass the
> kvm_is_reserved_pfn() dependency as well.

No, please don't.  The kvm_is_reserved_pfn() check is for correctness,
the page-size check is for optimization.  In theory you could have a
ZONE_DEVICE area that is smaller than 2MB and thus does not use huge pages.

Paolo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-31  8:52   ` Paolo Bonzini
@ 2018-10-31 21:16     ` Dan Williams
  2018-11-06 10:22       ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2018-10-31 21:16 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Barret Rhoden, X86 ML, Zhang,

On Wed, Oct 31, 2018 at 1:52 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 29/10/2018 23:25, Dan Williams wrote:
> > I'm wondering if we're adding an explicit is_zone_device_page() check
> > in this path to determine the page mapping size if that can be a
> > replacement for the kvm_is_reserved_pfn() check. In other words, the
> > goal of fixing up PageReserved() was to preclude the need for DAX-page
> > special casing in KVM, but if we already need add some special casing
> > for page size determination, might as well bypass the
> > kvm_is_reserved_pfn() dependency as well.
>
> No, please don't.  The kvm_is_reserved_pfn() check is for correctness,
> the page-size check is for optimization.  In theory you could have a
> ZONE_DEVICE area that is smaller than 2MB and thus does not use huge pages.

To be clear, I was not suggesting that a ZONE_DEVICE check would be
sufficient to determine a 2MB page. I was suggesting that given there
is a debate about removing the "reserved" designation for dax pages
that debate is moot if kvm is going to add interrogation code to
figure out the size of dax mappings.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-31  8:49             ` Paolo Bonzini
@ 2018-11-02 20:32               ` Barret Rhoden
  2018-11-06 10:19                 ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Barret Rhoden @ 2018-11-02 20:32 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams
  Cc: Dave Jiang, zwisler, Vishal L Verma, rkrcmar, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, linux-nvdimm,
	Linux Kernel Mailing List, H. Peter Anvin, X86 ML, KVM list,
	Zhang, Yu C, Zhang, Yi Z

On 2018-10-31 at 09:49 Paolo Bonzini <pbonzini@redhat.com> wrote:
> > On 2018-10-29 at 20:10 Dan Williams <dan.j.williams@intel.com> wrote:  
> >> The property of DAX pages that requires special coordination is the
> >> fact that the device hosting the pages can be disabled at will. The
> >> get_dev_pagemap() api is the interface to pin a device-pfn so that you
> >> can safely perform a pfn_to_page() operation.
> >>
> >> Have the pages that kvm uses in this path already been pinned by vfio?  
> 
> No, VFIO is not involved here.
> 
> The pages that KVM uses are never pinned.  Soon after we grab them and
> we build KVM's page table, we do put_page in mmu_set_spte (via
> kvm_release_pfn_clean).  From that point on the MMU notifier will take
> care of invalidating SPT before the page disappears from the mm's page
> table.

I looked into this a little, and I think it's safe in terms of DAX's
get_dev_pagemap() refcounts to have kvm_is_reserved_pfn() treat
DAX pages as not reserved.  kvm_is_reserved_pfn() is used before a
pfn_to_page() call, so the concern was that the pages didn't have a DAX
refcount.

Many of the times that KVM looks at the PFN, it's holding the KVM
mmu_lock, which keeps the pages in the host-side VMA. (IIUC).
Functions like kvm_set_pfn_dirty() fall into this category.

The other times, such as the vmexit path I mentioned before, go through
a gfn_to_pfn call.  Under the hood, those call one of the
get_user_pages() functions during hva_to_pfn, and those do the
right thing w.r.t. get_dev_pagemap().

One of the other things I noticed was some places in KVM make a
distinction between kvm_is_reserved_pfn and PageReserved:

void kvm_set_pfn_dirty(kvm_pfn_t pfn)
{
        if (!kvm_is_reserved_pfn(pfn)) {
                struct page *page = pfn_to_page(pfn);

                if (!PageReserved(page))
                        SetPageDirty(page);
        }
}

I think we want to SetPageDirty for DAX, so making PageReserved be true
for DAX seems like the way to go, or we'll need more KVM-specific
changes.  Apologies is this was discussed in the previous thread on this
topic and is redundant.

Thanks,

Barret

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-11-02 20:32               ` Barret Rhoden
@ 2018-11-06 10:19                 ` Paolo Bonzini
       [not found]                   ` <876d5a71-8dda-4728-5329-4e169777ba4a-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2018-11-06 10:19 UTC (permalink / raw)
  To: Barret Rhoden, Dan Williams
  Cc: Dave Jiang, zwisler, Vishal L Verma, rkrcmar, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, linux-nvdimm,
	Linux Kernel Mailing List, H. Peter Anvin, X86 ML, KVM list,
	Zhang, Yu C, Zhang, Yi Z

On 02/11/2018 21:32, Barret Rhoden wrote:
> One of the other things I noticed was some places in KVM make a
> distinction between kvm_is_reserved_pfn and PageReserved:
> 
> void kvm_set_pfn_dirty(kvm_pfn_t pfn)
> {
>         if (!kvm_is_reserved_pfn(pfn)) {
>                 struct page *page = pfn_to_page(pfn);
> 
>                 if (!PageReserved(page))
>                         SetPageDirty(page);
>         }
> }
> 
> I think we want to SetPageDirty for DAX, so making PageReserved be true
> for DAX seems like the way to go, or we'll need more KVM-specific
> changes.  Apologies is this was discussed in the previous thread on this
> topic and is redundant.

Isn't it the opposite?  We want SetPageDirty, so PageReserved must _not_
be true.

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-10-31 21:16     ` Dan Williams
@ 2018-11-06 10:22       ` Paolo Bonzini
  0 siblings, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2018-11-06 10:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Barret Rhoden, Dave Jiang, zwisler, Vishal L Verma, rkrcmar,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, linux-nvdimm,
	Linux Kernel Mailing List, H. Peter Anvin, X86 ML, KVM list,
	Zhang, Yu C, Zhang, Yi Z

On 31/10/2018 22:16, Dan Williams wrote:
>> No, please don't.  The kvm_is_reserved_pfn() check is for correctness,
>> the page-size check is for optimization.  In theory you could have a
>> ZONE_DEVICE area that is smaller than 2MB and thus does not use huge pages.
> To be clear, I was not suggesting that a ZONE_DEVICE check would be
> sufficient to determine a 2MB page. I was suggesting that given there
> is a debate about removing the "reserved" designation for dax pages
> that debate is moot if kvm is going to add interrogation code to
> figure out the size of dax mappings.

Oh indeed.  And in general it's okay for me to add more ZONE_DEVICE
checks to complement the existing PageReserved checks, if DAX pages are
not going to be reserved anymore.

In some cases, such as the "if (!PageReserved(page)) SetPageDirty(page)"
that Barret noted, it may even be more correct for KVM if DAX pages stop
being reserved.  All this given my very limited knowledge of MM though...

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
       [not found]                   ` <876d5a71-8dda-4728-5329-4e169777ba4a-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-11-06 16:22                     ` Barret Rhoden
  0 siblings, 0 replies; 16+ messages in thread
From: Barret Rhoden @ 2018-11-06 16:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: X86 ML, Zhang, Yu C, KVM list, rkrcmar-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm, Linux Kernel Mailing List, Ingo Molnar,
	Borislav Petkov, zwisler-DgEjT+Ai2ygdnm+yROfE0A, H. Peter Anvin,
	Thomas Gleixner, Zhang, Yi Z

On 2018-11-06 at 11:19 Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > void kvm_set_pfn_dirty(kvm_pfn_t pfn)
> > {
> >         if (!kvm_is_reserved_pfn(pfn)) {
> >                 struct page *page = pfn_to_page(pfn);
> > 
> >                 if (!PageReserved(page))
> >                         SetPageDirty(page);
> >         }
> > }
> > 
> > I think we want to SetPageDirty for DAX, so making PageReserved be true
> > for DAX seems like the way to go, or we'll need more KVM-specific
> > changes.  Apologies is this was discussed in the previous thread on this
> > topic and is redundant.  
> 
> Isn't it the opposite?  We want SetPageDirty, so PageReserved must _not_
> be true.

You're right on that, I had it backwards.  The other DAX work is making
it so that DAX pages are not reserved, so the only extra '!' was in my
head.  

Thanks,

Barret

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
       [not found] ` <20181029210716.212159-1-brho-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2018-11-06 21:05   ` Barret Rhoden
  2018-11-06 21:16     ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Barret Rhoden @ 2018-11-06 21:05 UTC (permalink / raw)
  To: Dan Williams, Dave Jiang, Ross Zwisler, Vishal Verma,
	Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, yu.c.zhang-ral2JQCrhuEAvxtiuMwx3w,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, H. Peter Anvin,
	yi.z.zhang-ral2JQCrhuEAvxtiuMwx3w

On 2018-10-29 at 17:07 Barret Rhoden <brho-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Another issue is that kvm_mmu_zap_collapsible_spte() also uses
> PageTransCompoundMap() to detect huge pages, but we don't have a way to
> get the HVA easily.  Can we just aggressively zap DAX pages there?

Any thoughts about this?  Is there a way to determine the HVA or GFN in
this function:

static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,                        
                                         struct kvm_rmap_head *rmap_head)        
{       
        u64 *sptep;                                                              
        struct rmap_iterator iter;                                               
        int need_tlb_flush = 0;
        kvm_pfn_t pfn;
        struct kvm_mmu_page *sp;
                                                                                 
restart:        
        for_each_rmap_spte(rmap_head, &iter, sptep) {
                sp = page_header(__pa(sptep));
                pfn = spte_to_pfn(*sptep);

                /*
                 * We cannot do huge page mapping for indirect shadow pages,     
                 * which are found on the last rmap (level = 1) when not using   
                 * tdp; such shadow pages are synced with the page table in      
                 * the guest, and the guest page table is using 4K page size     
                 * mapping if the indirect sp has level = 1.                     
                 */     
                if (sp->role.direct &&                                           
                        !kvm_is_reserved_pfn(pfn) &&                             
                        PageTransCompoundMap(pfn_to_page(pfn))) {                
                        pte_list_remove(rmap_head, sptep);                       
                        need_tlb_flush = 1;                                      
                        goto restart;                                            
                }                                                                
        }
                                   
        return need_tlb_flush;                                                   
}    

If not, I was thinking of changing that loop to always remove PTEs for
DAX mappings, with the understanding that they'll get faulted back in
later.  Ideally, we'd like to check if the page is huge, but DAX can't
use the PageTransCompoundMap check.

Thanks,

Barret

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-11-06 21:05   ` Barret Rhoden
@ 2018-11-06 21:16     ` Paolo Bonzini
  2018-11-06 21:17       ` Barret Rhoden
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2018-11-06 21:16 UTC (permalink / raw)
  To: Barret Rhoden, Dan Williams, Dave Jiang, Ross Zwisler,
	Vishal Verma, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: kvm, yu.c.zhang, linux-nvdimm, x86, linux-kernel, H. Peter Anvin,
	yi.z.zhang

On 06/11/2018 22:05, Barret Rhoden wrote:
> On 2018-10-29 at 17:07 Barret Rhoden <brho@google.com> wrote:
>> Another issue is that kvm_mmu_zap_collapsible_spte() also uses
>> PageTransCompoundMap() to detect huge pages, but we don't have a way to
>> get the HVA easily.  Can we just aggressively zap DAX pages there?
> 
> Any thoughts about this?  Is there a way to determine the HVA or GFN in
> this function:

Yes, iter.gfn is the gfn inside the loop and iter.level is the level
(1=PTE, 2=PDE, ...).  iter.level of course is unusable here, similar to
*levelp in transparent_hugepage_adjust, but you can use iter.gfn and
gfn_to_hva.

Paolo

> static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,                        
>                                          struct kvm_rmap_head *rmap_head)        
> {       
>         u64 *sptep;                                                              
>         struct rmap_iterator iter;                                               
>         int need_tlb_flush = 0;
>         kvm_pfn_t pfn;
>         struct kvm_mmu_page *sp;
>                                                                                  
> restart:        
>         for_each_rmap_spte(rmap_head, &iter, sptep) {
>                 sp = page_header(__pa(sptep));
>                 pfn = spte_to_pfn(*sptep);
> 
>                 /*
>                  * We cannot do huge page mapping for indirect shadow pages,     
>                  * which are found on the last rmap (level = 1) when not using   
>                  * tdp; such shadow pages are synced with the page table in      
>                  * the guest, and the guest page table is using 4K page size     
>                  * mapping if the indirect sp has level = 1.                     
>                  */     
>                 if (sp->role.direct &&                                           
>                         !kvm_is_reserved_pfn(pfn) &&                             
>                         PageTransCompoundMap(pfn_to_page(pfn))) {                
>                         pte_list_remove(rmap_head, sptep);                       
>                         need_tlb_flush = 1;                                      
>                         goto restart;                                            
>                 }                                                                
>         }
>                                    
>         return need_tlb_flush;                                                   
> }    
> 
> If not, I was thinking of changing that loop to always remove PTEs for
> DAX mappings, with the understanding that they'll get faulted back in
> later.  Ideally, we'd like to check if the page is huge, but DAX can't
> use the PageTransCompoundMap check.
> 
> Thanks,
> 
> Barret
> 
> 
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
  2018-11-06 21:16     ` Paolo Bonzini
@ 2018-11-06 21:17       ` Barret Rhoden
  0 siblings, 0 replies; 16+ messages in thread
From: Barret Rhoden @ 2018-11-06 21:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dan Williams, Dave Jiang, Ross Zwisler, Vishal Verma,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, linux-nvdimm,
	linux-kernel, H. Peter Anvin, x86, kvm, yu.c.zhang, yi.z.zhang

On 2018-11-06 at 22:16 Paolo Bonzini <pbonzini@redhat.com> wrote:
> Yes, iter.gfn is the gfn inside the loop and iter.level is the level
> (1=PTE, 2=PDE, ...).  iter.level of course is unusable here, similar to
> *levelp in transparent_hugepage_adjust, but you can use iter.gfn and
> gfn_to_hva.

Great, thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-11-06 21:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-29 21:07 [RFC PATCH] kvm: Use huge pages for DAX-backed files Barret Rhoden
2018-10-29 22:25 ` Dan Williams
     [not found]   ` <CAPcyv4gJUjuSKwy7i2wuKR=Vz-AkDrxnGya5qkg7XTFxuXbtzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-30  0:28     ` Barret Rhoden
2018-10-30  3:10       ` Dan Williams
     [not found]         ` <CAPcyv4gQztHrJ3--rhU4ZpaZyyqdqE0=gx50CRArHKiXwfYC+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-30 19:45           ` Barret Rhoden
2018-10-31  8:49             ` Paolo Bonzini
2018-11-02 20:32               ` Barret Rhoden
2018-11-06 10:19                 ` Paolo Bonzini
     [not found]                   ` <876d5a71-8dda-4728-5329-4e169777ba4a-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-11-06 16:22                     ` Barret Rhoden
2018-10-31  3:05         ` Yu Zhang
2018-10-31  8:52   ` Paolo Bonzini
2018-10-31 21:16     ` Dan Williams
2018-11-06 10:22       ` Paolo Bonzini
     [not found] ` <20181029210716.212159-1-brho-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2018-11-06 21:05   ` Barret Rhoden
2018-11-06 21:16     ` Paolo Bonzini
2018-11-06 21:17       ` Barret Rhoden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).