From: Dan Williams <dan.j.williams@intel.com>
To: Barret Rhoden <brho@google.com>
Cc: Dave Jiang <dave.jiang@intel.com>,
zwisler@kernel.org, Vishal L Verma <vishal.l.verma@intel.com>,
Paolo Bonzini <pbonzini@redhat.com>,
rkrcmar@redhat.com, Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
linux-nvdimm <linux-nvdimm@lists.01.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
"H. Peter Anvin" <hpa@zytor.com>, X86 ML <x86@kernel.org>,
KVM list <kvm@vger.kernel.org>,
"Zhang, Yu C" <yu.c.zhang@intel.com>,
"Zhang, Yi Z" <yi.z.zhang@intel.com>
Subject: Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files
Date: Mon, 29 Oct 2018 15:25:42 -0700 [thread overview]
Message-ID: <CAPcyv4gJUjuSKwy7i2wuKR=Vz-AkDrxnGya5qkg7XTFxuXbtzw@mail.gmail.com> (raw)
In-Reply-To: <20181029210716.212159-1-brho@google.com>
On Mon, Oct 29, 2018 at 2:07 PM Barret Rhoden <brho@google.com> wrote:
>
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
>
> DAX pages are not PageTransCompound. The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not. For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>
> For DAX, we can check the page table itself. Actually, we might always
> be able to walk the page table, even for PageTransCompound pages, but
> it's probably a little slower.
>
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock (grabbed before checking
> the mmu seq). Based on the other comments about not worrying about a
> pmd split, we might be able to safely walk the page table without
> holding the mm sem.
>
> This patch relies on kvm_is_reserved_pfn() being false for DAX pages,
> which I've hacked up for testing this code. That change should
> eventually happen:
>
> https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/
>
> Another issue is that kvm_mmu_zap_collapsible_spte() also uses
> PageTransCompoundMap() to detect huge pages, but we don't have a way to
> get the HVA easily. Can we just aggressively zap DAX pages there?
>
> Alternatively, is there a better way to track at the struct page level
> whether or not a page is huge-mapped? Maybe the DAX huge pages mark
> themselves as TransCompound or something similar, and we don't need to
> special case DAX/ZONE_DEVICE pages.
>
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
> arch/x86/kvm/mmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 70 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cf5f572f2305..9f3e0f83a2dd 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> return -EFAULT;
> }
>
> +static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr)
> +{
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + return 0;
> +
> + p4d = p4d_offset(pgd, addr);
> + if (!p4d_present(*p4d))
> + return 0;
> + if (p4d_huge(*p4d))
> + return P4D_SIZE;
> +
> + pud = pud_offset(p4d, addr);
> + if (!pud_present(*pud))
> + return 0;
> + if (pud_huge(*pud))
> + return PUD_SIZE;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + return 0;
> + if (pmd_huge(*pmd))
> + return PMD_SIZE;
> +
> + pte = pte_offset_map(pmd, addr);
> + if (!pte_present(*pte))
> + return 0;
> + return PAGE_SIZE;
> +}
> +
> +static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + unsigned long hva, map_sz;
> +
> + if (!is_zone_device_page(page))
> + return PageTransCompoundMap(page);
> +
> + /*
> + * DAX pages do not use compound pages. The page should have already
> + * been mapped into the host-side page table during try_async_pf(), so
> + * we can check the page tables directly.
> + */
> + hva = gfn_to_hva(kvm, gfn);
> + if (kvm_is_error_hva(hva))
> + return false;
> +
> + /*
> + * Our caller grabbed the KVM mmu_lock with a successful
> + * mmu_notifier_retry, so we're safe to walk the page table.
> + */
> + map_sz = pgd_mapping_size(current->mm, hva);
> + switch (map_sz) {
> + case PMD_SIZE:
> + return true;
> + case P4D_SIZE:
> + case PUD_SIZE:
> + printk_once(KERN_INFO "KVM THP promo found a very large page");
Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.
> + return false;
> + }
> + return false;
> +}
The above 2 functions are similar to what we need to do for
determining the blast radius of a memory error, see
dev_pagemap_mapping_shift() and its usage in add_to_kill().
> +
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t *gfnp, kvm_pfn_t *pfnp,
> int *levelp)
> @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> level == PT_PAGE_TABLE_LEVEL &&
> - PageTransCompoundMap(pfn_to_page(pfn)) &&
> + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&
I'm wondering if we're adding an explicit is_zone_device_page() check
in this path to determine the page mapping size if that can be a
replacement for the kvm_is_reserved_pfn() check. In other words, the
goal of fixing up PageReserved() was to preclude the need for DAX-page
special casing in KVM, but if we already need add some special casing
for page size determination, might as well bypass the
kvm_is_reserved_pfn() dependency as well.
next prev parent reply other threads:[~2018-10-29 22:25 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-29 21:07 [RFC PATCH] kvm: Use huge pages for DAX-backed files Barret Rhoden
2018-10-29 22:25 ` Dan Williams [this message]
2018-10-30 0:28 ` Barret Rhoden
2018-10-30 3:10 ` Dan Williams
2018-10-30 19:45 ` Barret Rhoden
2018-10-31 8:49 ` Paolo Bonzini
2018-11-02 20:32 ` Barret Rhoden
2018-11-06 10:19 ` Paolo Bonzini
2018-11-06 16:22 ` Barret Rhoden
2018-10-31 3:05 ` Yu Zhang
2018-10-31 8:52 ` Paolo Bonzini
2018-10-31 21:16 ` Dan Williams
2018-11-06 10:22 ` Paolo Bonzini
2018-11-06 21:05 ` Barret Rhoden
2018-11-06 21:16 ` Paolo Bonzini
2018-11-06 21:17 ` Barret Rhoden
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPcyv4gJUjuSKwy7i2wuKR=Vz-AkDrxnGya5qkg7XTFxuXbtzw@mail.gmail.com' \
--to=dan.j.williams@intel.com \
--cc=bp@alien8.de \
--cc=brho@google.com \
--cc=dave.jiang@intel.com \
--cc=hpa@zytor.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=rkrcmar@redhat.com \
--cc=tglx@linutronix.de \
--cc=vishal.l.verma@intel.com \
--cc=x86@kernel.org \
--cc=yi.z.zhang@intel.com \
--cc=yu.c.zhang@intel.com \
--cc=zwisler@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).