From: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Andres Lagar-Cavilla <andreslc@google.com>,
Yang Shi <yang.shi@linaro.org>, Ning Qu <quning@gmail.com>,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
linux-mm@kvack.org
Subject: [PATCH 17/31] kvm: teach kvm to map page teams as huge pages.
Date: Tue, 5 Apr 2016 14:41:14 -0700 (PDT) [thread overview]
Message-ID: <alpine.LSU.2.11.1604051439340.5965@eggly.anvils> (raw)
In-Reply-To: <alpine.LSU.2.11.1604051403210.5965@eggly.anvils>
From: Andres Lagar-Cavilla <andreslc@google.com>
Include a small treatise on the locking rules around page teams.
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
Cc'ed to kvm@vger.kernel.org as an FYI: this patch is not expected to
go into the tree in the next few weeks, and depends upon a pageteam.h
not yet available outside this patchset. The context is a huge tmpfs
patchset which implements huge pagecache transparently on tmpfs,
using a team of small pages rather than one compound page:
please refer to linux-mm or linux-kernel for more context.
arch/x86/kvm/mmu.c | 130 ++++++++++++++++++++++++++++++-----
arch/x86/kvm/paging_tmpl.h | 3
2 files changed, 117 insertions(+), 16 deletions(-)
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -32,6 +32,7 @@
#include <linux/module.h>
#include <linux/swap.h>
#include <linux/hugetlb.h>
+#include <linux/pageteam.h>
#include <linux/compiler.h>
#include <linux/srcu.h>
#include <linux/slab.h>
@@ -2799,33 +2800,132 @@ static int kvm_handle_bad_page(struct kv
return -EFAULT;
}
+/*
+ * We are holding kvm->mmu_lock, serializing against mmu notifiers.
+ * We have a ref on page.
+ *
+ * A team of tmpfs 512 pages can be mapped as an integral hugepage as long as
+ * the team is not disbanded. The head page is !PageTeam if disbanded.
+ *
+ * Huge tmpfs pages are disbanded for page freeing, shrinking, or swap out.
+ *
+ * Freeing (punch hole, truncation):
+ * shmem_undo_range
+ * disband
+ * lock head page
+ * unmap_mapping_range
+ * zap_page_range_single
+ * mmu_notifier_invalidate_range_start
+ * __split_huge_pmd or zap_huge_pmd
+ * remap_team_by_ptes
+ * mmu_notifier_invalidate_range_end
+ * unlock head page
+ * pagevec_release
+ * pages are freed
+ * If we race with disband MMUN will fix us up. The head page lock also
+ * serializes any gup() against resolving the page team.
+ *
+ * Shrinker, disbands, but once a page team is fully banded up it no longer is
+ * tagged as shrinkable in the radix tree and hence can't be shrunk.
+ * shmem_shrink_hugehole
+ * shmem_choose_hugehole
+ * disband
+ * migrate_pages
+ * try_to_unmap
+ * mmu_notifier_invalidate_page
+ * Double-indemnity: if we race with disband, MMUN will fix us up.
+ *
+ * Swap out:
+ * shrink_page_list
+ * try_to_unmap
+ * unmap_team_by_pmd
+ * mmu_notifier_invalidate_range
+ * pageout
+ * shmem_writepage
+ * disband
+ * free_hot_cold_page_list
+ * pages are freed
+ * If we race with disband, no one will come to fix us up. So, we check for a
+ * pmd mapping, serializing against the MMUN in unmap_team_by_pmd, which will
+ * break the pmd mapping if it runs before us (or invalidate our mapping if ran
+ * after).
+ */
+static bool is_huge_tmpfs(struct kvm_vcpu *vcpu,
+ unsigned long address, struct page *page)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ struct page *head;
+
+ if (!PageTeam(page))
+ return false;
+ /*
+ * This strictly assumes PMD-level huge-ing.
+ * Which is the only thing KVM can handle here.
+ */
+ if (((address & (HPAGE_PMD_SIZE - 1)) >> PAGE_SHIFT) !=
+ (page->index & (HPAGE_PMD_NR-1)))
+ return false;
+ head = team_head(page);
+ if (!PageTeam(head))
+ return false;
+ /*
+ * Attempt at early discard. If the head races into becoming SwapCache,
+ * and thus having a bogus team_usage, we'll know for sure next.
+ */
+ if (!team_pmd_mapped(head))
+ return false;
+ /*
+ * Copied from page_check_address_transhuge, to avoid making it
+ * a module-visible symbol. Simplify it. No need for page table lock,
+ * as mmu notifier serialization ensures we are on either side of
+ * unmap_team_by_pmd or remap_team_by_ptes.
+ */
+ address &= HPAGE_PMD_MASK;
+ pgd = pgd_offset(vcpu->kvm->mm, address);
+ if (!pgd_present(*pgd))
+ return false;
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return false;
+ pmd = pmd_offset(pud, address);
+ if (!pmd_trans_huge(*pmd))
+ return false;
+ return pmd_page(*pmd) == head;
+}
+
+static bool is_transparent_hugepage(struct kvm_vcpu *vcpu,
+ unsigned long address, kvm_pfn_t pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ return PageTransCompound(page) || is_huge_tmpfs(vcpu, address, page);
+}
+
static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
- gfn_t *gfnp, kvm_pfn_t *pfnp,
- int *levelp)
+ unsigned long address, gfn_t *gfnp,
+ kvm_pfn_t *pfnp, int *levelp)
{
kvm_pfn_t pfn = *pfnp;
gfn_t gfn = *gfnp;
int level = *levelp;
/*
- * Check if it's a transparent hugepage. If this would be an
- * hugetlbfs page, level wouldn't be set to
- * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
- * here.
+ * Check if it's a transparent hugepage, either anon or huge tmpfs.
+ * If this were a hugetlbfs page, level wouldn't be set to
+ * PT_PAGE_TABLE_LEVEL and no adjustment would be done here.
*/
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
level == PT_PAGE_TABLE_LEVEL &&
- PageTransCompound(pfn_to_page(pfn)) &&
+ is_transparent_hugepage(vcpu, address, pfn) &&
!mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
unsigned long mask;
/*
* mmu_notifier_retry was successful and we hold the
- * mmu_lock here, so the pmd can't become splitting
- * from under us, and in turn
- * __split_huge_page_refcount() can't run from under
- * us and we can safely transfer the refcount from
- * PG_tail to PG_head as we switch the pfn to tail to
- * head.
+ * mmu_lock here, so the pmd can't be split under us,
+ * so we can safely transfer the refcount from PG_tail
+ * to PG_head as we switch the pfn from tail to head.
*/
*levelp = level = PT_DIRECTORY_LEVEL;
mask = KVM_PAGES_PER_HPAGE(level) - 1;
@@ -3038,7 +3138,7 @@ static int nonpaging_map(struct kvm_vcpu
goto out_unlock;
make_mmu_pages_available(vcpu);
if (likely(!force_pt_level))
- transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+ transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3578,7 +3678,7 @@ static int tdp_page_fault(struct kvm_vcp
goto out_unlock;
make_mmu_pages_available(vcpu);
if (likely(!force_pt_level))
- transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+ transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -800,7 +800,8 @@ static int FNAME(page_fault)(struct kvm_
kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
make_mmu_pages_available(vcpu);
if (!force_pt_level)
- transparent_hugepage_adjust(vcpu, &walker.gfn, &pfn, &level);
+ transparent_hugepage_adjust(vcpu, hva, &walker.gfn, &pfn,
+ &level);
r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
level, pfn, map_writable, prefault);
++vcpu->stat.pf_fixed;
next prev parent reply other threads:[~2016-04-05 21:41 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-05 21:10 [PATCH 00/31] huge tmpfs: THPagecache implemented by teams Hugh Dickins
2016-04-05 21:12 ` [PATCH 01/31] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m Hugh Dickins
2016-04-11 11:05 ` Kirill A. Shutemov
2016-04-17 2:28 ` Hugh Dickins
2016-04-05 21:13 ` [PATCH 02/31] huge tmpfs: include shmem freeholes in available memory Hugh Dickins
2016-04-05 21:15 ` [PATCH 03/31] huge tmpfs: huge=N mount option and /proc/sys/vm/shmem_huge Hugh Dickins
2016-04-11 11:17 ` Kirill A. Shutemov
2016-04-17 2:00 ` Hugh Dickins
2016-04-05 21:16 ` [PATCH 04/31] huge tmpfs: try to allocate huge pages, split into a team Hugh Dickins
2016-04-05 21:17 ` [PATCH 05/31] huge tmpfs: avoid team pages in a few places Hugh Dickins
2016-04-05 21:20 ` [PATCH 06/31] huge tmpfs: shrinker to migrate and free underused holes Hugh Dickins
2016-04-05 21:21 ` [PATCH 07/31] huge tmpfs: get_unmapped_area align & fault supply huge page Hugh Dickins
2016-04-05 21:23 ` [PATCH 08/31] huge tmpfs: try_to_unmap_one use page_check_address_transhuge Hugh Dickins
2016-04-05 21:24 ` [PATCH 09/31] huge tmpfs: avoid premature exposure of new pagetable Hugh Dickins
2016-04-11 11:54 ` Kirill A. Shutemov
2016-04-17 1:49 ` Hugh Dickins
2016-04-05 21:25 ` [PATCH 10/31] huge tmpfs: map shmem by huge page pmd or by page team ptes Hugh Dickins
2016-04-05 21:29 ` [PATCH 11/31] huge tmpfs: disband split huge pmds on race or memory failure Hugh Dickins
2016-04-05 21:33 ` [PATCH 12/31] huge tmpfs: extend get_user_pages_fast to shmem pmd Hugh Dickins
2016-04-06 7:00 ` Ingo Molnar
2016-04-07 2:53 ` Hugh Dickins
2016-04-13 8:58 ` Ingo Molnar
2016-04-05 21:34 ` [PATCH 13/31] huge tmpfs: use Unevictable lru with variable hpage_nr_pages Hugh Dickins
2016-04-05 21:35 ` [PATCH 14/31] huge tmpfs: fix Mlocked meminfo, track huge & unhuge mlocks Hugh Dickins
2016-04-05 21:37 ` [PATCH 15/31] huge tmpfs: fix Mapped meminfo, track huge & unhuge mappings Hugh Dickins
2016-04-05 21:39 ` [PATCH 16/31] kvm: plumb return of hva when resolving page fault Hugh Dickins
2016-04-05 21:41 ` Hugh Dickins [this message]
2016-04-05 23:37 ` [PATCH 17/31] kvm: teach kvm to map page teams as huge pages Paolo Bonzini
2016-04-06 1:12 ` Hugh Dickins
2016-04-06 6:47 ` Paolo Bonzini
2016-04-05 21:44 ` [PATCH 18/31] huge tmpfs: mem_cgroup move charge on shmem " Hugh Dickins
2016-04-05 21:46 ` [PATCH 19/31] huge tmpfs: mem_cgroup shmem_pmdmapped accounting Hugh Dickins
2016-04-05 21:47 ` [PATCH 20/31] huge tmpfs: mem_cgroup shmem_hugepages accounting Hugh Dickins
2016-04-05 21:49 ` [PATCH 21/31] huge tmpfs: show page team flag in pageflags Hugh Dickins
2016-04-05 21:51 ` [PATCH 22/31] huge tmpfs: /proc/<pid>/smaps show ShmemHugePages Hugh Dickins
2016-04-05 21:53 ` [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages Hugh Dickins
2016-04-06 10:28 ` Mika Penttilä
2016-04-07 2:05 ` Hugh Dickins
2016-04-05 21:54 ` [PATCH 24/31] huge tmpfs recovery: shmem_recovery_populate to fill huge page Hugh Dickins
2016-04-05 21:56 ` [PATCH 25/31] huge tmpfs recovery: shmem_recovery_remap & remap_team_by_pmd Hugh Dickins
2016-04-05 21:58 ` [PATCH 26/31] huge tmpfs recovery: shmem_recovery_swapin to read from swap Hugh Dickins
2016-04-05 22:00 ` [PATCH 27/31] huge tmpfs recovery: tweak shmem_getpage_gfp to fill team Hugh Dickins
2016-04-05 22:02 ` [PATCH 28/31] huge tmpfs recovery: debugfs stats to complete this phase Hugh Dickins
2016-04-05 22:03 ` [PATCH 29/31] huge tmpfs recovery: page migration call back into shmem Hugh Dickins
2016-04-05 22:05 ` [PATCH 30/31] huge tmpfs: shmem_huge_gfpmask and shmem_recovery_gfpmask Hugh Dickins
2016-04-05 22:07 ` [PATCH 31/31] huge tmpfs: no kswapd by default on sync allocations Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LSU.2.11.1604051439340.5965@eggly.anvils \
--to=hughd@google.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andreslc@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=quning@gmail.com \
--cc=yang.shi@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).