Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware

From: Sean Christopherson <seanjc@google.com>
To: David Matlack <dmatlack@google.com>
Cc: Vipin Sharma <vipinsh@google.com>,
	pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware
Date: Mon, 1 Aug 2022 23:56:21 +0000	[thread overview]
Message-ID: <YuhoJUoPBOu5eMz8@google.com> (raw)
In-Reply-To: <YuhPT2drgqL+osLl@google.com>

On Mon, Aug 01, 2022, David Matlack wrote:
> On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote:
> That being said, KVM currently has a gap where a guest doing a lot of
> remote memory accesses when touching memory for the first time will
> cause KVM to allocate the TDP page tables on the arguably wrong node.

Userspace can solve this by setting the NUMA policy on a VMA or shared-object
basis.  E.g. create dedicated memslots for each NUMA node, then bind each of the
backing stores to the appropriate host node.

If there is a gap, e.g. a backing store we want to use doesn't properly support
mempolicy for shared mappings, then we should enhance the backing store.

> > We can improve TDP MMU eager page splitting by making
> > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a
> > huge page, allocate the new lower level page tables on the same node as the
> > huge page.
> > 
> > __get_free_page() is replaced by alloc_page_nodes(). This introduces two
> > functional changes.
> > 
> >   1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to
> >   __get_free_pages(). This should not be an issue  as __GFP_HIGHMEM flag is
> >   not passed in tdp_mmu_alloc_sp_for_split() anyway.
> > 
> >   2. __get_free_page() calls alloc_pages() and use thread's mempolicy for
> >   the NUMA node allocation. From this commit, thread's mempolicy will not
> >   be used and first preference will be to allocate on the node where huge
> >   page was present.
> 
> It would be worth noting that userspace could change the mempolicy of
> the thread doing eager splitting to prefer allocating from the target
> NUMA node, as an alternative approach.
> 
> I don't prefer the alternative though since it bleeds details from KVM
> into userspace, such as the fact that enabling dirty logging does eager
> page splitting, which allocates page tables. 

As above, if userspace is cares about vNUMA, then it already needs to be aware of
some of KVM/kernel details.  Separate memslots aren't strictly necessary, e.g.
userspace could stitch together contiguous VMAs to create a single mega-memslot,
but that seems like it'd be more work than just creating separate memslots.

And because eager page splitting for dirty logging runs with mmu_lock held for,
userspace might also benefit from per-node memslots as it can do the splitting on
multiple tasks/CPUs.

Regardless of what we do, the behavior needs to be document, i.e. KVM details will
bleed into userspace.  E.g. if KVM is overriding the per-task NUMA policy, then
that should be documented.

> It's also unnecessary since KVM can infer an appropriate NUMA placement
> without the help of userspace, and I can't think of a reason for userspace to
> prefer a different policy.

I can't think of a reason why userspace would want to have a different policy for
the task that's enabling dirty logging, but I also can't think of a reason why
KVM should go out of its way to ignore that policy.

IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to
document how to effectively configure vNUMA-aware memslots.