Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware

From: Vipin Sharma <vipinsh@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	David Matlack <dmatlack@google.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware
Date: Fri, 5 Aug 2022 16:30:11 -0700	[thread overview]
Message-ID: <CAHVum0c=s8DH=p8zJcGzYDsfLY_qHEmvD1uF58h5WoUk6ZF8rQ@mail.gmail.com> (raw)
In-Reply-To: <cedcced0-b92c-07bd-ef2b-272ae58fdf40@redhat.com>

On Wed, Aug 3, 2022 at 8:08 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 8/2/22 21:12, Sean Christopherson wrote:
> > Ah crud, I misread the patch.  I was thinking tdp_mmu_spte_to_nid() was getting
> > the node for the existing shadow page, but it's actually getting the node for the
> > underlying physical huge page.
> >
> > That means this patch is broken now that KVM doesn't require a "struct page" to
> > create huge SPTEs (commit  a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted
> > "struct page" to create huge SPTEs").  I.e. this will explode as pfn_to_page()
> > may return garbage.
> >
> >       return page_to_nid(pfn_to_page(spte_to_pfn(spte)));
>
> I was about to say that yesterday.  However my knowledge of struct page
> things has been proved to be spotty enough, that I wasn't 100% sure of
> that.  But yeah, with a fresh mind it's quite obvious that anything that
> goes through hva_to_pfn_remap and similar paths will fail.
>
> > That said, I still don't like this patch, at least not as a one-off thing.  I do
> > like the idea of having page table allocations follow the node requested for the
> > target pfn, what I don't like is having one-off behavior that silently overrides
> > userspace policy.
>
> Yes, I totally agree with making it all or nothing.
>
> > I would much rather we solve the problem for all page table allocations, maybe
> > with a KVM_CAP or module param?  Unfortunately, that's a very non-trivial change
> > because it will probably require having a per-node cache in each of the per-vCPU
> > caches.
>
> A module parameter is fine.  If you care about performance to this
> level, your userspace is probably homogeneous enough.
>

Thank you all for the feedback and suggestions.

Regarding dirty_log_perf_test, I will send out a patch to add an
option which will tag vcpus to physical cpus using sched_setaffinity()
calls. This should increase accuracy for the test.

 It seems like we are agreeing on two things:

1. Keep the same behavior for the page table pages allocation for both
during the page fault and during eager page splitting.
2. Page table pages should be allocated on the same node where the
underlying guest memory page is residing.

Here are the two approaches, please provide feedback on which one
looks more appropriate before I start spamming your inbox with my
patches

Approach A:
Have per numa node cache for page table pages allocation

Instead of having only one mmu_shadow_page_cache per vcpu, we provide
multiple caches for each node

either:
mmu_shadow_page_cache[MAX_NUMNODES]
or
mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE]

We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value
instead of 40 to control memory consumption.

When a fault happens, use the pfn to find which node the page should
belong to and use the corresponding cache to get page table pages.

struct *page = kvm_pfn_to_refcounted_page(pfn);
int nid;
if(page) {
      nid = page_to_nid(page);
} else {
     nid = numa_node_id();
}

...
tdp_mmu_alloc_sp(nid, vcpu);
...

static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) {
...
      sp->spt = kvm_mmu_memory_cache_alloc(nid,
&vcpu->arch.mmu_shadow_page_cache);
...
}

Since we are changing cache allocation for page table pages, should we
also make similar changes for other caches like mmu_page_header_cache,
mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how
good this idea is.

Approach B:
Ask page from the specific node on fault path with option to fallback
to the original cache and default task policy.

This is what Sean's rough patch looks like.