Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG

From: Peter Xu <peterx@redhat.com>
To: David Matlack <dmatlack@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Ben Gardon <bgardon@google.com>,
	Joerg Roedel <joro@8bytes.org>, Jim Mattson <jmattson@google.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Janis Schoetterl-Glausch <scgl@linux.vnet.ibm.com>,
	Junaid Shahid <junaids@google.com>,
	Oliver Upton <oupton@google.com>,
	Harish Barathvajasankar <hbarath@google.com>,
	Peter Shier <pshier@google.com>
Subject: Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
Date: Wed, 1 Dec 2021 12:03:49 +0800	[thread overview]
Message-ID: <Yab0JRVmwyr1GL3Y@xz-m1.local> (raw)
In-Reply-To: <CALzav=ex+5y7-5a-8Vum2-eOKuKYe=RU9NvrS82H=sTwj2mqaw@mail.gmail.com>

On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote:
> On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 6768ef9c0891..4e78ef2dd352 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > > >
> > > > +             /*
> > > > +              * Try to proactively split any large pages down to 4KB so that
> > > > +              * vCPUs don't have to take write-protection faults.
> > > > +              */
> > > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > > +
> > > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > > >
> > > >               /* Cross two large pages? */
> > >
> > > Is it intended to try split every time even if we could have split it already?
> > > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > > CLEAR_LOG on the same range, and IIUC that makes sense.
> > >
> > > But indeed I don't see a trivial way to know whether this is the first clear of
> > > this range.  Maybe we can maintain "how many huge pages are there under current
> > > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > > skip it.  Just a wild idea..
> > >
> > > Or maybe it's intended to try split unconditionally for some reason?  If so
> > > it'll be great to mention that either in the commit message or in comments.
> >
> > Thanks for calling this out. Could the same be said about the existing
> > code that unconditionally tries to write-protect 2M+ pages?

They're different because wr-protect can be restored (to be not-wr-protected)
when vcpu threads write to the pages, so they need to be always done.

For huge page split - when it happened during dirty tracking it'll not be
recovered anymore, so it's a one-time thing.

> > I aimed to keep parity with the write-protection calls (always try to split
> > before write-protecting) but I agree there might be opportunities available
> > to skip altogether.

So IMHO it's not about parity but it could be about how easy can it be
implemented, and whether it'll be worth it to add that complexity.

Besides the above accounting idea per-sp, we can have other ways to do this
too, e.g., keeping a bitmap showing which range has been split: that bitmap
will be 2M in granule for x86 because that'll be enough.  We init-all-ones for
the bitmap when start logging for a memslot.

But again maybe it turns out we don't really want that complexity.

IMHO a good start could be the perf numbers (which I asked in the cover letter)
comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page
split.

> >
> > By the way, looking at this code again I think I see some potential bugs:
> >  - I don't think I ever free split_caches in the initially-all-set case.

I saw that it's freed in kvm_mmu_try_split_large_pages(), no?

> >  - What happens if splitting fails the CLEAR_LOG but succeeds the
> > CLEAR_LOG?
> 
> Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.
> 
> > We would end up propagating the write-protection on the 2M
> > page down to the 4K page. This might cause issues if using PML.

Hmm looks correct.. I'm wondering what will happen with that.

Firstly this should be rare as the 1st split should in 99% cases succeed.

Then if split failed at the 1st attempt, we wr-protected sptes even during pml
during the split.  When written, we'll go the fast page fault and record the
writes too, I think, as we'll apply dirty bit to the new spte so I think it'll
just skip pml.  Looks like we'll be using a mixture of pml+wp but all dirty
will still be captured as exptected?..

There could be leftover wp when stopping dirty logging, but that seems not
directly harmful too.  It'll make things a bit messed up, at least.

-- 
Peter Xu