Re: [RFC PATCH] kvm/x86: Keep root hpa in prev_roots as much as possible

From: Lai Jiangshan <jiangshanlai@gmail.com>
To: Sean Christopherson <seanjc@google.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	Lai Jiangshan <laijs@linux.alibaba.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>
Subject: Re: [RFC PATCH] kvm/x86: Keep root hpa in prev_roots as much as possible
Date: Tue, 3 Aug 2021 09:19:17 +0800	[thread overview]
Message-ID: <CAJhGHyCU-Om3NWLVg-kbUE7FZD1nNZft8+KeCDH3cr_FDaitXQ@mail.gmail.com> (raw)
In-Reply-To: <YQLuBDZ2MlNlIoH4@google.com>

On Fri, Jul 30, 2021 at 2:06 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 26, 2021, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> >
> > Pagetable roots in prev_roots[] are likely to be reused soon and
> > there is no much overhead to keep it with a new need_sync field
> > introduced.
> >
> > With the help of the new need_sync field, pagetable roots are
> > kept as much as possible, and they will be re-synced before reused
> > instead of being dropped.
> >
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > ---
> >
> > This patch is just for RFC.
> >   Is the idea Ok?
>
> Yes, the idea is definitely a good one.
>
> >   If the idea is Ok, we need to reused one bit from pgd or hpa
> >     as need_sync to save memory.  Which one is better?
>
> Ha, we can do this without increasing the memory footprint and without co-opting
> a bit from pgd or hpa.  Because of compiler alignment/padding, the u8s and bools
> between mmu_role and prev_roots already occupy 8 bytes, even though the actual
> size is 4 bytes.  In total, we need room for 4 roots (3 previous + current), i.e.
> 4 bytes.  If a separate array is used, no additional memory is consumed and no
> masking is needed when reading/writing e.g. pgd.
>
> The cost is an extra swap() when updating the prev_roots LRU, but that's peanuts
> and would likely be offset by masking anyways.
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 99f37781a6fc..13bb3c3a60b4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -424,10 +424,12 @@ struct kvm_mmu {
>         hpa_t root_hpa;
>         gpa_t root_pgd;
>         union kvm_mmu_role mmu_role;
> +       bool root_unsync;
>         u8 root_level;
>         u8 shadow_root_level;
>         u8 ept_ad;
>         bool direct_map;
> +       bool unsync_roots[KVM_MMU_NUM_PREV_ROOTS];
>         struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
>

Hello

I think it is too complicated.  And it is hard to accept to put "unsync"
out of struct kvm_mmu_root_info when they should be bound to each other.

How about this:
- KVM_MMU_NUM_PREV_ROOTS
+ KVM_MMU_NUM_CACHED_ROOTS
- mmu->prev_roots[KVM_MMU_NUM_PREV_ROOTS]
+ mmu->cached_roots[KVM_MMU_NUM_CACHED_ROOTS]
- mmu->root_hpa
+ mmu->cached_roots[0].hpa
- mmu->root_pgd
+ mmu->cached_roots[0].pgd

And using the bit63 in @pgd as the information that it is not requested
to sync since the last sync.

Thanks
Lai.

>         /*
>
>
> >  arch/x86/include/asm/kvm_host.h |  3 ++-
> >  arch/x86/kvm/mmu/mmu.c          |  6 ++++++
> >  arch/x86/kvm/vmx/nested.c       | 12 ++++--------
> >  arch/x86/kvm/x86.c              |  9 +++++----
> >  4 files changed, 17 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 55efbacfc244..19a337cf7aa6 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -354,10 +354,11 @@ struct rsvd_bits_validate {
> >  struct kvm_mmu_root_info {
> >       gpa_t pgd;
> >       hpa_t hpa;
> > +     bool need_sync;
>
> Hmm, use "unsync" instead of "need_sync", purely to match the existing terminology
> in KVM's MMU for this sort of behavior.
>
> >  };
> >
> >  #define KVM_MMU_ROOT_INFO_INVALID \
> > -     ((struct kvm_mmu_root_info) { .pgd = INVALID_PAGE, .hpa = INVALID_PAGE })
> > +     ((struct kvm_mmu_root_info) { .pgd = INVALID_PAGE, .hpa = INVALID_PAGE, .need_sync = true})
> >
> >  #define KVM_MMU_NUM_PREV_ROOTS 3
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 5e60b00e8e50..147827135549 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3878,6 +3878,7 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> >
> >       root.pgd = mmu->root_pgd;
> >       root.hpa = mmu->root_hpa;
> > +     root.need_sync = false;
> >
> >       if (is_root_usable(&root, new_pgd, new_role))
> >               return true;
> > @@ -3892,6 +3893,11 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> >       mmu->root_hpa = root.hpa;
> >       mmu->root_pgd = root.pgd;
> >
> > +     if (i < KVM_MMU_NUM_PREV_ROOTS && root.need_sync) {
>
> Probably makes sense to write this as:
>
>         if (i >= KVM_MMU_NUM_PREV_ROOTS)
>                 return false;
>
>         if (root.need_sync) {
>                 kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>                 kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>         }
>         return true;
>
> The "i < KVM_MMU_NUM_PREV_ROOTS == success" logic is just confusing enough that
> it'd be nice to write it only once.
>
> And that would also play nicely with deferring a sync for the "current" root
> (see below), e.g.
>
>         ...
>         unsync = mmu->root_unsync;
>
>         if (is_root_usable(&root, new_pgd, new_role))
>                 goto found_root;
>
>         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
>                 swap(root, mmu->prev_roots[i]);
>                 swap(unsync, mmu->unsync_roots[i]);
>
>                 if (is_root_usable(&root, new_pgd, new_role))
>                         break;
>         }
>
>         if (i >= KVM_MMU_NUM_PREV_ROOTS)
>                 return false;
>
> found_root:
>         if (unsync) {
>                 kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
>                 kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>         }
>         return true;
>
> > +             kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > +             kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > +     }
> > +
> >       return i < KVM_MMU_NUM_PREV_ROOTS;
> >  }
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 6058a65a6ede..ab7069ac6dc5 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -5312,7 +5312,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> >  {
> >       struct vcpu_vmx *vmx = to_vmx(vcpu);
> >       u32 vmx_instruction_info, types;
> > -     unsigned long type, roots_to_free;
> > +     unsigned long type;
> >       struct kvm_mmu *mmu;
> >       gva_t gva;
> >       struct x86_exception e;
> > @@ -5361,29 +5361,25 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> >                       return nested_vmx_fail(vcpu,
> >                               VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> >
> > -             roots_to_free = 0;
> >               if (nested_ept_root_matches(mmu->root_hpa, mmu->root_pgd,
> >                                           operand.eptp))
> > -                     roots_to_free |= KVM_MMU_ROOT_CURRENT;
> > +                     kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
>
> For a non-RFC series, I think this should do two things:
>
>   1. Separate INVEPT from INVPCID, i.e. do only INVPCID first.
>   2. Enhance INVEPT to SYNC+FLUSH the current root instead of freeing it
>
> As alluded to above, this can be done by deferring the sync+flush (which can't
> be done right away because INVEPT runs in L1 context, whereas KVM needs to sync+flush
> L2 EPT context).
>
> >               for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
> >                       if (nested_ept_root_matches(mmu->prev_roots[i].hpa,
> >                                                   mmu->prev_roots[i].pgd,
> >                                                   operand.eptp))
> > -                             roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
> > +                             mmu->prev_roots[i].need_sync = true;
> >               }
> >               break;
> >       case VMX_EPT_EXTENT_GLOBAL:
> > -             roots_to_free = KVM_MMU_ROOTS_ALL;
> > +             kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
> >               break;
> >       default:
> >               BUG();
> >               break;
> >       }
> >
> > -     if (roots_to_free)
> > -             kvm_mmu_free_roots(vcpu, mmu, roots_to_free);
> > -
> >       return nested_vmx_succeed(vcpu);
> >  }