All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xiao Guangrong <guangrong.xiao@linux.intel.com>
To: Andres Lagar-Cavilla <andreslc@google.com>,
	Wanpeng Li <wanpeng.li@linux.intel.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	Eric Northup <digitaleric@google.com>
Subject: Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
Date: Mon, 13 Apr 2015 09:45:39 +0800	[thread overview]
Message-ID: <552B1FC3.4070604@linux.intel.com> (raw)
In-Reply-To: <CAJu=L59ogYcW_VGVXeEyjVY65ExNrr_P61TdnpdXuWe9cHEpNw@mail.gmail.com>



On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote:
> On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
>>
>> There are two scenarios for the requirement of collapsing small sptes
>> into large sptes.
>> - dirty logging tracks sptes in 4k granularity, so large sptes are split,
>>    the large sptes will be reallocated in the destination machine and the
>>    guest in the source machine will be destroyed when live migration successfully.
>>    However, the guest in the source machine will continue to run if live migration
>>    fail due to some reasons, the sptes still keep small which lead to bad
>>    performance.
>> - our customers write tools to track the dirty speed of guests by EPT D bit/PML
>>    in order to determine the most appropriate one to be live migrated, however
>>    sptes will still keep small after tracking dirty speed.
>>
>> This patch introduce lazy collapse small sptes into large sptes, the memory region
>> will be scanned on the ioctl context when dirty log is stopped, the ones which can
>> be collapsed into large pages will be dropped during the scan, it depends the on
>> later #PF to reallocate all large sptes.
>>
>> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
>> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>
> Hi, apologies for late review (vacation), but wanted to bring
> attention to a few matters:

No problem, your comments are really valuable to us. :)

>
>>
>> ---
>> v2 -> v3:
>>   * update comments
>>   * fix infinite for loop
>> v1 -> v2:
>>   * use 'bool' instead of 'int'
>>   * add more comments
>>   * fix can not get the next spte after drop the current spte
>>
>>   arch/x86/include/asm/kvm_host.h |  2 ++
>>   arch/x86/kvm/mmu.c              | 73 +++++++++++++++++++++++++++++++++++++++++
>>   arch/x86/kvm/x86.c              | 19 +++++++++++
>>   3 files changed, 94 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 30b28dc..91b5bdb 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>>   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>>   void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>                                        struct kvm_memory_slot *memslot);
>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>> +                                       struct kvm_memory_slot *memslot);
>>   void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>                                     struct kvm_memory_slot *memslot);
>>   void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index cee7592..ba002a0 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>                  kvm_flush_remote_tlbs(kvm);
>>   }
>>
>> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>> +               unsigned long *rmapp)
>> +{
>> +       u64 *sptep;
>> +       struct rmap_iterator iter;
>> +       int need_tlb_flush = 0;
>> +       pfn_t pfn;
>> +       struct kvm_mmu_page *sp;
>> +
>> +       for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>> +               BUG_ON(!(*sptep & PT_PRESENT_MASK));
>> +
>> +               sp = page_header(__pa(sptep));
>> +               pfn = spte_to_pfn(*sptep);
>> +
>> +               /*
>> +                * Lets support EPT only for now, there still needs to figure
>> +                * out an efficient way to let these codes be aware what mapping
>> +                * level used in guest.
>> +                */
>> +               if (sp->role.direct &&
>> +                       !kvm_is_reserved_pfn(pfn) &&
>> +                       PageTransCompound(pfn_to_page(pfn))) {
>
> Not your fault, but PageTransCompound is very unhappy naming, as it
> also yields true for PageHuge. Suggestion: document this check covers
> static hugetlbfs, or switch to PageCompound() check.
>
> A slightly bolder approach would be to refactor and reuse the nearly
> identical check done in transparent_hugepage_adjust, instead of
> open-coding here. In essence this code is asking for the same check,
> plus the out-of-band check for static hugepages.

I agree.

>
>
>> +                       drop_spte(kvm, sptep);
>> +                       sptep = rmap_get_first(*rmapp, &iter);
>> +                       need_tlb_flush = 1;
>> +               } else
>> +                       sptep = rmap_get_next(&iter);
>> +       }
>> +
>> +       return need_tlb_flush;
>> +}
>> +
>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>> +                       struct kvm_memory_slot *memslot)
>> +{
>> +       bool flush = false;
>> +       unsigned long *rmapp;
>> +       unsigned long last_index, index;
>> +       gfn_t gfn_start, gfn_end;
>> +
>> +       spin_lock(&kvm->mmu_lock);
>> +
>> +       gfn_start = memslot->base_gfn;
>> +       gfn_end = memslot->base_gfn + memslot->npages - 1;
>> +
>> +       if (gfn_start >= gfn_end)
>> +               goto out;
>
> I don't understand the value of this check here. Are we looking for a
> broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
> about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
> 2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
> would be trouble in many other ways.
>
> All this to say: please remove the above 5 lines and make code simpler.

Yes, this check is unnecessary indeed.

>
>> +
>> +       rmapp = memslot->arch.rmap[0];
>> +       last_index = gfn_to_index(gfn_end, memslot->base_gfn,
>> +                                       PT_PAGE_TABLE_LEVEL);
>> +
>> +       for (index = 0; index <= last_index; ++index, ++rmapp) {
>
> One could argue that the cleaner iteration should be over the gfn
> space covered by the memslot, thus leaving the gfn <--> rmap <--> spte
> interactions hidden under the hood of __gfn_to_rmap. That yields much
> cleaner (IMHO) code:
>
>      for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn +
> memslot->npages; gfn++) {
>          flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn,
> 1, memslot));
>          ....
>
> Now you can also get rid of index, last_index and rmapp. And more
> importantly, the code is more understandable, and follows pattern as
> established in x86/kvm/mmu.

Do not have strong opinion on it. Current code also has this style, please
refer to kvm_mmu_slot_remove_write_access().

>
>> +               if (*rmapp)
>> +                       flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
>> +
>> +               if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>> +                       if (flush) {
>> +                               kvm_flush_remote_tlbs(kvm);
>> +                               flush = false;
>> +                       }
>> +                       cond_resched_lock(&kvm->mmu_lock);
>
> Relinquishing this spinlock is problematic, because
> commit_memory_region has not gotten around to removing write
> protection. Are you certain no new write-protected PTEs will be
> inserted by a racing fault that sneaks in while the spinlock is
> relinquished?
>

I do not know clearly about the problem, new spte creation will be based on
the host mapping, i.e, the huge mapping on shadow page table will be
sync-ed with huge mapping on host. Could you please detail the problem?

Wanpeng, could you please post a patch to address Andres's comments?


  reply	other threads:[~2015-04-13  1:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-03  7:40 [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
2015-04-07 15:41 ` Paolo Bonzini
2015-04-10 18:05 ` Andres Lagar-Cavilla
2015-04-13  1:45   ` Xiao Guangrong [this message]
2015-04-13  5:59     ` Wanpeng Li
2015-04-13  6:31     ` Andres Lagar-Cavilla
2015-04-14  4:04       ` [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte Xiao Guangrong
2015-04-15 15:05         ` Paolo Bonzini
2015-04-16  0:38           ` Xiao Guangrong
2015-04-16  8:14           ` Wanpeng Li
2015-04-14  5:25   ` [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
2015-04-14  6:06     ` Andres Lagar-Cavilla
2015-04-14  6:38       ` Wanpeng Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=552B1FC3.4070604@linux.intel.com \
    --to=guangrong.xiao@linux.intel.com \
    --cc=andreslc@google.com \
    --cc=digitaleric@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=wanpeng.li@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.