From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752945AbbDMBsU (ORCPT ); Sun, 12 Apr 2015 21:48:20 -0400 Received: from mga01.intel.com ([192.55.52.88]:9766 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752053AbbDMBsR (ORCPT ); Sun, 12 Apr 2015 21:48:17 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.11,567,1422950400"; d="scan'208";a="480053614" Message-ID: <552B1FC3.4070604@linux.intel.com> Date: Mon, 13 Apr 2015 09:45:39 +0800 From: Xiao Guangrong User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Andres Lagar-Cavilla , Wanpeng Li CC: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Paolo Bonzini , Eric Northup Subject: Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes References: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote: > On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li wrote: >> >> There are two scenarios for the requirement of collapsing small sptes >> into large sptes. >> - dirty logging tracks sptes in 4k granularity, so large sptes are split, >> the large sptes will be reallocated in the destination machine and the >> guest in the source machine will be destroyed when live migration successfully. >> However, the guest in the source machine will continue to run if live migration >> fail due to some reasons, the sptes still keep small which lead to bad >> performance. >> - our customers write tools to track the dirty speed of guests by EPT D bit/PML >> in order to determine the most appropriate one to be live migrated, however >> sptes will still keep small after tracking dirty speed. >> >> This patch introduce lazy collapse small sptes into large sptes, the memory region >> will be scanned on the ioctl context when dirty log is stopped, the ones which can >> be collapsed into large pages will be dropped during the scan, it depends the on >> later #PF to reallocate all large sptes. >> >> Reviewed-by: Xiao Guangrong >> Signed-off-by: Wanpeng Li > > Hi, apologies for late review (vacation), but wanted to bring > attention to a few matters: No problem, your comments are really valuable to us. :) > >> >> --- >> v2 -> v3: >> * update comments >> * fix infinite for loop >> v1 -> v2: >> * use 'bool' instead of 'int' >> * add more comments >> * fix can not get the next spte after drop the current spte >> >> arch/x86/include/asm/kvm_host.h | 2 ++ >> arch/x86/kvm/mmu.c | 73 +++++++++++++++++++++++++++++++++++++++++ >> arch/x86/kvm/x86.c | 19 +++++++++++ >> 3 files changed, 94 insertions(+) >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index 30b28dc..91b5bdb 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, >> void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); >> void kvm_mmu_slot_remove_write_access(struct kvm *kvm, >> struct kvm_memory_slot *memslot); >> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, >> + struct kvm_memory_slot *memslot); >> void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, >> struct kvm_memory_slot *memslot); >> void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index cee7592..ba002a0 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, >> kvm_flush_remote_tlbs(kvm); >> } >> >> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, >> + unsigned long *rmapp) >> +{ >> + u64 *sptep; >> + struct rmap_iterator iter; >> + int need_tlb_flush = 0; >> + pfn_t pfn; >> + struct kvm_mmu_page *sp; >> + >> + for (sptep = rmap_get_first(*rmapp, &iter); sptep;) { >> + BUG_ON(!(*sptep & PT_PRESENT_MASK)); >> + >> + sp = page_header(__pa(sptep)); >> + pfn = spte_to_pfn(*sptep); >> + >> + /* >> + * Lets support EPT only for now, there still needs to figure >> + * out an efficient way to let these codes be aware what mapping >> + * level used in guest. >> + */ >> + if (sp->role.direct && >> + !kvm_is_reserved_pfn(pfn) && >> + PageTransCompound(pfn_to_page(pfn))) { > > Not your fault, but PageTransCompound is very unhappy naming, as it > also yields true for PageHuge. Suggestion: document this check covers > static hugetlbfs, or switch to PageCompound() check. > > A slightly bolder approach would be to refactor and reuse the nearly > identical check done in transparent_hugepage_adjust, instead of > open-coding here. In essence this code is asking for the same check, > plus the out-of-band check for static hugepages. I agree. > > >> + drop_spte(kvm, sptep); >> + sptep = rmap_get_first(*rmapp, &iter); >> + need_tlb_flush = 1; >> + } else >> + sptep = rmap_get_next(&iter); >> + } >> + >> + return need_tlb_flush; >> +} >> + >> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, >> + struct kvm_memory_slot *memslot) >> +{ >> + bool flush = false; >> + unsigned long *rmapp; >> + unsigned long last_index, index; >> + gfn_t gfn_start, gfn_end; >> + >> + spin_lock(&kvm->mmu_lock); >> + >> + gfn_start = memslot->base_gfn; >> + gfn_end = memslot->base_gfn + memslot->npages - 1; >> + >> + if (gfn_start >= gfn_end) >> + goto out; > > I don't understand the value of this check here. Are we looking for a > broken memslot? Shouldn't this be a BUG_ON? Is this the place to care > about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e. > 2^31. A 64 bit overflow would be caused by a gigantic gfn_start which > would be trouble in many other ways. > > All this to say: please remove the above 5 lines and make code simpler. Yes, this check is unnecessary indeed. > >> + >> + rmapp = memslot->arch.rmap[0]; >> + last_index = gfn_to_index(gfn_end, memslot->base_gfn, >> + PT_PAGE_TABLE_LEVEL); >> + >> + for (index = 0; index <= last_index; ++index, ++rmapp) { > > One could argue that the cleaner iteration should be over the gfn > space covered by the memslot, thus leaving the gfn <--> rmap <--> spte > interactions hidden under the hood of __gfn_to_rmap. That yields much > cleaner (IMHO) code: > > for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn + > memslot->npages; gfn++) { > flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn, > 1, memslot)); > .... > > Now you can also get rid of index, last_index and rmapp. And more > importantly, the code is more understandable, and follows pattern as > established in x86/kvm/mmu. Do not have strong opinion on it. Current code also has this style, please refer to kvm_mmu_slot_remove_write_access(). > >> + if (*rmapp) >> + flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp); >> + >> + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { >> + if (flush) { >> + kvm_flush_remote_tlbs(kvm); >> + flush = false; >> + } >> + cond_resched_lock(&kvm->mmu_lock); > > Relinquishing this spinlock is problematic, because > commit_memory_region has not gotten around to removing write > protection. Are you certain no new write-protected PTEs will be > inserted by a racing fault that sneaks in while the spinlock is > relinquished? > I do not know clearly about the problem, new spte creation will be based on the host mapping, i.e, the huge mapping on shadow page table will be sync-ed with huge mapping on host. Could you please detail the problem? Wanpeng, could you please post a patch to address Andres's comments?