All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
@ 2015-04-03  7:40 Wanpeng Li
  2015-04-07 15:41 ` Paolo Bonzini
  2015-04-10 18:05 ` Andres Lagar-Cavilla
  0 siblings, 2 replies; 13+ messages in thread
From: Wanpeng Li @ 2015-04-03  7:40 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: Paolo Bonzini, Xiao Guangrong, Wanpeng Li

There are two scenarios for the requirement of collapsing small sptes
into large sptes.
- dirty logging tracks sptes in 4k granularity, so large sptes are split,
  the large sptes will be reallocated in the destination machine and the
  guest in the source machine will be destroyed when live migration successfully.
  However, the guest in the source machine will continue to run if live migration
  fail due to some reasons, the sptes still keep small which lead to bad
  performance.
- our customers write tools to track the dirty speed of guests by EPT D bit/PML
  in order to determine the most appropriate one to be live migrated, however
  sptes will still keep small after tracking dirty speed.

This patch introduce lazy collapse small sptes into large sptes, the memory region
will be scanned on the ioctl context when dirty log is stopped, the ones which can
be collapsed into large pages will be dropped during the scan, it depends the on
later #PF to reallocate all large sptes.

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
---
v2 -> v3:
 * update comments 
 * fix infinite for loop
v1 -> v2:
 * use 'bool' instead of 'int'
 * add more comments
 * fix can not get the next spte after drop the current spte

 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.c              | 73 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              | 19 +++++++++++
 3 files changed, 94 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 30b28dc..91b5bdb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      struct kvm_memory_slot *memslot);
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+					struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cee7592..ba002a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 }
 
+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
+		unsigned long *rmapp)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	int need_tlb_flush = 0;
+	pfn_t pfn;
+	struct kvm_mmu_page *sp;
+
+	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
+		BUG_ON(!(*sptep & PT_PRESENT_MASK));
+
+		sp = page_header(__pa(sptep));
+		pfn = spte_to_pfn(*sptep);
+
+		/*
+		 * Lets support EPT only for now, there still needs to figure
+		 * out an efficient way to let these codes be aware what mapping
+		 * level used in guest.
+		 */
+		if (sp->role.direct &&
+			!kvm_is_reserved_pfn(pfn) &&
+			PageTransCompound(pfn_to_page(pfn))) {
+			drop_spte(kvm, sptep);
+			sptep = rmap_get_first(*rmapp, &iter);
+			need_tlb_flush = 1;
+		} else
+			sptep = rmap_get_next(&iter);
+	}
+
+	return need_tlb_flush;
+}
+
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+			struct kvm_memory_slot *memslot)
+{
+	bool flush = false;
+	unsigned long *rmapp;
+	unsigned long last_index, index;
+	gfn_t gfn_start, gfn_end;
+
+	spin_lock(&kvm->mmu_lock);
+
+	gfn_start = memslot->base_gfn;
+	gfn_end = memslot->base_gfn + memslot->npages - 1;
+
+	if (gfn_start >= gfn_end)
+		goto out;
+
+	rmapp = memslot->arch.rmap[0];
+	last_index = gfn_to_index(gfn_end, memslot->base_gfn,
+					PT_PAGE_TABLE_LEVEL);
+
+	for (index = 0; index <= last_index; ++index, ++rmapp) {
+		if (*rmapp)
+			flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
+
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+			if (flush) {
+				kvm_flush_remote_tlbs(kvm);
+				flush = false;
+			}
+			cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+out:
+	spin_unlock(&kvm->mmu_lock);
+}
+
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 50861dd..a6cd10b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7647,6 +7647,25 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	new = id_to_memslot(kvm->memslots, mem->slot);
 
 	/*
+	 * Dirty logging tracks sptes in 4k granularity, so large sptes are
+	 * split, the large sptes will be reallocated in the destination
+	 * machine and the guest in the source machine will be destroyed
+	 * when live migration successfully. However, the guest in the source
+	 * machine will continue to run if live migration fail due to some
+	 * reasons, the sptes still keep small which lead to bad performance.
+	 *
+	 * Lazy collapse small sptes into large sptes is intended to handle
+	 * this, the memory region will be scanned on the ioctl context when
+	 * dirty log is stopped, the ones which can be collapsed into large
+	 * pages will be dropped during the scan, it depends the on later #PF
+	 * to reallocate all large sptes.
+	 */
+	if ((change != KVM_MR_DELETE) &&
+		(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
+		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
+		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+	/*
 	 * Set up write protection and/or dirty logging for the new slot.
 	 *
 	 * For KVM_MR_DELETE and KVM_MR_MOVE, the shadow pages of old slot have
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-03  7:40 [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
@ 2015-04-07 15:41 ` Paolo Bonzini
  2015-04-10 18:05 ` Andres Lagar-Cavilla
  1 sibling, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2015-04-07 15:41 UTC (permalink / raw)
  To: Wanpeng Li, kvm, linux-kernel; +Cc: Xiao Guangrong



On 03/04/2015 09:40, Wanpeng Li wrote:
> There are two scenarios for the requirement of collapsing small sptes
> into large sptes.
> - dirty logging tracks sptes in 4k granularity, so large sptes are split,
>   the large sptes will be reallocated in the destination machine and the
>   guest in the source machine will be destroyed when live migration successfully.
>   However, the guest in the source machine will continue to run if live migration
>   fail due to some reasons, the sptes still keep small which lead to bad
>   performance.
> - our customers write tools to track the dirty speed of guests by EPT D bit/PML
>   in order to determine the most appropriate one to be live migrated, however
>   sptes will still keep small after tracking dirty speed.
> 
> This patch introduce lazy collapse small sptes into large sptes, the memory region
> will be scanned on the ioctl context when dirty log is stopped, the ones which can
> be collapsed into large pages will be dropped during the scan, it depends the on
> later #PF to reallocate all large sptes.
> 
> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
> ---
> v2 -> v3:
>  * update comments 
>  * fix infinite for loop
> v1 -> v2:
>  * use 'bool' instead of 'int'
>  * add more comments
>  * fix can not get the next spte after drop the current spte
> 
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.c              | 73 +++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              | 19 +++++++++++
>  3 files changed, 94 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 30b28dc..91b5bdb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  				      struct kvm_memory_slot *memslot);
> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> +					struct kvm_memory_slot *memslot);
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>  				   struct kvm_memory_slot *memslot);
>  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cee7592..ba002a0 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_flush_remote_tlbs(kvm);
>  }
>  
> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> +		unsigned long *rmapp)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	int need_tlb_flush = 0;
> +	pfn_t pfn;
> +	struct kvm_mmu_page *sp;
> +
> +	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
> +		BUG_ON(!(*sptep & PT_PRESENT_MASK));
> +
> +		sp = page_header(__pa(sptep));
> +		pfn = spte_to_pfn(*sptep);
> +
> +		/*
> +		 * Lets support EPT only for now, there still needs to figure
> +		 * out an efficient way to let these codes be aware what mapping
> +		 * level used in guest.
> +		 */
> +		if (sp->role.direct &&
> +			!kvm_is_reserved_pfn(pfn) &&
> +			PageTransCompound(pfn_to_page(pfn))) {
> +			drop_spte(kvm, sptep);
> +			sptep = rmap_get_first(*rmapp, &iter);
> +			need_tlb_flush = 1;
> +		} else
> +			sptep = rmap_get_next(&iter);
> +	}
> +
> +	return need_tlb_flush;
> +}
> +
> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> +			struct kvm_memory_slot *memslot)
> +{
> +	bool flush = false;
> +	unsigned long *rmapp;
> +	unsigned long last_index, index;
> +	gfn_t gfn_start, gfn_end;
> +
> +	spin_lock(&kvm->mmu_lock);
> +
> +	gfn_start = memslot->base_gfn;
> +	gfn_end = memslot->base_gfn + memslot->npages - 1;
> +
> +	if (gfn_start >= gfn_end)
> +		goto out;
> +
> +	rmapp = memslot->arch.rmap[0];
> +	last_index = gfn_to_index(gfn_end, memslot->base_gfn,
> +					PT_PAGE_TABLE_LEVEL);
> +
> +	for (index = 0; index <= last_index; ++index, ++rmapp) {
> +		if (*rmapp)
> +			flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
> +
> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +			if (flush) {
> +				kvm_flush_remote_tlbs(kvm);
> +				flush = false;
> +			}
> +			cond_resched_lock(&kvm->mmu_lock);
> +		}
> +	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +out:
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>  				   struct kvm_memory_slot *memslot)
>  {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 50861dd..a6cd10b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7647,6 +7647,25 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  	new = id_to_memslot(kvm->memslots, mem->slot);
>  
>  	/*
> +	 * Dirty logging tracks sptes in 4k granularity, so large sptes are
> +	 * split, the large sptes will be reallocated in the destination
> +	 * machine and the guest in the source machine will be destroyed
> +	 * when live migration successfully. However, the guest in the source
> +	 * machine will continue to run if live migration fail due to some
> +	 * reasons, the sptes still keep small which lead to bad performance.
> +	 *
> +	 * Lazy collapse small sptes into large sptes is intended to handle
> +	 * this, the memory region will be scanned on the ioctl context when
> +	 * dirty log is stopped, the ones which can be collapsed into large
> +	 * pages will be dropped during the scan, it depends the on later #PF
> +	 * to reallocate all large sptes.
> +	 */
> +	if ((change != KVM_MR_DELETE) &&
> +		(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
> +		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
> +		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +	/*
>  	 * Set up write protection and/or dirty logging for the new slot.
>  	 *
>  	 * For KVM_MR_DELETE and KVM_MR_MOVE, the shadow pages of old slot have
> 


Applied just with editing of the comments and commit message.

Thanks to you and Xiao.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-03  7:40 [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
  2015-04-07 15:41 ` Paolo Bonzini
@ 2015-04-10 18:05 ` Andres Lagar-Cavilla
  2015-04-13  1:45   ` Xiao Guangrong
  2015-04-14  5:25   ` [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
  1 sibling, 2 replies; 13+ messages in thread
From: Andres Lagar-Cavilla @ 2015-04-10 18:05 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm, linux-kernel, Paolo Bonzini, Xiao Guangrong, Eric Northup

On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
>
> There are two scenarios for the requirement of collapsing small sptes
> into large sptes.
> - dirty logging tracks sptes in 4k granularity, so large sptes are split,
>   the large sptes will be reallocated in the destination machine and the
>   guest in the source machine will be destroyed when live migration successfully.
>   However, the guest in the source machine will continue to run if live migration
>   fail due to some reasons, the sptes still keep small which lead to bad
>   performance.
> - our customers write tools to track the dirty speed of guests by EPT D bit/PML
>   in order to determine the most appropriate one to be live migrated, however
>   sptes will still keep small after tracking dirty speed.
>
> This patch introduce lazy collapse small sptes into large sptes, the memory region
> will be scanned on the ioctl context when dirty log is stopped, the ones which can
> be collapsed into large pages will be dropped during the scan, it depends the on
> later #PF to reallocate all large sptes.
>
> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>

Hi, apologies for late review (vacation), but wanted to bring
attention to a few matters:

>
> ---
> v2 -> v3:
>  * update comments
>  * fix infinite for loop
> v1 -> v2:
>  * use 'bool' instead of 'int'
>  * add more comments
>  * fix can not get the next spte after drop the current spte
>
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.c              | 73 +++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              | 19 +++++++++++
>  3 files changed, 94 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 30b28dc..91b5bdb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>                                       struct kvm_memory_slot *memslot);
> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> +                                       struct kvm_memory_slot *memslot);
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>                                    struct kvm_memory_slot *memslot);
>  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cee7592..ba002a0 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>                 kvm_flush_remote_tlbs(kvm);
>  }
>
> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> +               unsigned long *rmapp)
> +{
> +       u64 *sptep;
> +       struct rmap_iterator iter;
> +       int need_tlb_flush = 0;
> +       pfn_t pfn;
> +       struct kvm_mmu_page *sp;
> +
> +       for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
> +               BUG_ON(!(*sptep & PT_PRESENT_MASK));
> +
> +               sp = page_header(__pa(sptep));
> +               pfn = spte_to_pfn(*sptep);
> +
> +               /*
> +                * Lets support EPT only for now, there still needs to figure
> +                * out an efficient way to let these codes be aware what mapping
> +                * level used in guest.
> +                */
> +               if (sp->role.direct &&
> +                       !kvm_is_reserved_pfn(pfn) &&
> +                       PageTransCompound(pfn_to_page(pfn))) {

Not your fault, but PageTransCompound is very unhappy naming, as it
also yields true for PageHuge. Suggestion: document this check covers
static hugetlbfs, or switch to PageCompound() check.

A slightly bolder approach would be to refactor and reuse the nearly
identical check done in transparent_hugepage_adjust, instead of
open-coding here. In essence this code is asking for the same check,
plus the out-of-band check for static hugepages.


> +                       drop_spte(kvm, sptep);
> +                       sptep = rmap_get_first(*rmapp, &iter);
> +                       need_tlb_flush = 1;
> +               } else
> +                       sptep = rmap_get_next(&iter);
> +       }
> +
> +       return need_tlb_flush;
> +}
> +
> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> +                       struct kvm_memory_slot *memslot)
> +{
> +       bool flush = false;
> +       unsigned long *rmapp;
> +       unsigned long last_index, index;
> +       gfn_t gfn_start, gfn_end;
> +
> +       spin_lock(&kvm->mmu_lock);
> +
> +       gfn_start = memslot->base_gfn;
> +       gfn_end = memslot->base_gfn + memslot->npages - 1;
> +
> +       if (gfn_start >= gfn_end)
> +               goto out;

I don't understand the value of this check here. Are we looking for a
broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
would be trouble in many other ways.

All this to say: please remove the above 5 lines and make code simpler.

> +
> +       rmapp = memslot->arch.rmap[0];
> +       last_index = gfn_to_index(gfn_end, memslot->base_gfn,
> +                                       PT_PAGE_TABLE_LEVEL);
> +
> +       for (index = 0; index <= last_index; ++index, ++rmapp) {

One could argue that the cleaner iteration should be over the gfn
space covered by the memslot, thus leaving the gfn <--> rmap <--> spte
interactions hidden under the hood of __gfn_to_rmap. That yields much
cleaner (IMHO) code:

    for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn +
memslot->npages; gfn++) {
        flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn,
1, memslot));
        ....

Now you can also get rid of index, last_index and rmapp. And more
importantly, the code is more understandable, and follows pattern as
established in x86/kvm/mmu.

> +               if (*rmapp)
> +                       flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
> +
> +               if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +                       if (flush) {
> +                               kvm_flush_remote_tlbs(kvm);
> +                               flush = false;
> +                       }
> +                       cond_resched_lock(&kvm->mmu_lock);

Relinquishing this spinlock is problematic, because
commit_memory_region has not gotten around to removing write
protection. Are you certain no new write-protected PTEs will be
inserted by a racing fault that sneaks in while the spinlock is
relinquished?

If that can happen, you can decide that does not really matter, or the
collapsing is best effort, but either of those decisions should be
documented.

If you absolutely want to go for complete removal of PTEs, then you
need to restart the loop (costly!), or retain you structure that
iterates on the rmapp array, but restart from the beginning after each
spinlock removal (less costly, but not happy).

Thanks
Andres

> +               }
> +       }
> +
> +       if (flush)
> +               kvm_flush_remote_tlbs(kvm);
> +
> +out:
> +       spin_unlock(&kvm->mmu_lock);
> +}
> +
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>                                    struct kvm_memory_slot *memslot)
>  {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 50861dd..a6cd10b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7647,6 +7647,25 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>         new = id_to_memslot(kvm->memslots, mem->slot);
>
>         /*
> +        * Dirty logging tracks sptes in 4k granularity, so large sptes are
> +        * split, the large sptes will be reallocated in the destination
> +        * machine and the guest in the source machine will be destroyed
> +        * when live migration successfully. However, the guest in the source
> +        * machine will continue to run if live migration fail due to some
> +        * reasons, the sptes still keep small which lead to bad performance.
> +        *
> +        * Lazy collapse small sptes into large sptes is intended to handle
> +        * this, the memory region will be scanned on the ioctl context when
> +        * dirty log is stopped, the ones which can be collapsed into large
> +        * pages will be dropped during the scan, it depends the on later #PF
> +        * to reallocate all large sptes.
> +        */
> +       if ((change != KVM_MR_DELETE) &&
> +               (old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
> +               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
> +               kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +       /*
>          * Set up write protection and/or dirty logging for the new slot.
>          *
>          * For KVM_MR_DELETE and KVM_MR_MOVE, the shadow pages of old slot have
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-10 18:05 ` Andres Lagar-Cavilla
@ 2015-04-13  1:45   ` Xiao Guangrong
  2015-04-13  5:59     ` Wanpeng Li
  2015-04-13  6:31     ` Andres Lagar-Cavilla
  2015-04-14  5:25   ` [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
  1 sibling, 2 replies; 13+ messages in thread
From: Xiao Guangrong @ 2015-04-13  1:45 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, Wanpeng Li
  Cc: kvm, linux-kernel, Paolo Bonzini, Eric Northup



On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote:
> On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
>>
>> There are two scenarios for the requirement of collapsing small sptes
>> into large sptes.
>> - dirty logging tracks sptes in 4k granularity, so large sptes are split,
>>    the large sptes will be reallocated in the destination machine and the
>>    guest in the source machine will be destroyed when live migration successfully.
>>    However, the guest in the source machine will continue to run if live migration
>>    fail due to some reasons, the sptes still keep small which lead to bad
>>    performance.
>> - our customers write tools to track the dirty speed of guests by EPT D bit/PML
>>    in order to determine the most appropriate one to be live migrated, however
>>    sptes will still keep small after tracking dirty speed.
>>
>> This patch introduce lazy collapse small sptes into large sptes, the memory region
>> will be scanned on the ioctl context when dirty log is stopped, the ones which can
>> be collapsed into large pages will be dropped during the scan, it depends the on
>> later #PF to reallocate all large sptes.
>>
>> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
>> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>
> Hi, apologies for late review (vacation), but wanted to bring
> attention to a few matters:

No problem, your comments are really valuable to us. :)

>
>>
>> ---
>> v2 -> v3:
>>   * update comments
>>   * fix infinite for loop
>> v1 -> v2:
>>   * use 'bool' instead of 'int'
>>   * add more comments
>>   * fix can not get the next spte after drop the current spte
>>
>>   arch/x86/include/asm/kvm_host.h |  2 ++
>>   arch/x86/kvm/mmu.c              | 73 +++++++++++++++++++++++++++++++++++++++++
>>   arch/x86/kvm/x86.c              | 19 +++++++++++
>>   3 files changed, 94 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 30b28dc..91b5bdb 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>>   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>>   void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>                                        struct kvm_memory_slot *memslot);
>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>> +                                       struct kvm_memory_slot *memslot);
>>   void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>                                     struct kvm_memory_slot *memslot);
>>   void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index cee7592..ba002a0 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>                  kvm_flush_remote_tlbs(kvm);
>>   }
>>
>> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>> +               unsigned long *rmapp)
>> +{
>> +       u64 *sptep;
>> +       struct rmap_iterator iter;
>> +       int need_tlb_flush = 0;
>> +       pfn_t pfn;
>> +       struct kvm_mmu_page *sp;
>> +
>> +       for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>> +               BUG_ON(!(*sptep & PT_PRESENT_MASK));
>> +
>> +               sp = page_header(__pa(sptep));
>> +               pfn = spte_to_pfn(*sptep);
>> +
>> +               /*
>> +                * Lets support EPT only for now, there still needs to figure
>> +                * out an efficient way to let these codes be aware what mapping
>> +                * level used in guest.
>> +                */
>> +               if (sp->role.direct &&
>> +                       !kvm_is_reserved_pfn(pfn) &&
>> +                       PageTransCompound(pfn_to_page(pfn))) {
>
> Not your fault, but PageTransCompound is very unhappy naming, as it
> also yields true for PageHuge. Suggestion: document this check covers
> static hugetlbfs, or switch to PageCompound() check.
>
> A slightly bolder approach would be to refactor and reuse the nearly
> identical check done in transparent_hugepage_adjust, instead of
> open-coding here. In essence this code is asking for the same check,
> plus the out-of-band check for static hugepages.

I agree.

>
>
>> +                       drop_spte(kvm, sptep);
>> +                       sptep = rmap_get_first(*rmapp, &iter);
>> +                       need_tlb_flush = 1;
>> +               } else
>> +                       sptep = rmap_get_next(&iter);
>> +       }
>> +
>> +       return need_tlb_flush;
>> +}
>> +
>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>> +                       struct kvm_memory_slot *memslot)
>> +{
>> +       bool flush = false;
>> +       unsigned long *rmapp;
>> +       unsigned long last_index, index;
>> +       gfn_t gfn_start, gfn_end;
>> +
>> +       spin_lock(&kvm->mmu_lock);
>> +
>> +       gfn_start = memslot->base_gfn;
>> +       gfn_end = memslot->base_gfn + memslot->npages - 1;
>> +
>> +       if (gfn_start >= gfn_end)
>> +               goto out;
>
> I don't understand the value of this check here. Are we looking for a
> broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
> about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
> 2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
> would be trouble in many other ways.
>
> All this to say: please remove the above 5 lines and make code simpler.

Yes, this check is unnecessary indeed.

>
>> +
>> +       rmapp = memslot->arch.rmap[0];
>> +       last_index = gfn_to_index(gfn_end, memslot->base_gfn,
>> +                                       PT_PAGE_TABLE_LEVEL);
>> +
>> +       for (index = 0; index <= last_index; ++index, ++rmapp) {
>
> One could argue that the cleaner iteration should be over the gfn
> space covered by the memslot, thus leaving the gfn <--> rmap <--> spte
> interactions hidden under the hood of __gfn_to_rmap. That yields much
> cleaner (IMHO) code:
>
>      for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn +
> memslot->npages; gfn++) {
>          flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn,
> 1, memslot));
>          ....
>
> Now you can also get rid of index, last_index and rmapp. And more
> importantly, the code is more understandable, and follows pattern as
> established in x86/kvm/mmu.

Do not have strong opinion on it. Current code also has this style, please
refer to kvm_mmu_slot_remove_write_access().

>
>> +               if (*rmapp)
>> +                       flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
>> +
>> +               if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>> +                       if (flush) {
>> +                               kvm_flush_remote_tlbs(kvm);
>> +                               flush = false;
>> +                       }
>> +                       cond_resched_lock(&kvm->mmu_lock);
>
> Relinquishing this spinlock is problematic, because
> commit_memory_region has not gotten around to removing write
> protection. Are you certain no new write-protected PTEs will be
> inserted by a racing fault that sneaks in while the spinlock is
> relinquished?
>

I do not know clearly about the problem, new spte creation will be based on
the host mapping, i.e, the huge mapping on shadow page table will be
sync-ed with huge mapping on host. Could you please detail the problem?

Wanpeng, could you please post a patch to address Andres's comments?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-13  1:45   ` Xiao Guangrong
@ 2015-04-13  5:59     ` Wanpeng Li
  2015-04-13  6:31     ` Andres Lagar-Cavilla
  1 sibling, 0 replies; 13+ messages in thread
From: Wanpeng Li @ 2015-04-13  5:59 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Andres Lagar-Cavilla, kvm, linux-kernel, Paolo Bonzini, Eric Northup

On Mon, Apr 13, 2015 at 09:45:39AM +0800, Xiao Guangrong wrote:
>
>
>On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote:
>>On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
>>>
>>>There are two scenarios for the requirement of collapsing small sptes
>>>into large sptes.
>>>- dirty logging tracks sptes in 4k granularity, so large sptes are split,
>>>   the large sptes will be reallocated in the destination machine and the
>>>   guest in the source machine will be destroyed when live migration successfully.
>>>   However, the guest in the source machine will continue to run if live migration
>>>   fail due to some reasons, the sptes still keep small which lead to bad
>>>   performance.
>>>- our customers write tools to track the dirty speed of guests by EPT D bit/PML
>>>   in order to determine the most appropriate one to be live migrated, however
>>>   sptes will still keep small after tracking dirty speed.
>>>
>>>This patch introduce lazy collapse small sptes into large sptes, the memory region
>>>will be scanned on the ioctl context when dirty log is stopped, the ones which can
>>>be collapsed into large pages will be dropped during the scan, it depends the on
>>>later #PF to reallocate all large sptes.
>>>
>>>Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
>>>Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>>
>>Hi, apologies for late review (vacation), but wanted to bring
>>attention to a few matters:
>
>No problem, your comments are really valuable to us. :)
>
>>
>>>
>>>---
>>>v2 -> v3:
>>>  * update comments
>>>  * fix infinite for loop
>>>v1 -> v2:
>>>  * use 'bool' instead of 'int'
>>>  * add more comments
>>>  * fix can not get the next spte after drop the current spte
>>>
>>>  arch/x86/include/asm/kvm_host.h |  2 ++
>>>  arch/x86/kvm/mmu.c              | 73 +++++++++++++++++++++++++++++++++++++++++
>>>  arch/x86/kvm/x86.c              | 19 +++++++++++
>>>  3 files changed, 94 insertions(+)
>>>
>>>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>index 30b28dc..91b5bdb 100644
>>>--- a/arch/x86/include/asm/kvm_host.h
>>>+++ b/arch/x86/include/asm/kvm_host.h
>>>@@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>>>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>>>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>>                                       struct kvm_memory_slot *memslot);
>>>+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>>+                                       struct kvm_memory_slot *memslot);
>>>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>>                                    struct kvm_memory_slot *memslot);
>>>  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>>>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>index cee7592..ba002a0 100644
>>>--- a/arch/x86/kvm/mmu.c
>>>+++ b/arch/x86/kvm/mmu.c
>>>@@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>>                 kvm_flush_remote_tlbs(kvm);
>>>  }
>>>
>>>+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>>+               unsigned long *rmapp)
>>>+{
>>>+       u64 *sptep;
>>>+       struct rmap_iterator iter;
>>>+       int need_tlb_flush = 0;
>>>+       pfn_t pfn;
>>>+       struct kvm_mmu_page *sp;
>>>+
>>>+       for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>>>+               BUG_ON(!(*sptep & PT_PRESENT_MASK));
>>>+
>>>+               sp = page_header(__pa(sptep));
>>>+               pfn = spte_to_pfn(*sptep);
>>>+
>>>+               /*
>>>+                * Lets support EPT only for now, there still needs to figure
>>>+                * out an efficient way to let these codes be aware what mapping
>>>+                * level used in guest.
>>>+                */
>>>+               if (sp->role.direct &&
>>>+                       !kvm_is_reserved_pfn(pfn) &&
>>>+                       PageTransCompound(pfn_to_page(pfn))) {
>>
>>Not your fault, but PageTransCompound is very unhappy naming, as it
>>also yields true for PageHuge. Suggestion: document this check covers
>>static hugetlbfs, or switch to PageCompound() check.
>>
>>A slightly bolder approach would be to refactor and reuse the nearly
>>identical check done in transparent_hugepage_adjust, instead of
>>open-coding here. In essence this code is asking for the same check,
>>plus the out-of-band check for static hugepages.
>
>I agree.
>
>>
>>
>>>+                       drop_spte(kvm, sptep);
>>>+                       sptep = rmap_get_first(*rmapp, &iter);
>>>+                       need_tlb_flush = 1;
>>>+               } else
>>>+                       sptep = rmap_get_next(&iter);
>>>+       }
>>>+
>>>+       return need_tlb_flush;
>>>+}
>>>+
>>>+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>>+                       struct kvm_memory_slot *memslot)
>>>+{
>>>+       bool flush = false;
>>>+       unsigned long *rmapp;
>>>+       unsigned long last_index, index;
>>>+       gfn_t gfn_start, gfn_end;
>>>+
>>>+       spin_lock(&kvm->mmu_lock);
>>>+
>>>+       gfn_start = memslot->base_gfn;
>>>+       gfn_end = memslot->base_gfn + memslot->npages - 1;
>>>+
>>>+       if (gfn_start >= gfn_end)
>>>+               goto out;
>>
>>I don't understand the value of this check here. Are we looking for a
>>broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
>>about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
>>2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
>>would be trouble in many other ways.
>>
>>All this to say: please remove the above 5 lines and make code simpler.
>
>Yes, this check is unnecessary indeed.
>
>>
>>>+
>>>+       rmapp = memslot->arch.rmap[0];
>>>+       last_index = gfn_to_index(gfn_end, memslot->base_gfn,
>>>+                                       PT_PAGE_TABLE_LEVEL);
>>>+
>>>+       for (index = 0; index <= last_index; ++index, ++rmapp) {
>>
>>One could argue that the cleaner iteration should be over the gfn
>>space covered by the memslot, thus leaving the gfn <--> rmap <--> spte
>>interactions hidden under the hood of __gfn_to_rmap. That yields much
>>cleaner (IMHO) code:
>>
>>     for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn +
>>memslot->npages; gfn++) {
>>         flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn,
>>1, memslot));
>>         ....
>>
>>Now you can also get rid of index, last_index and rmapp. And more
>>importantly, the code is more understandable, and follows pattern as
>>established in x86/kvm/mmu.
>
>Do not have strong opinion on it. Current code also has this style, please
>refer to kvm_mmu_slot_remove_write_access().
>
>>
>>>+               if (*rmapp)
>>>+                       flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
>>>+
>>>+               if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>>>+                       if (flush) {
>>>+                               kvm_flush_remote_tlbs(kvm);
>>>+                               flush = false;
>>>+                       }
>>>+                       cond_resched_lock(&kvm->mmu_lock);
>>
>>Relinquishing this spinlock is problematic, because
>>commit_memory_region has not gotten around to removing write
>>protection. Are you certain no new write-protected PTEs will be
>>inserted by a racing fault that sneaks in while the spinlock is
>>relinquished?
>>
>
>I do not know clearly about the problem, new spte creation will be based on
>the host mapping, i.e, the huge mapping on shadow page table will be
>sync-ed with huge mapping on host. Could you please detail the problem?
>
>Wanpeng, could you please post a patch to address Andres's comments?

Yeah, I will post a patch which is based on this one after merge window. :)

Regards,
Wanpeng Li 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-13  1:45   ` Xiao Guangrong
  2015-04-13  5:59     ` Wanpeng Li
@ 2015-04-13  6:31     ` Andres Lagar-Cavilla
  2015-04-14  4:04       ` [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte Xiao Guangrong
  1 sibling, 1 reply; 13+ messages in thread
From: Andres Lagar-Cavilla @ 2015-04-13  6:31 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Wanpeng Li, kvm, linux-kernel, Paolo Bonzini, Eric Northup

On Sun, Apr 12, 2015 at 6:45 PM, Xiao Guangrong
<guangrong.xiao@linux.intel.com> wrote:
>
>
> On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote:
>>
>> On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li <wanpeng.li@linux.intel.com>
>> wrote:
>>>
>>>
>>> There are two scenarios for the requirement of collapsing small sptes
>>> into large sptes.
>>> - dirty logging tracks sptes in 4k granularity, so large sptes are split,
>>>    the large sptes will be reallocated in the destination machine and the
>>>    guest in the source machine will be destroyed when live migration
>>> successfully.
>>>    However, the guest in the source machine will continue to run if live
>>> migration
>>>    fail due to some reasons, the sptes still keep small which lead to bad
>>>    performance.
>>> - our customers write tools to track the dirty speed of guests by EPT D
>>> bit/PML
>>>    in order to determine the most appropriate one to be live migrated,
>>> however
>>>    sptes will still keep small after tracking dirty speed.
>>>
>>> This patch introduce lazy collapse small sptes into large sptes, the
>>> memory region
>>> will be scanned on the ioctl context when dirty log is stopped, the ones
>>> which can
>>> be collapsed into large pages will be dropped during the scan, it depends
>>> the on
>>> later #PF to reallocate all large sptes.
>>>
>>> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
>>> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>>
>>
>> Hi, apologies for late review (vacation), but wanted to bring
>> attention to a few matters:
>
>
> No problem, your comments are really valuable to us. :)

Glad to know, thanks.

>
>
>>
>>>
>>> ---
>>> v2 -> v3:
>>>   * update comments
>>>   * fix infinite for loop
>>> v1 -> v2:
>>>   * use 'bool' instead of 'int'
>>>   * add more comments
>>>   * fix can not get the next spte after drop the current spte
>>>
>>>   arch/x86/include/asm/kvm_host.h |  2 ++
>>>   arch/x86/kvm/mmu.c              | 73
>>> +++++++++++++++++++++++++++++++++++++++++
>>>   arch/x86/kvm/x86.c              | 19 +++++++++++
>>>   3 files changed, 94 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 30b28dc..91b5bdb 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64
>>> accessed_mask,
>>>   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>>>   void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>>                                        struct kvm_memory_slot *memslot);
>>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>> +                                       struct kvm_memory_slot *memslot);
>>>   void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>>                                     struct kvm_memory_slot *memslot);
>>>   void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index cee7592..ba002a0 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm
>>> *kvm,
>>>                  kvm_flush_remote_tlbs(kvm);
>>>   }
>>>
>>> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>> +               unsigned long *rmapp)
>>> +{
>>> +       u64 *sptep;
>>> +       struct rmap_iterator iter;
>>> +       int need_tlb_flush = 0;
>>> +       pfn_t pfn;
>>> +       struct kvm_mmu_page *sp;
>>> +
>>> +       for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>>> +               BUG_ON(!(*sptep & PT_PRESENT_MASK));
>>> +
>>> +               sp = page_header(__pa(sptep));
>>> +               pfn = spte_to_pfn(*sptep);
>>> +
>>> +               /*
>>> +                * Lets support EPT only for now, there still needs to
>>> figure
>>> +                * out an efficient way to let these codes be aware what
>>> mapping
>>> +                * level used in guest.
>>> +                */
>>> +               if (sp->role.direct &&
>>> +                       !kvm_is_reserved_pfn(pfn) &&
>>> +                       PageTransCompound(pfn_to_page(pfn))) {
>>
>>
>> Not your fault, but PageTransCompound is very unhappy naming, as it
>> also yields true for PageHuge. Suggestion: document this check covers
>> static hugetlbfs, or switch to PageCompound() check.
>>
>> A slightly bolder approach would be to refactor and reuse the nearly
>> identical check done in transparent_hugepage_adjust, instead of
>> open-coding here. In essence this code is asking for the same check,
>> plus the out-of-band check for static hugepages.
>
>
> I agree.
>
>
>>
>>
>>> +                       drop_spte(kvm, sptep);
>>> +                       sptep = rmap_get_first(*rmapp, &iter);
>>> +                       need_tlb_flush = 1;
>>> +               } else
>>> +                       sptep = rmap_get_next(&iter);
>>> +       }
>>> +
>>> +       return need_tlb_flush;
>>> +}
>>> +
>>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>> +                       struct kvm_memory_slot *memslot)
>>> +{
>>> +       bool flush = false;
>>> +       unsigned long *rmapp;
>>> +       unsigned long last_index, index;
>>> +       gfn_t gfn_start, gfn_end;
>>> +
>>> +       spin_lock(&kvm->mmu_lock);
>>> +
>>> +       gfn_start = memslot->base_gfn;
>>> +       gfn_end = memslot->base_gfn + memslot->npages - 1;
>>> +
>>> +       if (gfn_start >= gfn_end)
>>> +               goto out;
>>
>>
>> I don't understand the value of this check here. Are we looking for a
>> broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
>> about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
>> 2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
>> would be trouble in many other ways.
>>
>> All this to say: please remove the above 5 lines and make code simpler.
>
>
> Yes, this check is unnecessary indeed.
>
>>
>>> +
>>> +       rmapp = memslot->arch.rmap[0];
>>> +       last_index = gfn_to_index(gfn_end, memslot->base_gfn,
>>> +                                       PT_PAGE_TABLE_LEVEL);
>>> +
>>> +       for (index = 0; index <= last_index; ++index, ++rmapp) {
>>
>>
>> One could argue that the cleaner iteration should be over the gfn
>> space covered by the memslot, thus leaving the gfn <--> rmap <--> spte
>> interactions hidden under the hood of __gfn_to_rmap. That yields much
>> cleaner (IMHO) code:
>>
>>      for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn +
>> memslot->npages; gfn++) {
>>          flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn,
>> 1, memslot));
>>          ....
>>
>> Now you can also get rid of index, last_index and rmapp. And more
>> importantly, the code is more understandable, and follows pattern as
>> established in x86/kvm/mmu.
>
>
> Do not have strong opinion on it. Current code also has this style, please
> refer to kvm_mmu_slot_remove_write_access().
>

Ugh. You're right. Not your fight to pick, I guess.

>>
>>> +               if (*rmapp)
>>> +                       flush |= kvm_mmu_zap_collapsible_spte(kvm,
>>> rmapp);
>>> +
>>> +               if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>>> +                       if (flush) {
>>> +                               kvm_flush_remote_tlbs(kvm);
>>> +                               flush = false;
>>> +                       }
>>> +                       cond_resched_lock(&kvm->mmu_lock);
>>
>>
>> Relinquishing this spinlock is problematic, because
>> commit_memory_region has not gotten around to removing write
>> protection. Are you certain no new write-protected PTEs will be
>> inserted by a racing fault that sneaks in while the spinlock is
>> relinquished?
>>
>
> I do not know clearly about the problem, new spte creation will be based on
> the host mapping, i.e, the huge mapping on shadow page table will be
> sync-ed with huge mapping on host. Could you please detail the problem?
>

It seems that, prior to calling commit_memory_region, the new memslot
with dirty_bitmap = NULL is made effective. This will ensure
force_pt_level is false in tdp_page_fault, and huge mappings are
observed. So all is well and I was fud'ing. N.B.: software shadow
mode?

> Wanpeng, could you please post a patch to address Andres's comments?
>
Awesome, thanks
Andres


-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte
  2015-04-13  6:31     ` Andres Lagar-Cavilla
@ 2015-04-14  4:04       ` Xiao Guangrong
  2015-04-15 15:05         ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Guangrong @ 2015-04-14  4:04 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Wanpeng Li, kvm, linux-kernel, Paolo Bonzini, Eric Northup

Soft mmu uses direct shadow page to fill guest large mapping with small pages
if huge mamping is disallowed on host. So zapping direct shadow page works well
both for soft mmu and hard mmu

Fix the comment to reflect this truth

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
  arch/x86/kvm/mmu.c | 8 +++++---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 146f295..68c5487 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4481,9 +4481,11 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
  		pfn = spte_to_pfn(*sptep);

  		/*
-		 * Only EPT supported for now; otherwise, one would need to
-		 * find out efficiently whether the guest page tables are
-		 * also using huge pages.
+		 * We can not do huge page mapping for the indirect shadow
+		 * page (sp) found on the last rmap (level = 1 ) since
+		 * indirect sp is synced with the page table in guest and
+		 * indirect sp->level = 1 means the guest page table is
+		 * using 4K page size mapping.
  		 */
  		if (sp->role.direct &&
  			!kvm_is_reserved_pfn(pfn) &&
-- 
2.1.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-10 18:05 ` Andres Lagar-Cavilla
  2015-04-13  1:45   ` Xiao Guangrong
@ 2015-04-14  5:25   ` Wanpeng Li
  2015-04-14  6:06     ` Andres Lagar-Cavilla
  1 sibling, 1 reply; 13+ messages in thread
From: Wanpeng Li @ 2015-04-14  5:25 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: kvm, linux-kernel, Paolo Bonzini, Xiao Guangrong, Eric Northup

Hi Andres,
On Fri, Apr 10, 2015 at 11:05:26AM -0700, Andres Lagar-Cavilla wrote:
[...]
>> +               if (sp->role.direct &&
>> +                       !kvm_is_reserved_pfn(pfn) &&
>> +                       PageTransCompound(pfn_to_page(pfn))) {
>
>Not your fault, but PageTransCompound is very unhappy naming, as it
>also yields true for PageHuge. Suggestion: document this check covers
>static hugetlbfs, or switch to PageCompound() check.
>
>A slightly bolder approach would be to refactor and reuse the nearly
>identical check done in transparent_hugepage_adjust, instead of
>open-coding here. In essence this code is asking for the same check,
>plus the out-of-band check for static hugepages.

PageCompound() check still return true for both transparent huge pages
and hugetlbfs pages, !PageHuge(page) && PageTransHuge(page) check can 
guarantee to catch the right transparent huge pages just as my old commit 
e76d30e20be5fc ("mm/hwpoison: fix test for a transparent huge page"). 
I will send a patch to fix this.

>
>
>> +                       drop_spte(kvm, sptep);
>> +                       sptep = rmap_get_first(*rmapp, &iter);
>> +                       need_tlb_flush = 1;
>> +               } else
>> +                       sptep = rmap_get_next(&iter);
>> +       }
>> +
>> +       return need_tlb_flush;
>> +}
>> +
>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>> +                       struct kvm_memory_slot *memslot)
>> +{
>> +       bool flush = false;
>> +       unsigned long *rmapp;
>> +       unsigned long last_index, index;
>> +       gfn_t gfn_start, gfn_end;
>> +
>> +       spin_lock(&kvm->mmu_lock);
>> +
>> +       gfn_start = memslot->base_gfn;
>> +       gfn_end = memslot->base_gfn + memslot->npages - 1;
>> +
>> +       if (gfn_start >= gfn_end)
>> +               goto out;
>
>I don't understand the value of this check here. Are we looking for a
>broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
>about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
>2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
>would be trouble in many other ways.
>
>All this to say: please remove the above 5 lines and make code simpler.

I will send a patch to cleanup it. Thanks for your review. :)

Regards,
Wanpeng Li


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-14  5:25   ` [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
@ 2015-04-14  6:06     ` Andres Lagar-Cavilla
  2015-04-14  6:38       ` Wanpeng Li
  0 siblings, 1 reply; 13+ messages in thread
From: Andres Lagar-Cavilla @ 2015-04-14  6:06 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: kvm, linux-kernel, Paolo Bonzini, Xiao Guangrong, Eric Northup

On Mon, Apr 13, 2015 at 10:25 PM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
> Hi Andres,
> On Fri, Apr 10, 2015 at 11:05:26AM -0700, Andres Lagar-Cavilla wrote:
> [...]
>>> +               if (sp->role.direct &&
>>> +                       !kvm_is_reserved_pfn(pfn) &&
>>> +                       PageTransCompound(pfn_to_page(pfn))) {
>>
>>Not your fault, but PageTransCompound is very unhappy naming, as it
>>also yields true for PageHuge. Suggestion: document this check covers
>>static hugetlbfs, or switch to PageCompound() check.
>>
>>A slightly bolder approach would be to refactor and reuse the nearly
>>identical check done in transparent_hugepage_adjust, instead of
>>open-coding here. In essence this code is asking for the same check,
>>plus the out-of-band check for static hugepages.
>
> PageCompound() check still return true for both transparent huge pages
> and hugetlbfs pages, !PageHuge(page) && PageTransHuge(page) check can
> guarantee to catch the right transparent huge pages just as my old commit
> e76d30e20be5fc ("mm/hwpoison: fix test for a transparent huge page").
> I will send a patch to fix this.
>
Why would you want to "fix" it that way? Aren't static hugepages supported?

(PageAnon is an inline check and much cheaper than !PageHuge(), which
is an actual function call)

Please consider my suggestion about refactoring the similar checks in
transparent_hugepage_adjust.

Thanks a ton
Andres
>>
>>
>>> +                       drop_spte(kvm, sptep);
>>> +                       sptep = rmap_get_first(*rmapp, &iter);
>>> +                       need_tlb_flush = 1;
>>> +               } else
>>> +                       sptep = rmap_get_next(&iter);
>>> +       }
>>> +
>>> +       return need_tlb_flush;
>>> +}
>>> +
>>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>> +                       struct kvm_memory_slot *memslot)
>>> +{
>>> +       bool flush = false;
>>> +       unsigned long *rmapp;
>>> +       unsigned long last_index, index;
>>> +       gfn_t gfn_start, gfn_end;
>>> +
>>> +       spin_lock(&kvm->mmu_lock);
>>> +
>>> +       gfn_start = memslot->base_gfn;
>>> +       gfn_end = memslot->base_gfn + memslot->npages - 1;
>>> +
>>> +       if (gfn_start >= gfn_end)
>>> +               goto out;
>>
>>I don't understand the value of this check here. Are we looking for a
>>broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
>>about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
>>2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
>>would be trouble in many other ways.
>>
>>All this to say: please remove the above 5 lines and make code simpler.
>
> I will send a patch to cleanup it. Thanks for your review. :)
>
> Regards,
> Wanpeng Li
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-14  6:06     ` Andres Lagar-Cavilla
@ 2015-04-14  6:38       ` Wanpeng Li
  0 siblings, 0 replies; 13+ messages in thread
From: Wanpeng Li @ 2015-04-14  6:38 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: kvm, linux-kernel, Paolo Bonzini, Xiao Guangrong, Eric Northup

On Mon, Apr 13, 2015 at 11:06:25PM -0700, Andres Lagar-Cavilla wrote:
>On Mon, Apr 13, 2015 at 10:25 PM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
>> Hi Andres,
>> On Fri, Apr 10, 2015 at 11:05:26AM -0700, Andres Lagar-Cavilla wrote:
>> [...]
>>>> +               if (sp->role.direct &&
>>>> +                       !kvm_is_reserved_pfn(pfn) &&
>>>> +                       PageTransCompound(pfn_to_page(pfn))) {
>>>
>>>Not your fault, but PageTransCompound is very unhappy naming, as it
>>>also yields true for PageHuge. Suggestion: document this check covers
>>>static hugetlbfs, or switch to PageCompound() check.
>>>
>>>A slightly bolder approach would be to refactor and reuse the nearly
>>>identical check done in transparent_hugepage_adjust, instead of
>>>open-coding here. In essence this code is asking for the same check,
>>>plus the out-of-band check for static hugepages.
>>
>> PageCompound() check still return true for both transparent huge pages
>> and hugetlbfs pages, !PageHuge(page) && PageTransHuge(page) check can
>> guarantee to catch the right transparent huge pages just as my old commit
>> e76d30e20be5fc ("mm/hwpoison: fix test for a transparent huge page").
>> I will send a patch to fix this.
>>
>Why would you want to "fix" it that way? Aren't static hugepages supported?
>
>(PageAnon is an inline check and much cheaper than !PageHuge(), which
>is an actual function call)
>
>Please consider my suggestion about refactoring the similar checks in
>transparent_hugepage_adjust.

Ok, will do. :)

Regards,
Wanpeng Li 

>
>Thanks a ton
>Andres
>>>
>>>
>>>> +                       drop_spte(kvm, sptep);
>>>> +                       sptep = rmap_get_first(*rmapp, &iter);
>>>> +                       need_tlb_flush = 1;
>>>> +               } else
>>>> +                       sptep = rmap_get_next(&iter);
>>>> +       }
>>>> +
>>>> +       return need_tlb_flush;
>>>> +}
>>>> +
>>>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>>> +                       struct kvm_memory_slot *memslot)
>>>> +{
>>>> +       bool flush = false;
>>>> +       unsigned long *rmapp;
>>>> +       unsigned long last_index, index;
>>>> +       gfn_t gfn_start, gfn_end;
>>>> +
>>>> +       spin_lock(&kvm->mmu_lock);
>>>> +
>>>> +       gfn_start = memslot->base_gfn;
>>>> +       gfn_end = memslot->base_gfn + memslot->npages - 1;
>>>> +
>>>> +       if (gfn_start >= gfn_end)
>>>> +               goto out;
>>>
>>>I don't understand the value of this check here. Are we looking for a
>>>broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
>>>about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
>>>2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
>>>would be trouble in many other ways.
>>>
>>>All this to say: please remove the above 5 lines and make code simpler.
>>
>> I will send a patch to cleanup it. Thanks for your review. :)
>>
>> Regards,
>> Wanpeng Li
>>
>
>
>
>-- 
>Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte
  2015-04-14  4:04       ` [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte Xiao Guangrong
@ 2015-04-15 15:05         ` Paolo Bonzini
  2015-04-16  0:38           ` Xiao Guangrong
  2015-04-16  8:14           ` Wanpeng Li
  0 siblings, 2 replies; 13+ messages in thread
From: Paolo Bonzini @ 2015-04-15 15:05 UTC (permalink / raw)
  To: Xiao Guangrong, Andres Lagar-Cavilla
  Cc: Wanpeng Li, kvm, linux-kernel, Eric Northup



> -         * Only EPT supported for now; otherwise, one would need to
> -         * find out efficiently whether the guest page tables are
> -         * also using huge pages.
> +         * We can not do huge page mapping for the indirect shadow
> +         * page (sp) found on the last rmap (level = 1 ) since
> +         * indirect sp is synced with the page table in guest and
> +         * indirect sp->level = 1 means the guest page table is
> +         * using 4K page size mapping.

What about:

+		 * We cannot do huge page mapping for indirect shadow pages,
+		 * which are found on the last rmap (level = 1) when not using
+		 * tdp; such shadow pages are synced with the page table in
+		 * the guest, and the guest page table is using 4K page size
+		 * mapping if the indirect sp has level = 1.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte
  2015-04-15 15:05         ` Paolo Bonzini
@ 2015-04-16  0:38           ` Xiao Guangrong
  2015-04-16  8:14           ` Wanpeng Li
  1 sibling, 0 replies; 13+ messages in thread
From: Xiao Guangrong @ 2015-04-16  0:38 UTC (permalink / raw)
  To: Paolo Bonzini, Andres Lagar-Cavilla
  Cc: Wanpeng Li, kvm, linux-kernel, Eric Northup



On 04/15/2015 11:05 PM, Paolo Bonzini wrote:
>
>
>> -         * Only EPT supported for now; otherwise, one would need to
>> -         * find out efficiently whether the guest page tables are
>> -         * also using huge pages.
>> +         * We can not do huge page mapping for the indirect shadow
>> +         * page (sp) found on the last rmap (level = 1 ) since
>> +         * indirect sp is synced with the page table in guest and
>> +         * indirect sp->level = 1 means the guest page table is
>> +         * using 4K page size mapping.
>
> What about:
>
> +		 * We cannot do huge page mapping for indirect shadow pages,
> +		 * which are found on the last rmap (level = 1) when not using
> +		 * tdp; such shadow pages are synced with the page table in
> +		 * the guest, and the guest page table is using 4K page size
> +		 * mapping if the indirect sp has level = 1.

Yeah, much better. Thanks for your improvement, Paolo!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte
  2015-04-15 15:05         ` Paolo Bonzini
  2015-04-16  0:38           ` Xiao Guangrong
@ 2015-04-16  8:14           ` Wanpeng Li
  1 sibling, 0 replies; 13+ messages in thread
From: Wanpeng Li @ 2015-04-16  8:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiao Guangrong, Andres Lagar-Cavilla, Wanpeng Li, kvm,
	linux-kernel, Eric Northup

On Wed, Apr 15, 2015 at 05:05:48PM +0200, Paolo Bonzini wrote:
>
>
>> -         * Only EPT supported for now; otherwise, one would need to
>> -         * find out efficiently whether the guest page tables are
>> -         * also using huge pages.
>> +         * We can not do huge page mapping for the indirect shadow
>> +         * page (sp) found on the last rmap (level = 1 ) since
>> +         * indirect sp is synced with the page table in guest and
>> +         * indirect sp->level = 1 means the guest page table is
>> +         * using 4K page size mapping.
>
>What about:
>
>+		 * We cannot do huge page mapping for indirect shadow pages,
>+		 * which are found on the last rmap (level = 1) when not using
>+		 * tdp; such shadow pages are synced with the page table in
>+		 * the guest, and the guest page table is using 4K page size
>+		 * mapping if the indirect sp has level = 1.

Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-04-16  8:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-03  7:40 [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
2015-04-07 15:41 ` Paolo Bonzini
2015-04-10 18:05 ` Andres Lagar-Cavilla
2015-04-13  1:45   ` Xiao Guangrong
2015-04-13  5:59     ` Wanpeng Li
2015-04-13  6:31     ` Andres Lagar-Cavilla
2015-04-14  4:04       ` [PATCH] KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte Xiao Guangrong
2015-04-15 15:05         ` Paolo Bonzini
2015-04-16  0:38           ` Xiao Guangrong
2015-04-16  8:14           ` Wanpeng Li
2015-04-14  5:25   ` [PATCH v3] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
2015-04-14  6:06     ` Andres Lagar-Cavilla
2015-04-14  6:38       ` Wanpeng Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.