linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] kvm: mmu: lazy collapse small sptes into large sptes
@ 2015-03-29 23:48 Wanpeng Li
  2015-04-03  4:25 ` Xiao Guangrong
  0 siblings, 1 reply; 3+ messages in thread
From: Wanpeng Li @ 2015-03-29 23:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Marcelo Tosatti, Paolo Bonzini, Xiao Guangrong, Wanpeng Li

There are two scenarios for the requirement of collapsing small sptes
into large sptes.
- dirty logging tracks sptes in 4k granularity, so large sptes are splitted,
  the large sptes will be reallocated in the destination machine and the
  guest in the source machine will be destroyed when live migration successfully.
  However, the guest in the source machine will continue to run if live migration
  fail due to some reasons, the sptes still keep small which lead to bad
  performance.
- our customers write tools to track the dirty speed of guests by EPT D bit/PML
  in order to determine the most appropriate one to be live migrated, however
  sptes will still keep small after tracking dirty speed.

This patch introduce lazy collapse small sptes into large sptes, the memory region 
will be scanned on the ioctl context when dirty log is stopped, the ones which can 
be collapsed into large pages will be dropped during the scan, it depends the on 
later #PF to reallocate all large sptes.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.c              | 66 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  5 ++++
 3 files changed, 73 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a236e39..73de5d3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -859,6 +859,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      struct kvm_memory_slot *memslot);
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+					struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cee7592..d25ced1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4465,6 +4465,72 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 }
 
+static int kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
+		unsigned long *rmapp)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	int need_tlb_flush = 0;
+	pfn_t pfn;
+	struct kvm_mmu_page *sp;
+
+	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
+		BUG_ON(!(*sptep & PT_PRESENT_MASK));
+
+		sp = page_header(__pa(sptep));
+		pfn = spte_to_pfn(*sptep);
+		if (sp->role.direct &&
+			!kvm_is_reserved_pfn(pfn) &&
+			PageTransCompound(pfn_to_page(pfn))) {
+			drop_spte(kvm, sptep);
+			need_tlb_flush = 1;
+		}
+		sptep = rmap_get_next(&iter);
+	}
+
+	return need_tlb_flush;
+}
+
+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
+			struct kvm_memory_slot *memslot)
+{
+	bool flush = false;
+	unsigned long *rmapp;
+	unsigned long last_index, index;
+	gfn_t gfn_start, gfn_end;
+
+	spin_lock(&kvm->mmu_lock);
+
+	gfn_start = memslot->base_gfn;
+	gfn_end = memslot->base_gfn + memslot->npages - 1;
+
+	if (gfn_start >= gfn_end)
+		goto out;
+
+	rmapp = memslot->arch.rmap[0];
+	last_index = gfn_to_index(gfn_end, memslot->base_gfn,
+					PT_PAGE_TABLE_LEVEL);
+
+	for (index = 0; index <= last_index; ++index, ++rmapp) {
+		if (*rmapp)
+			flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
+
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+			if (flush) {
+				kvm_flush_remote_tlbs(kvm);
+				flush = false;
+			}
+			cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+out:
+	spin_unlock(&kvm->mmu_lock);
+}
+
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5f7e03..6037389 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7618,6 +7618,11 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	/* It's OK to get 'new' slot here as it has already been installed */
 	new = id_to_memslot(kvm->memslots, mem->slot);
 
+	if ((change != KVM_MR_DELETE) &&
+		(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
+		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
+		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
 	/*
 	 * Set up write protection and/or dirty logging for the new slot.
 	 *
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] kvm: mmu: lazy collapse small sptes into large sptes
  2015-03-29 23:48 [PATCH] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
@ 2015-04-03  4:25 ` Xiao Guangrong
  2015-04-03  6:10   ` Wanpeng Li
  0 siblings, 1 reply; 3+ messages in thread
From: Xiao Guangrong @ 2015-04-03  4:25 UTC (permalink / raw)
  To: Wanpeng Li, kvm, linux-kernel; +Cc: Marcelo Tosatti, Paolo Bonzini



On 03/30/2015 07:48 AM, Wanpeng Li wrote:
> There are two scenarios for the requirement of collapsing small sptes
> into large sptes.
> - dirty logging tracks sptes in 4k granularity, so large sptes are splitted,
>    the large sptes will be reallocated in the destination machine and the
>    guest in the source machine will be destroyed when live migration successfully.
>    However, the guest in the source machine will continue to run if live migration
>    fail due to some reasons, the sptes still keep small which lead to bad
>    performance.
> - our customers write tools to track the dirty speed of guests by EPT D bit/PML
>    in order to determine the most appropriate one to be live migrated, however
>    sptes will still keep small after tracking dirty speed.
>
> This patch introduce lazy collapse small sptes into large sptes, the memory region
> will be scanned on the ioctl context when dirty log is stopped, the ones which can
> be collapsed into large pages will be dropped during the scan, it depends the on
> later #PF to reallocate all large sptes.
>
> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  2 ++
>   arch/x86/kvm/mmu.c              | 66 +++++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c              |  5 ++++
>   3 files changed, 73 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a236e39..73de5d3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -859,6 +859,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>   void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>   				      struct kvm_memory_slot *memslot);
> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> +					struct kvm_memory_slot *memslot);
>   void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>   				   struct kvm_memory_slot *memslot);
>   void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cee7592..d25ced1 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4465,6 +4465,72 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>   		kvm_flush_remote_tlbs(kvm);
>   }
>
> +static int kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> +		unsigned long *rmapp)

Can use 'bool' instead of 'int'. You used 'bool' in
kvm_mmu_zap_collapsible_sptes anyway.

> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	int need_tlb_flush = 0;
> +	pfn_t pfn;
> +	struct kvm_mmu_page *sp;
> +
> +	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
> +		BUG_ON(!(*sptep & PT_PRESENT_MASK));
> +
> +		sp = page_header(__pa(sptep));
> +		pfn = spte_to_pfn(*sptep);
> +		if (sp->role.direct &&

It only works on direct mapping, please drop a comment to explain
why.

> +			!kvm_is_reserved_pfn(pfn) &&
> +			PageTransCompound(pfn_to_page(pfn))) {
> +			drop_spte(kvm, sptep);
> +			need_tlb_flush = 1;
> +		}
> +		sptep = rmap_get_next(&iter);

You can not get the next spte after drop the current spte. Please
refer to kvm_unmap_rmapp().

> +	}
> +
> +	return need_tlb_flush;
> +}
> +
> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> +			struct kvm_memory_slot *memslot)
> +{
> +	bool flush = false;
> +	unsigned long *rmapp;
> +	unsigned long last_index, index;
> +	gfn_t gfn_start, gfn_end;
> +
> +	spin_lock(&kvm->mmu_lock);
> +
> +	gfn_start = memslot->base_gfn;
> +	gfn_end = memslot->base_gfn + memslot->npages - 1;
> +
> +	if (gfn_start >= gfn_end)
> +		goto out;
> +
> +	rmapp = memslot->arch.rmap[0];
> +	last_index = gfn_to_index(gfn_end, memslot->base_gfn,
> +					PT_PAGE_TABLE_LEVEL);
> +
> +	for (index = 0; index <= last_index; ++index, ++rmapp) {
> +		if (*rmapp)
> +			flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
> +
> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +			if (flush) {
> +				kvm_flush_remote_tlbs(kvm);
> +				flush = false;
> +			}
> +			cond_resched_lock(&kvm->mmu_lock);
> +		}
> +	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +out:
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
>   void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>   				   struct kvm_memory_slot *memslot)
>   {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c5f7e03..6037389 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7618,6 +7618,11 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>   	/* It's OK to get 'new' slot here as it has already been installed */
>   	new = id_to_memslot(kvm->memslots, mem->slot);
>
> +	if ((change != KVM_MR_DELETE) &&
> +		(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
> +		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
> +		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +

You'd better drop comments here to explain the situation.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] kvm: mmu: lazy collapse small sptes into large sptes
  2015-04-03  4:25 ` Xiao Guangrong
@ 2015-04-03  6:10   ` Wanpeng Li
  0 siblings, 0 replies; 3+ messages in thread
From: Wanpeng Li @ 2015-04-03  6:10 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Wanpeng Li, kvm, linux-kernel, Marcelo Tosatti, Paolo Bonzini

On Fri, Apr 03, 2015 at 12:25:14PM +0800, Xiao Guangrong wrote:
>
>
>On 03/30/2015 07:48 AM, Wanpeng Li wrote:
>>There are two scenarios for the requirement of collapsing small sptes
>>into large sptes.
>>- dirty logging tracks sptes in 4k granularity, so large sptes are splitted,
>>   the large sptes will be reallocated in the destination machine and the
>>   guest in the source machine will be destroyed when live migration successfully.
>>   However, the guest in the source machine will continue to run if live migration
>>   fail due to some reasons, the sptes still keep small which lead to bad
>>   performance.
>>- our customers write tools to track the dirty speed of guests by EPT D bit/PML
>>   in order to determine the most appropriate one to be live migrated, however
>>   sptes will still keep small after tracking dirty speed.
>>
>>This patch introduce lazy collapse small sptes into large sptes, the memory region
>>will be scanned on the ioctl context when dirty log is stopped, the ones which can
>>be collapsed into large pages will be dropped during the scan, it depends the on
>>later #PF to reallocate all large sptes.
>>
>>Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>>---
>>  arch/x86/include/asm/kvm_host.h |  2 ++
>>  arch/x86/kvm/mmu.c              | 66 +++++++++++++++++++++++++++++++++++++++++
>>  arch/x86/kvm/x86.c              |  5 ++++
>>  3 files changed, 73 insertions(+)
>>
>>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>index a236e39..73de5d3 100644
>>--- a/arch/x86/include/asm/kvm_host.h
>>+++ b/arch/x86/include/asm/kvm_host.h
>>@@ -859,6 +859,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>  				      struct kvm_memory_slot *memslot);
>>+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>+					struct kvm_memory_slot *memslot);
>>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>  				   struct kvm_memory_slot *memslot);
>>  void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>index cee7592..d25ced1 100644
>>--- a/arch/x86/kvm/mmu.c
>>+++ b/arch/x86/kvm/mmu.c
>>@@ -4465,6 +4465,72 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>  		kvm_flush_remote_tlbs(kvm);
>>  }
>>
>>+static int kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>+		unsigned long *rmapp)
>
>Can use 'bool' instead of 'int'. You used 'bool' in
>kvm_mmu_zap_collapsible_sptes anyway.
>
>>+{
>>+	u64 *sptep;
>>+	struct rmap_iterator iter;
>>+	int need_tlb_flush = 0;
>>+	pfn_t pfn;
>>+	struct kvm_mmu_page *sp;
>>+
>>+	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>>+		BUG_ON(!(*sptep & PT_PRESENT_MASK));
>>+
>>+		sp = page_header(__pa(sptep));
>>+		pfn = spte_to_pfn(*sptep);
>>+		if (sp->role.direct &&
>
>It only works on direct mapping, please drop a comment to explain
>why.
>
>>+			!kvm_is_reserved_pfn(pfn) &&
>>+			PageTransCompound(pfn_to_page(pfn))) {
>>+			drop_spte(kvm, sptep);
>>+			need_tlb_flush = 1;
>>+		}
>>+		sptep = rmap_get_next(&iter);
>
>You can not get the next spte after drop the current spte. Please
>refer to kvm_unmap_rmapp().
>
>>+	}
>>+
>>+	return need_tlb_flush;
>>+}
>>+
>>+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>+			struct kvm_memory_slot *memslot)
>>+{
>>+	bool flush = false;
>>+	unsigned long *rmapp;
>>+	unsigned long last_index, index;
>>+	gfn_t gfn_start, gfn_end;
>>+
>>+	spin_lock(&kvm->mmu_lock);
>>+
>>+	gfn_start = memslot->base_gfn;
>>+	gfn_end = memslot->base_gfn + memslot->npages - 1;
>>+
>>+	if (gfn_start >= gfn_end)
>>+		goto out;
>>+
>>+	rmapp = memslot->arch.rmap[0];
>>+	last_index = gfn_to_index(gfn_end, memslot->base_gfn,
>>+					PT_PAGE_TABLE_LEVEL);
>>+
>>+	for (index = 0; index <= last_index; ++index, ++rmapp) {
>>+		if (*rmapp)
>>+			flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp);
>>+
>>+		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>>+			if (flush) {
>>+				kvm_flush_remote_tlbs(kvm);
>>+				flush = false;
>>+			}
>>+			cond_resched_lock(&kvm->mmu_lock);
>>+		}
>>+	}
>>+
>>+	if (flush)
>>+		kvm_flush_remote_tlbs(kvm);
>>+
>>+out:
>>+	spin_unlock(&kvm->mmu_lock);
>>+}
>>+
>>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>  				   struct kvm_memory_slot *memslot)
>>  {
>>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>index c5f7e03..6037389 100644
>>--- a/arch/x86/kvm/x86.c
>>+++ b/arch/x86/kvm/x86.c
>>@@ -7618,6 +7618,11 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>>  	/* It's OK to get 'new' slot here as it has already been installed */
>>  	new = id_to_memslot(kvm->memslots, mem->slot);
>>
>>+	if ((change != KVM_MR_DELETE) &&
>>+		(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
>>+		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
>>+		kvm_mmu_zap_collapsible_sptes(kvm, new);
>>+
>
>You'd better drop comments here to explain the situation.

Just handle all your comments in v2, thanks for your review. ;)

Regards,
Wanpeng Li 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-04-03  6:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-29 23:48 [PATCH] kvm: mmu: lazy collapse small sptes into large sptes Wanpeng Li
2015-04-03  4:25 ` Xiao Guangrong
2015-04-03  6:10   ` Wanpeng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).