linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/9] rework KVM mmu_shrink() code
@ 2010-06-15 13:55 Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 1/9] abstract kvm x86 mmu->n_free_mmu_pages Dave Hansen
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen

This is a big RFC for the moment.  These need a bunch more
runtime testing.

--

We've seen contention in the mmu_shrink() function.  This patch
set reworks it to hopefully be more scalable to large numbers
of CPUs, as well as large numbers of running VMs.

The patches are ordered with increasing invasiveness.

These seem to boot and run fine.  I'm running about 40 VMs at
once, while doing "echo 3 > /proc/sys/vm/drop_caches", and
killing/restarting VMs constantly.

Seems to be relatively stable, and seems to keep the numbers
of kvm_mmu_page_header objects down.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 1/9] abstract kvm x86 mmu->n_free_mmu_pages
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-16  8:40   ` Avi Kivity
  2010-06-15 13:55 ` [RFC][PATCH 2/9] rename x86 kvm->arch.n_alloc_mmu_pages Dave Hansen
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


First of all, I think "free" is a poor name for this value.  In
this context, it means, "the number of mmu pages which this kvm
instance should be able to allocate."  To me, "free" implies
much more that the objects are there and ready for use.  I think
"available" is a much better description, especially when you
see how it is calculated.

In this patch, we abstract its use into a function.  We'll soon
replace the function's contents by calculating the value in a
different way.


Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/kvm/mmu.c |    6 +++---
 linux-2.6.git-dave/arch/x86/kvm/mmu.h |    7 ++++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff -puN arch/x86/kvm/mmu.c~abstract_kvm_free_mmu_pages arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~abstract_kvm_free_mmu_pages	2010-06-09 15:14:28.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-09 15:14:28.000000000 -0700
@@ -1522,7 +1522,7 @@ void kvm_mmu_change_mmu_pages(struct kvm
 {
 	int used_pages;
 
-	used_pages = kvm->arch.n_alloc_mmu_pages - kvm->arch.n_free_mmu_pages;
+	used_pages = kvm->arch.n_alloc_mmu_pages - kvm_mmu_available_pages(kvm);
 	used_pages = max(0, used_pages);
 
 	/*
@@ -2752,7 +2752,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page
 
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
 {
-	while (vcpu->kvm->arch.n_free_mmu_pages < KVM_REFILL_PAGES &&
+	while (kvm_mmu_available_pages(vcpu->kvm) < KVM_REFILL_PAGES &&
 	       !list_empty(&vcpu->kvm->arch.active_mmu_pages)) {
 		struct kvm_mmu_page *sp;
 
@@ -2933,7 +2933,7 @@ static int mmu_shrink(int nr_to_scan, gf
 		idx = srcu_read_lock(&kvm->srcu);
 		spin_lock(&kvm->mmu_lock);
 		npages = kvm->arch.n_alloc_mmu_pages -
-			 kvm->arch.n_free_mmu_pages;
+			 kvm_mmu_available_pages(kvm);
 		cache_count += npages;
 		if (!kvm_freed && nr_to_scan > 0 && npages > 0) {
 			freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
diff -puN arch/x86/kvm/mmu.h~abstract_kvm_free_mmu_pages arch/x86/kvm/mmu.h
--- linux-2.6.git/arch/x86/kvm/mmu.h~abstract_kvm_free_mmu_pages	2010-06-09 15:14:28.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.h	2010-06-09 15:14:28.000000000 -0700
@@ -50,9 +50,14 @@
 
 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 
+static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
+{
+	return kvm->arch.n_free_mmu_pages;
+}
+
 static inline void kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
 {
-	if (unlikely(vcpu->kvm->arch.n_free_mmu_pages < KVM_MIN_FREE_MMU_PAGES))
+	if (unlikely(kvm_mmu_available_pages(vcpu->kvm)< KVM_MIN_FREE_MMU_PAGES))
 		__kvm_mmu_free_some_pages(vcpu);
 }
 
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 2/9] rename x86 kvm->arch.n_alloc_mmu_pages
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 1/9] abstract kvm x86 mmu->n_free_mmu_pages Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages Dave Hansen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


Again, I think this is a poor choice of names.  This value truly
means, "the number of pages which _may_ be allocated".  But,
reading the value, "n_alloc_mmu_pages", I'm unable to think of
anything it should mean other than "the number of allocated mmu
pages", which is dead wrong.

It's really the high watermark, so let's give it a name to match:
nr_max_mmu_pages.  This change will make the next few patches
much more obvious and easy to read.


Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h |    2 +-
 linux-2.6.git-dave/arch/x86/kvm/mmu.c              |    8 ++++----
 linux-2.6.git-dave/arch/x86/kvm/x86.c              |    2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/kvm_host.h~rename-kvm_alloc arch/x86/include/asm/kvm_host.h
--- linux-2.6.git/arch/x86/include/asm/kvm_host.h~rename-kvm_alloc	2010-06-09 15:14:28.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h	2010-06-09 15:14:28.000000000 -0700
@@ -382,7 +382,7 @@ struct kvm_arch {
 
 	unsigned int n_free_mmu_pages;
 	unsigned int n_requested_mmu_pages;
-	unsigned int n_alloc_mmu_pages;
+	unsigned int n_max_mmu_pages;
 	atomic_t invlpg_counter;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	/*
diff -puN arch/x86/kvm/mmu.c~rename-kvm_alloc arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~rename-kvm_alloc	2010-06-09 15:14:28.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-09 15:14:28.000000000 -0700
@@ -1522,7 +1522,7 @@ void kvm_mmu_change_mmu_pages(struct kvm
 {
 	int used_pages;
 
-	used_pages = kvm->arch.n_alloc_mmu_pages - kvm_mmu_available_pages(kvm);
+	used_pages = kvm->arch.n_max_mmu_pages - kvm_mmu_available_pages(kvm);
 	used_pages = max(0, used_pages);
 
 	/*
@@ -1546,9 +1546,9 @@ void kvm_mmu_change_mmu_pages(struct kvm
 	}
 	else
 		kvm->arch.n_free_mmu_pages += kvm_nr_mmu_pages
-					 - kvm->arch.n_alloc_mmu_pages;
+					 - kvm->arch.n_max_mmu_pages;
 
-	kvm->arch.n_alloc_mmu_pages = kvm_nr_mmu_pages;
+	kvm->arch.n_max_mmu_pages = kvm_nr_mmu_pages;
 }
 
 static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
@@ -2932,7 +2932,7 @@ static int mmu_shrink(int nr_to_scan, gf
 
 		idx = srcu_read_lock(&kvm->srcu);
 		spin_lock(&kvm->mmu_lock);
-		npages = kvm->arch.n_alloc_mmu_pages -
+		npages = kvm->arch.n_max_mmu_pages -
 			 kvm_mmu_available_pages(kvm);
 		cache_count += npages;
 		if (!kvm_freed && nr_to_scan > 0 && npages > 0) {
diff -puN arch/x86/kvm/x86.c~rename-kvm_alloc arch/x86/kvm/x86.c
--- linux-2.6.git/arch/x86/kvm/x86.c~rename-kvm_alloc	2010-06-09 15:14:28.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/x86.c	2010-06-09 15:14:28.000000000 -0700
@@ -2557,7 +2557,7 @@ static int kvm_vm_ioctl_set_nr_mmu_pages
 
 static int kvm_vm_ioctl_get_nr_mmu_pages(struct kvm *kvm)
 {
-	return kvm->arch.n_alloc_mmu_pages;
+	return kvm->arch.n_max_mmu_pages;
 }
 
 gfn_t unalias_gfn_instantiation(struct kvm *kvm, gfn_t gfn)
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 1/9] abstract kvm x86 mmu->n_free_mmu_pages Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 2/9] rename x86 kvm->arch.n_alloc_mmu_pages Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-16 14:25   ` Marcelo Tosatti
  2010-06-15 13:55 ` [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value Dave Hansen
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


I think doing this makes the code much more readable.  That's
borne out by the fact that this patch removes code.  "used"
also happens to be the number that we need to return back to
the slab code when our shrinker gets called.  Keeping this
value as opposed to free makes the next patch simpler.

So, 'struct kvm' is kzalloc()'d.  'struct kvm_arch' is a
structure member (and not a pointer) of 'struct kvm'.  That
means they start out zeroed.  I _think_ they get initialized
properly by kvm_mmu_change_mmu_pages().  But, that only happens
via kvm ioctls.

I have a suspicion that they values are actually inconsistent
until those ioctls get called; "free" and "alloc" are both zero.
But, the VM can't really get run until these ioctl()s get called
anyway.  There are also some checks for negative "used_pages"
values which confused me.  It might all tie together.

Anyway, another benefit of storing 'used' intead of 'free' is
that the values are consistent from the moment the structure is
allocated: no negative "used" value.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h |    2 -
 linux-2.6.git-dave/arch/x86/kvm/mmu.c              |   29 +++++++--------------
 linux-2.6.git-dave/arch/x86/kvm/mmu.h              |    3 +-
 3 files changed, 13 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/kvm_host.h~replace-free-with-used arch/x86/include/asm/kvm_host.h
--- linux-2.6.git/arch/x86/include/asm/kvm_host.h~replace-free-with-used	2010-06-09 15:14:29.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h	2010-06-09 15:14:29.000000000 -0700
@@ -380,7 +380,7 @@ struct kvm_mem_aliases {
 struct kvm_arch {
 	struct kvm_mem_aliases *aliases;
 
-	unsigned int n_free_mmu_pages;
+	unsigned int n_used_mmu_pages;
 	unsigned int n_requested_mmu_pages;
 	unsigned int n_max_mmu_pages;
 	atomic_t invlpg_counter;
diff -puN arch/x86/kvm/mmu.c~replace-free-with-used arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~replace-free-with-used	2010-06-09 15:14:29.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-09 15:14:29.000000000 -0700
@@ -898,7 +898,7 @@ static void kvm_mmu_free_page(struct kvm
 	__free_page(virt_to_page(sp->spt));
 	__free_page(virt_to_page(sp->gfns));
 	kfree(sp);
-	++kvm->arch.n_free_mmu_pages;
+	--kvm->arch.n_used_mmu_pages;
 }
 
 static unsigned kvm_page_table_hashfn(gfn_t gfn)
@@ -919,7 +919,7 @@ static struct kvm_mmu_page *kvm_mmu_allo
 	bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
 	sp->multimapped = 0;
 	sp->parent_pte = parent_pte;
-	--vcpu->kvm->arch.n_free_mmu_pages;
+	++vcpu->kvm->arch.n_used_mmu_pages;
 	return sp;
 }
 
@@ -1516,39 +1516,30 @@ static int kvm_mmu_zap_page(struct kvm *
 
 /*
  * Changing the number of mmu pages allocated to the vm
- * Note: if kvm_nr_mmu_pages is too small, you will get dead lock
+ * Note: if goal_nr_mmu_pages is too small, you will get dead lock
  */
-void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages)
+void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages)
 {
-	int used_pages;
-
-	used_pages = kvm->arch.n_max_mmu_pages - kvm_mmu_available_pages(kvm);
-	used_pages = max(0, used_pages);
-
 	/*
 	 * If we set the number of mmu pages to be smaller be than the
 	 * number of actived pages , we must to free some mmu pages before we
 	 * change the value
 	 */
 
-	if (used_pages > kvm_nr_mmu_pages) {
-		while (used_pages > kvm_nr_mmu_pages &&
+	if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
+		while (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages &&
 			!list_empty(&kvm->arch.active_mmu_pages)) {
 			struct kvm_mmu_page *page;
 
 			page = container_of(kvm->arch.active_mmu_pages.prev,
 					    struct kvm_mmu_page, link);
-			used_pages -= kvm_mmu_zap_page(kvm, page);
-			used_pages--;
+			kvm->arch.n_used_mmu_pages -= kvm_mmu_zap_page(kvm, page);
+			kvm->arch.n_used_mmu_pages--;
 		}
-		kvm_nr_mmu_pages = used_pages;
-		kvm->arch.n_free_mmu_pages = 0;
+		goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages;
 	}
-	else
-		kvm->arch.n_free_mmu_pages += kvm_nr_mmu_pages
-					 - kvm->arch.n_max_mmu_pages;
 
-	kvm->arch.n_max_mmu_pages = kvm_nr_mmu_pages;
+	kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
 }
 
 static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
diff -puN arch/x86/kvm/mmu.h~replace-free-with-used arch/x86/kvm/mmu.h
--- linux-2.6.git/arch/x86/kvm/mmu.h~replace-free-with-used	2010-06-09 15:14:29.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.h	2010-06-09 15:14:29.000000000 -0700
@@ -52,7 +52,8 @@ int kvm_mmu_get_spte_hierarchy(struct kv
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
-	return kvm->arch.n_free_mmu_pages;
+	return kvm->arch.n_max_mmu_pages -
+		kvm->arch.n_used_mmu_pages;
 }
 
 static inline void kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
                   ` (2 preceding siblings ...)
  2010-06-15 13:55 ` [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-16  8:48   ` Avi Kivity
  2010-06-15 13:55 ` [RFC][PATCH 5/9] break out some mmu_skrink() code Dave Hansen
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


Note: this is the real meat of the patch set.  It can be applied up
to this point, and everything will probably be improved, at least
a bit.

Of slab shrinkers, the VM code says:

 * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
 * querying the cache size, so a fastpath for that case is appropriate.

and it *means* it.  Look at how it calls the shrinkers:

	nr_before = (*shrinker->shrink)(0, gfp_mask);
	shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);

So, if you do anything stupid in your shrinker, the VM will doubly
punish you.

The mmu_shrink() function takes the global kvm_lock, then acquires
every VM's kvm->mmu_lock in sequence.  If we have 100 VMs, then
we're going to take 101 locks.  We do it twice, so each call takes
202 locks.  If we're under memory pressure, we can have each cpu
trying to do this.  It can get really hairy, and we've seen lock
spinning in mmu_shrink() be the dominant entry in profiles.

This is guaranteed to optimize at least half of those lock
aquisitions away.  It removes the need to take any of the locks
when simply trying to count objects.


Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/kvm/mmu.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff -puN arch/x86/kvm/mmu.c~make_global_used_value arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~make_global_used_value	2010-06-09 15:14:30.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-09 15:14:30.000000000 -0700
@@ -891,6 +891,19 @@ static int is_empty_shadow_page(u64 *spt
 }
 #endif
 
+/*
+ * This value is the sum of all of the kvm instances's
+ * kvm->arch.n_used_mmu_pages values.  We need a global,
+ * aggregate version in order to make the slab shrinker
+ * faster
+ */
+static unsigned int kvm_total_used_mmu_pages;
+static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
+{
+	kvm->arch.n_used_mmu_pages += nr;
+	kvm_total_used_mmu_pages += nr;
+}
+
 static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	ASSERT(is_empty_shadow_page(sp->spt));
@@ -898,7 +911,7 @@ static void kvm_mmu_free_page(struct kvm
 	__free_page(virt_to_page(sp->spt));
 	__free_page(virt_to_page(sp->gfns));
 	kfree(sp);
-	--kvm->arch.n_used_mmu_pages;
+	kvm_mod_used_mmu_pages(kvm, -1);
 }
 
 static unsigned kvm_page_table_hashfn(gfn_t gfn)
@@ -919,7 +932,7 @@ static struct kvm_mmu_page *kvm_mmu_allo
 	bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
 	sp->multimapped = 0;
 	sp->parent_pte = parent_pte;
-	++vcpu->kvm->arch.n_used_mmu_pages;
+	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 	return sp;
 }
 
@@ -2914,21 +2927,20 @@ static int mmu_shrink(int nr_to_scan, gf
 {
 	struct kvm *kvm;
 	struct kvm *kvm_freed = NULL;
-	int cache_count = 0;
+
+	if (nr_to_scan == 0)
+		goto out;
 
 	spin_lock(&kvm_lock);
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		int npages, idx, freed_pages;
+		int idx, freed_pages;
 
 		idx = srcu_read_lock(&kvm->srcu);
 		spin_lock(&kvm->mmu_lock);
-		npages = kvm->arch.n_max_mmu_pages -
-			 kvm_mmu_available_pages(kvm);
-		cache_count += npages;
-		if (!kvm_freed && nr_to_scan > 0 && npages > 0) {
+		if (!kvm_freed && nr_to_scan > 0 &&
+		    kvm->arch.n_used_mmu_pages > 0) {
 			freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
-			cache_count -= freed_pages;
 			kvm_freed = kvm;
 		}
 		nr_to_scan--;
@@ -2941,7 +2953,8 @@ static int mmu_shrink(int nr_to_scan, gf
 
 	spin_unlock(&kvm_lock);
 
-	return cache_count;
+out:
+	return kvm_total_used_mmu_pages;
 }
 
 static struct shrinker mmu_shrinker = {
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 5/9] break out some mmu_skrink() code
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
                   ` (3 preceding siblings ...)
  2010-06-15 13:55 ` [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 6/9] remove kvm_freed variable Dave Hansen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


This basically takes the loop contents and sticks it in
its own function for readability.  Don't pay too much
attention to the use of nr_scanned in here.  It's a bit
wonky but it'll change in a minute anyway.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/kvm/mmu.c |   35 ++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff -puN arch/x86/kvm/mmu.c~optimize_shrinker arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~optimize_shrinker	2010-06-09 15:14:30.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-09 15:14:30.000000000 -0700
@@ -2923,6 +2923,26 @@ static int kvm_mmu_remove_some_alloc_mmu
 	return kvm_mmu_zap_page(kvm, page) + 1;
 }
 
+static int shrink_kvm_mmu(struct kvm *kvm, int nr_to_scan)
+{
+	int idx, freed_pages;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	if (kvm->arch.n_used_mmu_pages > 0)
+		freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	/*
+	 * This should optimally return the number of objects (mmu pages)
+	 * that we have scanned.  But, for now, just return the number
+	 * that we were able to free.
+	 */
+	return freed_pages;
+}
+
 static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
 {
 	struct kvm *kvm;
@@ -2934,20 +2954,15 @@ static int mmu_shrink(int nr_to_scan, gf
 	spin_lock(&kvm_lock);
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		int idx, freed_pages;
+		if (nr_to_scan <= 0)
+			break;
 
-		idx = srcu_read_lock(&kvm->srcu);
-		spin_lock(&kvm->mmu_lock);
-		if (!kvm_freed && nr_to_scan > 0 &&
-		    kvm->arch.n_used_mmu_pages > 0) {
-			freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
+		shrink_kvm_mmu(kvm, nr_to_scan);
+		if (!kvm_freed)
 			kvm_freed = kvm;
-		}
 		nr_to_scan--;
-
-		spin_unlock(&kvm->mmu_lock);
-		srcu_read_unlock(&kvm->srcu, idx);
 	}
+
 	if (kvm_freed)
 		list_move_tail(&kvm_freed->vm_list, &vm_list);
 
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 6/9] remove kvm_freed variable
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
                   ` (4 preceding siblings ...)
  2010-06-15 13:55 ` [RFC][PATCH 5/9] break out some mmu_skrink() code Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 7/9] make kvm_get_kvm() more robust Dave Hansen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


In mmu_shrink(), we walk the vm_list to try and free some
of the mmu objects.  If we did this over and over again, we
would unfairly bias the shrinking at the beginning of the
list.  So, if a KVM instance gets successfully shrunk, then
we move it to the end of the vm_list.  This keeps things
fair.

With the global mmu object counter, we no longer need to
look at each kvm object on the vm_list during each
mmu_shrink() operation.  So, move the list manipulation
to inside of the loop, and get rid of the 'kvm_freed'
variable since we no longer need to store it.

This also removes the manipulation of the 'nr_to_scan'
variable.  It use in here was questionable, especially
since it was being decremented even in cases where no
scanning was taking place: when building the counter.
Interestingly enough, removing it here does not affect
the reclaim behavior at all.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/kvm/mmu.c |   15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff -puN arch/x86/kvm/mmu.c~optimize_shrinker-2 arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~optimize_shrinker-2	2010-06-14 10:24:48.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-14 10:24:48.000000000 -0700
@@ -2951,7 +2951,6 @@ static int shrink_kvm_mmu(struct kvm *kv
 static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
 {
 	struct kvm *kvm;
-	struct kvm *kvm_freed = NULL;
 
 	if (nr_to_scan == 0)
 		goto out;
@@ -2959,18 +2958,14 @@ static int mmu_shrink(int nr_to_scan, gf
 	spin_lock(&kvm_lock);
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		if (nr_to_scan <= 0)
-			break;
+		int freed = shrink_kvm_mmu(kvm, nr_to_scan);
+		if (!freed)
+			continue;
 
-		shrink_kvm_mmu(kvm, nr_to_scan);
-		if (!kvm_freed)
-			kvm_freed = kvm;
-		nr_to_scan--;
+		list_move_tail(&kvm->vm_list, &vm_list);
+		break;
 	}
 
-	if (kvm_freed)
-		list_move_tail(&kvm_freed->vm_list, &vm_list);
-
 	spin_unlock(&kvm_lock);
 
 out:
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 7/9] make kvm_get_kvm() more robust
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
                   ` (5 preceding siblings ...)
  2010-06-15 13:55 ` [RFC][PATCH 6/9] remove kvm_freed variable Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-15 13:55 ` [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink() Dave Hansen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


The comment tells most of the story here.  This patch guarantees
that once a user decrements kvm->users_count to 0 that no one
will increment it again.

We'll need this in a moment because we are going to use
kvm->users_count as a more generic refcount.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/include/linux/kvm_host.h |    2 -
 linux-2.6.git-dave/virt/kvm/kvm_main.c      |   32 +++++++++++++++++++++++++---
 2 files changed, 30 insertions(+), 4 deletions(-)

diff -puN include/linux/kvm_host.h~make-kvm_get_kvm-more-robust include/linux/kvm_host.h
--- linux-2.6.git/include/linux/kvm_host.h~make-kvm_get_kvm-more-robust	2010-06-11 08:39:16.000000000 -0700
+++ linux-2.6.git-dave/include/linux/kvm_host.h	2010-06-11 08:39:16.000000000 -0700
@@ -247,7 +247,7 @@ int kvm_init(void *opaque, unsigned vcpu
 		  struct module *module);
 void kvm_exit(void);
 
-void kvm_get_kvm(struct kvm *kvm);
+int kvm_get_kvm(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
 
 static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
diff -puN virt/kvm/kvm_main.c~make-kvm_get_kvm-more-robust virt/kvm/kvm_main.c
--- linux-2.6.git/virt/kvm/kvm_main.c~make-kvm_get_kvm-more-robust	2010-06-11 08:39:16.000000000 -0700
+++ linux-2.6.git-dave/virt/kvm/kvm_main.c	2010-06-11 08:39:16.000000000 -0700
@@ -496,9 +496,30 @@ static void kvm_destroy_vm(struct kvm *k
 	mmdrop(mm);
 }
 
-void kvm_get_kvm(struct kvm *kvm)
+/*
+ * Once the counter goes to 0, we destroy the
+ * kvm object.  Do not allow additional refs
+ * to be obtained once this occurs.
+ *
+ * Any calls which are done via the kvm fd
+ * could use atomic_inc().  That is because
+ * ->users_count is set to 1 when the kvm fd
+ * is created, and stays at least 1 while
+ * the fd exists.
+ *
+ * But, those calls are currently rare, so do
+ * this (more expensive) atomic_add_unless()
+ * to keep the number of functions down.
+ *
+ * Returns 0 if the reference was obtained
+ * successfully.
+ */
+int kvm_get_kvm(struct kvm *kvm)
 {
-	atomic_inc(&kvm->users_count);
+	int did_add = atomic_add_unless(&kvm->users_count, 1, 0);
+	if (did_add)
+		return 0;
+	return -EBUSY;
 }
 EXPORT_SYMBOL_GPL(kvm_get_kvm);
 
@@ -1332,7 +1353,12 @@ static int kvm_vm_ioctl_create_vcpu(stru
 	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
 
 	/* Now it's all set up, let userspace reach it */
-	kvm_get_kvm(kvm);
+	r = kvm_get_kvm(kvm);
+	/*
+	 * Getting called via the kvm fd _should_ guarantee
+	 * that we can always get a reference.
+	 */
+	WARN_ON(r);
 	r = create_vcpu_fd(vcpu);
 	if (r < 0) {
 		kvm_put_kvm(kvm);
diff -puN arch/x86/kvm/mmu.c~make-kvm_get_kvm-more-robust arch/x86/kvm/mmu.c
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink()
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
                   ` (6 preceding siblings ...)
  2010-06-15 13:55 ` [RFC][PATCH 7/9] make kvm_get_kvm() more robust Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-16  8:54   ` Avi Kivity
  2010-06-15 13:55 ` [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive Dave Hansen
  2010-06-16  8:38 ` [RFC][PATCH 0/9] rework KVM mmu_shrink() code Avi Kivity
  9 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


mmu_shrink() is effectively single-threaded since the global
kvm_lock is held over the entire function.

I beleive its only use here is for synchronization of the
vm_list.  Instead of using the kvm_lock to ensure
consistency of the list, we instead obtain a kvm_get_kvm()
reference.  This keeps the kvm object on the vm_list while
we shrink it.

Since we don't need the lock to maintain the list any more,
we can drop it.  We'll reacquire it if we need to get another
object off.

This leads to a larger number of atomic ops, but reduces
lock hold times: the typical latency vs. throughput debate.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/kvm/mmu.c |   48 ++++++++++++++++++++++++++--------
 linux-2.6.git-dave/kernel/profile.c   |    2 +
 2 files changed, 40 insertions(+), 10 deletions(-)

diff -puN arch/x86/kvm/mmu.c~optimize_shrinker-3 arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~optimize_shrinker-3	2010-06-11 08:39:17.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-11 08:39:17.000000000 -0700
@@ -2930,7 +2930,8 @@ static int kvm_mmu_remove_some_alloc_mmu
 
 static int shrink_kvm_mmu(struct kvm *kvm, int nr_to_scan)
 {
-	int idx, freed_pages;
+	int idx;
+	int freed_pages = 0;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -2950,23 +2951,50 @@ static int shrink_kvm_mmu(struct kvm *kv
 
 static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
 {
+	int err;
+	int freed;
 	struct kvm *kvm;
 
 	if (nr_to_scan == 0)
 		goto out;
 
+retry:
+	nr_to_scan--;
 	spin_lock(&kvm_lock);
-
-	list_for_each_entry(kvm, &vm_list, vm_list) {
-		int freed = shrink_kvm_mmu(kvm, nr_to_scan);
-		if (!freed)
-			continue;
-
-		list_move_tail(&kvm->vm_list, &vm_list);
-		break;
+	if (list_empty(&vm_list)) {
+		spin_unlock(&kvm_lock);
+		goto out;
 	}
-
+	kvm = list_first_entry(&vm_list, struct kvm, vm_list);
+	/*
+	 * With a reference to the kvm object, it can not go away
+	 * nor get removed from the vm_list.
+	 */
+	err = kvm_get_kvm(kvm);
+	/* Did someone race and start destroying the kvm object? */
+	if (err) {
+		spin_unlock(&kvm_lock);
+		goto retry;
+	}
+	/*
+	 * Stick this kvm on the end of the list so the next
+	 * interation will shrink a different one.  Do this here
+	 * so that we normally don't have to reacquire the lock.
+	 */
+	list_move_tail(&kvm->vm_list, &vm_list);
+	/*
+	 * Which lets us release the global lock, holding it for
+	 * the minimal amount of time possible, and ensuring that
+	 * we don't hold it during the (presumably slow) shrink
+	 * operation itself.
+	 */
 	spin_unlock(&kvm_lock);
+	freed = shrink_kvm_mmu(kvm, nr_to_scan);
+
+	kvm_put_kvm(kvm);
+
+	if (!freed && nr_to_scan > 0)
+		goto retry;
 
 out:
 	return kvm_total_used_mmu_pages;
diff -puN virt/kvm/kvm_main.c~optimize_shrinker-3 virt/kvm/kvm_main.c
diff -puN kernel/profile.c~optimize_shrinker-3 kernel/profile.c
--- linux-2.6.git/kernel/profile.c~optimize_shrinker-3	2010-06-11 09:09:43.000000000 -0700
+++ linux-2.6.git-dave/kernel/profile.c	2010-06-11 09:12:24.000000000 -0700
@@ -314,6 +314,8 @@ void profile_hits(int type, void *__pc, 
 	if (prof_on != type || !prof_buffer)
 		return;
 	pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1);
+	if ((pc == prof_len - 1) && printk_ratelimit())
+		printk("profile_hits(%d, %p, %d)\n", type, __pc, nr_hits);
 	i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT;
 	secondary = (~(pc << 1) & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT;
 	cpu = get_cpu();
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
                   ` (7 preceding siblings ...)
  2010-06-15 13:55 ` [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink() Dave Hansen
@ 2010-06-15 13:55 ` Dave Hansen
  2010-06-16  9:24   ` Avi Kivity
  2010-06-16  8:38 ` [RFC][PATCH 0/9] rework KVM mmu_shrink() code Avi Kivity
  9 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-15 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Dave Hansen


In a previous patch, we removed the 'nr_to_scan' tracking.
It was not being used to track the number of objects
scanned, so we stopped using it entirely.  Here, we
strart using it again.

The theory here is simple; if we already have the refcount
and the kvm->mmu_lock, then we should do as much work as
possible under the lock.  The downside is that we're less
fair about the KVM instances from which we reclaim.  Each
call to mmu_shrink() will tend to "pick on" one instance,
after which it gets moved to the end of the list and left
alone for a while.

If mmu_shrink() has already done a significant amount of
scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
will also ensure that we do not over-reclaim when we have
already done a lot of work in this call.

In the end, this patch defines a "scan" as:
1. An attempt to acquire a refcount on a 'struct kvm'
2. freeing a kvm mmu page

This would probably be most ideal if we can expose some
of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
as also counting as scanning, but I think we have churned
enough for the moment.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/kvm/mmu.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive	2010-06-14 11:30:44.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-14 11:38:04.000000000 -0700
@@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
-	if (kvm->arch.n_used_mmu_pages > 0)
-		freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
+	while (nr_to_scan > 0 && kvm->arch.n_used_mmu_pages > 0) {
+		freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
+		nr_to_scan--;
+	}
 
 	spin_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
@@ -2952,7 +2954,6 @@ static int shrink_kvm_mmu(struct kvm *kv
 static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
 {
 	int err;
-	int freed;
 	struct kvm *kvm;
 
 	if (nr_to_scan == 0)
@@ -2989,11 +2990,11 @@ retry:
 	 * operation itself.
 	 */
 	spin_unlock(&kvm_lock);
-	freed = shrink_kvm_mmu(kvm, nr_to_scan);
+	nr_to_scan -= shrink_kvm_mmu(kvm, nr_to_scan);
 
 	kvm_put_kvm(kvm);
 
-	if (!freed && nr_to_scan > 0)
+	if (nr_to_scan > 0)
 		goto retry;
 
 out:
diff -puN arch/x86/kvm/x86.c~make-shrinker-more-aggressive arch/x86/kvm/x86.c
diff -puN arch/x86/include/asm/kvm_host.h~make-shrinker-more-aggressive arch/x86/include/asm/kvm_host.h
_

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 0/9] rework KVM mmu_shrink() code
  2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
                   ` (8 preceding siblings ...)
  2010-06-15 13:55 ` [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive Dave Hansen
@ 2010-06-16  8:38 ` Avi Kivity
  2010-06-16 15:03   ` Dave Hansen
  9 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2010-06-16  8:38 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/15/2010 04:55 PM, Dave Hansen wrote:
> This is a big RFC for the moment.  These need a bunch more
> runtime testing.
>
> --
>
> We've seen contention in the mmu_shrink() function.

First of all, that's surprising.  I tried to configure the shrinker so 
it would stay away from kvm unless memory was really tight.  The reason 
is that kvm mmu pages can cost as much as 1-2 ms of cpu time to build, 
perhaps even more, so we shouldn't drop them lightly.

It's certainly a neglected area that needs attention, though.

> This patch
> set reworks it to hopefully be more scalable to large numbers
> of CPUs, as well as large numbers of running VMs.
>
> The patches are ordered with increasing invasiveness.
>
> These seem to boot and run fine.  I'm running about 40 VMs at
> once, while doing "echo 3>  /proc/sys/vm/drop_caches", and
> killing/restarting VMs constantly.
>    

Will drop_caches actually shrink the kvm caches too?  If so we probably 
need to add that to autotest since it's a really good stress test for 
the mmu.

> Seems to be relatively stable, and seems to keep the numbers
> of kvm_mmu_page_header objects down.
>    

That's no necessarily a good thing, those things are expensive to 
recreate.  Of course, when we do need to reclaim them, that should be 
efficient.

We also do a very bad job of selecting which page to reclaim.  We need 
to start using the accessed bit on sptes that point to shadow page 
tables, and then look those up and reclaim unreferenced pages sooner.  
With shadow paging there can be tons of unsync pages that are basically 
unused and can be reclaimed at no cost to future runtime.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 1/9] abstract kvm x86 mmu->n_free_mmu_pages
  2010-06-15 13:55 ` [RFC][PATCH 1/9] abstract kvm x86 mmu->n_free_mmu_pages Dave Hansen
@ 2010-06-16  8:40   ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2010-06-16  8:40 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/15/2010 04:55 PM, Dave Hansen wrote:
> First of all, I think "free" is a poor name for this value.  In
> this context, it means, "the number of mmu pages which this kvm
> instance should be able to allocate."  To me, "free" implies
> much more that the objects are there and ready for use.  I think
> "available" is a much better description, especially when you
> see how it is calculated.
>
>    

Agreed.  Note that if reclaim is improved, we can drop it completely and 
let the kernel trim the cache when it grows too large.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
  2010-06-15 13:55 ` [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value Dave Hansen
@ 2010-06-16  8:48   ` Avi Kivity
  2010-06-16 15:06     ` Dave Hansen
  2010-06-16 16:55     ` Dave Hansen
  0 siblings, 2 replies; 31+ messages in thread
From: Avi Kivity @ 2010-06-16  8:48 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/15/2010 04:55 PM, Dave Hansen wrote:
> Note: this is the real meat of the patch set.  It can be applied up
> to this point, and everything will probably be improved, at least
> a bit.
>
> Of slab shrinkers, the VM code says:
>
>   * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
>   * querying the cache size, so a fastpath for that case is appropriate.
>
> and it *means* it.  Look at how it calls the shrinkers:
>
> 	nr_before = (*shrinker->shrink)(0, gfp_mask);
> 	shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
>
> So, if you do anything stupid in your shrinker, the VM will doubly
> punish you.
>    

Ouch.

> The mmu_shrink() function takes the global kvm_lock, then acquires
> every VM's kvm->mmu_lock in sequence.  If we have 100 VMs, then
> we're going to take 101 locks.  We do it twice, so each call takes
> 202 locks.  If we're under memory pressure, we can have each cpu
> trying to do this.  It can get really hairy, and we've seen lock
> spinning in mmu_shrink() be the dominant entry in profiles.
>
> This is guaranteed to optimize at least half of those lock
> aquisitions away.  It removes the need to take any of the locks
> when simply trying to count objects.
>
>
> Signed-off-by: Dave Hansen<dave@linux.vnet.ibm.com>
> ---
>
>   linux-2.6.git-dave/arch/x86/kvm/mmu.c |   33 +++++++++++++++++++++++----------
>   1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff -puN arch/x86/kvm/mmu.c~make_global_used_value arch/x86/kvm/mmu.c
> --- linux-2.6.git/arch/x86/kvm/mmu.c~make_global_used_value	2010-06-09 15:14:30.000000000 -0700
> +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-09 15:14:30.000000000 -0700
> @@ -891,6 +891,19 @@ static int is_empty_shadow_page(u64 *spt
>   }
>   #endif
>
> +/*
> + * This value is the sum of all of the kvm instances's
> + * kvm->arch.n_used_mmu_pages values.  We need a global,
> + * aggregate version in order to make the slab shrinker
> + * faster
> + */
> +static unsigned int kvm_total_used_mmu_pages;
>    

The variable needs to be at the top of the file.

> +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> +{
> +	kvm->arch.n_used_mmu_pages += nr;
> +	kvm_total_used_mmu_pages += nr;
>    

Needs an atomic operation, since there's no global lock here.  To avoid 
bouncing this cacheline around, make the variable percpu and make 
readers take a sum across all cpus.  Side benefit is that you no longer 
need an atomic but a local_t, which is considerably cheaper.

> +}
> +
>    


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink()
  2010-06-15 13:55 ` [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink() Dave Hansen
@ 2010-06-16  8:54   ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2010-06-16  8:54 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/15/2010 04:55 PM, Dave Hansen wrote:
> mmu_shrink() is effectively single-threaded since the global
> kvm_lock is held over the entire function.
>
> I beleive its only use here is for synchronization of the
> vm_list.  Instead of using the kvm_lock to ensure
> consistency of the list, we instead obtain a kvm_get_kvm()
> reference.  This keeps the kvm object on the vm_list while
> we shrink it.
>
> Since we don't need the lock to maintain the list any more,
> we can drop it.  We'll reacquire it if we need to get another
> object off.
>
> This leads to a larger number of atomic ops, but reduces
> lock hold times: the typical latency vs. throughput debate.
>
>    

<snip content>

> diff -puN kernel/profile.c~optimize_shrinker-3 kernel/profile.c
> --- linux-2.6.git/kernel/profile.c~optimize_shrinker-3	2010-06-11 09:09:43.000000000 -0700
> +++ linux-2.6.git-dave/kernel/profile.c	2010-06-11 09:12:24.000000000 -0700
> @@ -314,6 +314,8 @@ void profile_hits(int type, void *__pc,
>   	if (prof_on != type || !prof_buffer)
>   		return;
>   	pc = min((pc - (unsigned long)_stext)>>  prof_shift, prof_len - 1);
> +	if ((pc == prof_len - 1)&&  printk_ratelimit())
> +		printk("profile_hits(%d, %p, %d)\n", type, __pc, nr_hits);
>   	i = primary = (pc&  (NR_PROFILE_GRP - 1))<<  PROFILE_GRPSHIFT;
>   	secondary = (~(pc<<  1)&  (NR_PROFILE_GRP - 1))<<  PROFILE_GRPSHIFT;
>   	cpu = get_cpu();
> _
>    

Unrelated gunk...

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-15 13:55 ` [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive Dave Hansen
@ 2010-06-16  9:24   ` Avi Kivity
  2010-06-16 15:25     ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2010-06-16  9:24 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/15/2010 04:55 PM, Dave Hansen wrote:
> In a previous patch, we removed the 'nr_to_scan' tracking.
> It was not being used to track the number of objects
> scanned, so we stopped using it entirely.  Here, we
> strart using it again.
>
> The theory here is simple; if we already have the refcount
> and the kvm->mmu_lock, then we should do as much work as
> possible under the lock.  The downside is that we're less
> fair about the KVM instances from which we reclaim.  Each
> call to mmu_shrink() will tend to "pick on" one instance,
> after which it gets moved to the end of the list and left
> alone for a while.
>    

That also increases the latency hit, as well as a potential fault storm, 
on that instance.  Spreading out is less efficient, but smoother.

> If mmu_shrink() has already done a significant amount of
> scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
> will also ensure that we do not over-reclaim when we have
> already done a lot of work in this call.
>
> In the end, this patch defines a "scan" as:
> 1. An attempt to acquire a refcount on a 'struct kvm'
> 2. freeing a kvm mmu page
>
> This would probably be most ideal if we can expose some
> of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
> as also counting as scanning, but I think we have churned
> enough for the moment.
>    

It usually removes one page.

> Signed-off-by: Dave Hansen<dave@linux.vnet.ibm.com>
> ---
>
>   linux-2.6.git-dave/arch/x86/kvm/mmu.c |   11 ++++++-----
>   1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
> --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive	2010-06-14 11:30:44.000000000 -0700
> +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-14 11:38:04.000000000 -0700
> @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
>
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> -	if (kvm->arch.n_used_mmu_pages>  0)
> -		freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> +	while (nr_to_scan>  0&&  kvm->arch.n_used_mmu_pages>  0) {
> +		freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> +		nr_to_scan--;
> +	}
>    

What tree are you patching?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages
  2010-06-15 13:55 ` [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages Dave Hansen
@ 2010-06-16 14:25   ` Marcelo Tosatti
  2010-06-16 15:42     ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Marcelo Tosatti @ 2010-06-16 14:25 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On Tue, Jun 15, 2010 at 06:55:22AM -0700, Dave Hansen wrote:
> 
> I think doing this makes the code much more readable.  That's
> borne out by the fact that this patch removes code.  "used"
> also happens to be the number that we need to return back to
> the slab code when our shrinker gets called.  Keeping this
> value as opposed to free makes the next patch simpler.
> 
> So, 'struct kvm' is kzalloc()'d.  'struct kvm_arch' is a
> structure member (and not a pointer) of 'struct kvm'.  That
> means they start out zeroed.  I _think_ they get initialized
> properly by kvm_mmu_change_mmu_pages().  But, that only happens
> via kvm ioctls.
> 
> I have a suspicion that they values are actually inconsistent
> until those ioctls get called; "free" and "alloc" are both zero.
> But, the VM can't really get run until these ioctl()s get called
> anyway.  There are also some checks for negative "used_pages"
> values which confused me.  It might all tie together.
> 
> Anyway, another benefit of storing 'used' intead of 'free' is
> that the values are consistent from the moment the structure is
> allocated: no negative "used" value.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
> 
>  linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h |    2 -
>  linux-2.6.git-dave/arch/x86/kvm/mmu.c              |   29 +++++++--------------
>  linux-2.6.git-dave/arch/x86/kvm/mmu.h              |    3 +-
>  3 files changed, 13 insertions(+), 21 deletions(-)
> 
> diff -puN arch/x86/include/asm/kvm_host.h~replace-free-with-used arch/x86/include/asm/kvm_host.h
> --- linux-2.6.git/arch/x86/include/asm/kvm_host.h~replace-free-with-used	2010-06-09 15:14:29.000000000 -0700
> +++ linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h	2010-06-09 15:14:29.000000000 -0700
> @@ -380,7 +380,7 @@ struct kvm_mem_aliases {
>  struct kvm_arch {
>  	struct kvm_mem_aliases *aliases;
>  
> -	unsigned int n_free_mmu_pages;
> +	unsigned int n_used_mmu_pages;
>  	unsigned int n_requested_mmu_pages;
>  	unsigned int n_max_mmu_pages;
>  	atomic_t invlpg_counter;
> diff -puN arch/x86/kvm/mmu.c~replace-free-with-used arch/x86/kvm/mmu.c
> --- linux-2.6.git/arch/x86/kvm/mmu.c~replace-free-with-used	2010-06-09 15:14:29.000000000 -0700
> +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-09 15:14:29.000000000 -0700
> @@ -898,7 +898,7 @@ static void kvm_mmu_free_page(struct kvm
>  	__free_page(virt_to_page(sp->spt));
>  	__free_page(virt_to_page(sp->gfns));
>  	kfree(sp);
> -	++kvm->arch.n_free_mmu_pages;
> +	--kvm->arch.n_used_mmu_pages;
>  }
>  
>  static unsigned kvm_page_table_hashfn(gfn_t gfn)
> @@ -919,7 +919,7 @@ static struct kvm_mmu_page *kvm_mmu_allo
>  	bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
>  	sp->multimapped = 0;
>  	sp->parent_pte = parent_pte;
> -	--vcpu->kvm->arch.n_free_mmu_pages;
> +	++vcpu->kvm->arch.n_used_mmu_pages;
>  	return sp;
>  }
>  
> @@ -1516,39 +1516,30 @@ static int kvm_mmu_zap_page(struct kvm *
>  
>  /*
>   * Changing the number of mmu pages allocated to the vm
> - * Note: if kvm_nr_mmu_pages is too small, you will get dead lock
> + * Note: if goal_nr_mmu_pages is too small, you will get dead lock
>   */
> -void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages)
> +void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages)
>  {
> -	int used_pages;
> -
> -	used_pages = kvm->arch.n_max_mmu_pages - kvm_mmu_available_pages(kvm);
> -	used_pages = max(0, used_pages);
> -
>  	/*
>  	 * If we set the number of mmu pages to be smaller be than the
>  	 * number of actived pages , we must to free some mmu pages before we
>  	 * change the value
>  	 */
>  
> -	if (used_pages > kvm_nr_mmu_pages) {
> -		while (used_pages > kvm_nr_mmu_pages &&
> +	if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
> +		while (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages &&
>  			!list_empty(&kvm->arch.active_mmu_pages)) {
>  			struct kvm_mmu_page *page;
>  
>  			page = container_of(kvm->arch.active_mmu_pages.prev,
>  					    struct kvm_mmu_page, link);
> -			used_pages -= kvm_mmu_zap_page(kvm, page);
> -			used_pages--;
> +			kvm->arch.n_used_mmu_pages -= kvm_mmu_zap_page(kvm, page);
> +			kvm->arch.n_used_mmu_pages--;

kvm->arch.n_used_mmu_pages is deaccounted for in kvm_mmu_zap_page -> kvm_mmu_free_page
already.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 0/9] rework KVM mmu_shrink() code
  2010-06-16  8:38 ` [RFC][PATCH 0/9] rework KVM mmu_shrink() code Avi Kivity
@ 2010-06-16 15:03   ` Dave Hansen
  2010-06-17  8:40     ` Avi Kivity
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-16 15:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, kvm

On Wed, 2010-06-16 at 11:38 +0300, Avi Kivity wrote:
> On 06/15/2010 04:55 PM, Dave Hansen wrote:
> > These seem to boot and run fine.  I'm running about 40 VMs at
> > once, while doing "echo 3>  /proc/sys/vm/drop_caches", and
> > killing/restarting VMs constantly.
> >    
> 
> Will drop_caches actually shrink the kvm caches too?  If so we probably 
> need to add that to autotest since it's a really good stress test for 
> the mmu.

I'm completely sure.  I crashed my machines several times this way
during testing.

> > Seems to be relatively stable, and seems to keep the numbers
> > of kvm_mmu_page_header objects down.
> >    
> 
> That's no necessarily a good thing, those things are expensive to 
> recreate.  Of course, when we do need to reclaim them, that should be 
> efficient.

Oh, I meant that I didn't break the shrinker completely.

> We also do a very bad job of selecting which page to reclaim.  We need 
> to start using the accessed bit on sptes that point to shadow page 
> tables, and then look those up and reclaim unreferenced pages sooner.  
> With shadow paging there can be tons of unsync pages that are basically 
> unused and can be reclaimed at no cost to future runtime.

Sounds like a good next step.

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
  2010-06-16  8:48   ` Avi Kivity
@ 2010-06-16 15:06     ` Dave Hansen
  2010-06-17  8:43       ` Avi Kivity
  2010-06-16 16:55     ` Dave Hansen
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-16 15:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, kvm

On Wed, 2010-06-16 at 11:48 +0300, Avi Kivity wrote:
> On 06/15/2010 04:55 PM, Dave Hansen wrote:
> > +/*
> > + * This value is the sum of all of the kvm instances's
> > + * kvm->arch.n_used_mmu_pages values.  We need a global,
> > + * aggregate version in order to make the slab shrinker
> > + * faster
> > + */
> > +static unsigned int kvm_total_used_mmu_pages;
> >    
> 
> The variable needs to be at the top of the file.

Gotcha, will do.

> > +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> > +{
> > +	kvm->arch.n_used_mmu_pages += nr;
> > +	kvm_total_used_mmu_pages += nr;
> >    
> 
> Needs an atomic operation, since there's no global lock here.  To avoid 
> bouncing this cacheline around, make the variable percpu and make 
> readers take a sum across all cpus.  Side benefit is that you no longer 
> need an atomic but a local_t, which is considerably cheaper.

That's a good point.  All of the modifications are done under locks, but
the fast path isn't any more.  I'll fix it up.

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-16  9:24   ` Avi Kivity
@ 2010-06-16 15:25     ` Dave Hansen
  2010-06-17  8:37       ` Avi Kivity
  2010-06-18 15:49       ` Dave Hansen
  0 siblings, 2 replies; 31+ messages in thread
From: Dave Hansen @ 2010-06-16 15:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, kvm

On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote:
> On 06/15/2010 04:55 PM, Dave Hansen wrote:
> > In a previous patch, we removed the 'nr_to_scan' tracking.
> > It was not being used to track the number of objects
> > scanned, so we stopped using it entirely.  Here, we
> > strart using it again.
> >
> > The theory here is simple; if we already have the refcount
> > and the kvm->mmu_lock, then we should do as much work as
> > possible under the lock.  The downside is that we're less
> > fair about the KVM instances from which we reclaim.  Each
> > call to mmu_shrink() will tend to "pick on" one instance,
> > after which it gets moved to the end of the list and left
> > alone for a while.
> >    
> 
> That also increases the latency hit, as well as a potential fault storm, 
> on that instance.  Spreading out is less efficient, but smoother.

This is probably something that we need to go back and actually measure.
My suspicion is that, when memory fills up and this shrinker is getting
called a lot, it will be naturally fair.  That list gets shuffled around
enough, and mmu_shrink() called often enough that no VMs get picked on
too unfairly.

I'll go back and see if I can quantify this a bit, though.

I do worry about the case where you really have only a single CPU going
into reclaim and a very small number of VMs on the system.  You're
basically guaranteeing that you'll throw away nr_to_scan of the poor
victim VM's, with no penalty on the other guy.  

> > If mmu_shrink() has already done a significant amount of
> > scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
> > will also ensure that we do not over-reclaim when we have
> > already done a lot of work in this call.
> >
> > In the end, this patch defines a "scan" as:
> > 1. An attempt to acquire a refcount on a 'struct kvm'
> > 2. freeing a kvm mmu page
> >
> > This would probably be most ideal if we can expose some
> > of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
> > as also counting as scanning, but I think we have churned
> > enough for the moment.
> 
> It usually removes one page.

Does it always just go right now and free it, or is there any real
scanning that has to go on?

> > diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
> > --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive	2010-06-14 11:30:44.000000000 -0700
> > +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-14 11:38:04.000000000 -0700
> > @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
> >
> >   	idx = srcu_read_lock(&kvm->srcu);
> >   	spin_lock(&kvm->mmu_lock);
> > -	if (kvm->arch.n_used_mmu_pages>  0)
> > -		freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> > +	while (nr_to_scan>  0&&  kvm->arch.n_used_mmu_pages>  0) {
> > +		freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
> > +		nr_to_scan--;
> > +	}
> >    
> 
> What tree are you patching?

These applied to Linus's latest as of yesterday.

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages
  2010-06-16 14:25   ` Marcelo Tosatti
@ 2010-06-16 15:42     ` Dave Hansen
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-06-16 15:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, kvm

On Wed, 2010-06-16 at 11:25 -0300, Marcelo Tosatti wrote:
> > -     if (used_pages > kvm_nr_mmu_pages) {
> > -             while (used_pages > kvm_nr_mmu_pages &&
> > +     if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
> > +             while (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages &&
> >                       !list_empty(&kvm->arch.active_mmu_pages)) {
> >                       struct kvm_mmu_page *page;
> >  
> >                       page = container_of(kvm->arch.active_mmu_pages.prev,
> >                                           struct kvm_mmu_page, link);
> > -                     used_pages -= kvm_mmu_zap_page(kvm, page);
> > -                     used_pages--;
> > +                     kvm->arch.n_used_mmu_pages -= kvm_mmu_zap_page(kvm, page);
> > +                     kvm->arch.n_used_mmu_pages--;
> 
> kvm->arch.n_used_mmu_pages is deaccounted for in kvm_mmu_zap_page ->
> kvm_mmu_free_page
> already.

Ahh, I see that now.  Thanks!

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
  2010-06-16  8:48   ` Avi Kivity
  2010-06-16 15:06     ` Dave Hansen
@ 2010-06-16 16:55     ` Dave Hansen
  2010-06-17  8:23       ` Avi Kivity
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-16 16:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, kvm

On Wed, 2010-06-16 at 11:48 +0300, Avi Kivity wrote:
> > +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
> > +{
> > +     kvm->arch.n_used_mmu_pages += nr;
> > +     kvm_total_used_mmu_pages += nr;
> >    
> 
> Needs an atomic operation, since there's no global lock here.  To avoid 
> bouncing this cacheline around, make the variable percpu and make 
> readers take a sum across all cpus.  Side benefit is that you no longer 
> need an atomic but a local_t, which is considerably cheaper.

We do have the stuff in:

	include/linux/percpu_counter.h

the downside being that they're not precise and they're *HUGE* according
to the comment. :)

It's actually fairly difficult to do a counter which is precise,
scalable, and works well for small CPU counts when NR_CPUS is large.  Do
you mind if we just stick with a plain atomic_t for now?

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
  2010-06-16 16:55     ` Dave Hansen
@ 2010-06-17  8:23       ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2010-06-17  8:23 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/16/2010 07:55 PM, Dave Hansen wrote:
> On Wed, 2010-06-16 at 11:48 +0300, Avi Kivity wrote:
>    
>>> +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>> +{
>>> +     kvm->arch.n_used_mmu_pages += nr;
>>> +     kvm_total_used_mmu_pages += nr;
>>>
>>>        
>> Needs an atomic operation, since there's no global lock here.  To avoid
>> bouncing this cacheline around, make the variable percpu and make
>> readers take a sum across all cpus.  Side benefit is that you no longer
>> need an atomic but a local_t, which is considerably cheaper.
>>      
> We do have the stuff in:
>
> 	include/linux/percpu_counter.h
>
> the downside being that they're not precise and they're *HUGE* according
> to the comment. :)
>
> It's actually fairly difficult to do a counter which is precise,
> scalable, and works well for small CPU counts when NR_CPUS is large.  Do
> you mind if we just stick with a plain atomic_t for now?
>    

Do we really need something precise?

I'm not excited by adding a global atomic.  So far nothing in the kvm 
hot paths depends on global shared memory (though we have lots of per-vm 
shared memory).

Can we perhaps query the kmem_cache for a representation of the amount 
of objects?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-16 15:25     ` Dave Hansen
@ 2010-06-17  8:37       ` Avi Kivity
  2010-06-18 15:49       ` Dave Hansen
  1 sibling, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2010-06-17  8:37 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/16/2010 06:25 PM, Dave Hansen wrote:
>
>>> If mmu_shrink() has already done a significant amount of
>>> scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
>>> will also ensure that we do not over-reclaim when we have
>>> already done a lot of work in this call.
>>>
>>> In the end, this patch defines a "scan" as:
>>> 1. An attempt to acquire a refcount on a 'struct kvm'
>>> 2. freeing a kvm mmu page
>>>
>>> This would probably be most ideal if we can expose some
>>> of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
>>> as also counting as scanning, but I think we have churned
>>> enough for the moment.
>>>        
>> It usually removes one page.
>>      
> Does it always just go right now and free it, or is there any real
> scanning that has to go on?
>    

It picks a page from the tail of the LRU and frees it.  There is very 
little attempt to keep the LRU in LRU order, though.

We do need a scanner that looks at spte accessed bits if this isn't 
going to result in performance losses.

>>> diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
>>> --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive	2010-06-14 11:30:44.000000000 -0700
>>> +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c	2010-06-14 11:38:04.000000000 -0700
>>> @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
>>>
>>>    	idx = srcu_read_lock(&kvm->srcu);
>>>    	spin_lock(&kvm->mmu_lock);
>>> -	if (kvm->arch.n_used_mmu_pages>   0)
>>> -		freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
>>> +	while (nr_to_scan>   0&&   kvm->arch.n_used_mmu_pages>   0) {
>>> +		freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
>>> +		nr_to_scan--;
>>> +	}
>>>
>>>        
>> What tree are you patching?
>>      
> These applied to Linus's latest as of yesterday.
>    

Please patch against kvm.git master (or next, which is usually a few 
unregression-tested patches ahead).  This code has changed.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 0/9] rework KVM mmu_shrink() code
  2010-06-16 15:03   ` Dave Hansen
@ 2010-06-17  8:40     ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2010-06-17  8:40 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/16/2010 06:03 PM, Dave Hansen wrote:
> On Wed, 2010-06-16 at 11:38 +0300, Avi Kivity wrote:
>    
>> On 06/15/2010 04:55 PM, Dave Hansen wrote:
>>      
>>> These seem to boot and run fine.  I'm running about 40 VMs at
>>> once, while doing "echo 3>   /proc/sys/vm/drop_caches", and
>>> killing/restarting VMs constantly.
>>>
>>>        
>> Will drop_caches actually shrink the kvm caches too?  If so we probably
>> need to add that to autotest since it's a really good stress test for
>> the mmu.
>>      
> I'm completely sure.

Yes, easily seen from the code as well.

> I crashed my machines several times this way
> during testing.
>    

Hopefully only with your patches applied?

I'll try to run autotest from time to time with drop_caches running in 
the background.  Looks like an excellent way to stress out the mmu.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value
  2010-06-16 15:06     ` Dave Hansen
@ 2010-06-17  8:43       ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2010-06-17  8:43 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/16/2010 06:06 PM, Dave Hansen wrote:
>
>
>>> +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>>> +{
>>> +	kvm->arch.n_used_mmu_pages += nr;
>>> +	kvm_total_used_mmu_pages += nr;
>>>
>>>        
>> Needs an atomic operation, since there's no global lock here.  To avoid
>> bouncing this cacheline around, make the variable percpu and make
>> readers take a sum across all cpus.  Side benefit is that you no longer
>> need an atomic but a local_t, which is considerably cheaper.
>>      
> That's a good point.  All of the modifications are done under locks, but
> the fast path isn't any more.  I'll fix it up.
>
>    

Note, even before you moved the shrinker out of the lock, this was a 
problem.  kvm_mod_used_mmu_pages() is called not just from the shrinker 
code, we zap pages for other reasons as well.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-16 15:25     ` Dave Hansen
  2010-06-17  8:37       ` Avi Kivity
@ 2010-06-18 15:49       ` Dave Hansen
  2010-06-20  8:11         ` Avi Kivity
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-18 15:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, kvm

On Wed, 2010-06-16 at 08:25 -0700, Dave Hansen wrote:
> On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote:
> > On 06/15/2010 04:55 PM, Dave Hansen wrote:
> > > In a previous patch, we removed the 'nr_to_scan' tracking.
> > > It was not being used to track the number of objects
> > > scanned, so we stopped using it entirely.  Here, we
> > > strart using it again.
> > >
> > > The theory here is simple; if we already have the refcount
> > > and the kvm->mmu_lock, then we should do as much work as
> > > possible under the lock.  The downside is that we're less
> > > fair about the KVM instances from which we reclaim.  Each
> > > call to mmu_shrink() will tend to "pick on" one instance,
> > > after which it gets moved to the end of the list and left
> > > alone for a while.
> > >    
> > 
> > That also increases the latency hit, as well as a potential fault storm, 
> > on that instance.  Spreading out is less efficient, but smoother.
> 
> This is probably something that we need to go back and actually measure.
> My suspicion is that, when memory fills up and this shrinker is getting
> called a lot, it will be naturally fair.  That list gets shuffled around
> enough, and mmu_shrink() called often enough that no VMs get picked on
> too unfairly.
> 
> I'll go back and see if I can quantify this a bit, though.

The shrink _query_ (mmu_shrink() with nr_to_scan=0) code is called
really, really often.  Like 5,000-10,000 times a second during lots of
VM pressure.  But, it's almost never called on to actually shrink
anything.

Over the 20 minutes or so that I tested, I saw about 700k calls to
mmu_shrink().  But, only 6 (yes, six) calls that had a non-zero
nr_to_scan.  I'm not sure whether this is because of the .seeks argument
to the shrinker or what, but the slab code stays far, far away from
making mmu_shrink() do much real work.

That changes a few things.  I bet all the contention we were seeing was
just from nr_to_scan=0 calls and not from actual shrink operations.
Perhaps we should just stop this set after patch 4.

Any thoughts?

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-18 15:49       ` Dave Hansen
@ 2010-06-20  8:11         ` Avi Kivity
  2010-06-22 16:32           ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2010-06-20  8:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/18/2010 06:49 PM, Dave Hansen wrote:
> On Wed, 2010-06-16 at 08:25 -0700, Dave Hansen wrote:
>    
>> On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote:
>>      
>>> On 06/15/2010 04:55 PM, Dave Hansen wrote:
>>>        
>>>> In a previous patch, we removed the 'nr_to_scan' tracking.
>>>> It was not being used to track the number of objects
>>>> scanned, so we stopped using it entirely.  Here, we
>>>> strart using it again.
>>>>
>>>> The theory here is simple; if we already have the refcount
>>>> and the kvm->mmu_lock, then we should do as much work as
>>>> possible under the lock.  The downside is that we're less
>>>> fair about the KVM instances from which we reclaim.  Each
>>>> call to mmu_shrink() will tend to "pick on" one instance,
>>>> after which it gets moved to the end of the list and left
>>>> alone for a while.
>>>>
>>>>          
>>> That also increases the latency hit, as well as a potential fault storm,
>>> on that instance.  Spreading out is less efficient, but smoother.
>>>        
>> This is probably something that we need to go back and actually measure.
>> My suspicion is that, when memory fills up and this shrinker is getting
>> called a lot, it will be naturally fair.  That list gets shuffled around
>> enough, and mmu_shrink() called often enough that no VMs get picked on
>> too unfairly.
>>
>> I'll go back and see if I can quantify this a bit, though.
>>      
> The shrink _query_ (mmu_shrink() with nr_to_scan=0) code is called
> really, really often.  Like 5,000-10,000 times a second during lots of
> VM pressure.  But, it's almost never called on to actually shrink
> anything.
>
> Over the 20 minutes or so that I tested, I saw about 700k calls to
> mmu_shrink().  But, only 6 (yes, six) calls that had a non-zero
> nr_to_scan.  I'm not sure whether this is because of the .seeks argument
> to the shrinker or what, but the slab code stays far, far away from
> making mmu_shrink() do much real work.
>    

Certainly seems so from vmscan.c.

> That changes a few things.  I bet all the contention we were seeing was
> just from nr_to_scan=0 calls and not from actual shrink operations.
> Perhaps we should just stop this set after patch 4.
>    

At the very least, we should re-measure things.

Even afterwards, we might reduce .seeks in return for making the 
shrinker cleverer and eliminating the cap on mmu pages.  But I'm afraid 
the interface between vmscan and the shrinker is too simplistic; 
sometimes we can trim pages without much cost (unreferenced pages), but 
some pages are really critical for performance.  To see real 
improvement, we might need our own scanner.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-20  8:11         ` Avi Kivity
@ 2010-06-22 16:32           ` Dave Hansen
  2010-07-22  4:36             ` Avi Kivity
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-06-22 16:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, kvm

On Sun, 2010-06-20 at 11:11 +0300, Avi Kivity wrote:
> > That changes a few things.  I bet all the contention we were seeing was
> > just from nr_to_scan=0 calls and not from actual shrink operations.
> > Perhaps we should just stop this set after patch 4.
> >    
> 
> At the very least, we should re-measure things. 

Sure.  I'll go back to the folks that found this in the first place and
see how these patches affect the contention we were seeing.

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-06-22 16:32           ` Dave Hansen
@ 2010-07-22  4:36             ` Avi Kivity
  2010-07-22  5:36               ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2010-07-22  4:36 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm

On 06/22/2010 07:32 PM, Dave Hansen wrote:
> On Sun, 2010-06-20 at 11:11 +0300, Avi Kivity wrote:
>    
>>> That changes a few things.  I bet all the contention we were seeing was
>>> just from nr_to_scan=0 calls and not from actual shrink operations.
>>> Perhaps we should just stop this set after patch 4.
>>>
>>>        
>> At the very least, we should re-measure things.
>>      
> Sure.  I'll go back to the folks that found this in the first place and
> see how these patches affect the contention we were seeing.
>    

Dave, how did those tests go?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-07-22  4:36             ` Avi Kivity
@ 2010-07-22  5:36               ` Dave Hansen
  2010-07-22  5:42                 ` Avi Kivity
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2010-07-22  5:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, kvm, Karl Rister

On Thu, 2010-07-22 at 07:36 +0300, Avi Kivity wrote:
> On 06/22/2010 07:32 PM, Dave Hansen wrote:
> > On Sun, 2010-06-20 at 11:11 +0300, Avi Kivity wrote:
> >>> That changes a few things.  I bet all the contention we were seeing was
> >>> just from nr_to_scan=0 calls and not from actual shrink operations.
> >>> Perhaps we should just stop this set after patch 4.
> >>>        
> >> At the very least, we should re-measure things.
> >>      
> > Sure.  I'll go back to the folks that found this in the first place and
> > see how these patches affect the contention we were seeing.
> 
> Dave, how did those tests go?

Still waiting on the folks that found this in the first place to
reproduce it and see if the patches help.  I'll go nudge them some more.

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive
  2010-07-22  5:36               ` Dave Hansen
@ 2010-07-22  5:42                 ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2010-07-22  5:42 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, kvm, Karl Rister

On 07/22/2010 08:36 AM, Dave Hansen wrote:
>
>> Dave, how did those tests go?
>>      
> Still waiting on the folks that found this in the first place to
> reproduce it and see if the patches help.  I'll go nudge them some more.
>    

The merge window is coming up soon, and I'd like these patches to go in.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2010-07-22  5:42 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-15 13:55 [RFC][PATCH 0/9] rework KVM mmu_shrink() code Dave Hansen
2010-06-15 13:55 ` [RFC][PATCH 1/9] abstract kvm x86 mmu->n_free_mmu_pages Dave Hansen
2010-06-16  8:40   ` Avi Kivity
2010-06-15 13:55 ` [RFC][PATCH 2/9] rename x86 kvm->arch.n_alloc_mmu_pages Dave Hansen
2010-06-15 13:55 ` [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages Dave Hansen
2010-06-16 14:25   ` Marcelo Tosatti
2010-06-16 15:42     ` Dave Hansen
2010-06-15 13:55 ` [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value Dave Hansen
2010-06-16  8:48   ` Avi Kivity
2010-06-16 15:06     ` Dave Hansen
2010-06-17  8:43       ` Avi Kivity
2010-06-16 16:55     ` Dave Hansen
2010-06-17  8:23       ` Avi Kivity
2010-06-15 13:55 ` [RFC][PATCH 5/9] break out some mmu_skrink() code Dave Hansen
2010-06-15 13:55 ` [RFC][PATCH 6/9] remove kvm_freed variable Dave Hansen
2010-06-15 13:55 ` [RFC][PATCH 7/9] make kvm_get_kvm() more robust Dave Hansen
2010-06-15 13:55 ` [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink() Dave Hansen
2010-06-16  8:54   ` Avi Kivity
2010-06-15 13:55 ` [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive Dave Hansen
2010-06-16  9:24   ` Avi Kivity
2010-06-16 15:25     ` Dave Hansen
2010-06-17  8:37       ` Avi Kivity
2010-06-18 15:49       ` Dave Hansen
2010-06-20  8:11         ` Avi Kivity
2010-06-22 16:32           ` Dave Hansen
2010-07-22  4:36             ` Avi Kivity
2010-07-22  5:36               ` Dave Hansen
2010-07-22  5:42                 ` Avi Kivity
2010-06-16  8:38 ` [RFC][PATCH 0/9] rework KVM mmu_shrink() code Avi Kivity
2010-06-16 15:03   ` Dave Hansen
2010-06-17  8:40     ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).