linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic
@ 2021-06-24 18:13 Peter Xu
  2021-06-24 18:13 ` [PATCH 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

(This is still based on a random 5.13-rc3-ish branch, but I can rebase if needed)

All things started from patch 1, which introduced a new statistic to keep "max
rmap entry count per vm".  At that time I was just curious about how many rmap
is there normally for a guest, and it surprised me a bit.

For TDP mappings it's all fine as mostly rmap of a page is either 0 or 1
depending on faulted or not.  It turns out with EPT=N there seems to be a huge
number of pages that can have tens or hundreds of rmap entries even for an idle
guest.  Then I continued with the rest.

To understand better on "how much of those pages", I did patch 2-6 which
introduced the idea of per-arch per-vm debugfs nodes, and added a debug file to
do statistics for rmap, which is similar to kvm_arch_create_vcpu_debugfs() but
for vm not vcpu.

I did notice this should be the clean approach as I also see other archs
randomly create some per-vm debugfs nodes there:

---8<---
*** arch/arm64/kvm/vgic/vgic-debug.c:
vgic_debug_init[274]           debugfs_create_file("vgic-state", 0444, kvm->debugfs_dentry, kvm,

*** arch/powerpc/kvm/book3s_64_mmu_hv.c:
kvmppc_mmu_debugfs_init[2115]  debugfs_create_file("htab", 0400, kvm->arch.debugfs_dir, kvm,

*** arch/powerpc/kvm/book3s_64_mmu_radix.c:
kvmhv_radix_debugfs_init[1434] debugfs_create_file("radix", 0400, kvm->arch.debugfs_dir, kvm,

*** arch/powerpc/kvm/book3s_hv.c:
debugfs_vcpu_init[2395]        debugfs_create_file("timings", 0444, vcpu->arch.debugfs_dir, vcpu,

*** arch/powerpc/kvm/book3s_xics.c:
xics_debugfs_init[1027]        xics->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,

*** arch/powerpc/kvm/book3s_xive.c:
xive_debugfs_init[2236]        xive->dentry = debugfs_create_file(name, S_IRUGO, powerpc_debugfs_root,

*** arch/powerpc/kvm/timing.c:
kvmppc_create_vcpu_debugfs[214] debugfs_file = debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir,
---8<---

PPC even has its own per-vm dir for that.  I think if patch 2-6 can be
considered to be accepted then the next thing to consider is to merge all these
usages to be under the same existing per-vm dentry with their per-arch hooks
introduced.

The last 3 patches (patch 7-9) are a few optimizations of existing rmap logic.
The major test case I used is rmap_fork [1], however it's not really the ideal
one to show their effect for sure as that test I wrote covers both
rmap_add/remove, while I don't have good idea on optimizing rmap_remove without
changing the array structure or adding much overhead (e.g. sort the array, or
making a tree-like structure somehow to replace the array list).  However it
already shows some benefit with those changes, so I post them out.

Applying patch 7-8 will bring a summary of 38% perf boost when I fork 500
childs with the test I used.  Didn't run perf test on patch 9.  More in the
commit log.

Please review, thanks.

[1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3

Peter Xu (9):
  KVM: X86: Add per-vm stat for max rmap list size
  KVM: Introduce kvm_get_kvm_safe()
  KVM: Allow to have arch-specific per-vm debugfs files
  KVM: X86: Introduce pte_list_count() helper
  KVM: X86: Introduce kvm_mmu_slot_lpages() helpers
  KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file
  KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger
  KVM: X86: Optimize pte_list_desc with per-array counter
  KVM: X86: Optimize zapping rmap

 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/kvm/mmu/mmu.c          |  90 +++++++++++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h |   1 +
 arch/x86/kvm/x86.c              | 131 +++++++++++++++++++++++++++++++-
 include/linux/kvm_host.h        |   2 +
 virt/kvm/kvm_main.c             |  36 +++++++--
 6 files changed, 233 insertions(+), 28 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/9] KVM: X86: Add per-vm stat for max rmap list size
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
@ 2021-06-24 18:13 ` Peter Xu
  2021-06-24 18:13 ` [PATCH 2/9] KVM: Introduce kvm_get_kvm_safe() Peter Xu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
for the vm.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/mmu/mmu.c          | 2 ++
 arch/x86/kvm/x86.c              | 1 +
 3 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55efbacfc244..937bc8021f7a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1138,6 +1138,7 @@ struct kvm_vm_stat {
 	ulong lpages;
 	ulong nx_lpage_splits;
 	ulong max_mmu_page_hash_collisions;
+	ulong max_mmu_rmap_size;
 };
 
 struct kvm_vcpu_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5e60b00e8e50..6dd338738118 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2602,6 +2602,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (is_shadow_present_pte(*sptep)) {
 		if (!was_rmapped) {
 			rmap_count = rmap_add(vcpu, sptep, gfn);
+			if (rmap_count > vcpu->kvm->stat.max_mmu_rmap_size)
+				vcpu->kvm->stat.max_mmu_rmap_size = rmap_count;
 			if (rmap_count > RMAP_RECYCLE_THRESHOLD)
 				rmap_recycle(vcpu, sptep, gfn);
 		}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbc4e04e67ad..4aa3cc6ae5d4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -257,6 +257,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	VM_STAT("largepages", lpages, .mode = 0444),
 	VM_STAT("nx_largepages_splitted", nx_lpage_splits, .mode = 0444),
 	VM_STAT("max_mmu_page_hash_collisions", max_mmu_page_hash_collisions),
+	VM_STAT("max_mmu_rmap_size", max_mmu_rmap_size),
 	{ NULL }
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/9] KVM: Introduce kvm_get_kvm_safe()
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
  2021-06-24 18:13 ` [PATCH 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
@ 2021-06-24 18:13 ` Peter Xu
  2021-06-24 18:13 ` [PATCH 3/9] KVM: Allow to have arch-specific per-vm debugfs files Peter Xu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

Introduce this safe version of kvm_get_kvm() so that it can be called even
during vm destruction.  Use it in kvm_debugfs_open() and remove the verbose
comment.  Prepare to be used elsewhere.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 17 +++++++++++------
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2f34487e21f2..53d7d09eebd7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -698,6 +698,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 void kvm_exit(void);
 
 void kvm_get_kvm(struct kvm *kvm);
+bool kvm_get_kvm_safe(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
 bool file_is_kvm(struct file *file);
 void kvm_put_kvm_no_destroy(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d170f65e15b0..0b4f55370b18 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1049,6 +1049,16 @@ void kvm_get_kvm(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_get_kvm);
 
+/*
+ * Make sure the vm is not during destruction, which is a safe version of
+ * kvm_get_kvm().  Return true if kvm referenced successfully, false otherwise.
+ */
+bool kvm_get_kvm_safe(struct kvm *kvm)
+{
+	return refcount_inc_not_zero(&kvm->users_count);
+}
+EXPORT_SYMBOL_GPL(kvm_get_kvm_safe);
+
 void kvm_put_kvm(struct kvm *kvm)
 {
 	if (refcount_dec_and_test(&kvm->users_count))
@@ -4713,12 +4723,7 @@ static int kvm_debugfs_open(struct inode *inode, struct file *file,
 	struct kvm_stat_data *stat_data = (struct kvm_stat_data *)
 					  inode->i_private;
 
-	/* The debugfs files are a reference to the kvm struct which
-	 * is still valid when kvm_destroy_vm is called.
-	 * To avoid the race between open and the removal of the debugfs
-	 * directory we test against the users count.
-	 */
-	if (!refcount_inc_not_zero(&stat_data->kvm->users_count))
+	if (!kvm_get_kvm_safe(stat_data->kvm))
 		return -ENOENT;
 
 	if (simple_attr_open(inode, file, get,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/9] KVM: Allow to have arch-specific per-vm debugfs files
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
  2021-06-24 18:13 ` [PATCH 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
  2021-06-24 18:13 ` [PATCH 2/9] KVM: Introduce kvm_get_kvm_safe() Peter Xu
@ 2021-06-24 18:13 ` Peter Xu
  2021-06-24 18:13 ` [PATCH 4/9] KVM: X86: Introduce pte_list_count() helper Peter Xu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
besides the stats fields.  The new interface kvm_arch_create_vm_debugfs() is
defined but not yet used.  It's called after kvm->debugfs_dentry is created, so
it can be referenced directly in kvm_arch_create_vm_debugfs().  Arch should
define their own versions when they want to create extra debugfs nodes.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 53d7d09eebd7..480baa55d93f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1009,6 +1009,7 @@ bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
+int kvm_arch_create_vm_debugfs(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0b4f55370b18..6648743f4dcf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -847,6 +847,7 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
 	char dir_name[ITOA_MAX_LEN * 2];
 	struct kvm_stat_data *stat_data;
 	struct kvm_stats_debugfs_item *p;
+	int ret;
 
 	if (!debugfs_initialized())
 		return 0;
@@ -872,6 +873,13 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
 				    kvm->debugfs_dentry, stat_data,
 				    &stat_fops_per_vm);
 	}
+
+	ret = kvm_arch_create_vm_debugfs(kvm);
+	if (ret) {
+		kvm_destroy_vm_debugfs(kvm);
+		return ret;
+	}
+
 	return 0;
 }
 
@@ -892,6 +900,17 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
 {
 }
 
+/*
+ * Called after per-vm debugfs created.  When called kvm->debugfs_dentry should
+ * be setup already, so we can create arch-specific debugfs entries under it.
+ * Cleanup should be automatic done in kvm_destroy_vm_debugfs() recursively, so
+ * a per-arch destroy interface is not needed.
+ */
+int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
+{
+	return 0;
+}
+
 static struct kvm *kvm_create_vm(unsigned long type)
 {
 	struct kvm *kvm = kvm_arch_alloc_vm();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/9] KVM: X86: Introduce pte_list_count() helper
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (2 preceding siblings ...)
  2021-06-24 18:13 ` [PATCH 3/9] KVM: Allow to have arch-specific per-vm debugfs files Peter Xu
@ 2021-06-24 18:13 ` Peter Xu
  2021-06-24 18:13 ` [PATCH 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Peter Xu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

This helper is used to count the number of rmap entries in the rmap list
pointed by the kvm_rmap_head.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c          | 21 +++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6dd338738118..80263ecb1de3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -927,6 +927,27 @@ static void pte_list_remove(struct kvm_rmap_head *rmap_head, u64 *sptep)
 	__pte_list_remove(sptep, rmap_head);
 }
 
+unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
+{
+	struct pte_list_desc *desc;
+	unsigned int i, count = 0;
+
+	if (!rmap_head->val)
+		return 0;
+	else if (!(rmap_head->val & 1))
+		return 1;
+
+	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+
+	while (desc) {
+		for (i = 0; (i < PTE_LIST_EXT) && desc->sptes[i]; i++)
+			count++;
+		desc = desc->more;
+	}
+
+	return count;
+}
+
 static struct kvm_rmap_head *__gfn_to_rmap(gfn_t gfn, int level,
 					   struct kvm_memory_slot *slot)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d64ccb417c60..3cd1a878ffeb 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -126,6 +126,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
+unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
 
 /*
  * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (3 preceding siblings ...)
  2021-06-24 18:13 ` [PATCH 4/9] KVM: X86: Introduce pte_list_count() helper Peter Xu
@ 2021-06-24 18:13 ` Peter Xu
  2021-06-24 18:13 ` [PATCH 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Peter Xu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
than fetching from the memslot pointer.  Start to use the latter one in
kvm_alloc_memslot_metadata().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/x86.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4aa3cc6ae5d4..d2acbea2f3b5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -269,6 +269,20 @@ static struct kmem_cache *x86_fpu_cache;
 
 static struct kmem_cache *x86_emulator_cache;
 
+static inline unsigned long
+__kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, unsigned long npages,
+		      int level)
+{
+	return gfn_to_index(slot->base_gfn + npages - 1,
+			    slot->base_gfn, level) + 1;
+}
+
+static inline unsigned long
+kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, int level)
+{
+	return __kvm_mmu_slot_lpages(slot, slot->npages, level);
+}
+
 /*
  * When called, it means the previous get/set msr reached an invalid msr.
  * Return true if we want to ignore/silent this failed msr access.
@@ -10933,8 +10947,7 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
 		int lpages;
 		int level = i + 1;
 
-		lpages = gfn_to_index(slot->base_gfn + npages - 1,
-				      slot->base_gfn, level) + 1;
+		lpages = __kvm_mmu_slot_lpages(slot, npages, level);
 
 		slot->arch.rmap[i] =
 			kvcalloc(lpages, sizeof(*slot->arch.rmap[i]),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (4 preceding siblings ...)
  2021-06-24 18:13 ` [PATCH 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Peter Xu
@ 2021-06-24 18:13 ` Peter Xu
  2021-06-24 18:22   ` Peter Xu
  2021-06-24 18:13 ` [PATCH 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Peter Xu
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

Use this file to dump rmap statistic information.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/x86.c | 113 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2acbea2f3b5..6dfae8375c44 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -30,6 +30,7 @@
 #include "hyperv.h"
 #include "lapic.h"
 #include "xen.h"
+#include "mmu/mmu_internal.h"
 
 #include <linux/clocksource.h>
 #include <linux/interrupt.h>
@@ -58,6 +59,7 @@
 #include <linux/sched/isolation.h>
 #include <linux/mem_encrypt.h>
 #include <linux/entry-kvm.h>
+#include <linux/debugfs.h>
 
 #include <trace/events/kvm.h>
 
@@ -10763,6 +10765,117 @@ int kvm_arch_post_init_vm(struct kvm *kvm)
 	return kvm_mmu_post_init_vm(kvm);
 }
 
+/*
+ * This covers statistics <1024 (11=log(1024)+1), which should be enough to
+ * cover RMAP_RECYCLE_THRESHOLD.
+ */
+#define  RMAP_LOG_SIZE  11
+
+static const char *kvm_lpage_str[KVM_NR_PAGE_SIZES] = { "4K", "2M", "1G" };
+
+static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
+{
+	struct kvm_rmap_head *rmap;
+	struct kvm *kvm = m->private;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	unsigned int lpage_size, index;
+	/* Still small enough to be on the stack */
+	unsigned int *log[KVM_NR_PAGE_SIZES], *cur;
+	int i, j, k, l, ret;
+
+	memset(log, 0, sizeof(log));
+
+	ret = -ENOMEM;
+	for (i = 0; i < KVM_NR_PAGE_SIZES; i++) {
+		log[i] = kzalloc(RMAP_LOG_SIZE * sizeof(unsigned int), GFP_KERNEL);
+		if (!log[i])
+			goto out;
+	}
+
+	mutex_lock(&kvm->slots_lock);
+	write_lock(&kvm->mmu_lock);
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		for (j = 0; j < slots->used_slots; j++) {
+			slot = &slots->memslots[j];
+			for (k = 0; k < KVM_NR_PAGE_SIZES; k++) {
+				rmap = slot->arch.rmap[k];
+				lpage_size = kvm_mmu_slot_lpages(slot, k + 1);
+				cur = log[k];
+				for (l = 0; l < lpage_size; l++) {
+					index = ffs(pte_list_count(&rmap[l]));
+					if (WARN_ON_ONCE(index >= RMAP_LOG_SIZE))
+						index = RMAP_LOG_SIZE - 1;
+					cur[index]++;
+				}
+			}
+		}
+	}
+
+	write_unlock(&kvm->mmu_lock);
+	mutex_unlock(&kvm->slots_lock);
+
+	/* index=0 counts no rmap; index=1 counts 1 rmap */
+	seq_printf(m, "Rmap_Count:\t0\t1\t");
+	for (i = 2; i < RMAP_LOG_SIZE; i++) {
+		j = 1 << (i - 1);
+		k = (1 << i) - 1;
+		seq_printf(m, "%d-%d\t", j, k);
+	}
+	seq_printf(m, "\n");
+
+	for (i = 0; i < KVM_NR_PAGE_SIZES; i++) {
+		seq_printf(m, "Level=%s:\t", kvm_lpage_str[i]);
+		cur = log[i];
+		for (j = 0; j < RMAP_LOG_SIZE; j++)
+			seq_printf(m, "%d\t", cur[j]);
+		seq_printf(m, "\n");
+	}
+
+	ret = 0;
+out:
+	for (i = 0; i < KVM_NR_PAGE_SIZES; i++)
+		if (log[i])
+			kfree(log[i]);
+
+	return ret;
+}
+
+static int kvm_mmu_rmaps_stat_open(struct inode *inode, struct file *file)
+{
+	struct kvm *kvm = inode->i_private;
+
+	if (!kvm_get_kvm_safe(kvm))
+		return -ENOENT;
+
+	return single_open(file, kvm_mmu_rmaps_stat_show, kvm);
+}
+
+static int kvm_mmu_rmaps_stat_release(struct inode *inode, struct file *file)
+{
+	struct kvm *kvm = inode->i_private;
+
+	kvm_put_kvm(kvm);
+
+	return single_release(inode, file);
+}
+
+static const struct file_operations mmu_rmaps_stat_fops = {
+	.open		= kvm_mmu_rmaps_stat_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= kvm_mmu_rmaps_stat_release,
+};
+
+int kvm_arch_create_vm_debugfs(struct kvm *kvm)
+{
+	debugfs_create_file("mmu_rmaps_stat", 0644, kvm->debugfs_dentry, kvm,
+			    &mmu_rmaps_stat_fops);
+	return 0;
+}
+
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 {
 	vcpu_load(vcpu);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (5 preceding siblings ...)
  2021-06-24 18:13 ` [PATCH 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Peter Xu
@ 2021-06-24 18:13 ` Peter Xu
  2021-06-24 18:15 ` [PATCH 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
  2021-06-24 18:15 ` [PATCH 9/9] KVM: X86: Optimize zapping rmap Peter Xu
  8 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:13 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, peterx,
	Sean Christopherson

Currently rmap array element only contains 3 entries.  However for EPT=N there
could have a lot of guest pages that got tens of even hundreds of rmap entry.

A normal distribution of a 6G guest (even if idle) shows this with rmap count
statistics:

Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

If we do some more fork some pages will grow even larger rmap counts.

This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
will be slightly more efficient; but still not too large so less waste when
array not full.

It should not affecting EPT=Y since EPT normally only has zero or one rmap
entry for each page, so no array is even allocated.

With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
this patch speeds up fork time of about 22%.

    Before: 367.20 (+-4.58%)
    After:  302.00 (+-5.30%)

[1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 80263ecb1de3..8888ae291cb9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -135,8 +135,8 @@ module_param(dbg, bool, 0644);
 
 #include <trace/events/kvm.h>
 
-/* make pte_list_desc fit well in cache line */
-#define PTE_LIST_EXT 3
+/* make pte_list_desc fit well in cache lines */
+#define PTE_LIST_EXT 15
 
 struct pte_list_desc {
 	u64 *sptes[PTE_LIST_EXT];
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (6 preceding siblings ...)
  2021-06-24 18:13 ` [PATCH 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Peter Xu
@ 2021-06-24 18:15 ` Peter Xu
  2021-06-24 22:53   ` Peter Xu
  2021-06-24 18:15 ` [PATCH 9/9] KVM: X86: Optimize zapping rmap Peter Xu
  8 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:15 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: peterx, Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky,
	Sean Christopherson

Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
logic.  E.g., we don't need to loop over the array any more for most reasons.

This will make more sense after we've switched the array size to be larger
otherwise the counter will be a waste.

Initially I wanted to store a tail pointer at the head of the array list so we
don't need to traverse the list at least for pushing new ones (if without the
counter we traverse both the list and the array).  However that'll need
slightly more change without a huge lot benefit, e.g., after we grow entry
numbers per array the list traversing is not so expensive.

So let's be simple but still try to get as much benefit as we can with just
these extra few lines of changes (not to mention the code looks easier too
without looping over arrays).

I used the same a test case to fork 500 child and recycle them ("./rmap_fork
500" [1]), this patch further speeds up the total fork time of about 14%, which
is a total of 38% of vanilla kernel:

        Vanilla:      367.20 (+-4.58%)
        3->15 slots:  302.00 (+-5.30%)
        Add counter:  265.20 (+-9.88%)

[1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8888ae291cb9..b21e52dfc27b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -136,10 +136,15 @@ module_param(dbg, bool, 0644);
 #include <trace/events/kvm.h>
 
 /* make pte_list_desc fit well in cache lines */
-#define PTE_LIST_EXT 15
+#define PTE_LIST_EXT 14
 
 struct pte_list_desc {
 	u64 *sptes[PTE_LIST_EXT];
+	/*
+	 * Stores number of entries stored in the pte_list_desc.  No need to be
+	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
+	 */
+	u64 spte_count;
 	struct pte_list_desc *more;
 };
 
@@ -830,7 +835,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
-	int i, count = 0;
+	int count = 0;
 
 	if (!rmap_head->val) {
 		rmap_printk("%p %llx 0->1\n", spte, *spte);
@@ -840,24 +845,24 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		desc = mmu_alloc_pte_list_desc(vcpu);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
+		desc->spte_count = 2;
 		rmap_head->val = (unsigned long)desc | 1;
 		++count;
 	} else {
 		rmap_printk("%p %llx many->many\n", spte, *spte);
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-		while (desc->sptes[PTE_LIST_EXT-1]) {
+		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
-
 			if (!desc->more) {
 				desc->more = mmu_alloc_pte_list_desc(vcpu);
 				desc = desc->more;
+				desc->spte_count = 0;
 				break;
 			}
 			desc = desc->more;
 		}
-		for (i = 0; desc->sptes[i]; ++i)
-			++count;
-		desc->sptes[i] = spte;
+		count += desc->spte_count;
+		desc->sptes[desc->spte_count++] = spte;
 	}
 	return count;
 }
@@ -873,8 +878,10 @@ pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
 		;
 	desc->sptes[i] = desc->sptes[j];
 	desc->sptes[j] = NULL;
+	desc->spte_count--;
 	if (j != 0)
 		return;
+	WARN_ON_ONCE(desc->spte_count);
 	if (!prev_desc && !desc->more)
 		rmap_head->val = 0;
 	else
@@ -930,7 +937,7 @@ static void pte_list_remove(struct kvm_rmap_head *rmap_head, u64 *sptep)
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
-	unsigned int i, count = 0;
+	unsigned int count = 0;
 
 	if (!rmap_head->val)
 		return 0;
@@ -940,8 +947,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
 
 	while (desc) {
-		for (i = 0; (i < PTE_LIST_EXT) && desc->sptes[i]; i++)
-			count++;
+		count += desc->spte_count;
 		desc = desc->more;
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 9/9] KVM: X86: Optimize zapping rmap
  2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (7 preceding siblings ...)
  2021-06-24 18:15 ` [PATCH 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
@ 2021-06-24 18:15 ` Peter Xu
  8 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:15 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: peterx, Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky,
	Sean Christopherson

Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
slow.  The easy way is to travers the rmap list, collecting the a/d bits and
free the slots along the way.

Provide a pte_list_destroy() and do exactly that.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 45 +++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b21e52dfc27b..719fb6fd0aa0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -954,6 +954,38 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 	return count;
 }
 
+/* Return true if rmap existed and callback called, false otherwise */
+static bool pte_list_destroy(struct kvm_rmap_head *rmap_head,
+			     int (*callback)(u64 *sptep))
+{
+	struct pte_list_desc *desc, *next;
+	int i;
+
+	if (!rmap_head->val)
+		return false;
+
+	if (!(rmap_head->val & 1)) {
+		if (callback)
+			callback((u64 *)rmap_head->val);
+		goto out;
+	}
+
+	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+
+	while (desc) {
+		if (callback)
+			for (i = 0; i < desc->spte_count; i++)
+				callback(desc->sptes[i]);
+		next = desc->more;
+		mmu_free_pte_list_desc(desc);
+		desc = next;
+	}
+out:
+	/* rmap_head is meaningless now, remember to reset it */
+	rmap_head->val = 0;
+	return true;
+}
+
 static struct kvm_rmap_head *__gfn_to_rmap(gfn_t gfn, int level,
 					   struct kvm_memory_slot *slot)
 {
@@ -1310,18 +1342,7 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			  struct kvm_memory_slot *slot)
 {
-	u64 *sptep;
-	struct rmap_iterator iter;
-	bool flush = false;
-
-	while ((sptep = rmap_get_first(rmap_head, &iter))) {
-		rmap_printk("spte %p %llx.\n", sptep, *sptep);
-
-		pte_list_remove(rmap_head, sptep);
-		flush = true;
-	}
-
-	return flush;
+	return pte_list_destroy(rmap_head, mmu_spte_clear_track_bits);
 }
 
 static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file
  2021-06-24 18:13 ` [PATCH 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Peter Xu
@ 2021-06-24 18:22   ` Peter Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 18:22 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, Sean Christopherson

On Thu, Jun 24, 2021 at 02:13:53PM -0400, Peter Xu wrote:
> Use this file to dump rmap statistic information.

I should still paste some example output here in commit message, which I think
could help on review but I forgot..  There's actually a example in another
commit message, but still, it looks like this (6G guest):

# cat mmu_rmaps_stat
Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

We've got another smm address spaces so it's just got doubled.

I'll append that into commit message in the new version (as long as I don't get
a NAK on this patch).

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 113 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 113 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d2acbea2f3b5..6dfae8375c44 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -30,6 +30,7 @@
>  #include "hyperv.h"
>  #include "lapic.h"
>  #include "xen.h"
> +#include "mmu/mmu_internal.h"
>  
>  #include <linux/clocksource.h>
>  #include <linux/interrupt.h>
> @@ -58,6 +59,7 @@
>  #include <linux/sched/isolation.h>
>  #include <linux/mem_encrypt.h>
>  #include <linux/entry-kvm.h>
> +#include <linux/debugfs.h>
>  
>  #include <trace/events/kvm.h>
>  
> @@ -10763,6 +10765,117 @@ int kvm_arch_post_init_vm(struct kvm *kvm)
>  	return kvm_mmu_post_init_vm(kvm);
>  }
>  
> +/*
> + * This covers statistics <1024 (11=log(1024)+1), which should be enough to
> + * cover RMAP_RECYCLE_THRESHOLD.
> + */
> +#define  RMAP_LOG_SIZE  11
> +
> +static const char *kvm_lpage_str[KVM_NR_PAGE_SIZES] = { "4K", "2M", "1G" };
> +
> +static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
> +{
> +	struct kvm_rmap_head *rmap;
> +	struct kvm *kvm = m->private;
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	unsigned int lpage_size, index;
> +	/* Still small enough to be on the stack */
> +	unsigned int *log[KVM_NR_PAGE_SIZES], *cur;
> +	int i, j, k, l, ret;
> +
> +	memset(log, 0, sizeof(log));
> +
> +	ret = -ENOMEM;
> +	for (i = 0; i < KVM_NR_PAGE_SIZES; i++) {
> +		log[i] = kzalloc(RMAP_LOG_SIZE * sizeof(unsigned int), GFP_KERNEL);
> +		if (!log[i])
> +			goto out;
> +	}
> +
> +	mutex_lock(&kvm->slots_lock);
> +	write_lock(&kvm->mmu_lock);
> +
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		slots = __kvm_memslots(kvm, i);
> +		for (j = 0; j < slots->used_slots; j++) {
> +			slot = &slots->memslots[j];
> +			for (k = 0; k < KVM_NR_PAGE_SIZES; k++) {
> +				rmap = slot->arch.rmap[k];
> +				lpage_size = kvm_mmu_slot_lpages(slot, k + 1);
> +				cur = log[k];
> +				for (l = 0; l < lpage_size; l++) {
> +					index = ffs(pte_list_count(&rmap[l]));
> +					if (WARN_ON_ONCE(index >= RMAP_LOG_SIZE))
> +						index = RMAP_LOG_SIZE - 1;
> +					cur[index]++;
> +				}
> +			}
> +		}
> +	}
> +
> +	write_unlock(&kvm->mmu_lock);
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	/* index=0 counts no rmap; index=1 counts 1 rmap */
> +	seq_printf(m, "Rmap_Count:\t0\t1\t");
> +	for (i = 2; i < RMAP_LOG_SIZE; i++) {
> +		j = 1 << (i - 1);
> +		k = (1 << i) - 1;
> +		seq_printf(m, "%d-%d\t", j, k);
> +	}
> +	seq_printf(m, "\n");
> +
> +	for (i = 0; i < KVM_NR_PAGE_SIZES; i++) {
> +		seq_printf(m, "Level=%s:\t", kvm_lpage_str[i]);
> +		cur = log[i];
> +		for (j = 0; j < RMAP_LOG_SIZE; j++)
> +			seq_printf(m, "%d\t", cur[j]);
> +		seq_printf(m, "\n");
> +	}
> +
> +	ret = 0;
> +out:
> +	for (i = 0; i < KVM_NR_PAGE_SIZES; i++)
> +		if (log[i])
> +			kfree(log[i]);
> +
> +	return ret;
> +}
> +
> +static int kvm_mmu_rmaps_stat_open(struct inode *inode, struct file *file)
> +{
> +	struct kvm *kvm = inode->i_private;
> +
> +	if (!kvm_get_kvm_safe(kvm))
> +		return -ENOENT;
> +
> +	return single_open(file, kvm_mmu_rmaps_stat_show, kvm);
> +}
> +
> +static int kvm_mmu_rmaps_stat_release(struct inode *inode, struct file *file)
> +{
> +	struct kvm *kvm = inode->i_private;
> +
> +	kvm_put_kvm(kvm);
> +
> +	return single_release(inode, file);
> +}
> +
> +static const struct file_operations mmu_rmaps_stat_fops = {
> +	.open		= kvm_mmu_rmaps_stat_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= kvm_mmu_rmaps_stat_release,
> +};
> +
> +int kvm_arch_create_vm_debugfs(struct kvm *kvm)
> +{
> +	debugfs_create_file("mmu_rmaps_stat", 0644, kvm->debugfs_dentry, kvm,
> +			    &mmu_rmaps_stat_fops);
> +	return 0;
> +}
> +
>  static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>  {
>  	vcpu_load(vcpu);
> -- 
> 2.31.1
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-06-24 18:15 ` [PATCH 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
@ 2021-06-24 22:53   ` Peter Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2021-06-24 22:53 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Maxim Levitsky, Sean Christopherson

On Thu, Jun 24, 2021 at 02:15:20PM -0400, Peter Xu wrote:
> Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
> logic.  E.g., we don't need to loop over the array any more for most reasons.
> 
> This will make more sense after we've switched the array size to be larger
> otherwise the counter will be a waste.
> 
> Initially I wanted to store a tail pointer at the head of the array list so we
> don't need to traverse the list at least for pushing new ones (if without the
> counter we traverse both the list and the array).  However that'll need
> slightly more change without a huge lot benefit, e.g., after we grow entry
> numbers per array the list traversing is not so expensive.
> 
> So let's be simple but still try to get as much benefit as we can with just
> these extra few lines of changes (not to mention the code looks easier too
> without looping over arrays).
> 
> I used the same a test case to fork 500 child and recycle them ("./rmap_fork
> 500" [1]), this patch further speeds up the total fork time of about 14%, which
> is a total of 38% of vanilla kernel:
> 
>         Vanilla:      367.20 (+-4.58%)
>         3->15 slots:  302.00 (+-5.30%)
>         Add counter:  265.20 (+-9.88%)
> 
> [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8888ae291cb9..b21e52dfc27b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -136,10 +136,15 @@ module_param(dbg, bool, 0644);
>  #include <trace/events/kvm.h>
>  
>  /* make pte_list_desc fit well in cache lines */
> -#define PTE_LIST_EXT 15
> +#define PTE_LIST_EXT 14
>  
>  struct pte_list_desc {
>  	u64 *sptes[PTE_LIST_EXT];
> +	/*
> +	 * Stores number of entries stored in the pte_list_desc.  No need to be
> +	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
> +	 */
> +	u64 spte_count;
>  	struct pte_list_desc *more;
>  };
>  
> @@ -830,7 +835,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>  			struct kvm_rmap_head *rmap_head)
>  {
>  	struct pte_list_desc *desc;
> -	int i, count = 0;
> +	int count = 0;
>  
>  	if (!rmap_head->val) {
>  		rmap_printk("%p %llx 0->1\n", spte, *spte);
> @@ -840,24 +845,24 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>  		desc = mmu_alloc_pte_list_desc(vcpu);
>  		desc->sptes[0] = (u64 *)rmap_head->val;
>  		desc->sptes[1] = spte;
> +		desc->spte_count = 2;
>  		rmap_head->val = (unsigned long)desc | 1;
>  		++count;
>  	} else {
>  		rmap_printk("%p %llx many->many\n", spte, *spte);
>  		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
> -		while (desc->sptes[PTE_LIST_EXT-1]) {
> +		while (desc->spte_count == PTE_LIST_EXT) {
>  			count += PTE_LIST_EXT;
> -
>  			if (!desc->more) {
>  				desc->more = mmu_alloc_pte_list_desc(vcpu);
>  				desc = desc->more;
> +				desc->spte_count = 0;
>  				break;
>  			}
>  			desc = desc->more;
>  		}
> -		for (i = 0; desc->sptes[i]; ++i)
> -			++count;
> -		desc->sptes[i] = spte;
> +		count += desc->spte_count;
> +		desc->sptes[desc->spte_count++] = spte;
>  	}
>  	return count;
>  }
> @@ -873,8 +878,10 @@ pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
>  		;
>  	desc->sptes[i] = desc->sptes[j];
>  	desc->sptes[j] = NULL;
> +	desc->spte_count--;
>  	if (j != 0)
>  		return;
> +	WARN_ON_ONCE(desc->spte_count);
>  	if (!prev_desc && !desc->more)
>  		rmap_head->val = 0;
>  	else
> @@ -930,7 +937,7 @@ static void pte_list_remove(struct kvm_rmap_head *rmap_head, u64 *sptep)
>  unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
>  {
>  	struct pte_list_desc *desc;
> -	unsigned int i, count = 0;
> +	unsigned int count = 0;
>  
>  	if (!rmap_head->val)
>  		return 0;
> @@ -940,8 +947,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
>  	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
>  
>  	while (desc) {
> -		for (i = 0; (i < PTE_LIST_EXT) && desc->sptes[i]; i++)
> -			count++;
> +		count += desc->spte_count;
>  		desc = desc->more;
>  	}

I think I still missed another loop in pte_list_desc_remove_entry() that we can
drop.  With some other cleanups, I plan to squash below into this patch too..

---8<---
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 719fb6fd0aa0..2d8c56eb36f8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -872,16 +872,13 @@ pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
                           struct pte_list_desc *desc, int i,
                           struct pte_list_desc *prev_desc)
 {
-       int j;
+       int j = desc->spte_count - 1;
 
-       for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
-               ;
        desc->sptes[i] = desc->sptes[j];
        desc->sptes[j] = NULL;
        desc->spte_count--;
-       if (j != 0)
+       if (desc->spte_count)
                return;
-       WARN_ON_ONCE(desc->spte_count);
        if (!prev_desc && !desc->more)
                rmap_head->val = 0;
        else
@@ -913,7 +910,7 @@ static void __pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
                desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
                prev_desc = NULL;
                while (desc) {
-                       for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) {
+                       for (i = 0; i < desc->spte_count; ++i) {
                                if (desc->sptes[i] == spte) {
                                        pte_list_desc_remove_entry(rmap_head,
                                                        desc, i, prev_desc);
---8<---

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-06-24 22:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-24 18:13 [PATCH 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
2021-06-24 18:13 ` [PATCH 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
2021-06-24 18:13 ` [PATCH 2/9] KVM: Introduce kvm_get_kvm_safe() Peter Xu
2021-06-24 18:13 ` [PATCH 3/9] KVM: Allow to have arch-specific per-vm debugfs files Peter Xu
2021-06-24 18:13 ` [PATCH 4/9] KVM: X86: Introduce pte_list_count() helper Peter Xu
2021-06-24 18:13 ` [PATCH 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Peter Xu
2021-06-24 18:13 ` [PATCH 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Peter Xu
2021-06-24 18:22   ` Peter Xu
2021-06-24 18:13 ` [PATCH 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Peter Xu
2021-06-24 18:15 ` [PATCH 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
2021-06-24 22:53   ` Peter Xu
2021-06-24 18:15 ` [PATCH 9/9] KVM: X86: Optimize zapping rmap Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).