linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic
@ 2021-06-25 15:32 Peter Xu
  2021-06-25 15:32 ` [PATCH v2 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Sean Christopherson, peterx,
	Maxim Levitsky

v2:\r
- Rebased to kvm-queue since I found quite a few conflicts already\r
- Add an example into patch commit message of "KVM: X86: Introduce\r
  mmu_rmaps_stat per-vm debugfs file"\r
- Cleanup more places in patch "KVM: X86: Optimize pte_list_desc with per-array\r
  counter" and squashed\r
\r
All things started from patch 1, which introduced a new statistic to keep "max\r
rmap entry count per vm".  At that time I was just curious about how many rmap\r
is there normally for a guest, and it surprised me a bit.\r
\r
For TDP mappings it's all fine as mostly rmap of a page is either 0 or 1\r
depending on faulted or not.  It turns out with EPT=N there seems to be a huge\r
number of pages that can have tens or hundreds of rmap entries even for an idle\r
guest.  Then I continued with the rest.\r
\r
To understand better on "how much of those pages", I did patch 2-6 which\r
introduced the idea of per-arch per-vm debugfs nodes, and added a debug file to\r
do statistics for rmap, which is similar to kvm_arch_create_vcpu_debugfs() but\r
for vm not vcpu.\r
\r
I did notice this should be the clean approach as I also see other archs\r
randomly create some per-vm debugfs nodes there:\r
\r
---8<---\r
*** arch/arm64/kvm/vgic/vgic-debug.c:\r
vgic_debug_init[274]           debugfs_create_file("vgic-state", 0444, kvm->debugfs_dentry, kvm,\r
\r
*** arch/powerpc/kvm/book3s_64_mmu_hv.c:\r
kvmppc_mmu_debugfs_init[2115]  debugfs_create_file("htab", 0400, kvm->arch.debugfs_dir, kvm,\r
\r
*** arch/powerpc/kvm/book3s_64_mmu_radix.c:\r
kvmhv_radix_debugfs_init[1434] debugfs_create_file("radix", 0400, kvm->arch.debugfs_dir, kvm,\r
\r
*** arch/powerpc/kvm/book3s_hv.c:\r
debugfs_vcpu_init[2395]        debugfs_create_file("timings", 0444, vcpu->arch.debugfs_dir, vcpu,\r
\r
*** arch/powerpc/kvm/book3s_xics.c:\r
xics_debugfs_init[1027]        xics->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,\r
\r
*** arch/powerpc/kvm/book3s_xive.c:\r
xive_debugfs_init[2236]        xive->dentry = debugfs_create_file(name, S_IRUGO, powerpc_debugfs_root,\r
\r
*** arch/powerpc/kvm/timing.c:\r
kvmppc_create_vcpu_debugfs[214] debugfs_file = debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir,\r
---8<---\r
\r
PPC even has its own per-vm dir for that.  I think if patch 2-6 can be\r
considered to be accepted then the next thing to consider is to merge all these\r
usages to be under the same existing per-vm dentry with their per-arch hooks\r
introduced.\r
\r
The last 3 patches (patch 7-9) are a few optimizations of existing rmap logic.\r
The major test case I used is rmap_fork [1], however it's not really the ideal\r
one to show their effect for sure as that test I wrote covers both\r
rmap_add/remove, while I don't have good idea on optimizing rmap_remove without\r
changing the array structure or adding much overhead (e.g. sort the array, or\r
making a tree-like structure somehow to replace the array list).  However it\r
already shows some benefit with those changes, so I post them out.\r
\r
Applying patch 7-8 will bring a summary of 38% perf boost when I fork 500\r
childs with the test I used.  Didn't run perf test on patch 9.  More in the\r
commit log.\r
\r
Please review, thanks.\r
\r
[1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3\r
\r
Peter Xu (9):\r
  KVM: X86: Add per-vm stat for max rmap list size\r
  KVM: Introduce kvm_get_kvm_safe()\r
  KVM: Allow to have arch-specific per-vm debugfs files\r
  KVM: X86: Introduce pte_list_count() helper\r
  KVM: X86: Introduce kvm_mmu_slot_lpages() helpers\r
  KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file\r
  KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger\r
  KVM: X86: Optimize pte_list_desc with per-array counter\r
  KVM: X86: Optimize zapping rmap\r
\r
 arch/x86/include/asm/kvm_host.h |   1 +\r
 arch/x86/kvm/mmu/mmu.c          |  97 +++++++++++++++++------\r
 arch/x86/kvm/mmu/mmu_internal.h |   1 +\r
 arch/x86/kvm/x86.c              | 131 +++++++++++++++++++++++++++++++-\r
 include/linux/kvm_host.h        |   2 +\r
 virt/kvm/kvm_main.c             |  37 +++++++--\r
 6 files changed, 235 insertions(+), 34 deletions(-)\r
\r
-- \r
2.31.1\r
\r


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/9] KVM: X86: Add per-vm stat for max rmap list size
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
@ 2021-06-25 15:32 ` Peter Xu
  2021-06-25 15:32 ` [PATCH v2 2/9] KVM: Introduce kvm_get_kvm_safe() Peter Xu
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Sean Christopherson, peterx,
	Maxim Levitsky

Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
for the vm.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/mmu/mmu.c          | 2 ++
 arch/x86/kvm/x86.c              | 1 +
 3 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 974cbfb1eefe..d798650ad793 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1209,6 +1209,7 @@ struct kvm_vm_stat {
 	u64 lpages;
 	u64 nx_lpage_splits;
 	u64 max_mmu_page_hash_collisions;
+	u64 max_mmu_rmap_size;
 };
 
 struct kvm_vcpu_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b888385d1933..eb16c1dbbb32 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2696,6 +2696,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (is_shadow_present_pte(*sptep)) {
 		if (!was_rmapped) {
 			rmap_count = rmap_add(vcpu, sptep, gfn);
+			if (rmap_count > vcpu->kvm->stat.max_mmu_rmap_size)
+				vcpu->kvm->stat.max_mmu_rmap_size = rmap_count;
 			if (rmap_count > RMAP_RECYCLE_THRESHOLD)
 				rmap_recycle(vcpu, sptep, gfn);
 		}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8166ad113fb2..d83ccf35ce99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -237,6 +237,7 @@ const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
 	STATS_DESC_ICOUNTER(VM, mmu_unsync),
 	STATS_DESC_ICOUNTER(VM, lpages),
 	STATS_DESC_ICOUNTER(VM, nx_lpage_splits),
+	STATS_DESC_PCOUNTER(VM, max_mmu_rmap_size),
 	STATS_DESC_PCOUNTER(VM, max_mmu_page_hash_collisions)
 };
 static_assert(ARRAY_SIZE(kvm_vm_stats_desc) ==
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 2/9] KVM: Introduce kvm_get_kvm_safe()
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
  2021-06-25 15:32 ` [PATCH v2 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
@ 2021-06-25 15:32 ` Peter Xu
  2021-07-26 13:42   ` Paolo Bonzini
  2021-06-25 15:32 ` [PATCH v2 3/9] KVM: Allow to have arch-specific per-vm debugfs files Peter Xu
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Sean Christopherson, peterx,
	Maxim Levitsky

Introduce this safe version of kvm_get_kvm() so that it can be called even
during vm destruction.  Use it in kvm_debugfs_open() and remove the verbose
comment.  Prepare to be used elsewhere.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 17 +++++++++++------
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae7735b490b4..c6fcd75dd8b9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -720,6 +720,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 void kvm_exit(void);
 
 void kvm_get_kvm(struct kvm *kvm);
+bool kvm_get_kvm_safe(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
 bool file_is_kvm(struct file *file);
 void kvm_put_kvm_no_destroy(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3dcc2abbfc60..79b0c1b7b284 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1120,6 +1120,16 @@ void kvm_get_kvm(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_get_kvm);
 
+/*
+ * Make sure the vm is not during destruction, which is a safe version of
+ * kvm_get_kvm().  Return true if kvm referenced successfully, false otherwise.
+ */
+bool kvm_get_kvm_safe(struct kvm *kvm)
+{
+	return refcount_inc_not_zero(&kvm->users_count);
+}
+EXPORT_SYMBOL_GPL(kvm_get_kvm_safe);
+
 void kvm_put_kvm(struct kvm *kvm)
 {
 	if (refcount_dec_and_test(&kvm->users_count))
@@ -4925,12 +4935,7 @@ static int kvm_debugfs_open(struct inode *inode, struct file *file,
 	struct kvm_stat_data *stat_data = (struct kvm_stat_data *)
 					  inode->i_private;
 
-	/* The debugfs files are a reference to the kvm struct which
-	 * is still valid when kvm_destroy_vm is called.
-	 * To avoid the race between open and the removal of the debugfs
-	 * directory we test against the users count.
-	 */
-	if (!refcount_inc_not_zero(&stat_data->kvm->users_count))
+	if (!kvm_get_kvm_safe(stat_data->kvm))
 		return -ENOENT;
 
 	if (simple_attr_open(inode, file, get,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 3/9] KVM: Allow to have arch-specific per-vm debugfs files
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
  2021-06-25 15:32 ` [PATCH v2 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
  2021-06-25 15:32 ` [PATCH v2 2/9] KVM: Introduce kvm_get_kvm_safe() Peter Xu
@ 2021-06-25 15:32 ` Peter Xu
  2021-06-25 15:32 ` [PATCH v2 4/9] KVM: X86: Introduce pte_list_count() helper Peter Xu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Sean Christopherson, peterx,
	Maxim Levitsky

Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
besides the stats fields.  The new interface kvm_arch_create_vm_debugfs() is
defined but not yet used.  It's called after kvm->debugfs_dentry is created, so
it can be referenced directly in kvm_arch_create_vm_debugfs().  Arch should
define their own versions when they want to create extra debugfs nodes.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c6fcd75dd8b9..8521d3492eb2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1035,6 +1035,7 @@ bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
+int kvm_arch_create_vm_debugfs(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 79b0c1b7b284..516ba8d25bda 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -895,7 +895,7 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
 	char dir_name[ITOA_MAX_LEN * 2];
 	struct kvm_stat_data *stat_data;
 	const struct _kvm_stats_desc *pdesc;
-	int i;
+	int i, ret;
 	int kvm_debugfs_num_entries = kvm_vm_stats_header.num_desc +
 				      kvm_vcpu_stats_header.num_desc;
 
@@ -940,6 +940,13 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
 				    kvm->debugfs_dentry, stat_data,
 				    &stat_fops_per_vm);
 	}
+
+	ret = kvm_arch_create_vm_debugfs(kvm);
+	if (ret) {
+		kvm_destroy_vm_debugfs(kvm);
+		return i;
+	}
+
 	return 0;
 }
 
@@ -960,6 +967,17 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
 {
 }
 
+/*
+ * Called after per-vm debugfs created.  When called kvm->debugfs_dentry should
+ * be setup already, so we can create arch-specific debugfs entries under it.
+ * Cleanup should be automatic done in kvm_destroy_vm_debugfs() recursively, so
+ * a per-arch destroy interface is not needed.
+ */
+int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
+{
+	return 0;
+}
+
 static struct kvm *kvm_create_vm(unsigned long type)
 {
 	struct kvm *kvm = kvm_arch_alloc_vm();
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 4/9] KVM: X86: Introduce pte_list_count() helper
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (2 preceding siblings ...)
  2021-06-25 15:32 ` [PATCH v2 3/9] KVM: Allow to have arch-specific per-vm debugfs files Peter Xu
@ 2021-06-25 15:32 ` Peter Xu
  2021-06-25 15:32 ` [PATCH v2 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Peter Xu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Sean Christopherson, peterx,
	Maxim Levitsky

This helper is used to count the number of rmap entries in the rmap list
pointed by the kvm_rmap_head.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c          | 21 +++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eb16c1dbbb32..b3f738a7c05e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -990,6 +990,27 @@ static void pte_list_remove(struct kvm_rmap_head *rmap_head, u64 *sptep)
 	__pte_list_remove(sptep, rmap_head);
 }
 
+unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
+{
+	struct pte_list_desc *desc;
+	unsigned int i, count = 0;
+
+	if (!rmap_head->val)
+		return 0;
+	else if (!(rmap_head->val & 1))
+		return 1;
+
+	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+
+	while (desc) {
+		for (i = 0; (i < PTE_LIST_EXT) && desc->sptes[i]; i++)
+			count++;
+		desc = desc->more;
+	}
+
+	return count;
+}
+
 static struct kvm_rmap_head *__gfn_to_rmap(gfn_t gfn, int level,
 					   struct kvm_memory_slot *slot)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 35567293c1fd..325b4242deed 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -131,6 +131,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
+unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
 
 /*
  * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (3 preceding siblings ...)
  2021-06-25 15:32 ` [PATCH v2 4/9] KVM: X86: Introduce pte_list_count() helper Peter Xu
@ 2021-06-25 15:32 ` Peter Xu
  2021-06-25 15:32 ` [PATCH v2 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Peter Xu
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Sean Christopherson, peterx,
	Maxim Levitsky

Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
than fetching from the memslot pointer.  Start to use the latter one in
kvm_alloc_memslot_metadata().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/x86.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d83ccf35ce99..b64708f9f27d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -301,6 +301,20 @@ static struct kmem_cache *x86_fpu_cache;
 
 static struct kmem_cache *x86_emulator_cache;
 
+static inline unsigned long
+__kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, unsigned long npages,
+		      int level)
+{
+	return gfn_to_index(slot->base_gfn + npages - 1,
+			    slot->base_gfn, level) + 1;
+}
+
+static inline unsigned long
+kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, int level)
+{
+	return __kvm_mmu_slot_lpages(slot, slot->npages, level);
+}
+
 /*
  * When called, it means the previous get/set msr reached an invalid msr.
  * Return true if we want to ignore/silent this failed msr access.
@@ -11388,8 +11402,7 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		int lpages;
 		int level = i + 1;
 
-		lpages = gfn_to_index(slot->base_gfn + npages - 1,
-				      slot->base_gfn, level) + 1;
+		lpages = __kvm_mmu_slot_lpages(slot, npages, level);
 
 		linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL_ACCOUNT);
 		if (!linfo)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (4 preceding siblings ...)
  2021-06-25 15:32 ` [PATCH v2 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Peter Xu
@ 2021-06-25 15:32 ` Peter Xu
  2021-06-25 15:34 ` [PATCH v2 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Peter Xu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Sean Christopherson, peterx,
	Maxim Levitsky

Use this file to dump rmap statistic information.  The statistic is done by
calculating the rmap count and the result is log-2-based.

An example output of this looks like (idle 6GB guest, right after boot linux):

Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3086676 53045   12330   1272    502     121     76      2       0       0       0
Level=2M:       5947    231     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/x86.c | 113 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b64708f9f27d..c3c51361cf6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -30,6 +30,7 @@
 #include "hyperv.h"
 #include "lapic.h"
 #include "xen.h"
+#include "mmu/mmu_internal.h"
 
 #include <linux/clocksource.h>
 #include <linux/interrupt.h>
@@ -59,6 +60,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/entry-kvm.h>
 #include <linux/suspend.h>
+#include <linux/debugfs.h>
 
 #include <trace/events/kvm.h>
 
@@ -11138,6 +11140,117 @@ int kvm_arch_post_init_vm(struct kvm *kvm)
 	return kvm_mmu_post_init_vm(kvm);
 }
 
+/*
+ * This covers statistics <1024 (11=log(1024)+1), which should be enough to
+ * cover RMAP_RECYCLE_THRESHOLD.
+ */
+#define  RMAP_LOG_SIZE  11
+
+static const char *kvm_lpage_str[KVM_NR_PAGE_SIZES] = { "4K", "2M", "1G" };
+
+static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
+{
+	struct kvm_rmap_head *rmap;
+	struct kvm *kvm = m->private;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	unsigned int lpage_size, index;
+	/* Still small enough to be on the stack */
+	unsigned int *log[KVM_NR_PAGE_SIZES], *cur;
+	int i, j, k, l, ret;
+
+	memset(log, 0, sizeof(log));
+
+	ret = -ENOMEM;
+	for (i = 0; i < KVM_NR_PAGE_SIZES; i++) {
+		log[i] = kzalloc(RMAP_LOG_SIZE * sizeof(unsigned int), GFP_KERNEL);
+		if (!log[i])
+			goto out;
+	}
+
+	mutex_lock(&kvm->slots_lock);
+	write_lock(&kvm->mmu_lock);
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		for (j = 0; j < slots->used_slots; j++) {
+			slot = &slots->memslots[j];
+			for (k = 0; k < KVM_NR_PAGE_SIZES; k++) {
+				rmap = slot->arch.rmap[k];
+				lpage_size = kvm_mmu_slot_lpages(slot, k + 1);
+				cur = log[k];
+				for (l = 0; l < lpage_size; l++) {
+					index = ffs(pte_list_count(&rmap[l]));
+					if (WARN_ON_ONCE(index >= RMAP_LOG_SIZE))
+						index = RMAP_LOG_SIZE - 1;
+					cur[index]++;
+				}
+			}
+		}
+	}
+
+	write_unlock(&kvm->mmu_lock);
+	mutex_unlock(&kvm->slots_lock);
+
+	/* index=0 counts no rmap; index=1 counts 1 rmap */
+	seq_printf(m, "Rmap_Count:\t0\t1\t");
+	for (i = 2; i < RMAP_LOG_SIZE; i++) {
+		j = 1 << (i - 1);
+		k = (1 << i) - 1;
+		seq_printf(m, "%d-%d\t", j, k);
+	}
+	seq_printf(m, "\n");
+
+	for (i = 0; i < KVM_NR_PAGE_SIZES; i++) {
+		seq_printf(m, "Level=%s:\t", kvm_lpage_str[i]);
+		cur = log[i];
+		for (j = 0; j < RMAP_LOG_SIZE; j++)
+			seq_printf(m, "%d\t", cur[j]);
+		seq_printf(m, "\n");
+	}
+
+	ret = 0;
+out:
+	for (i = 0; i < KVM_NR_PAGE_SIZES; i++)
+		if (log[i])
+			kfree(log[i]);
+
+	return ret;
+}
+
+static int kvm_mmu_rmaps_stat_open(struct inode *inode, struct file *file)
+{
+	struct kvm *kvm = inode->i_private;
+
+	if (!kvm_get_kvm_safe(kvm))
+		return -ENOENT;
+
+	return single_open(file, kvm_mmu_rmaps_stat_show, kvm);
+}
+
+static int kvm_mmu_rmaps_stat_release(struct inode *inode, struct file *file)
+{
+	struct kvm *kvm = inode->i_private;
+
+	kvm_put_kvm(kvm);
+
+	return single_release(inode, file);
+}
+
+static const struct file_operations mmu_rmaps_stat_fops = {
+	.open		= kvm_mmu_rmaps_stat_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= kvm_mmu_rmaps_stat_release,
+};
+
+int kvm_arch_create_vm_debugfs(struct kvm *kvm)
+{
+	debugfs_create_file("mmu_rmaps_stat", 0644, kvm->debugfs_dentry, kvm,
+			    &mmu_rmaps_stat_fops);
+	return 0;
+}
+
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 {
 	vcpu_load(vcpu);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (5 preceding siblings ...)
  2021-06-25 15:32 ` [PATCH v2 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Peter Xu
@ 2021-06-25 15:34 ` Peter Xu
  2021-07-28 21:01   ` Sean Christopherson
  2021-06-25 15:34 ` [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: peterx, Maxim Levitsky, Vitaly Kuznetsov, Sean Christopherson,
	Paolo Bonzini

Currently rmap array element only contains 3 entries.  However for EPT=N there
could have a lot of guest pages that got tens of even hundreds of rmap entry.

A normal distribution of a 6G guest (even if idle) shows this with rmap count
statistics:

Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

If we do some more fork some pages will grow even larger rmap counts.

This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
will be slightly more efficient; but still not too large so less waste when
array not full.

It should not affecting EPT=Y since EPT normally only has zero or one rmap
entry for each page, so no array is even allocated.

With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
this patch speeds up fork time of about 22%.

    Before: 367.20 (+-4.58%)
    After:  302.00 (+-5.30%)

[1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b3f738a7c05e..9b093985a2ef 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -137,8 +137,8 @@ module_param(dbg, bool, 0644);
 
 #include <trace/events/kvm.h>
 
-/* make pte_list_desc fit well in cache line */
-#define PTE_LIST_EXT 3
+/* make pte_list_desc fit well in cache lines */
+#define PTE_LIST_EXT 15
 
 struct pte_list_desc {
 	u64 *sptes[PTE_LIST_EXT];
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (6 preceding siblings ...)
  2021-06-25 15:34 ` [PATCH v2 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Peter Xu
@ 2021-06-25 15:34 ` Peter Xu
  2021-07-28 21:04   ` Sean Christopherson
  2021-06-25 15:34 ` [PATCH v2 9/9] KVM: X86: Optimize zapping rmap Peter Xu
  2021-07-26 13:05 ` [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Paolo Bonzini
  9 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: peterx, Maxim Levitsky, Vitaly Kuznetsov, Sean Christopherson,
	Paolo Bonzini

Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
logic.  E.g., we don't need to loop over the array any more for most reasons.

This will make more sense after we've switched the array size to be larger
otherwise the counter will be a waste.

Initially I wanted to store a tail pointer at the head of the array list so we
don't need to traverse the list at least for pushing new ones (if without the
counter we traverse both the list and the array).  However that'll need
slightly more change without a huge lot benefit, e.g., after we grow entry
numbers per array the list traversing is not so expensive.

So let's be simple but still try to get as much benefit as we can with just
these extra few lines of changes (not to mention the code looks easier too
without looping over arrays).

I used the same a test case to fork 500 child and recycle them ("./rmap_fork
500" [1]), this patch further speeds up the total fork time of about 14%, which
is a total of 38% of vanilla kernel:

        Vanilla:      367.20 (+-4.58%)
        3->15 slots:  302.00 (+-5.30%)
        Add counter:  265.20 (+-9.88%)

[1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9b093985a2ef..ba0258bdebc4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -138,10 +138,15 @@ module_param(dbg, bool, 0644);
 #include <trace/events/kvm.h>
 
 /* make pte_list_desc fit well in cache lines */
-#define PTE_LIST_EXT 15
+#define PTE_LIST_EXT 14
 
 struct pte_list_desc {
 	u64 *sptes[PTE_LIST_EXT];
+	/*
+	 * Stores number of entries stored in the pte_list_desc.  No need to be
+	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
+	 */
+	u64 spte_count;
 	struct pte_list_desc *more;
 };
 
@@ -893,7 +898,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
-	int i, count = 0;
+	int count = 0;
 
 	if (!rmap_head->val) {
 		rmap_printk("%p %llx 0->1\n", spte, *spte);
@@ -903,24 +908,24 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		desc = mmu_alloc_pte_list_desc(vcpu);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
+		desc->spte_count = 2;
 		rmap_head->val = (unsigned long)desc | 1;
 		++count;
 	} else {
 		rmap_printk("%p %llx many->many\n", spte, *spte);
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-		while (desc->sptes[PTE_LIST_EXT-1]) {
+		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
-
 			if (!desc->more) {
 				desc->more = mmu_alloc_pte_list_desc(vcpu);
 				desc = desc->more;
+				desc->spte_count = 0;
 				break;
 			}
 			desc = desc->more;
 		}
-		for (i = 0; desc->sptes[i]; ++i)
-			++count;
-		desc->sptes[i] = spte;
+		count += desc->spte_count;
+		desc->sptes[desc->spte_count++] = spte;
 	}
 	return count;
 }
@@ -930,13 +935,12 @@ pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
 			   struct pte_list_desc *desc, int i,
 			   struct pte_list_desc *prev_desc)
 {
-	int j;
+	int j = desc->spte_count - 1;
 
-	for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
-		;
 	desc->sptes[i] = desc->sptes[j];
 	desc->sptes[j] = NULL;
-	if (j != 0)
+	desc->spte_count--;
+	if (desc->spte_count)
 		return;
 	if (!prev_desc && !desc->more)
 		rmap_head->val = 0;
@@ -969,7 +973,7 @@ static void __pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
 		prev_desc = NULL;
 		while (desc) {
-			for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) {
+			for (i = 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] == spte) {
 					pte_list_desc_remove_entry(rmap_head,
 							desc, i, prev_desc);
@@ -993,7 +997,7 @@ static void pte_list_remove(struct kvm_rmap_head *rmap_head, u64 *sptep)
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
-	unsigned int i, count = 0;
+	unsigned int count = 0;
 
 	if (!rmap_head->val)
 		return 0;
@@ -1003,8 +1007,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
 
 	while (desc) {
-		for (i = 0; (i < PTE_LIST_EXT) && desc->sptes[i]; i++)
-			count++;
+		count += desc->spte_count;
 		desc = desc->more;
 	}
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 9/9] KVM: X86: Optimize zapping rmap
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (7 preceding siblings ...)
  2021-06-25 15:34 ` [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
@ 2021-06-25 15:34 ` Peter Xu
  2021-07-28 21:39   ` Sean Christopherson
  2021-07-26 13:05 ` [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Paolo Bonzini
  9 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-06-25 15:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: peterx, Maxim Levitsky, Vitaly Kuznetsov, Sean Christopherson,
	Paolo Bonzini

Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
slow.  The easy way is to travers the rmap list, collecting the a/d bits and
free the slots along the way.

Provide a pte_list_destroy() and do exactly that.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 45 +++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba0258bdebc4..45aac78dcabc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1014,6 +1014,38 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 	return count;
 }
 
+/* Return true if rmap existed and callback called, false otherwise */
+static bool pte_list_destroy(struct kvm_rmap_head *rmap_head,
+			     int (*callback)(u64 *sptep))
+{
+	struct pte_list_desc *desc, *next;
+	int i;
+
+	if (!rmap_head->val)
+		return false;
+
+	if (!(rmap_head->val & 1)) {
+		if (callback)
+			callback((u64 *)rmap_head->val);
+		goto out;
+	}
+
+	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+
+	while (desc) {
+		if (callback)
+			for (i = 0; i < desc->spte_count; i++)
+				callback(desc->sptes[i]);
+		next = desc->more;
+		mmu_free_pte_list_desc(desc);
+		desc = next;
+	}
+out:
+	/* rmap_head is meaningless now, remember to reset it */
+	rmap_head->val = 0;
+	return true;
+}
+
 static struct kvm_rmap_head *__gfn_to_rmap(gfn_t gfn, int level,
 					   struct kvm_memory_slot *slot)
 {
@@ -1403,18 +1435,7 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			  struct kvm_memory_slot *slot)
 {
-	u64 *sptep;
-	struct rmap_iterator iter;
-	bool flush = false;
-
-	while ((sptep = rmap_get_first(rmap_head, &iter))) {
-		rmap_printk("spte %p %llx.\n", sptep, *sptep);
-
-		pte_list_remove(rmap_head, sptep);
-		flush = true;
-	}
-
-	return flush;
+	return pte_list_destroy(rmap_head, mmu_spte_clear_track_bits);
 }
 
 static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic
  2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
                   ` (8 preceding siblings ...)
  2021-06-25 15:34 ` [PATCH v2 9/9] KVM: X86: Optimize zapping rmap Peter Xu
@ 2021-07-26 13:05 ` Paolo Bonzini
  9 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-07-26 13:05 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Sean Christopherson, Maxim Levitsky

On 25/06/21 17:32, Peter Xu wrote:
> v2:
> - Rebased to kvm-queue since I found quite a few conflicts already
> - Add an example into patch commit message of "KVM: X86: Introduce
>    mmu_rmaps_stat per-vm debugfs file"
> - Cleanup more places in patch "KVM: X86: Optimize pte_list_desc with per-array
>    counter" and squashed
> 
> All things started from patch 1, which introduced a new statistic to keep "max
> rmap entry count per vm".  At that time I was just curious about how many rmap
> is there normally for a guest, and it surprised me a bit.
> 
> For TDP mappings it's all fine as mostly rmap of a page is either 0 or 1
> depending on faulted or not.  It turns out with EPT=N there seems to be a huge
> number of pages that can have tens or hundreds of rmap entries even for an idle
> guest.  Then I continued with the rest.
> 
> To understand better on "how much of those pages", I did patch 2-6 which
> introduced the idea of per-arch per-vm debugfs nodes, and added a debug file to
> do statistics for rmap, which is similar to kvm_arch_create_vcpu_debugfs() but
> for vm not vcpu.
> 
> I did notice this should be the clean approach as I also see other archs
> randomly create some per-vm debugfs nodes there:
> 
> ---8<---
> *** arch/arm64/kvm/vgic/vgic-debug.c:
> vgic_debug_init[274]           debugfs_create_file("vgic-state", 0444, kvm->debugfs_dentry, kvm,
> 
> *** arch/powerpc/kvm/book3s_64_mmu_hv.c:
> kvmppc_mmu_debugfs_init[2115]  debugfs_create_file("htab", 0400, kvm->arch.debugfs_dir, kvm,
> 
> *** arch/powerpc/kvm/book3s_64_mmu_radix.c:
> kvmhv_radix_debugfs_init[1434] debugfs_create_file("radix", 0400, kvm->arch.debugfs_dir, kvm,
> 
> *** arch/powerpc/kvm/book3s_hv.c:
> debugfs_vcpu_init[2395]        debugfs_create_file("timings", 0444, vcpu->arch.debugfs_dir, vcpu,
> 
> *** arch/powerpc/kvm/book3s_xics.c:
> xics_debugfs_init[1027]        xics->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
> 
> *** arch/powerpc/kvm/book3s_xive.c:
> xive_debugfs_init[2236]        xive->dentry = debugfs_create_file(name, S_IRUGO, powerpc_debugfs_root,
> 
> *** arch/powerpc/kvm/timing.c:
> kvmppc_create_vcpu_debugfs[214] debugfs_file = debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir,
> ---8<---
> 
> PPC even has its own per-vm dir for that.  I think if patch 2-6 can be
> considered to be accepted then the next thing to consider is to merge all these
> usages to be under the same existing per-vm dentry with their per-arch hooks
> introduced.
> 
> The last 3 patches (patch 7-9) are a few optimizations of existing rmap logic.
> The major test case I used is rmap_fork [1], however it's not really the ideal
> one to show their effect for sure as that test I wrote covers both
> rmap_add/remove, while I don't have good idea on optimizing rmap_remove without
> changing the array structure or adding much overhead (e.g. sort the array, or
> making a tree-like structure somehow to replace the array list).  However it
> already shows some benefit with those changes, so I post them out.
> 
> Applying patch 7-8 will bring a summary of 38% perf boost when I fork 500
> childs with the test I used.  Didn't run perf test on patch 9.  More in the
> commit log.
> 
> Please review, thanks.
> 
> [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3
> 
> Peter Xu (9):
>    KVM: X86: Add per-vm stat for max rmap list size
>    KVM: Introduce kvm_get_kvm_safe()
>    KVM: Allow to have arch-specific per-vm debugfs files
>    KVM: X86: Introduce pte_list_count() helper
>    KVM: X86: Introduce kvm_mmu_slot_lpages() helpers
>    KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file
>    KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger
>    KVM: X86: Optimize pte_list_desc with per-array counter
>    KVM: X86: Optimize zapping rmap
> 
>   arch/x86/include/asm/kvm_host.h |   1 +
>   arch/x86/kvm/mmu/mmu.c          |  97 +++++++++++++++++------
>   arch/x86/kvm/mmu/mmu_internal.h |   1 +
>   arch/x86/kvm/x86.c              | 131 +++++++++++++++++++++++++++++++-
>   include/linux/kvm_host.h        |   2 +
>   virt/kvm/kvm_main.c             |  37 +++++++--
>   6 files changed, 235 insertions(+), 34 deletions(-)
> 

Looks good, thanks.  I queued it, but for now I have left out the 
statistics part; I would like to check the histogram patches too first.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/9] KVM: Introduce kvm_get_kvm_safe()
  2021-06-25 15:32 ` [PATCH v2 2/9] KVM: Introduce kvm_get_kvm_safe() Peter Xu
@ 2021-07-26 13:42   ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-07-26 13:42 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, kvm
  Cc: Vitaly Kuznetsov, Sean Christopherson, Maxim Levitsky

On 25/06/21 17:32, Peter Xu wrote:
> -	/* The debugfs files are a reference to the kvm struct which
> -	 * is still valid when kvm_destroy_vm is called.
> -	 * To avoid the race between open and the removal of the debugfs
> -	 * directory we test against the users count.
> -	 */
> -	if (!refcount_inc_not_zero(&stat_data->kvm

Better keep the comment here (but nothing to do on your part).

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger
  2021-06-25 15:34 ` [PATCH v2 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Peter Xu
@ 2021-07-28 21:01   ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2021-07-28 21:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov, Paolo Bonzini

On Fri, Jun 25, 2021, Peter Xu wrote:
> Currently rmap array element only contains 3 entries.  However for EPT=N there
> could have a lot of guest pages that got tens of even hundreds of rmap entry.
> 
> A normal distribution of a 6G guest (even if idle) shows this with rmap count
> statistics:
> 
> Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
> Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
> Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
> Level=1G:       32      0       0       0       0       0       0       0       0       0       0
> 
> If we do some more fork some pages will grow even larger rmap counts.
> 
> This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
> use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
> will be slightly more efficient; but still not too large so less waste when
> array not full.
> 
> It should not affecting EPT=Y since EPT normally only has zero or one rmap
> entry for each page, so no array is even allocated.
> 
> With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
> this patch speeds up fork time of about 22%.
> 
>     Before: 367.20 (+-4.58%)
>     After:  302.00 (+-5.30%)
> 
> [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b3f738a7c05e..9b093985a2ef 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -137,8 +137,8 @@ module_param(dbg, bool, 0644);
>  
>  #include <trace/events/kvm.h>
>  
> -/* make pte_list_desc fit well in cache line */
> -#define PTE_LIST_EXT 3
> +/* make pte_list_desc fit well in cache lines */
> +#define PTE_LIST_EXT 15

Ha, I was going to say that this should be '14' to fit pte_list_desc within two
cache lines, but looks like Paolo fixed it up on commit.

Also, if the whole cache line thing actually matters, sptes[] and spte_count
should be swapped since spte_count is always read, whereas spte_count[7:14] will
be read iff there are 8+ SPTEs.

>  struct pte_list_desc {
>  	u64 *sptes[PTE_LIST_EXT];
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-06-25 15:34 ` [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
@ 2021-07-28 21:04   ` Sean Christopherson
  2021-07-28 21:51     ` Peter Xu
  2021-07-30 15:45     ` Peter Xu
  0 siblings, 2 replies; 22+ messages in thread
From: Sean Christopherson @ 2021-07-28 21:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov, Paolo Bonzini

On Fri, Jun 25, 2021, Peter Xu wrote:
> Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
> logic.  E.g., we don't need to loop over the array any more for most reasons.
> 
> This will make more sense after we've switched the array size to be larger
> otherwise the counter will be a waste.
> 
> Initially I wanted to store a tail pointer at the head of the array list so we
> don't need to traverse the list at least for pushing new ones (if without the
> counter we traverse both the list and the array).  However that'll need
> slightly more change without a huge lot benefit, e.g., after we grow entry
> numbers per array the list traversing is not so expensive.
> 
> So let's be simple but still try to get as much benefit as we can with just
> these extra few lines of changes (not to mention the code looks easier too
> without looping over arrays).
> 
> I used the same a test case to fork 500 child and recycle them ("./rmap_fork
> 500" [1]), this patch further speeds up the total fork time of about 14%, which
> is a total of 38% of vanilla kernel:
> 
>         Vanilla:      367.20 (+-4.58%)
>         3->15 slots:  302.00 (+-5.30%)
>         Add counter:  265.20 (+-9.88%)
> 
> [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 33 ++++++++++++++++++---------------
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9b093985a2ef..ba0258bdebc4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -138,10 +138,15 @@ module_param(dbg, bool, 0644);
>  #include <trace/events/kvm.h>
>  
>  /* make pte_list_desc fit well in cache lines */
> -#define PTE_LIST_EXT 15
> +#define PTE_LIST_EXT 14

Doh, I looked at kvm/queue code before looking at the full series.

>  struct pte_list_desc {
>  	u64 *sptes[PTE_LIST_EXT];
> +	/*
> +	 * Stores number of entries stored in the pte_list_desc.  No need to be
> +	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
> +	 */
> +	u64 spte_count;

Per my feedback to the previous patch, this should be above sptes[] so that rmaps
with <8 SPTEs only touch one cache line.  No idea if it actually matters in
practice, but I can't see how it would harm anything.

>  	struct pte_list_desc *more;
>  };

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] KVM: X86: Optimize zapping rmap
  2021-06-25 15:34 ` [PATCH v2 9/9] KVM: X86: Optimize zapping rmap Peter Xu
@ 2021-07-28 21:39   ` Sean Christopherson
  2021-07-28 22:01     ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-07-28 21:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov, Paolo Bonzini

On Fri, Jun 25, 2021, Peter Xu wrote:
> Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
> slow.  The easy way is to travers the rmap list, collecting the a/d bits and
> free the slots along the way.
> 
> Provide a pte_list_destroy() and do exactly that.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 45 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 33 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ba0258bdebc4..45aac78dcabc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1014,6 +1014,38 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
>  	return count;
>  }
>  
> +/* Return true if rmap existed and callback called, false otherwise */
> +static bool pte_list_destroy(struct kvm_rmap_head *rmap_head,
> +			     int (*callback)(u64 *sptep))
> +{
> +	struct pte_list_desc *desc, *next;
> +	int i;
> +
> +	if (!rmap_head->val)
> +		return false;
> +
> +	if (!(rmap_head->val & 1)) {
> +		if (callback)
> +			callback((u64 *)rmap_head->val);
> +		goto out;
> +	}
> +
> +	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
> +
> +	while (desc) {
> +		if (callback)
> +			for (i = 0; i < desc->spte_count; i++)
> +				callback(desc->sptes[i]);
> +		next = desc->more;
> +		mmu_free_pte_list_desc(desc);
> +		desc = next;

Alternatively, 

	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
	for ( ; desc; desc = next) {
		for (i = 0; i < desc->spte_count; i++)
			mmu_spte_clear_track_bits((u64 *)rmap_head->val);
		next = desc->more;
		mmu_free_pte_list_desc(desc);
	}

> +	}
> +out:
> +	/* rmap_head is meaningless now, remember to reset it */
> +	rmap_head->val = 0;
> +	return true;

Why implement this as a generic method with a callback?  gcc is suprisingly
astute in optimizing callback(), but I don't see the point of adding a complex
helper that has a single caller, and is extremely unlikely to gain new callers.
Or is there another "zap everything" case I'm missing?

E.g. why not this?

static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
			  const struct kvm_memory_slot *slot)
{
	struct pte_list_desc *desc, *next;
	int i;

	if (!rmap_head->val)
		return false;

	if (!(rmap_head->val & 1)) {
		mmu_spte_clear_track_bits((u64 *)rmap_head->val);
		goto out;
	}

	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
	for ( ; desc; desc = next) {
		for (i = 0; i < desc->spte_count; i++)
			mmu_spte_clear_track_bits(desc->sptes[i]);
		next = desc->more;
		mmu_free_pte_list_desc(desc);
	}
out:
	/* rmap_head is meaningless now, remember to reset it */
	rmap_head->val = 0;
	return true;
}

> +}
> +
>  static struct kvm_rmap_head *__gfn_to_rmap(gfn_t gfn, int level,
>  					   struct kvm_memory_slot *slot)
>  {
> @@ -1403,18 +1435,7 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
>  static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  			  struct kvm_memory_slot *slot)
>  {
> -	u64 *sptep;
> -	struct rmap_iterator iter;
> -	bool flush = false;
> -
> -	while ((sptep = rmap_get_first(rmap_head, &iter))) {
> -		rmap_printk("spte %p %llx.\n", sptep, *sptep);
> -
> -		pte_list_remove(rmap_head, sptep);
> -		flush = true;
> -	}
> -
> -	return flush;
> +	return pte_list_destroy(rmap_head, mmu_spte_clear_track_bits);
>  }
>  
>  static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-07-28 21:04   ` Sean Christopherson
@ 2021-07-28 21:51     ` Peter Xu
  2021-07-29  9:33       ` Paolo Bonzini
  2021-07-30 15:45     ` Peter Xu
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-07-28 21:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov, Paolo Bonzini

On Wed, Jul 28, 2021 at 09:04:30PM +0000, Sean Christopherson wrote:
> >  struct pte_list_desc {
> >  	u64 *sptes[PTE_LIST_EXT];
> > +	/*
> > +	 * Stores number of entries stored in the pte_list_desc.  No need to be
> > +	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
> > +	 */
> > +	u64 spte_count;
> 
> Per my feedback to the previous patch, this should be above sptes[] so that rmaps
> with <8 SPTEs only touch one cache line.  No idea if it actually matters in
> practice, but I can't see how it would harm anything.

Reasonable.  Not sure whether this would change the numbers a bit in the commit
message; it can be slightly better but also possible to be non-observable.
Paolo, let me know if you want me to repost/retest with the change (along with
keeping the comment in the other patch).

Thanks for looking!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] KVM: X86: Optimize zapping rmap
  2021-07-28 21:39   ` Sean Christopherson
@ 2021-07-28 22:01     ` Peter Xu
  2021-07-28 22:31       ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-07-28 22:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov, Paolo Bonzini

On Wed, Jul 28, 2021 at 09:39:02PM +0000, Sean Christopherson wrote:
> On Fri, Jun 25, 2021, Peter Xu wrote:
> > Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
> > slow.  The easy way is to travers the rmap list, collecting the a/d bits and
> > free the slots along the way.
> > 
> > Provide a pte_list_destroy() and do exactly that.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 45 +++++++++++++++++++++++++++++++-----------
> >  1 file changed, 33 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ba0258bdebc4..45aac78dcabc 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1014,6 +1014,38 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
> >  	return count;
> >  }
> >  
> > +/* Return true if rmap existed and callback called, false otherwise */
> > +static bool pte_list_destroy(struct kvm_rmap_head *rmap_head,
> > +			     int (*callback)(u64 *sptep))
> > +{
> > +	struct pte_list_desc *desc, *next;
> > +	int i;
> > +
> > +	if (!rmap_head->val)
> > +		return false;
> > +
> > +	if (!(rmap_head->val & 1)) {
> > +		if (callback)
> > +			callback((u64 *)rmap_head->val);
> > +		goto out;
> > +	}
> > +
> > +	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
> > +
> > +	while (desc) {
> > +		if (callback)
> > +			for (i = 0; i < desc->spte_count; i++)
> > +				callback(desc->sptes[i]);
> > +		next = desc->more;
> > +		mmu_free_pte_list_desc(desc);
> > +		desc = next;
> 
> Alternatively, 
> 
> 	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
> 	for ( ; desc; desc = next) {
> 		for (i = 0; i < desc->spte_count; i++)
> 			mmu_spte_clear_track_bits((u64 *)rmap_head->val);
> 		next = desc->more;
> 		mmu_free_pte_list_desc(desc);
> 	}
> 
> > +	}
> > +out:
> > +	/* rmap_head is meaningless now, remember to reset it */
> > +	rmap_head->val = 0;
> > +	return true;
> 
> Why implement this as a generic method with a callback?  gcc is suprisingly
> astute in optimizing callback(), but I don't see the point of adding a complex
> helper that has a single caller, and is extremely unlikely to gain new callers.
> Or is there another "zap everything" case I'm missing?

No other case; it's just that pte_list_*() helpers will be more self-contained.
If that'll be a performance concern, no objection to hard code it.

> 
> E.g. why not this?
> 
> static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> 			  const struct kvm_memory_slot *slot)
> {
> 	struct pte_list_desc *desc, *next;
> 	int i;
> 
> 	if (!rmap_head->val)
> 		return false;
> 
> 	if (!(rmap_head->val & 1)) {
> 		mmu_spte_clear_track_bits((u64 *)rmap_head->val);
> 		goto out;
> 	}
> 
> 	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
> 	for ( ; desc; desc = next) {
> 		for (i = 0; i < desc->spte_count; i++)
> 			mmu_spte_clear_track_bits(desc->sptes[i]);
> 		next = desc->more;
> 		mmu_free_pte_list_desc(desc);
> 	}
> out:
> 	/* rmap_head is meaningless now, remember to reset it */
> 	rmap_head->val = 0;
> 	return true;
> }

Looks good, but so far I've no strong opinion on this.  I'll leave it for Paolo
to decide.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] KVM: X86: Optimize zapping rmap
  2021-07-28 22:01     ` Peter Xu
@ 2021-07-28 22:31       ` Sean Christopherson
  2021-07-29  9:35         ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-07-28 22:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov, Paolo Bonzini

On Wed, Jul 28, 2021, Peter Xu wrote:
> On Wed, Jul 28, 2021 at 09:39:02PM +0000, Sean Christopherson wrote:
> > On Fri, Jun 25, 2021, Peter Xu wrote:
> > Why implement this as a generic method with a callback?  gcc is suprisingly
> > astute in optimizing callback(), but I don't see the point of adding a complex
> > helper that has a single caller, and is extremely unlikely to gain new callers.
> > Or is there another "zap everything" case I'm missing?
> 
> No other case; it's just that pte_list_*() helpers will be more self-contained.

Eh, but this flow is as much about rmaps as it is about pte_list.

> If that'll be a performance concern, no objection to hard code it.

It's more about unnecessary complexity than it is about performance, e.g. gcc-10
generates identical code for both version (which did surprise the heck out of me).

If we really want to isolate pte_list_destroy(), I would vote for something like
this (squashed in).   pte_list_remove() already calls mmu_spte_clear_track_bits(),
so that particular separation of concerns has already gone out the window.

 
-/* Return true if rmap existed and callback called, false otherwise */
-static bool pte_list_destroy(struct kvm_rmap_head *rmap_head,
-                            void (*callback)(u64 *sptep))
+static bool pte_list_destroy(struct kvm_rmap_head *rmap_head)
 {
        struct pte_list_desc *desc, *next;
        int i;
@@ -1013,20 +1011,16 @@ static bool pte_list_destroy(struct kvm_rmap_head *rmap_head,
                return false;
 
        if (!(rmap_head->val & 1)) {
-               if (callback)
-                       callback((u64 *)rmap_head->val);
+               mmu_spte_clear_track_bits((u64 *)rmap_head->val);
                goto out;
        }
 
        desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-
-       while (desc) {
-               if (callback)
-                       for (i = 0; i < desc->spte_count; i++)
-                               callback(desc->sptes[i]);
+       for ( ; desc; desc = next) {
+               for (i = 0; i < desc->spte_count; i++)
+                       mmu_spte_clear_track_bits(desc->sptes[i]);
                next = desc->more;
                mmu_free_pte_list_desc(desc);
-               desc = next;
        }
 out:
        /* rmap_head is meaningless now, remember to reset it */
@@ -1422,22 +1416,17 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
        return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn, PG_LEVEL_4K);
 }
 
-static void mmu_spte_clear_track_bits_cb(u64 *sptep)
-{
-       mmu_spte_clear_track_bits(sptep);
-}
-
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
                          const struct kvm_memory_slot *slot)
 {
-       return pte_list_destroy(rmap_head, mmu_spte_clear_track_bits_cb);
+       return pte_list_destroy(rmap_head);
 }
 
 static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
                            struct kvm_memory_slot *slot, gfn_t gfn, int level,
                            pte_t unused)
 {
-       return kvm_zap_rmapp(kvm, rmap_head, slot);
+       return pte_list_destroy(rmap_head);
 }
 
 static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-07-28 21:51     ` Peter Xu
@ 2021-07-29  9:33       ` Paolo Bonzini
  2021-07-29 15:53         ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2021-07-29  9:33 UTC (permalink / raw)
  To: Peter Xu, Sean Christopherson
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov

On 28/07/21 23:51, Peter Xu wrote:
> Reasonable.  Not sure whether this would change the numbers a bit in the commit
> message; it can be slightly better but also possible to be non-observable.
> Paolo, let me know if you want me to repost/retest with the change (along with
> keeping the comment in the other patch).

Yes, feel free please start from the patches in kvm/queue (there were 
some conflicts, so it will save you the rebasing work) and send v3 
according to Sean's callbacks.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] KVM: X86: Optimize zapping rmap
  2021-07-28 22:31       ` Sean Christopherson
@ 2021-07-29  9:35         ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-07-29  9:35 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov

On 29/07/21 00:31, Sean Christopherson wrote:
>> If that'll be a performance concern, no objection to hard code it.
> It's more about unnecessary complexity than it is about performance, e.g. gcc-10
> generates identical code for both version (which did surprise the heck out of me).

If you think of what's needed to produce decent (as fast as C) code out 
of STL code, that's not surprising. :)  Pretty cool that it lets people 
write nicer C code too, though.

> If we really want to isolate pte_list_destroy(), I would vote for something like
> this (squashed in).   pte_list_remove() already calls mmu_spte_clear_track_bits(),
> so that particular separation of concerns has already gone out the window.

Yes, that's fair enough.  Thanks for the review!

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-07-29  9:33       ` Paolo Bonzini
@ 2021-07-29 15:53         ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2021-07-29 15:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov

On Thu, Jul 29, 2021 at 11:33:32AM +0200, Paolo Bonzini wrote:
> On 28/07/21 23:51, Peter Xu wrote:
> > Reasonable.  Not sure whether this would change the numbers a bit in the commit
> > message; it can be slightly better but also possible to be non-observable.
> > Paolo, let me know if you want me to repost/retest with the change (along with
> > keeping the comment in the other patch).
> 
> Yes, feel free please start from the patches in kvm/queue (there were some
> conflicts, so it will save you the rebasing work) and send v3 according to
> Sean's callbacks.

Will do.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter
  2021-07-28 21:04   ` Sean Christopherson
  2021-07-28 21:51     ` Peter Xu
@ 2021-07-30 15:45     ` Peter Xu
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Xu @ 2021-07-30 15:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Maxim Levitsky, Vitaly Kuznetsov, Paolo Bonzini

On Wed, Jul 28, 2021 at 09:04:30PM +0000, Sean Christopherson wrote:
> >  struct pte_list_desc {
> >  	u64 *sptes[PTE_LIST_EXT];
> > +	/*
> > +	 * Stores number of entries stored in the pte_list_desc.  No need to be
> > +	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
> > +	 */
> > +	u64 spte_count;
> 
> Per my feedback to the previous patch, this should be above sptes[] so that rmaps
> with <8 SPTEs only touch one cache line.  No idea if it actually matters in
> practice, but I can't see how it would harm anything.

Since at it, I'll further move "more" to be at the entry too, so I think it
optimizes full entries case too.

/*
 * Slight optimization of cacheline layout, by putting `more' and `spte_count'
 * at the start; then accessing it will only use one single cacheline for
 * either full (entries==PTE_LIST_EXT) case or entries<=6.
 */
struct pte_list_desc {
	struct pte_list_desc *more;
	/*
	 * Stores number of entries stored in the pte_list_desc.  No need to be
	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
	 */
	u64 spte_count;
	u64 *sptes[PTE_LIST_EXT];
};

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-30 15:46 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-25 15:32 [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Peter Xu
2021-06-25 15:32 ` [PATCH v2 1/9] KVM: X86: Add per-vm stat for max rmap list size Peter Xu
2021-06-25 15:32 ` [PATCH v2 2/9] KVM: Introduce kvm_get_kvm_safe() Peter Xu
2021-07-26 13:42   ` Paolo Bonzini
2021-06-25 15:32 ` [PATCH v2 3/9] KVM: Allow to have arch-specific per-vm debugfs files Peter Xu
2021-06-25 15:32 ` [PATCH v2 4/9] KVM: X86: Introduce pte_list_count() helper Peter Xu
2021-06-25 15:32 ` [PATCH v2 5/9] KVM: X86: Introduce kvm_mmu_slot_lpages() helpers Peter Xu
2021-06-25 15:32 ` [PATCH v2 6/9] KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Peter Xu
2021-06-25 15:34 ` [PATCH v2 7/9] KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Peter Xu
2021-07-28 21:01   ` Sean Christopherson
2021-06-25 15:34 ` [PATCH v2 8/9] KVM: X86: Optimize pte_list_desc with per-array counter Peter Xu
2021-07-28 21:04   ` Sean Christopherson
2021-07-28 21:51     ` Peter Xu
2021-07-29  9:33       ` Paolo Bonzini
2021-07-29 15:53         ` Peter Xu
2021-07-30 15:45     ` Peter Xu
2021-06-25 15:34 ` [PATCH v2 9/9] KVM: X86: Optimize zapping rmap Peter Xu
2021-07-28 21:39   ` Sean Christopherson
2021-07-28 22:01     ` Peter Xu
2021-07-28 22:31       ` Sean Christopherson
2021-07-29  9:35         ` Paolo Bonzini
2021-07-26 13:05 ` [PATCH v2 0/9] KVM: X86: Some light optimizations on rmap logic Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).