kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch v2 0/2] NUMA aware page table allocation
@ 2022-12-01 19:57 Vipin Sharma
  2022-12-01 19:57 ` [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node Vipin Sharma
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Vipin Sharma @ 2022-12-01 19:57 UTC (permalink / raw)
  To: dmatlack, bgardon, seanjc, pbonzini; +Cc: kvm, linux-kernel, Vipin Sharma

Hi,

This series improves page table accesses by allocating page tables on
the same NUMA node where underlying physical page is present.

Currently page tables are allocated during page faults and page splits.
In both instances page table location will depend on the current thread
mempolicy. This can create suboptimal placement of page tables on NUMA
node, for example, thread doing eager page split is on different NUMA
node compared to page it is splitting.

Reviewers please provide suggestion to the following:

1. Module parameter is true by default, which means this feature will
   be enabled by default. Is this okay or should I set it to false?

2. I haven't reduced KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE considering that
   it might not be too much of an impact as only online nodes are filled
   during topup phase and in many cases some of these nodes will never
   be refilled again.  Please let me know if you want this to be
   reduced.

3. I have tried to keep everything in x86/mmu except for some changes in
   virt/kvm/kvm_main.c. I used __weak function so that only x86/mmu will
   see the change, other arch nothing will change. I hope this is the
   right approach.

4. I am not sure what is the right way to split patch 2. If you think
   this is too big for a patch please let me know what would you prefer.

Thanks
Vipin

v2:
- All page table pages will be allocated on underlying physical page's
  NUMA node.
- Introduced module parameter, numa_aware_pagetable, to disable this
  feature.
- Using kvm_pfn_to_refcounted_page to get page from a pfn.

v1: https://lore.kernel.org/all/20220801151928.270380-1-vipinsh@google.com/

Vipin Sharma (2):
  KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log
    enable on the underlying page's numa node
  KVM: x86/mmu: Allocate page table pages on NUMA node of underlying
    pages

 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/mmu/mmu.c          | 126 ++++++++++++++++++++++++--------
 arch/x86/kvm/mmu/paging_tmpl.h  |   4 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  26 ++++---
 include/linux/kvm_host.h        |  17 +++++
 include/linux/kvm_types.h       |   2 +
 virt/kvm/kvm_main.c             |   7 +-
 7 files changed, 141 insertions(+), 45 deletions(-)


base-commit: df0bb47baa95aad133820b149851d5b94cbc6790
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node
  2022-12-01 19:57 [Patch v2 0/2] NUMA aware page table allocation Vipin Sharma
@ 2022-12-01 19:57 ` Vipin Sharma
  2022-12-05 18:24   ` Ben Gardon
  2022-12-01 19:57 ` [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages Vipin Sharma
  2022-12-09  0:21 ` [Patch v2 0/2] NUMA aware page table allocation David Matlack
  2 siblings, 1 reply; 17+ messages in thread
From: Vipin Sharma @ 2022-12-01 19:57 UTC (permalink / raw)
  To: dmatlack, bgardon, seanjc, pbonzini; +Cc: kvm, linux-kernel, Vipin Sharma

Huge pages are split when dirty log is enabled. New page table pages are
allocated based on the current thread NUMA node or mempolicy. This
causes inefficient page table accesses if underlying page is on a
different NUMA node

Allocate page table pages on the same NUMA node as the underlying huge
page when dirty log is enabled and huge pages are split.

The performance gain during the pre-copy phase of live migrations of a
416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
of 130% to 150%.

Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu.h         |  1 +
 arch/x86/kvm/mmu/mmu.c     | 19 +++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
 include/linux/kvm_host.h   | 15 +++++++++++++++
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6bdaacb6faa0..c960fb096e5c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -119,6 +119,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
+void *kvm_mmu_get_free_page(int nid, gfp_t gfp);
 
 static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4736d7849c60..0554dfc55553 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
 static bool __read_mostly force_flush_and_sync_on_reuse;
 module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
 
+static bool __read_mostly numa_aware_pagetable = true;
+module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
+
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
  * where the hardware walks 2 page tables:
@@ -6984,3 +6987,19 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_huge_page_recovery_thread)
 		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+void *kvm_mmu_get_free_page(int nid, gfp_t gfp)
+{
+	struct page *spt_page;
+	void *address = NULL;
+
+	if (numa_aware_pagetable) {
+		spt_page = alloc_pages_node(nid, gfp, 0);
+		if (spt_page)
+			address = page_address(spt_page);
+	} else {
+		address = (void *)__get_free_page(gfp);
+	}
+
+	return address;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 771210ce5181..1607afbfcc0b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1413,7 +1413,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1423,7 +1423,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 	if (!sp)
 		return NULL;
 
-	sp->spt = (void *)__get_free_page(gfp);
+	sp->spt = kvm_mmu_get_free_page(nid, gfp);
+
 	if (!sp->spt) {
 		kmem_cache_free(mmu_page_header_cache, sp);
 		return NULL;
@@ -1437,6 +1438,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       bool shared)
 {
 	struct kvm_mmu_page *sp;
+	int nid;
+
+	nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(iter->old_spte));
 
 	/*
 	 * Since we are allocating while under the MMU lock we have to be
@@ -1447,7 +1451,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 	 * If this allocation fails we drop the lock and retry with reclaim
 	 * allowed.
 	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
 	if (sp)
 		return sp;
 
@@ -1459,7 +1463,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f874a964313..558ded73f660 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1596,6 +1596,21 @@ void kvm_arch_sync_events(struct kvm *kvm);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 
 struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
+
+/*
+ * Returns the nid of a 'struct page' if pfn is valid and backed by a refcounted
+ * page, NUMA_NO_NODE otherwise.
+ */
+static inline int kvm_pfn_to_refcounted_page_nid(kvm_pfn_t pfn)
+{
+	struct page *page = kvm_pfn_to_refcounted_page(pfn);
+
+	if (page)
+		return page_to_nid(page);
+	else
+		return NUMA_NO_NODE;
+}
+
 bool kvm_is_zone_device_page(struct page *page);
 
 struct kvm_irq_ack_notifier {
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-01 19:57 [Patch v2 0/2] NUMA aware page table allocation Vipin Sharma
  2022-12-01 19:57 ` [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node Vipin Sharma
@ 2022-12-01 19:57 ` Vipin Sharma
  2022-12-05 18:17   ` Ben Gardon
  2022-12-09  0:27   ` David Matlack
  2022-12-09  0:21 ` [Patch v2 0/2] NUMA aware page table allocation David Matlack
  2 siblings, 2 replies; 17+ messages in thread
From: Vipin Sharma @ 2022-12-01 19:57 UTC (permalink / raw)
  To: dmatlack, bgardon, seanjc, pbonzini; +Cc: kvm, linux-kernel, Vipin Sharma

Page table pages of a VM are currently allocated based on the current
task's NUMA node or its mempolicy. This can cause suboptimal remote
accesses by the vCPU if it is accessing physical pages local to its NUMA
node but the page table pages mapping those physcal pages were created
by some other vCPU which was on different NUMA node or had different
policy.

Allocate page table pages on the same NUMA node where underlying
physical page exists. Page table at level 5, 4, and 3 might not end up
on the same NUMA node as they can span multiple NUMA nodes.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/mmu.h              |   1 -
 arch/x86/kvm/mmu/mmu.c          | 109 ++++++++++++++++++++++----------
 arch/x86/kvm/mmu/paging_tmpl.h  |   4 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  16 +++--
 include/linux/kvm_host.h        |   2 +
 include/linux/kvm_types.h       |   2 +
 virt/kvm/kvm_main.c             |   7 +-
 8 files changed, 101 insertions(+), 44 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 283cbb83d6ae..8a0293326abc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -782,7 +782,7 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu *walk_mmu;
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
+	struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
@@ -1415,7 +1415,7 @@ struct kvm_arch {
 	 *
 	 * Protected by kvm->slots_lock.
 	 */
-	struct kvm_mmu_memory_cache split_shadow_page_cache;
+	struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
 	struct kvm_mmu_memory_cache split_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c960fb096e5c..6bdaacb6faa0 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -119,7 +119,6 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
 void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
-void *kvm_mmu_get_free_page(int nid, gfp_t gfp);
 
 static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0554dfc55553..ff7b17af8ab8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -648,31 +648,43 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
-	int r;
+	int r, i;
 
 	/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				       PT64_ROOT_MAX_LEVEL);
-	if (r)
-		return r;
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (node_online(i)) {
+			r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache[i],
+						       PT64_ROOT_MAX_LEVEL);
+			if (r)
+				return r;
+		}
+	}
+
 	if (maybe_indirect) {
 		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
 	}
+
 	return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
 					  PT64_ROOT_MAX_LEVEL);
 }
 
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
+	int i;
+
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+
+	for (i = 0; i < MAX_NUMNODES; i++)
+		kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache[i]);
+
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
@@ -2203,13 +2215,17 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 
 static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+						    union kvm_mmu_page_role role,
+						    int nid)
 {
-	struct shadow_page_caches caches = {
-		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
-		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
-		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
-	};
+	struct shadow_page_caches caches;
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_mem_id();
+
+	caches.page_header_cache = &vcpu->arch.mmu_page_header_cache;
+	caches.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid];
+	caches.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache;
 
 	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }
@@ -2262,15 +2278,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
 
 static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 						 u64 *sptep, gfn_t gfn,
-						 bool direct, unsigned int access)
+						 bool direct, unsigned int access,
+						 kvm_pfn_t pfn)
 {
 	union kvm_mmu_page_role role;
+	int nid;
 
 	if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
 		return ERR_PTR(-EEXIST);
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
+	nid = kvm_pfn_to_refcounted_page_nid(pfn);
+
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -3153,7 +3173,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (it.level == fault->goal_level)
 			break;
 
-		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
+					  ACC_ALL, fault->pfn);
 		if (sp == ERR_PTR(-EEXIST))
 			continue;
 
@@ -3579,7 +3600,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
 	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
-	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, NUMA_NO_NODE);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -5853,15 +5874,20 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 
 int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
-	int ret;
+	int ret, i;
 
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_pte_list_desc_cache.node = NUMA_NO_NODE;
 
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
+	vcpu->arch.mmu_page_header_cache.node = NUMA_NO_NODE;
 
-	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		vcpu->arch.mmu_shadow_page_cache[i].gfp_zero = __GFP_ZERO;
+		vcpu->arch.mmu_shadow_page_cache[i].node = i;
+	}
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
@@ -6012,7 +6038,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 int kvm_mmu_init_vm(struct kvm *kvm)
 {
 	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
-	int r;
+	int r, i;
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
@@ -6029,20 +6055,29 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 
 	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
 	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+	kvm->arch.split_page_header_cache.node = NUMA_NO_NODE;
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		kvm->arch.split_shadow_page_cache[i].gfp_zero = __GFP_ZERO;
+		kvm->arch.split_shadow_page_cache[i].node = i;
+	}
 
-	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
 	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+	kvm->arch.split_desc_cache.node = NUMA_NO_NODE;
 
 	return 0;
 }
 
 static void mmu_free_vm_memory_caches(struct kvm *kvm)
 {
+	int i;
+
 	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
-	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+	for (i = 0; i < MAX_NUMNODES; i++)
+		kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache[i]);
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6150,7 +6185,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
 	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
 }
 
-static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid)
 {
 	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
 		return true;
@@ -6162,10 +6197,10 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
 	 */
 	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
 	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
-	       need_topup(&kvm->arch.split_shadow_page_cache, 1);
+	       need_topup(&kvm->arch.split_shadow_page_cache[nid], 1);
 }
 
-static int topup_split_caches(struct kvm *kvm)
+static int topup_split_caches(struct kvm *kvm, int nid)
 {
 	/*
 	 * Allocating rmap list entries when splitting huge pages for nested
@@ -6195,16 +6230,19 @@ static int topup_split_caches(struct kvm *kvm)
 	if (r)
 		return r;
 
-	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache[nid], 1);
 }
 
-static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm,
+							u64 *huge_sptep,
+							u64 huge_spte)
 {
 	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
 	struct shadow_page_caches caches = {};
 	union kvm_mmu_page_role role;
 	unsigned int access;
 	gfn_t gfn;
+	int nid;
 
 	gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
 	access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep));
@@ -6217,9 +6255,13 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
 	 */
 	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
 
+	nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(huge_spte));
+	if (nid == NUMA_NO_NODE)
+		nid = numa_mem_id();
+
 	/* Direct SPs do not require a shadowed_info_cache. */
 	caches.page_header_cache = &kvm->arch.split_page_header_cache;
-	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid];
 
 	/* Safe to pass NULL for vCPU since requesting a direct SP. */
 	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
@@ -6238,7 +6280,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
 	gfn_t gfn;
 	int index;
 
-	sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
+	sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep, huge_spte);
 
 	for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
 		sptep = &sp->spt[index];
@@ -6276,7 +6318,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 					  u64 *huge_sptep)
 {
 	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
-	int level, r = 0;
+	int level, r = 0, nid;
 	gfn_t gfn;
 	u64 spte;
 
@@ -6284,13 +6326,16 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 	gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
 	level = huge_sp->role.level;
 	spte = *huge_sptep;
+	nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(spte));
+	if (nid == NUMA_NO_NODE)
+		nid = numa_mem_id();
 
 	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
 		r = -ENOSPC;
 		goto out;
 	}
 
-	if (need_topup_split_caches_or_resched(kvm)) {
+	if (need_topup_split_caches_or_resched(kvm, nid)) {
 		write_unlock(&kvm->mmu_lock);
 		cond_resched();
 		/*
@@ -6298,7 +6343,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 		 * rmap iterator should be restarted because the MMU lock was
 		 * dropped.
 		 */
-		r = topup_split_caches(kvm) ?: -EAGAIN;
+		r = topup_split_caches(kvm, nid) ?: -EAGAIN;
 		write_lock(&kvm->mmu_lock);
 		goto out;
 	}
@@ -6988,7 +7033,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
 
-void *kvm_mmu_get_free_page(int nid, gfp_t gfp)
+void *kvm_arch_mmu_get_free_page(int nid, gfp_t gfp)
 {
 	struct page *spt_page;
 	void *address = NULL;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0f6455072055..1b1039a1b178 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		table_gfn = gw->table_gfn[it.level - 2];
 		access = gw->pt_access[it.level - 2];
 		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
-					  false, access);
+					  false, access, fault->pfn);
 
 		if (sp != ERR_PTR(-EEXIST)) {
 			/*
@@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		validate_direct_spte(vcpu, it.sptep, direct_access);
 
 		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
-					  true, direct_access);
+					  true, direct_access, fault->pfn);
 		if (sp == ERR_PTR(-EEXIST))
 			continue;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1607afbfcc0b..be0763e6b058 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -270,12 +270,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		    kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
 {
 	struct kvm_mmu_page *sp;
 
+	if (nid == NUMA_NO_NODE)
+		nid = numa_mem_id();
+
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid]);
 
 	return sp;
 }
@@ -327,7 +330,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = tdp_mmu_alloc_sp(vcpu);
+	root = tdp_mmu_alloc_sp(vcpu, NUMA_NO_NODE);
 	tdp_mmu_init_sp(root, NULL, 0, role);
 
 	refcount_set(&root->tdp_mmu_root_count, 1);
@@ -1159,7 +1162,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	int ret = RET_PF_RETRY;
+	int ret = RET_PF_RETRY, nid;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
 
@@ -1188,11 +1191,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		    !is_large_pte(iter.old_spte))
 			continue;
 
+		nid = kvm_pfn_to_refcounted_page_nid(fault->pfn);
 		/*
 		 * The SPTE is either non-present or points to a huge page that
 		 * needs to be split.
 		 */
-		sp = tdp_mmu_alloc_sp(vcpu);
+		sp = tdp_mmu_alloc_sp(vcpu, nid);
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
@@ -1423,7 +1427,7 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
 	if (!sp)
 		return NULL;
 
-	sp->spt = kvm_mmu_get_free_page(nid, gfp);
+	sp->spt = kvm_arch_mmu_get_free_page(nid, gfp);
 
 	if (!sp->spt) {
 		kmem_cache_free(mmu_page_header_cache, sp);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 558ded73f660..07674955460b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1374,6 +1374,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 
+void *kvm_arch_mmu_get_free_page(int nid, gfp_t gfp);
+
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 3ca3db020e0e..cb627cf1b4e1 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -96,6 +96,8 @@ struct kvm_mmu_memory_cache {
 	struct kmem_cache *kmem_cache;
 	int capacity;
 	void **objects;
+	/* Node on which memory should be allocated by default */
+	int node;
 };
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1782c4555d94..4d59c9d48277 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
 	kvm_arch_guest_memory_reclaimed(kvm);
 }
 
+void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
+{
+		return (void *)__get_free_page(gfp_flags);
+}
+
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 					       gfp_t gfp_flags)
@@ -393,7 +398,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 	if (mc->kmem_cache)
 		return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
 	else
-		return (void *)__get_free_page(gfp_flags);
+		return kvm_arch_mmu_get_free_page(mc->node, gfp_flags);
 }
 
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-01 19:57 ` [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages Vipin Sharma
@ 2022-12-05 18:17   ` Ben Gardon
  2022-12-05 23:40     ` Vipin Sharma
  2022-12-09  0:27   ` David Matlack
  1 sibling, 1 reply; 17+ messages in thread
From: Ben Gardon @ 2022-12-05 18:17 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: dmatlack, seanjc, pbonzini, kvm, linux-kernel

On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote:
>
> Page table pages of a VM are currently allocated based on the current
> task's NUMA node or its mempolicy. This can cause suboptimal remote
> accesses by the vCPU if it is accessing physical pages local to its NUMA
> node but the page table pages mapping those physcal pages were created
> by some other vCPU which was on different NUMA node or had different
> policy.
>
> Allocate page table pages on the same NUMA node where underlying
> physical page exists. Page table at level 5, 4, and 3 might not end up
> on the same NUMA node as they can span multiple NUMA nodes.
>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   4 +-
>  arch/x86/kvm/mmu.h              |   1 -
>  arch/x86/kvm/mmu/mmu.c          | 109 ++++++++++++++++++++++----------
>  arch/x86/kvm/mmu/paging_tmpl.h  |   4 +-
>  arch/x86/kvm/mmu/tdp_mmu.c      |  16 +++--
>  include/linux/kvm_host.h        |   2 +
>  include/linux/kvm_types.h       |   2 +
>  virt/kvm/kvm_main.c             |   7 +-
>  8 files changed, 101 insertions(+), 44 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 283cbb83d6ae..8a0293326abc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -782,7 +782,7 @@ struct kvm_vcpu_arch {
>         struct kvm_mmu *walk_mmu;
>
>         struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> +       struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
>         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> @@ -1415,7 +1415,7 @@ struct kvm_arch {
>          *
>          * Protected by kvm->slots_lock.
>          */
> -       struct kvm_mmu_memory_cache split_shadow_page_cache;
> +       struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
>         struct kvm_mmu_memory_cache split_page_header_cache;
>
>         /*
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index c960fb096e5c..6bdaacb6faa0 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -119,7 +119,6 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
>  void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
>  void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
>  void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
> -void *kvm_mmu_get_free_page(int nid, gfp_t gfp);
>
>  static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0554dfc55553..ff7b17af8ab8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -648,31 +648,43 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
> -       int r;
> +       int r, i;

Nit: nid or node or node_id would be more descriptive than just i.

>
>         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
>         r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
>                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -                                      PT64_ROOT_MAX_LEVEL);
> -       if (r)
> -               return r;
> +
> +       for (i = 0; i < MAX_NUMNODES; i++) {
> +               if (node_online(i)) {
> +                       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache[i],
> +                                                      PT64_ROOT_MAX_LEVEL);
> +                       if (r)
> +                               return r;
> +               }
> +       }
> +

Nice, I was worried about wasting cache memory we were never going to
use, but this works great to at least ensure we're not allocating
cached memory for nodes that don't exist.



>         if (maybe_indirect) {
>                 r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
>                                                PT64_ROOT_MAX_LEVEL);
>                 if (r)
>                         return r;
>         }
> +
>         return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
>                                           PT64_ROOT_MAX_LEVEL);
>  }
>
>  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
> +       int i;
> +
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> +
> +       for (i = 0; i < MAX_NUMNODES; i++)
> +               kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache[i]);
> +
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
> @@ -2203,13 +2215,17 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
>
>  static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                                                     gfn_t gfn,
> -                                                   union kvm_mmu_page_role role)
> +                                                   union kvm_mmu_page_role role,
> +                                                   int nid)
>  {
> -       struct shadow_page_caches caches = {
> -               .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> -               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> -               .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> -       };
> +       struct shadow_page_caches caches;
> +
> +       if (nid == NUMA_NO_NODE)
> +               nid = numa_mem_id();
> +
> +       caches.page_header_cache = &vcpu->arch.mmu_page_header_cache;
> +       caches.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid];
> +       caches.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache;
>
>         return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
>  }
> @@ -2262,15 +2278,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
>
>  static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
>                                                  u64 *sptep, gfn_t gfn,
> -                                                bool direct, unsigned int access)
> +                                                bool direct, unsigned int access,
> +                                                kvm_pfn_t pfn)
>  {
>         union kvm_mmu_page_role role;
> +       int nid;
>
>         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
>                 return ERR_PTR(-EEXIST);
>
>         role = kvm_mmu_child_role(sptep, direct, access);
> -       return kvm_mmu_get_shadow_page(vcpu, gfn, role);
> +       nid = kvm_pfn_to_refcounted_page_nid(pfn);
> +
> +       return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid);
>  }
>
>  static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> @@ -3153,7 +3173,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                 if (it.level == fault->goal_level)
>                         break;
>
> -               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
> +               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
> +                                         ACC_ALL, fault->pfn);
>                 if (sp == ERR_PTR(-EEXIST))
>                         continue;
>
> @@ -3579,7 +3600,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>         WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
>         WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
>
> -       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
> +       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, NUMA_NO_NODE);
>         ++sp->root_count;
>
>         return __pa(sp->spt);
> @@ -5853,15 +5874,20 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>
>  int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
> -       int ret;
> +       int ret, i;
>
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> +       vcpu->arch.mmu_pte_list_desc_cache.node = NUMA_NO_NODE;
>
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> +       vcpu->arch.mmu_page_header_cache.node = NUMA_NO_NODE;
>
> -       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +       for (i = 0; i < MAX_NUMNODES; i++) {
> +               vcpu->arch.mmu_shadow_page_cache[i].gfp_zero = __GFP_ZERO;
> +               vcpu->arch.mmu_shadow_page_cache[i].node = i;
> +       }
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
>         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6012,7 +6038,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  int kvm_mmu_init_vm(struct kvm *kvm)
>  {
>         struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> -       int r;
> +       int r, i;
>
>         INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>         INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
> @@ -6029,20 +6055,29 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>
>         kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
>         kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> +       kvm->arch.split_page_header_cache.node = NUMA_NO_NODE;
> +
> +       for (i = 0; i < MAX_NUMNODES; i++) {
> +               kvm->arch.split_shadow_page_cache[i].gfp_zero = __GFP_ZERO;
> +               kvm->arch.split_shadow_page_cache[i].node = i;
> +       }
>
> -       kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
>         kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
>         kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +       kvm->arch.split_desc_cache.node = NUMA_NO_NODE;
>
>         return 0;
>  }
>
>  static void mmu_free_vm_memory_caches(struct kvm *kvm)
>  {
> +       int i;
> +
>         kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
>         kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> -       kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> +       for (i = 0; i < MAX_NUMNODES; i++)
> +               kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache[i]);
>  }
>
>  void kvm_mmu_uninit_vm(struct kvm *kvm)
> @@ -6150,7 +6185,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
>         return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
>  }
>
> -static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> +static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid)
>  {
>         if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
>                 return true;
> @@ -6162,10 +6197,10 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
>          */
>         return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
>                need_topup(&kvm->arch.split_page_header_cache, 1) ||
> -              need_topup(&kvm->arch.split_shadow_page_cache, 1);
> +              need_topup(&kvm->arch.split_shadow_page_cache[nid], 1);
>  }
>
> -static int topup_split_caches(struct kvm *kvm)
> +static int topup_split_caches(struct kvm *kvm, int nid)
>  {
>         /*
>          * Allocating rmap list entries when splitting huge pages for nested
> @@ -6195,16 +6230,19 @@ static int topup_split_caches(struct kvm *kvm)
>         if (r)
>                 return r;
>
> -       return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> +       return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache[nid], 1);
>  }
>
> -static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
> +static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm,
> +                                                       u64 *huge_sptep,
> +                                                       u64 huge_spte)
>  {
>         struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
>         struct shadow_page_caches caches = {};
>         union kvm_mmu_page_role role;
>         unsigned int access;
>         gfn_t gfn;
> +       int nid;
>
>         gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
>         access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep));
> @@ -6217,9 +6255,13 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
>          */
>         role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
>
> +       nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(huge_spte));
> +       if (nid == NUMA_NO_NODE)
> +               nid = numa_mem_id();
> +
>         /* Direct SPs do not require a shadowed_info_cache. */
>         caches.page_header_cache = &kvm->arch.split_page_header_cache;
> -       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid];
>
>         /* Safe to pass NULL for vCPU since requesting a direct SP. */
>         return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -6238,7 +6280,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
>         gfn_t gfn;
>         int index;
>
> -       sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
> +       sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep, huge_spte);
>
>         for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
>                 sptep = &sp->spt[index];
> @@ -6276,7 +6318,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
>                                           u64 *huge_sptep)
>  {
>         struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> -       int level, r = 0;
> +       int level, r = 0, nid;
>         gfn_t gfn;
>         u64 spte;
>
> @@ -6284,13 +6326,16 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
>         gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
>         level = huge_sp->role.level;
>         spte = *huge_sptep;
> +       nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(spte));
> +       if (nid == NUMA_NO_NODE)
> +               nid = numa_mem_id();
>
>         if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
>                 r = -ENOSPC;
>                 goto out;
>         }
>
> -       if (need_topup_split_caches_or_resched(kvm)) {
> +       if (need_topup_split_caches_or_resched(kvm, nid)) {
>                 write_unlock(&kvm->mmu_lock);
>                 cond_resched();
>                 /*
> @@ -6298,7 +6343,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
>                  * rmap iterator should be restarted because the MMU lock was
>                  * dropped.
>                  */
> -               r = topup_split_caches(kvm) ?: -EAGAIN;
> +               r = topup_split_caches(kvm, nid) ?: -EAGAIN;
>                 write_lock(&kvm->mmu_lock);
>                 goto out;
>         }
> @@ -6988,7 +7033,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>                 kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
>
> -void *kvm_mmu_get_free_page(int nid, gfp_t gfp)
> +void *kvm_arch_mmu_get_free_page(int nid, gfp_t gfp)
>  {
>         struct page *spt_page;
>         void *address = NULL;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 0f6455072055..1b1039a1b178 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>                 table_gfn = gw->table_gfn[it.level - 2];
>                 access = gw->pt_access[it.level - 2];
>                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
> -                                         false, access);
> +                                         false, access, fault->pfn);
>
>                 if (sp != ERR_PTR(-EEXIST)) {
>                         /*
> @@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>                 validate_direct_spte(vcpu, it.sptep, direct_access);
>
>                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> -                                         true, direct_access);
> +                                         true, direct_access, fault->pfn);
>                 if (sp == ERR_PTR(-EEXIST))
>                         continue;
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1607afbfcc0b..be0763e6b058 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -270,12 +270,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>                     kvm_mmu_page_as_id(_root) != _as_id) {              \
>                 } else
>
> -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
>  {
>         struct kvm_mmu_page *sp;
>
> +       if (nid == NUMA_NO_NODE)
> +               nid = numa_mem_id();
> +
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid]);
>
>         return sp;
>  }
> @@ -327,7 +330,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>                         goto out;
>         }
>
> -       root = tdp_mmu_alloc_sp(vcpu);
> +       root = tdp_mmu_alloc_sp(vcpu, NUMA_NO_NODE);
>         tdp_mmu_init_sp(root, NULL, 0, role);
>
>         refcount_set(&root->tdp_mmu_root_count, 1);
> @@ -1159,7 +1162,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         struct kvm *kvm = vcpu->kvm;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
> -       int ret = RET_PF_RETRY;
> +       int ret = RET_PF_RETRY, nid;
>
>         kvm_mmu_hugepage_adjust(vcpu, fault);
>
> @@ -1188,11 +1191,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                     !is_large_pte(iter.old_spte))
>                         continue;
>
> +               nid = kvm_pfn_to_refcounted_page_nid(fault->pfn);
>                 /*
>                  * The SPTE is either non-present or points to a huge page that
>                  * needs to be split.
>                  */
> -               sp = tdp_mmu_alloc_sp(vcpu);
> +               sp = tdp_mmu_alloc_sp(vcpu, nid);
>                 tdp_mmu_init_child_sp(sp, &iter);
>
>                 sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> @@ -1423,7 +1427,7 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
>         if (!sp)
>                 return NULL;
>
> -       sp->spt = kvm_mmu_get_free_page(nid, gfp);
> +       sp->spt = kvm_arch_mmu_get_free_page(nid, gfp);
>
>         if (!sp->spt) {
>                 kmem_cache_free(mmu_page_header_cache, sp);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 558ded73f660..07674955460b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1374,6 +1374,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
>
>  void kvm_flush_remote_tlbs(struct kvm *kvm);
>
> +void *kvm_arch_mmu_get_free_page(int nid, gfp_t gfp);
> +
>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 3ca3db020e0e..cb627cf1b4e1 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -96,6 +96,8 @@ struct kvm_mmu_memory_cache {
>         struct kmem_cache *kmem_cache;
>         int capacity;
>         void **objects;
> +       /* Node on which memory should be allocated by default */
> +       int node;
>  };
>  #endif
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..4d59c9d48277 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
>         kvm_arch_guest_memory_reclaimed(kvm);
>  }
>
> +void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> +{
> +               return (void *)__get_free_page(gfp_flags);
> +}
> +

Rather than making this __weak, you could use #ifdef CONFIG_NUMA to
just put all the code in the arch-neutral function.

>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>  static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>                                                gfp_t gfp_flags)
> @@ -393,7 +398,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>         if (mc->kmem_cache)
>                 return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
>         else
> -               return (void *)__get_free_page(gfp_flags);
> +               return kvm_arch_mmu_get_free_page(mc->node, gfp_flags);
>  }
>
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node
  2022-12-01 19:57 ` [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node Vipin Sharma
@ 2022-12-05 18:24   ` Ben Gardon
  2022-12-05 18:47     ` Sean Christopherson
  0 siblings, 1 reply; 17+ messages in thread
From: Ben Gardon @ 2022-12-05 18:24 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: dmatlack, seanjc, pbonzini, kvm, linux-kernel

On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote:
>
> Huge pages are split when dirty log is enabled. New page table pages are
> allocated based on the current thread NUMA node or mempolicy. This
> causes inefficient page table accesses if underlying page is on a
> different NUMA node
>
> Allocate page table pages on the same NUMA node as the underlying huge
> page when dirty log is enabled and huge pages are split.
>
> The performance gain during the pre-copy phase of live migrations of a
> 416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
> of 130% to 150%.
>
> Suggested-by: David Matlack <dmatlack@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/kvm/mmu.h         |  1 +
>  arch/x86/kvm/mmu/mmu.c     | 19 +++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
>  include/linux/kvm_host.h   | 15 +++++++++++++++
>  4 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 6bdaacb6faa0..c960fb096e5c 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -119,6 +119,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);
>  void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
>  void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
>  void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
> +void *kvm_mmu_get_free_page(int nid, gfp_t gfp);
>
>  static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4736d7849c60..0554dfc55553 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
>  static bool __read_mostly force_flush_and_sync_on_reuse;
>  module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
>
> +static bool __read_mostly numa_aware_pagetable = true;
> +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
> +

I'm usually all for having module params to control things, but in
this case I don't think it provides much value because whether this
NUMA optimization is useful or not is going to depend more on VM size
and workload than anything else. If we wanted to make this
configurable, a VM capability would probably be a better mechanism so
that userspace could leave it off when running small,
non-performance-sensitive VMs and turn it on when running large,
multi-node VMs. A whole-host module parameter seems overly
restrictive.

>  /*
>   * When setting this variable to true it enables Two-Dimensional-Paging
>   * where the hardware walks 2 page tables:
> @@ -6984,3 +6987,19 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>         if (kvm->arch.nx_huge_page_recovery_thread)
>                 kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> +
> +void *kvm_mmu_get_free_page(int nid, gfp_t gfp)
> +{
> +       struct page *spt_page;
> +       void *address = NULL;
> +
> +       if (numa_aware_pagetable) {
> +               spt_page = alloc_pages_node(nid, gfp, 0);
> +               if (spt_page)
> +                       address = page_address(spt_page);
> +       } else {
> +               address = (void *)__get_free_page(gfp);
> +       }
> +
> +       return address;
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 771210ce5181..1607afbfcc0b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1413,7 +1413,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>         return spte_set;
>  }
>
> -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
>  {
>         struct kvm_mmu_page *sp;
>
> @@ -1423,7 +1423,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
>         if (!sp)
>                 return NULL;
>
> -       sp->spt = (void *)__get_free_page(gfp);
> +       sp->spt = kvm_mmu_get_free_page(nid, gfp);
> +
>         if (!sp->spt) {
>                 kmem_cache_free(mmu_page_header_cache, sp);
>                 return NULL;
> @@ -1437,6 +1438,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>                                                        bool shared)
>  {
>         struct kvm_mmu_page *sp;
> +       int nid;
> +
> +       nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(iter->old_spte));
>
>         /*
>          * Since we are allocating while under the MMU lock we have to be
> @@ -1447,7 +1451,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>          * If this allocation fails we drop the lock and retry with reclaim
>          * allowed.
>          */
> -       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
>         if (sp)
>                 return sp;
>
> @@ -1459,7 +1463,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>                 write_unlock(&kvm->mmu_lock);
>
>         iter->yielded = true;
> -       sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
>
>         if (shared)
>                 read_lock(&kvm->mmu_lock);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..558ded73f660 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1596,6 +1596,21 @@ void kvm_arch_sync_events(struct kvm *kvm);
>  int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
>
>  struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
> +
> +/*
> + * Returns the nid of a 'struct page' if pfn is valid and backed by a refcounted
> + * page, NUMA_NO_NODE otherwise.
> + */
> +static inline int kvm_pfn_to_refcounted_page_nid(kvm_pfn_t pfn)
> +{
> +       struct page *page = kvm_pfn_to_refcounted_page(pfn);
> +
> +       if (page)
> +               return page_to_nid(page);
> +       else
> +               return NUMA_NO_NODE;
> +}
> +
>  bool kvm_is_zone_device_page(struct page *page);
>
>  struct kvm_irq_ack_notifier {
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node
  2022-12-05 18:24   ` Ben Gardon
@ 2022-12-05 18:47     ` Sean Christopherson
  2022-12-05 18:51       ` Ben Gardon
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-12-05 18:47 UTC (permalink / raw)
  To: Ben Gardon; +Cc: Vipin Sharma, dmatlack, pbonzini, kvm, linux-kernel

Side topic, the shortlog is way, way too long.  The purpose of the shortlog is
to provide a synopsis of the change, not to describe the change in detail.

I also think this patch should be 2/2, with the more generic support added along
with the module param (or capability) in 1/2.  E.g. to yield something like

  KVM: x86/mmu: Add a module param to make per-vCPU SPTs NUMA aware
  KVM: x86/mmu: Honor NUMA awareness for per-VM page table allocations

On Mon, Dec 05, 2022, Ben Gardon wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4736d7849c60..0554dfc55553 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
> >  static bool __read_mostly force_flush_and_sync_on_reuse;
> >  module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> >
> > +static bool __read_mostly numa_aware_pagetable = true;
> > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
> > +
> 
> I'm usually all for having module params to control things, but in
> this case I don't think it provides much value because whether this
> NUMA optimization is useful or not is going to depend more on VM size
> and workload than anything else. If we wanted to make this
> configurable, a VM capability would probably be a better mechanism so
> that userspace could leave it off when running small,
> non-performance-sensitive VMs

Would we actually want to turn it off in this case?  IIUC, @nid is just the
preferred node, i.e. failure to allocate for the preferred @nid will result in
falling back to other nodes, not outright failure.  So the pathological worst
case scenario would be that for a system with VMs that don't care about performance,
all of a nodes memory is allocated due to all VMs starting on that node.

On the flip side, if a system had a mix of VM shapes, I think we'd want even the
performance insensitive VMs to be NUMA aware so that they can be sequestered on
their own node(s), i.e. don't "steal" memory from the VMs that are performance
sensitive and have been affined to a single node.

> and turn it on when running large, multi-node VMs. A whole-host module
> parameter seems overly restrictive.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node
  2022-12-05 18:47     ` Sean Christopherson
@ 2022-12-05 18:51       ` Ben Gardon
  2022-12-05 21:07         ` Sean Christopherson
  0 siblings, 1 reply; 17+ messages in thread
From: Ben Gardon @ 2022-12-05 18:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Vipin Sharma, dmatlack, pbonzini, kvm, linux-kernel

On Mon, Dec 5, 2022 at 10:47 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Side topic, the shortlog is way, way too long.  The purpose of the shortlog is
> to provide a synopsis of the change, not to describe the change in detail.
>
> I also think this patch should be 2/2, with the more generic support added along
> with the module param (or capability) in 1/2.  E.g. to yield something like
>
>   KVM: x86/mmu: Add a module param to make per-vCPU SPTs NUMA aware
>   KVM: x86/mmu: Honor NUMA awareness for per-VM page table allocations
>
> On Mon, Dec 05, 2022, Ben Gardon wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 4736d7849c60..0554dfc55553 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
> > >  static bool __read_mostly force_flush_and_sync_on_reuse;
> > >  module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> > >
> > > +static bool __read_mostly numa_aware_pagetable = true;
> > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
> > > +
> >
> > I'm usually all for having module params to control things, but in
> > this case I don't think it provides much value because whether this
> > NUMA optimization is useful or not is going to depend more on VM size
> > and workload than anything else. If we wanted to make this
> > configurable, a VM capability would probably be a better mechanism so
> > that userspace could leave it off when running small,
> > non-performance-sensitive VMs
>
> Would we actually want to turn it off in this case?  IIUC, @nid is just the
> preferred node, i.e. failure to allocate for the preferred @nid will result in
> falling back to other nodes, not outright failure.  So the pathological worst
> case scenario would be that for a system with VMs that don't care about performance,
> all of a nodes memory is allocated due to all VMs starting on that node.
>
> On the flip side, if a system had a mix of VM shapes, I think we'd want even the
> performance insensitive VMs to be NUMA aware so that they can be sequestered on
> their own node(s), i.e. don't "steal" memory from the VMs that are performance
> sensitive and have been affined to a single node.

Yeah, the only reason to turn it off would be to save memory. As a
strawman, if you had 100 1-vCPU VMs on a 2 node system, you'd have
4000 pages allocated in caches, doing nothing.

>
> > and turn it on when running large, multi-node VMs. A whole-host module
> > parameter seems overly restrictive.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node
  2022-12-05 18:51       ` Ben Gardon
@ 2022-12-05 21:07         ` Sean Christopherson
  2022-12-08 22:44           ` Vipin Sharma
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-12-05 21:07 UTC (permalink / raw)
  To: Ben Gardon; +Cc: Vipin Sharma, dmatlack, pbonzini, kvm, linux-kernel

On Mon, Dec 05, 2022, Ben Gardon wrote:
> On Mon, Dec 5, 2022 at 10:47 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > +static bool __read_mostly numa_aware_pagetable = true;
> > > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
> > > > +
> > >
> > > I'm usually all for having module params to control things, but in
> > > this case I don't think it provides much value because whether this
> > > NUMA optimization is useful or not is going to depend more on VM size
> > > and workload than anything else. If we wanted to make this
> > > configurable, a VM capability would probably be a better mechanism so
> > > that userspace could leave it off when running small,
> > > non-performance-sensitive VMs
> >
> > Would we actually want to turn it off in this case?  IIUC, @nid is just the
> > preferred node, i.e. failure to allocate for the preferred @nid will result in
> > falling back to other nodes, not outright failure.  So the pathological worst
> > case scenario would be that for a system with VMs that don't care about performance,
> > all of a nodes memory is allocated due to all VMs starting on that node.
> >
> > On the flip side, if a system had a mix of VM shapes, I think we'd want even the
> > performance insensitive VMs to be NUMA aware so that they can be sequestered on
> > their own node(s), i.e. don't "steal" memory from the VMs that are performance
> > sensitive and have been affined to a single node.
> 
> Yeah, the only reason to turn it off would be to save memory. As a
> strawman, if you had 100 1-vCPU VMs on a 2 node system, you'd have
> 4000 pages allocated in caches, doing nothing.

The management of the caches really should be addressed separately, and this is
the perfect excuse to finally get some traction on mmu_shrink_scan().

From a while back[1]:

 : I know we proposed dropping mmu_shrink_scan(), but the more I think about that,
 : the more I think that an outright drop is wrong.  The real issue is that KVM as
 : quite literally the dumbest possible "algorithm" for zapping possibly-in-use
 : shadow pages, and doesn't target the zapping to fit the cgroup that's under
 : pressure.
 : 
 : So for this, IMO rather than assume that freeing the caches when any memslot
 : disables dirty logging, I think it makes sense to initially keep the caches and
 : only free them at VM destruction.  Then in follow-up patches, if we want, free
 : the caches in the mmu_shrink_scan(), and/or add a function hook for toggling
 : eager_page_split to topup/free the caches accordingly.  That gives userspace
 : explicit control over when the caches are purged, and does the logical thing of
 : freeing the caches when eager_page_split is disabled.

The current mmu_shrink_scan() implementation is basically useless.  It's so naive
that relying on it to react to memory pressure all but guarantees that guest
performance will tank if the shrinker kicks in.  E.g. KVM zaps the oldest pages,
which are likely upper level SPTEs and thus most likely to be reused.  And prior
to the TDP MMU, a guest running nested VMs would more than likely zap L1's SPTEs
even if the majority of shadow pages were allocated for L2.

And as pointed out in the RFC to drop shrinker support entirely[1], for well over
a decade KVM zapped a single page in each call from the shrinker, i.e. it was
_really_ useless.  And although in [1] I said adding memcg would be easy, doing
so in a performant way would be quite difficult as the per-cpu counter would need
to be made memcg aware (didn't think about that before).

Lastly, given that I completely broke the shrink logic for ~6 months and someone
only noticed when n_max_mmu_pages kicked in (commit 8fc517267fb2 "KVM: x86: Zap
the oldest MMU pages, not the newest"), I highly doubt anyone is relying on the
current shrinker logic, i.e. KVM_SET_NR_MMU_PAGES is used for setups that actually
need to limit the number of MMU pages beyond KVM's default.

My vote is to do something like the below: repurpose KVM's shrinker integration
to only purge the shadow page caches.  That can be done with almost no impact on
guest performance, e.g. with a dedicated spinlock, reclaiming from the caches
wouldn't even need to kick the vCPU.  Supporting reclaim of the caches would allow
us to change the cache capacity and/or behavior without having to worry too much
about exploding memory, and would make KVM's shrinker support actually usable.

[1] https://lore.kernel.org/all/YqzRGj6ryBZfGBSl@google.com
[2] https://lore.kernel.org/all/20220503221357.943536-1-vipinsh@google.com

---
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/mmu/mmu.c          | 131 ++++++++++++++++----------------
 2 files changed, 69 insertions(+), 64 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 283cbb83d6ae..f123bd985e1a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -786,6 +786,8 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
+	spinlock_t mmu_shadow_page_cache_lock;
+
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
 	 * In vcpu_run, we switch between the user and guest FPU contexts.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4736d7849c60..fca74cb1f2ce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -157,7 +157,12 @@ struct kvm_shadow_walk_iterator {
 
 static struct kmem_cache *pte_list_desc_cache;
 struct kmem_cache *mmu_page_header_cache;
-static struct percpu_counter kvm_total_used_mmu_pages;
+
+/*
+ * The total number of allocated, unused MMU pages, i.e. the total number of
+ * unused pages sitting in kvm_mmu_memory_cache structures.
+ */
+static struct percpu_counter kvm_total_unused_mmu_pages;
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
 
@@ -643,6 +648,23 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+static int kvm_mmu_topup_sp_cache(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache;
+	int orig_nobjs = cache->nobjs;
+	int r;
+
+	spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
+	r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
+
+	if (cache->nobjs != orig_nobjs)
+		percpu_counter_add(&kvm_total_unused_mmu_pages,
+				   cache->nobjs - orig_nobjs);
+	spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
+
+	return r;
+}
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -652,10 +674,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				       PT64_ROOT_MAX_LEVEL);
+
+	r = kvm_mmu_topup_sp_cache(vcpu);
 	if (r)
 		return r;
+
 	if (maybe_indirect) {
 		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
@@ -1681,28 +1704,23 @@ static int is_empty_shadow_page(u64 *spt)
 }
 #endif
 
-/*
- * This value is the sum of all of the kvm instances's
- * kvm->arch.n_used_mmu_pages values.  We need a global,
- * aggregate version in order to make the slab shrinker
- * faster
- */
-static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
-{
-	kvm->arch.n_used_mmu_pages += nr;
-	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
-}
-
 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm_mod_used_mmu_pages(kvm, +1);
+	kvm->arch.n_used_mmu_pages++;
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
+
+	percpu_counter_dec(&kvm_total_unused_mmu_pages);
 }
 
 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm_mod_used_mmu_pages(kvm, -1);
+	kvm->arch.n_used_mmu_pages--;
 	kvm_account_pgtable_pages((void *)sp->spt, -1);
+
+	/*
+	 * KVM doesn't put freed pages back into the cache, thus freeing a page
+	 * doesn't affect the number of unused MMU pages.
+	 */
 }
 
 static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
@@ -5859,6 +5877,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
@@ -5994,11 +6013,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
 }
 
-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
-{
-	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
-}
-
 static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
@@ -6583,69 +6597,58 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 	}
 }
 
-static unsigned long
-mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+static unsigned long mmu_shrink_scan(struct shrinker *shrink,
+				     struct shrink_control *sc)
 {
+	struct kvm_mmu_memory_cache *cache;
+	struct kvm_vcpu *vcpu;
 	struct kvm *kvm;
-	int nr_to_scan = sc->nr_to_scan;
 	unsigned long freed = 0;
+	unsigned long i;
 
 	mutex_lock(&kvm_lock);
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		int idx;
-		LIST_HEAD(invalid_list);
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			cache = &vcpu->arch.mmu_shadow_page_cache;
 
-		/*
-		 * Never scan more than sc->nr_to_scan VM instances.
-		 * Will not hit this condition practically since we do not try
-		 * to shrink more than one VM and it is very unlikely to see
-		 * !n_used_mmu_pages so many times.
-		 */
-		if (!nr_to_scan--)
-			break;
-		/*
-		 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
-		 * here. We may skip a VM instance errorneosly, but we do not
-		 * want to shrink a VM that only started to populate its MMU
-		 * anyway.
-		 */
-		if (!kvm->arch.n_used_mmu_pages &&
-		    !kvm_has_zapped_obsolete_pages(kvm))
-			continue;
+			if (!READ_ONCE(cache->nobjs))
+				continue;
 
-		idx = srcu_read_lock(&kvm->srcu);
-		write_lock(&kvm->mmu_lock);
-
-		if (kvm_has_zapped_obsolete_pages(kvm)) {
-			kvm_mmu_commit_zap_page(kvm,
-			      &kvm->arch.zapped_obsolete_pages);
-			goto unlock;
+			spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
+			while (cache->nobjs) {
+				free_page((unsigned long)cache->objects[--cache->nobjs]);
+				++freed;
+			}
+			spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
 		}
 
-		freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
-
-unlock:
-		write_unlock(&kvm->mmu_lock);
-		srcu_read_unlock(&kvm->srcu, idx);
-
 		/*
 		 * unfair on small ones
 		 * per-vm shrinkers cry out
 		 * sadness comes quickly
 		 */
 		list_move_tail(&kvm->vm_list, &vm_list);
-		break;
+
+		/*
+		 * Process all vCPUs before checking if enough pages have been
+		 * freed as a very naive way of providing fairness, e.g. to
+		 * avoid starving a single vCPU.
+		 */
+		if (freed >= sc->nr_to_scan)
+			break;
 	}
 
 	mutex_unlock(&kvm_lock);
+
+	sc->nr_scanned = freed;
 	return freed;
 }
 
-static unsigned long
-mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+static unsigned long mmu_shrink_count(struct shrinker *shrink,
+				      struct shrink_control *sc)
 {
-	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
+	return percpu_counter_read_positive(&kvm_total_unused_mmu_pages);
 }
 
 static struct shrinker mmu_shrinker = {
@@ -6753,7 +6756,7 @@ int kvm_mmu_vendor_module_init(void)
 	if (!mmu_page_header_cache)
 		goto out;
 
-	if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
+	if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
 		goto out;
 
 	ret = register_shrinker(&mmu_shrinker, "x86-mmu");
@@ -6763,7 +6766,7 @@ int kvm_mmu_vendor_module_init(void)
 	return 0;
 
 out_shrinker:
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
+	percpu_counter_destroy(&kvm_total_unused_mmu_pages);
 out:
 	mmu_destroy_caches();
 	return ret;
@@ -6780,7 +6783,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 void kvm_mmu_vendor_module_exit(void)
 {
 	mmu_destroy_caches();
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
+	percpu_counter_destroy(&kvm_total_unused_mmu_pages);
 	unregister_shrinker(&mmu_shrinker);
 }
 

base-commit: d709db0a0c05a6a2973655ed88690d4d43128d60
-- 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-05 18:17   ` Ben Gardon
@ 2022-12-05 23:40     ` Vipin Sharma
  2022-12-06 18:17       ` Ben Gardon
  0 siblings, 1 reply; 17+ messages in thread
From: Vipin Sharma @ 2022-12-05 23:40 UTC (permalink / raw)
  To: Ben Gardon; +Cc: dmatlack, seanjc, pbonzini, kvm, linux-kernel

On Mon, Dec 5, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..4d59c9d48277 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
> >         kvm_arch_guest_memory_reclaimed(kvm);
> >  }
> >
> > +void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > +{
> > +               return (void *)__get_free_page(gfp_flags);
> > +}
> > +
>
> Rather than making this __weak, you could use #ifdef CONFIG_NUMA to
> just put all the code in the arch-neutral function.
>

I am not sure how it will work. Here, I am trying to keep this feature
only for x86. This function will be used for all architecture except
in x86 where we have different implementation in arch/x86/mmu/mmu.c
So, even if CONFIG_NUMA is defined, we want to keep the same
definition on other architectures.





> >  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> >  static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> >                                                gfp_t gfp_flags)
> > @@ -393,7 +398,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> >         if (mc->kmem_cache)
> >                 return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> >         else
> > -               return (void *)__get_free_page(gfp_flags);
> > +               return kvm_arch_mmu_get_free_page(mc->node, gfp_flags);
> >  }
> >
> >  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > --
> > 2.39.0.rc0.267.gcb52ba06e7-goog
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-05 23:40     ` Vipin Sharma
@ 2022-12-06 18:17       ` Ben Gardon
       [not found]         ` <CAHVum0f_6UQvcqWAJxDJyL_LN-6ryAXNuh9xY6nFtLxCOMtoXA@mail.gmail.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Ben Gardon @ 2022-12-06 18:17 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: dmatlack, seanjc, pbonzini, kvm, linux-kernel

On Mon, Dec 5, 2022 at 3:40 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Mon, Dec 5, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 1782c4555d94..4d59c9d48277 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
> > >         kvm_arch_guest_memory_reclaimed(kvm);
> > >  }
> > >
> > > +void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > > +{
> > > +               return (void *)__get_free_page(gfp_flags);
> > > +}
> > > +
> >
> > Rather than making this __weak, you could use #ifdef CONFIG_NUMA to
> > just put all the code in the arch-neutral function.
> >
>
> I am not sure how it will work. Here, I am trying to keep this feature
> only for x86. This function will be used for all architecture except
> in x86 where we have different implementation in arch/x86/mmu/mmu.c
> So, even if CONFIG_NUMA is defined, we want to keep the same
> definition on other architectures.
>
>

Something like:

+void * kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
+{
+       struct page *spt_page;
+       void *address = NULL;
+
+       #ifdef CONFIG_NUMA
+       if (nid != NUMA_NO_NODE) {
+               spt_page = alloc_pages_node(nid, gfp, 0);
+               if (spt_page) {
+                       address = page_address(spt_page);
+                       return address;
+               }
+       }
+       #endif // CONFIG_NUMA
+       return (void *)__get_free_page(gfp);
+}

>
>
>
> > >  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> > >  static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > >                                                gfp_t gfp_flags)
> > > @@ -393,7 +398,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > >         if (mc->kmem_cache)
> > >                 return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> > >         else
> > > -               return (void *)__get_free_page(gfp_flags);
> > > +               return kvm_arch_mmu_get_free_page(mc->node, gfp_flags);
> > >  }
> > >
> > >  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > --
> > > 2.39.0.rc0.267.gcb52ba06e7-goog
> > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
       [not found]             ` <CAHVum0dkKSY9e90xgfBVBHUqntwJOmONK+TYBXFEwg6acvUrAw@mail.gmail.com>
@ 2022-12-07 19:05               ` Vipin Sharma
  2022-12-09  0:06                 ` David Matlack
  0 siblings, 1 reply; 17+ messages in thread
From: Vipin Sharma @ 2022-12-07 19:05 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack, Paolo Bonzini, KVM,
	Linux Kernel Mailing List
  Cc: Ben Gardon

By mistake I started replying to just Ben and realized it after few
exchanges. Adding others. Sorry about that.

On Wed, Dec 7, 2022 at 10:58 AM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Tue, Dec 6, 2022 at 11:57 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Tue, Dec 6, 2022 at 11:18 AM Vipin Sharma <vipinsh@google.com> wrote:
> > >
> > > On Tue, Dec 6, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> > > >
> > > > On Mon, Dec 5, 2022 at 3:40 PM Vipin Sharma <vipinsh@google.com> wrote:
> > > > >
> > > > > On Mon, Dec 5, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> > > > > >
> > > > > > On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote:
> > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > > > index 1782c4555d94..4d59c9d48277 100644
> > > > > > > --- a/virt/kvm/kvm_main.c
> > > > > > > +++ b/virt/kvm/kvm_main.c
> > > > > > > @@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
> > > > > > >         kvm_arch_guest_memory_reclaimed(kvm);
> > > > > > >  }
> > > > > > >
> > > > > > > +void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > > > > > > +{
> > > > > > > +               return (void *)__get_free_page(gfp_flags);
> > > > > > > +}
> > > > > > > +
> > > > > >
> > > > > > Rather than making this __weak, you could use #ifdef CONFIG_NUMA to
> > > > > > just put all the code in the arch-neutral function.
> > > > > >
> > > > >
> > > > > I am not sure how it will work. Here, I am trying to keep this feature
> > > > > only for x86. This function will be used for all architecture except
> > > > > in x86 where we have different implementation in arch/x86/mmu/mmu.c
> > > > > So, even if CONFIG_NUMA is defined, we want to keep the same
> > > > > definition on other architectures.
> > > > >
> > > > >
> > > >
> > > > Something like:
> > > >
> > > > +void * kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > > > +{
> > > > +       struct page *spt_page;
> > > > +       void *address = NULL;
> > > > +
> > > > +       #ifdef CONFIG_NUMA
> > > > +       if (nid != NUMA_NO_NODE) {
> > > > +               spt_page = alloc_pages_node(nid, gfp, 0);
> > > > +               if (spt_page) {
> > > > +                       address = page_address(spt_page);
> > > > +                       return address;
> > > > +               }
> > > > +       }
> > > > +       #endif // CONFIG_NUMA
> > > > +       return (void *)__get_free_page(gfp);
> > > > +}
> > > >
> > >
> > > 'nid' will be 0 not NUMA_NO_NODE for other architectures. In x86, I am
> > > explicitly setting kvm_mmu_memory_cache->node to NUMA_NO_NODE or
> > > specific desired nodes. In others architectures it will be 0 as struct
> > > will be 0 initialized. __weak avoids initializing nid to NUM_NO_NODE
> > > in other architectures.
> >
> > ooh, I see. It might be worth setting it to NUMA_NO_NODE on other
> > archs as 0 could be kind of misleading.
>
> Discussed offline with Ben.
> Initialization code for cache is in the respective architectures.
> Using "__weak" avoids touching code in other architectures.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node
  2022-12-05 21:07         ` Sean Christopherson
@ 2022-12-08 22:44           ` Vipin Sharma
  0 siblings, 0 replies; 17+ messages in thread
From: Vipin Sharma @ 2022-12-08 22:44 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Ben Gardon, dmatlack, pbonzini, kvm, linux-kernel

On Mon, Dec 5, 2022 at 1:07 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 05, 2022, Ben Gardon wrote:
> > On Mon, Dec 5, 2022 at 10:47 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > > +static bool __read_mostly numa_aware_pagetable = true;
> > > > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
> > > > > +
> > > >
> > > > I'm usually all for having module params to control things, but in
> > > > this case I don't think it provides much value because whether this
> > > > NUMA optimization is useful or not is going to depend more on VM size
> > > > and workload than anything else. If we wanted to make this
> > > > configurable, a VM capability would probably be a better mechanism so
> > > > that userspace could leave it off when running small,
> > > > non-performance-sensitive VMs
> > >
> > > Would we actually want to turn it off in this case?  IIUC, @nid is just the
> > > preferred node, i.e. failure to allocate for the preferred @nid will result in
> > > falling back to other nodes, not outright failure.  So the pathological worst
> > > case scenario would be that for a system with VMs that don't care about performance,
> > > all of a nodes memory is allocated due to all VMs starting on that node.
> > >
> > > On the flip side, if a system had a mix of VM shapes, I think we'd want even the
> > > performance insensitive VMs to be NUMA aware so that they can be sequestered on
> > > their own node(s), i.e. don't "steal" memory from the VMs that are performance
> > > sensitive and have been affined to a single node.
> >
> > Yeah, the only reason to turn it off would be to save memory. As a
> > strawman, if you had 100 1-vCPU VMs on a 2 node system, you'd have
> > 4000 pages allocated in caches, doing nothing.
>

I will keep it as a module parameter for now as Sean noted we don't
want insensitive VMs to impact the performance sensitive VMs. Also, as
noted below, changing shrinker behaviour to free cache will allow to
reclaim memory of unused node.

If there is a better reason for using capability let me know we can
discuss more about it.

> The management of the caches really should be addressed separately, and this is
> the perfect excuse to finally get some traction on mmu_shrink_scan().
>
> From a while back[1]:
>
>  : I know we proposed dropping mmu_shrink_scan(), but the more I think about that,
>  : the more I think that an outright drop is wrong.  The real issue is that KVM as
>  : quite literally the dumbest possible "algorithm" for zapping possibly-in-use
>  : shadow pages, and doesn't target the zapping to fit the cgroup that's under
>  : pressure.
>  :
>  : So for this, IMO rather than assume that freeing the caches when any memslot
>  : disables dirty logging, I think it makes sense to initially keep the caches and
>  : only free them at VM destruction.  Then in follow-up patches, if we want, free
>  : the caches in the mmu_shrink_scan(), and/or add a function hook for toggling
>  : eager_page_split to topup/free the caches accordingly.  That gives userspace
>  : explicit control over when the caches are purged, and does the logical thing of
>  : freeing the caches when eager_page_split is disabled.
>
> The current mmu_shrink_scan() implementation is basically useless.  It's so naive
> that relying on it to react to memory pressure all but guarantees that guest
> performance will tank if the shrinker kicks in.  E.g. KVM zaps the oldest pages,
> which are likely upper level SPTEs and thus most likely to be reused.  And prior
> to the TDP MMU, a guest running nested VMs would more than likely zap L1's SPTEs
> even if the majority of shadow pages were allocated for L2.
>
> And as pointed out in the RFC to drop shrinker support entirely[1], for well over
> a decade KVM zapped a single page in each call from the shrinker, i.e. it was
> _really_ useless.  And although in [1] I said adding memcg would be easy, doing
> so in a performant way would be quite difficult as the per-cpu counter would need
> to be made memcg aware (didn't think about that before).
>
> Lastly, given that I completely broke the shrink logic for ~6 months and someone
> only noticed when n_max_mmu_pages kicked in (commit 8fc517267fb2 "KVM: x86: Zap
> the oldest MMU pages, not the newest"), I highly doubt anyone is relying on the
> current shrinker logic, i.e. KVM_SET_NR_MMU_PAGES is used for setups that actually
> need to limit the number of MMU pages beyond KVM's default.
>
> My vote is to do something like the below: repurpose KVM's shrinker integration
> to only purge the shadow page caches.  That can be done with almost no impact on
> guest performance, e.g. with a dedicated spinlock, reclaiming from the caches
> wouldn't even need to kick the vCPU.  Supporting reclaim of the caches would allow
> us to change the cache capacity and/or behavior without having to worry too much
> about exploding memory, and would make KVM's shrinker support actually usable.
>
> [1] https://lore.kernel.org/all/YqzRGj6ryBZfGBSl@google.com
> [2] https://lore.kernel.org/all/20220503221357.943536-1-vipinsh@google.com
>

As discussed offline.
1. I will create a patch in this series which changes the behaviour of
shrinker by freeing the cache.

2. One suggestion came up was to fill the cache of the correct node
only in the topup stage by using the PFN after kvm_faulin_pfn(). We
decided it is not good because mmu_topup_memory_cache() cannot be
called after kvm_faultin_pf() as topup will increase the time between
reading the mmu_invalidate_seq and checking it in
is_page_fault_stale() which will increase chances of retrying page
fault and reducing the guest performance.

3. I will reduce the cache size and try to see if there is any impact observed.

> ---
>  arch/x86/include/asm/kvm_host.h |   2 +
>  arch/x86/kvm/mmu/mmu.c          | 131 ++++++++++++++++----------------
>  2 files changed, 69 insertions(+), 64 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 283cbb83d6ae..f123bd985e1a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -786,6 +786,8 @@ struct kvm_vcpu_arch {
>         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> +       spinlock_t mmu_shadow_page_cache_lock;
> +
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
>          * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4736d7849c60..fca74cb1f2ce 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -157,7 +157,12 @@ struct kvm_shadow_walk_iterator {
>
>  static struct kmem_cache *pte_list_desc_cache;
>  struct kmem_cache *mmu_page_header_cache;
> -static struct percpu_counter kvm_total_used_mmu_pages;
> +
> +/*
> + * The total number of allocated, unused MMU pages, i.e. the total number of
> + * unused pages sitting in kvm_mmu_memory_cache structures.
> + */
> +static struct percpu_counter kvm_total_unused_mmu_pages;
>
>  static void mmu_spte_set(u64 *sptep, u64 spte);
>
> @@ -643,6 +648,23 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>         }
>  }
>
> +static int kvm_mmu_topup_sp_cache(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache;
> +       int orig_nobjs = cache->nobjs;
> +       int r;
> +
> +       spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> +       r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> +
> +       if (cache->nobjs != orig_nobjs)
> +               percpu_counter_add(&kvm_total_unused_mmu_pages,
> +                                  cache->nobjs - orig_nobjs);
> +       spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> +
> +       return r;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>         int r;
> @@ -652,10 +674,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -                                      PT64_ROOT_MAX_LEVEL);
> +
> +       r = kvm_mmu_topup_sp_cache(vcpu);
>         if (r)
>                 return r;
> +
>         if (maybe_indirect) {
>                 r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
>                                                PT64_ROOT_MAX_LEVEL);
> @@ -1681,28 +1704,23 @@ static int is_empty_shadow_page(u64 *spt)
>  }
>  #endif
>
> -/*
> - * This value is the sum of all of the kvm instances's
> - * kvm->arch.n_used_mmu_pages values.  We need a global,
> - * aggregate version in order to make the slab shrinker
> - * faster
> - */
> -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> -{
> -       kvm->arch.n_used_mmu_pages += nr;
> -       percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> -}
> -
>  static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -       kvm_mod_used_mmu_pages(kvm, +1);
> +       kvm->arch.n_used_mmu_pages++;
>         kvm_account_pgtable_pages((void *)sp->spt, +1);
> +
> +       percpu_counter_dec(&kvm_total_unused_mmu_pages);
>  }
>
>  static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -       kvm_mod_used_mmu_pages(kvm, -1);
> +       kvm->arch.n_used_mmu_pages--;
>         kvm_account_pgtable_pages((void *)sp->spt, -1);
> +
> +       /*
> +        * KVM doesn't put freed pages back into the cache, thus freeing a page
> +        * doesn't affect the number of unused MMU pages.
> +        */
>  }
>
>  static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
> @@ -5859,6 +5877,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +       spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
>         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -5994,11 +6013,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>                 kvm_tdp_mmu_zap_invalidated_roots(kvm);
>  }
>
> -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> -{
> -       return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> -}
> -
>  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>                         struct kvm_memory_slot *slot,
>                         struct kvm_page_track_notifier_node *node)
> @@ -6583,69 +6597,58 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
>         }
>  }
>
> -static unsigned long
> -mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> +static unsigned long mmu_shrink_scan(struct shrinker *shrink,
> +                                    struct shrink_control *sc)
>  {
> +       struct kvm_mmu_memory_cache *cache;
> +       struct kvm_vcpu *vcpu;
>         struct kvm *kvm;
> -       int nr_to_scan = sc->nr_to_scan;
>         unsigned long freed = 0;
> +       unsigned long i;
>
>         mutex_lock(&kvm_lock);
>
>         list_for_each_entry(kvm, &vm_list, vm_list) {
> -               int idx;
> -               LIST_HEAD(invalid_list);
> +               kvm_for_each_vcpu(i, vcpu, kvm) {
> +                       cache = &vcpu->arch.mmu_shadow_page_cache;
>
> -               /*
> -                * Never scan more than sc->nr_to_scan VM instances.
> -                * Will not hit this condition practically since we do not try
> -                * to shrink more than one VM and it is very unlikely to see
> -                * !n_used_mmu_pages so many times.
> -                */
> -               if (!nr_to_scan--)
> -                       break;
> -               /*
> -                * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> -                * here. We may skip a VM instance errorneosly, but we do not
> -                * want to shrink a VM that only started to populate its MMU
> -                * anyway.
> -                */
> -               if (!kvm->arch.n_used_mmu_pages &&
> -                   !kvm_has_zapped_obsolete_pages(kvm))
> -                       continue;
> +                       if (!READ_ONCE(cache->nobjs))
> +                               continue;
>
> -               idx = srcu_read_lock(&kvm->srcu);
> -               write_lock(&kvm->mmu_lock);
> -
> -               if (kvm_has_zapped_obsolete_pages(kvm)) {
> -                       kvm_mmu_commit_zap_page(kvm,
> -                             &kvm->arch.zapped_obsolete_pages);
> -                       goto unlock;
> +                       spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> +                       while (cache->nobjs) {
> +                               free_page((unsigned long)cache->objects[--cache->nobjs]);
> +                               ++freed;
> +                       }
> +                       spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
>                 }
>
> -               freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> -
> -unlock:
> -               write_unlock(&kvm->mmu_lock);
> -               srcu_read_unlock(&kvm->srcu, idx);
> -
>                 /*
>                  * unfair on small ones
>                  * per-vm shrinkers cry out
>                  * sadness comes quickly
>                  */
>                 list_move_tail(&kvm->vm_list, &vm_list);
> -               break;
> +
> +               /*
> +                * Process all vCPUs before checking if enough pages have been
> +                * freed as a very naive way of providing fairness, e.g. to
> +                * avoid starving a single vCPU.
> +                */
> +               if (freed >= sc->nr_to_scan)
> +                       break;
>         }
>
>         mutex_unlock(&kvm_lock);
> +
> +       sc->nr_scanned = freed;
>         return freed;
>  }
>
> -static unsigned long
> -mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> +static unsigned long mmu_shrink_count(struct shrinker *shrink,
> +                                     struct shrink_control *sc)
>  {
> -       return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> +       return percpu_counter_read_positive(&kvm_total_unused_mmu_pages);
>  }
>
>  static struct shrinker mmu_shrinker = {
> @@ -6753,7 +6756,7 @@ int kvm_mmu_vendor_module_init(void)
>         if (!mmu_page_header_cache)
>                 goto out;
>
> -       if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> +       if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
>                 goto out;
>
>         ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> @@ -6763,7 +6766,7 @@ int kvm_mmu_vendor_module_init(void)
>         return 0;
>
>  out_shrinker:
> -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>  out:
>         mmu_destroy_caches();
>         return ret;
> @@ -6780,7 +6783,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  void kvm_mmu_vendor_module_exit(void)
>  {
>         mmu_destroy_caches();
> -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>         unregister_shrinker(&mmu_shrinker);
>  }
>
>
> base-commit: d709db0a0c05a6a2973655ed88690d4d43128d60
> --

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-07 19:05               ` Vipin Sharma
@ 2022-12-09  0:06                 ` David Matlack
  2022-12-09 18:47                   ` Vipin Sharma
  0 siblings, 1 reply; 17+ messages in thread
From: David Matlack @ 2022-12-09  0:06 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: Sean Christopherson, Paolo Bonzini, KVM,
	Linux Kernel Mailing List, Ben Gardon

On Wed, Dec 07, 2022 at 11:05:09AM -0800, Vipin Sharma wrote:
> By mistake I started replying to just Ben and realized it after few
> exchanges. Adding others. Sorry about that.
> 
> On Wed, Dec 7, 2022 at 10:58 AM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > On Tue, Dec 6, 2022 at 11:57 AM Ben Gardon <bgardon@google.com> wrote:
> > >
> > > On Tue, Dec 6, 2022 at 11:18 AM Vipin Sharma <vipinsh@google.com> wrote:
> > > >
> > > > On Tue, Dec 6, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> > > > >
> > > > > On Mon, Dec 5, 2022 at 3:40 PM Vipin Sharma <vipinsh@google.com> wrote:
> > > > > >
> > > > > > On Mon, Dec 5, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote:
> > > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > > > > index 1782c4555d94..4d59c9d48277 100644
> > > > > > > > --- a/virt/kvm/kvm_main.c
> > > > > > > > +++ b/virt/kvm/kvm_main.c
> > > > > > > > @@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
> > > > > > > >         kvm_arch_guest_memory_reclaimed(kvm);
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > > > > > > > +{
> > > > > > > > +               return (void *)__get_free_page(gfp_flags);
> > > > > > > > +}
> > > > > > > > +
> > > > > > >
> > > > > > > Rather than making this __weak, you could use #ifdef CONFIG_NUMA to
> > > > > > > just put all the code in the arch-neutral function.
> > > > > > >
> > > > > >
> > > > > > I am not sure how it will work. Here, I am trying to keep this feature
> > > > > > only for x86. This function will be used for all architecture except
> > > > > > in x86 where we have different implementation in arch/x86/mmu/mmu.c
> > > > > > So, even if CONFIG_NUMA is defined, we want to keep the same
> > > > > > definition on other architectures.
> > > > > >
> > > > > >
> > > > >
> > > > > Something like:
> > > > >
> > > > > +void * kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > > > > +{
> > > > > +       struct page *spt_page;
> > > > > +       void *address = NULL;
> > > > > +
> > > > > +       #ifdef CONFIG_NUMA
> > > > > +       if (nid != NUMA_NO_NODE) {
> > > > > +               spt_page = alloc_pages_node(nid, gfp, 0);
> > > > > +               if (spt_page) {
> > > > > +                       address = page_address(spt_page);
> > > > > +                       return address;
> > > > > +               }
> > > > > +       }
> > > > > +       #endif // CONFIG_NUMA
> > > > > +       return (void *)__get_free_page(gfp);
> > > > > +}
> > > > >
> > > >
> > > > 'nid' will be 0 not NUMA_NO_NODE for other architectures. In x86, I am
> > > > explicitly setting kvm_mmu_memory_cache->node to NUMA_NO_NODE or
> > > > specific desired nodes. In others architectures it will be 0 as struct
> > > > will be 0 initialized. __weak avoids initializing nid to NUM_NO_NODE
> > > > in other architectures.
> > >
> > > ooh, I see. It might be worth setting it to NUMA_NO_NODE on other
> > > archs as 0 could be kind of misleading.
> >
> > Discussed offline with Ben.
> > Initialization code for cache is in the respective architectures.
> > Using "__weak" avoids touching code in other architectures.

But it's still a bit gross to have node=0 in struct
kvm_mmu_memory_cache for other architectures, even if it doesn't happen
to be misused in this series.

I would just bite the bullet and modify the other architectures. Do it
in a precusor patch where you just add node to struct
kvm_mmu_memory_cache and initialize it to NUMA_NO_NODE across all
architectures, probably with a common macro e.g.

#define INIT_KVM_MMU_MEMORY_CACHE(_cache) do { \
	(_cache)->node = NUMA_NO_NODE; \
} while (0)

Then, you can follow Ben's approach and avoid the __weak function.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 0/2] NUMA aware page table allocation
  2022-12-01 19:57 [Patch v2 0/2] NUMA aware page table allocation Vipin Sharma
  2022-12-01 19:57 ` [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node Vipin Sharma
  2022-12-01 19:57 ` [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages Vipin Sharma
@ 2022-12-09  0:21 ` David Matlack
  2 siblings, 0 replies; 17+ messages in thread
From: David Matlack @ 2022-12-09  0:21 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: bgardon, seanjc, pbonzini, kvm, linux-kernel

On Thu, Dec 01, 2022 at 11:57:16AM -0800, Vipin Sharma wrote:
> Hi,
> 
> This series improves page table accesses by allocating page tables on
> the same NUMA node where underlying physical page is present.
> 
> Currently page tables are allocated during page faults and page splits.
> In both instances page table location will depend on the current thread
> mempolicy. This can create suboptimal placement of page tables on NUMA
> node, for example, thread doing eager page split is on different NUMA
> node compared to page it is splitting.
> 
> Reviewers please provide suggestion to the following:
> 
> 1. Module parameter is true by default, which means this feature will
>    be enabled by default. Is this okay or should I set it to false?
> 
> 2. I haven't reduced KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE considering that
>    it might not be too much of an impact as only online nodes are filled
>    during topup phase and in many cases some of these nodes will never
>    be refilled again.  Please let me know if you want this to be
>    reduced.
> 
> 3. I have tried to keep everything in x86/mmu except for some changes in
>    virt/kvm/kvm_main.c. I used __weak function so that only x86/mmu will
>    see the change, other arch nothing will change. I hope this is the
>    right approach.
> 
> 4. I am not sure what is the right way to split patch 2. If you think
>    this is too big for a patch please let me know what would you prefer.

I agree it's too big. The split_shadow_page_cache changes can easily be
split into a separate commit.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-01 19:57 ` [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages Vipin Sharma
  2022-12-05 18:17   ` Ben Gardon
@ 2022-12-09  0:27   ` David Matlack
  2022-12-09 18:51     ` Vipin Sharma
  1 sibling, 1 reply; 17+ messages in thread
From: David Matlack @ 2022-12-09  0:27 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: bgardon, seanjc, pbonzini, kvm, linux-kernel

On Thu, Dec 01, 2022 at 11:57:18AM -0800, Vipin Sharma wrote:
> Page table pages of a VM are currently allocated based on the current
> task's NUMA node or its mempolicy. This can cause suboptimal remote
> accesses by the vCPU if it is accessing physical pages local to its NUMA
> node but the page table pages mapping those physcal pages were created
> by some other vCPU which was on different NUMA node or had different
> policy.
> 
> Allocate page table pages on the same NUMA node where underlying
> physical page exists. Page table at level 5, 4, and 3 might not end up
> on the same NUMA node as they can span multiple NUMA nodes.
> 
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
...
> @@ -6284,13 +6326,16 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
>  	gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
>  	level = huge_sp->role.level;
>  	spte = *huge_sptep;
> +	nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(spte));
> +	if (nid == NUMA_NO_NODE)
> +		nid = numa_mem_id();

What do you think about renaming kvm_pfn_to_refcounted_page_nid() to
kvm_pfn_to_page_table_nid() and having it return numa_mem_id() instead
of NUMA_NO_NODE (with a comment)? I think that will clean up this patch
quite a bit by getting rid of all the NUMA_NO_NODE checks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-09  0:06                 ` David Matlack
@ 2022-12-09 18:47                   ` Vipin Sharma
  0 siblings, 0 replies; 17+ messages in thread
From: Vipin Sharma @ 2022-12-09 18:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Paolo Bonzini, KVM,
	Linux Kernel Mailing List, Ben Gardon

On Thu, Dec 8, 2022 at 4:06 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Dec 07, 2022 at 11:05:09AM -0800, Vipin Sharma wrote:
> > By mistake I started replying to just Ben and realized it after few
> > exchanges. Adding others. Sorry about that.
> >
> > On Wed, Dec 7, 2022 at 10:58 AM Vipin Sharma <vipinsh@google.com> wrote:
> > >
> > > On Tue, Dec 6, 2022 at 11:57 AM Ben Gardon <bgardon@google.com> wrote:
> > > >
> > > > On Tue, Dec 6, 2022 at 11:18 AM Vipin Sharma <vipinsh@google.com> wrote:
> > > > >
> > > > > On Tue, Dec 6, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> > > > > >
> > > > > > On Mon, Dec 5, 2022 at 3:40 PM Vipin Sharma <vipinsh@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Dec 5, 2022 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote:
> > > > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > > > > > index 1782c4555d94..4d59c9d48277 100644
> > > > > > > > > --- a/virt/kvm/kvm_main.c
> > > > > > > > > +++ b/virt/kvm/kvm_main.c
> > > > > > > > > @@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
> > > > > > > > >         kvm_arch_guest_memory_reclaimed(kvm);
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > > > > > > > > +{
> > > > > > > > > +               return (void *)__get_free_page(gfp_flags);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > >
> > > > > > > > Rather than making this __weak, you could use #ifdef CONFIG_NUMA to
> > > > > > > > just put all the code in the arch-neutral function.
> > > > > > > >
> > > > > > >
> > > > > > > I am not sure how it will work. Here, I am trying to keep this feature
> > > > > > > only for x86. This function will be used for all architecture except
> > > > > > > in x86 where we have different implementation in arch/x86/mmu/mmu.c
> > > > > > > So, even if CONFIG_NUMA is defined, we want to keep the same
> > > > > > > definition on other architectures.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > Something like:
> > > > > >
> > > > > > +void * kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags)
> > > > > > +{
> > > > > > +       struct page *spt_page;
> > > > > > +       void *address = NULL;
> > > > > > +
> > > > > > +       #ifdef CONFIG_NUMA
> > > > > > +       if (nid != NUMA_NO_NODE) {
> > > > > > +               spt_page = alloc_pages_node(nid, gfp, 0);
> > > > > > +               if (spt_page) {
> > > > > > +                       address = page_address(spt_page);
> > > > > > +                       return address;
> > > > > > +               }
> > > > > > +       }
> > > > > > +       #endif // CONFIG_NUMA
> > > > > > +       return (void *)__get_free_page(gfp);
> > > > > > +}
> > > > > >
> > > > >
> > > > > 'nid' will be 0 not NUMA_NO_NODE for other architectures. In x86, I am
> > > > > explicitly setting kvm_mmu_memory_cache->node to NUMA_NO_NODE or
> > > > > specific desired nodes. In others architectures it will be 0 as struct
> > > > > will be 0 initialized. __weak avoids initializing nid to NUM_NO_NODE
> > > > > in other architectures.
> > > >
> > > > ooh, I see. It might be worth setting it to NUMA_NO_NODE on other
> > > > archs as 0 could be kind of misleading.
> > >
> > > Discussed offline with Ben.
> > > Initialization code for cache is in the respective architectures.
> > > Using "__weak" avoids touching code in other architectures.
>
> But it's still a bit gross to have node=0 in struct
> kvm_mmu_memory_cache for other architectures, even if it doesn't happen
> to be misused in this series.
>
> I would just bite the bullet and modify the other architectures. Do it
> in a precusor patch where you just add node to struct
> kvm_mmu_memory_cache and initialize it to NUMA_NO_NODE across all
> architectures, probably with a common macro e.g.
>
> #define INIT_KVM_MMU_MEMORY_CACHE(_cache) do { \
>         (_cache)->node = NUMA_NO_NODE; \
> } while (0)
>
> Then, you can follow Ben's approach and avoid the __weak function.

Okay, 2 votes for NUMA_NO_NODE and 1 for __weak. I will remove the
__weak and modify other architecture code.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages
  2022-12-09  0:27   ` David Matlack
@ 2022-12-09 18:51     ` Vipin Sharma
  0 siblings, 0 replies; 17+ messages in thread
From: Vipin Sharma @ 2022-12-09 18:51 UTC (permalink / raw)
  To: David Matlack; +Cc: bgardon, seanjc, pbonzini, kvm, linux-kernel

On Thu, Dec 8, 2022 at 4:28 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Dec 01, 2022 at 11:57:18AM -0800, Vipin Sharma wrote:
> > Page table pages of a VM are currently allocated based on the current
> > task's NUMA node or its mempolicy. This can cause suboptimal remote
> > accesses by the vCPU if it is accessing physical pages local to its NUMA
> > node but the page table pages mapping those physcal pages were created
> > by some other vCPU which was on different NUMA node or had different
> > policy.
> >
> > Allocate page table pages on the same NUMA node where underlying
> > physical page exists. Page table at level 5, 4, and 3 might not end up
> > on the same NUMA node as they can span multiple NUMA nodes.
> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> ...
> > @@ -6284,13 +6326,16 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
> >       gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
> >       level = huge_sp->role.level;
> >       spte = *huge_sptep;
> > +     nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(spte));
> > +     if (nid == NUMA_NO_NODE)
> > +             nid = numa_mem_id();
>
> What do you think about renaming kvm_pfn_to_refcounted_page_nid() to
> kvm_pfn_to_page_table_nid() and having it return numa_mem_id() instead
> of NUMA_NO_NODE (with a comment)? I think that will clean up this patch
> quite a bit by getting rid of all the NUMA_NO_NODE checks.

Yeah, this will clean up this patch. I will make this change in the
next version.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-12-09 18:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-01 19:57 [Patch v2 0/2] NUMA aware page table allocation Vipin Sharma
2022-12-01 19:57 ` [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node Vipin Sharma
2022-12-05 18:24   ` Ben Gardon
2022-12-05 18:47     ` Sean Christopherson
2022-12-05 18:51       ` Ben Gardon
2022-12-05 21:07         ` Sean Christopherson
2022-12-08 22:44           ` Vipin Sharma
2022-12-01 19:57 ` [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages Vipin Sharma
2022-12-05 18:17   ` Ben Gardon
2022-12-05 23:40     ` Vipin Sharma
2022-12-06 18:17       ` Ben Gardon
     [not found]         ` <CAHVum0f_6UQvcqWAJxDJyL_LN-6ryAXNuh9xY6nFtLxCOMtoXA@mail.gmail.com>
     [not found]           ` <CANgfPd-XkHPZyFsPe75WbUrufLpKtdr1Neri1JrrApQrjRLRJw@mail.gmail.com>
     [not found]             ` <CAHVum0dkKSY9e90xgfBVBHUqntwJOmONK+TYBXFEwg6acvUrAw@mail.gmail.com>
2022-12-07 19:05               ` Vipin Sharma
2022-12-09  0:06                 ` David Matlack
2022-12-09 18:47                   ` Vipin Sharma
2022-12-09  0:27   ` David Matlack
2022-12-09 18:51     ` Vipin Sharma
2022-12-09  0:21 ` [Patch v2 0/2] NUMA aware page table allocation David Matlack

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).