linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
@ 2023-08-10  8:56 Yan Zhao
  2023-08-10  8:57 ` [RFC PATCH v2 1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
                   ` (6 more replies)
  0 siblings, 7 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-10  8:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, david, Yan Zhao

This is an RFC series trying to fix the issue of unnecessary NUMA
protection and TLB-shootdowns found in VMs with assigned devices or VFIO
mediated devices during NUMA balance.

For VMs with assigned devices or VFIO mediated devices, all or part of
guest memory are pinned for long-term.

Auto NUMA balancing will periodically selects VMAs of a process and change
protections to PROT_NONE even though some or all pages in the selected
ranges are long-term pinned for DMAs, which is true for VMs with assigned
devices or VFIO mediated devices.

Though this will not cause real problem because NUMA migration will
ultimately reject migration of those kind of pages and restore those
PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
with equal SPTEs finally faulted back, wasting CPU cycles and generating
unnecessary TLB-shootdowns.

This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.

Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU.

Patch 3 introduces a new mmu notifier callback .numa_protect(), which
will be called in patch 4 when a page is ensured to be PROT_NONE protected.

Then in patch 5, KVM can recognize a .invalidate_range_start() notification
is for NUMA balancing specific and do not do the page unmap in secondary
MMU until .numa_protect() comes.


Changelog:
RFC v1 --> v2:
1. added patch 3-4 to introduce a new callback .numa_protect()
2. Rather than have KVM duplicate logic to check if a page is pinned for
long-term, let KVM depend on the new callback .numa_protect() to do the
page unmap in secondary MMU for NUMA migration purpose.

RFC v1:
https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@intel.com/ 

Yan Zhao (5):
  mm/mmu_notifier: introduce a new mmu notifier flag
    MMU_NOTIFIER_RANGE_NUMA
  mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
    purpose
  mm/mmu_notifier: introduce a new callback .numa_protect
  mm/autonuma: call .numa_protect() when page is protected for NUMA
    migrate
  KVM: Unmap pages only when it's indeed protected for NUMA migration

 include/linux/mmu_notifier.h | 16 ++++++++++++++++
 mm/huge_memory.c             |  6 ++++++
 mm/mmu_notifier.c            | 18 ++++++++++++++++++
 mm/mprotect.c                | 10 +++++++++-
 virt/kvm/kvm_main.c          | 25 ++++++++++++++++++++++---
 5 files changed, 71 insertions(+), 4 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA
  2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
@ 2023-08-10  8:57 ` Yan Zhao
  2023-08-10  8:58 ` [RFC PATCH v2 2/5] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-10  8:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, david, Yan Zhao

Introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA to indicate the
notification of MMU_NOTIFY_PROTECTION_VMA is for NUMA balance purpose
specifically.

So that, the subscriber of mmu notifier, like KVM, can recognize this
type of notification and do numa protection specific operations in the
handler.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 include/linux/mmu_notifier.h | 1 +
 mm/mprotect.c                | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..a6dc829a4bce 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -60,6 +60,7 @@ enum mmu_notifier_event {
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
+#define MMU_NOTIFIER_RANGE_NUMA (1 << 1)
 
 struct mmu_notifier_ops {
 	/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..cb99a7d66467 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -381,7 +381,9 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!range.start) {
 			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
+				MMU_NOTIFY_PROTECTION_VMA,
+				cp_flags & MM_CP_PROT_NUMA ?
+				MMU_NOTIFIER_RANGE_NUMA : 0,
 				vma->vm_mm, addr, end);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 2/5] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose
  2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
  2023-08-10  8:57 ` [RFC PATCH v2 1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
@ 2023-08-10  8:58 ` Yan Zhao
  2023-08-10  9:00 ` [RFC PATCH v2 3/5] mm/mmu_notifier: introduce a new callback .numa_protect Yan Zhao
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-10  8:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, david, Yan Zhao

Don't set PROT_NONE for exclusive anonymas and maybe-dma-pinned pages for
NUMA migration purpose.

For exclusive anonymas and page_maybe_dma_pinned() pages, NUMA-migration
will eventually drop migration of those pages in try_to_migrate_one().
(i.e. after -EBUSY returned in page_try_share_anon_rmap()).

So, skip setting PROT_NONE to those kind of pages earlier in
change_protection_range() phase to avoid later futile page faults,
detections, and restoration to original PTEs/PMDs.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 mm/huge_memory.c | 5 +++++
 mm/mprotect.c    | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..a71cf686e3b2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1875,6 +1875,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			goto unlock;
 
 		page = pmd_page(*pmd);
+
+		if (PageAnon(page) && PageAnonExclusive(page) &&
+		    page_maybe_dma_pinned(page))
+			goto unlock;
+
 		toptier = node_is_toptier(page_to_nid(page));
 		/*
 		 * Skip scanning top tier node if normal numa
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cb99a7d66467..a1f63df34b86 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 				nid = page_to_nid(page);
 				if (target_node == nid)
 					continue;
+
+				if (PageAnon(page) && PageAnonExclusive(page) &&
+				    page_maybe_dma_pinned(page))
+					continue;
+
 				toptier = node_is_toptier(nid);
 
 				/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 3/5] mm/mmu_notifier: introduce a new callback .numa_protect
  2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
  2023-08-10  8:57 ` [RFC PATCH v2 1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
  2023-08-10  8:58 ` [RFC PATCH v2 2/5] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
@ 2023-08-10  9:00 ` Yan Zhao
  2023-08-10  9:00 ` [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate Yan Zhao
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-10  9:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, david, Yan Zhao

This .numa_protect callback is called when PROT_NONE is set for sure on a
PTE or a huge PMD for numa migration purpose.

With this callback, subscriber of mmu notifier, (e.g. KVM), can unmap NUMA
migration protected pages only in the handler, rather than unmap a wider
range containing pages that are obvious none-NUMA-migratble.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 include/linux/mmu_notifier.h | 15 +++++++++++++++
 mm/mmu_notifier.c            | 18 ++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a6dc829a4bce..a173db83b071 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -132,6 +132,10 @@ struct mmu_notifier_ops {
 			   unsigned long address,
 			   pte_t pte);
 
+	void (*numa_protect)(struct mmu_notifier *subscription,
+			     struct mm_struct *mm,
+			     unsigned long start,
+			     unsigned long end);
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_lock and/or the
@@ -395,6 +399,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
+extern void __mmu_notifier_numa_protect(struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r,
 				  bool only_end);
@@ -448,6 +455,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 		__mmu_notifier_change_pte(mm, address, pte);
 }
 
+static inline void mmu_notifier_numa_protect(struct mm_struct *mm,
+					     unsigned long start,
+					     unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_numa_protect(mm, start, end);
+}
+
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde1354f..fc96fbd46e1d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -382,6 +382,24 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return young;
 }
 
+void __mmu_notifier_numa_protect(struct mm_struct *mm,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct mmu_notifier *subscription;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(subscription,
+				 &mm->notifier_subscriptions->list, hlist,
+				 srcu_read_lock_held(&srcu)) {
+		if (subscription->ops->numa_protect)
+			subscription->ops->numa_protect(subscription, mm, start,
+							end);
+	}
+	srcu_read_unlock(&srcu, id);
+}
+
 int __mmu_notifier_clear_young(struct mm_struct *mm,
 			       unsigned long start,
 			       unsigned long end)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate
  2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
                   ` (2 preceding siblings ...)
  2023-08-10  9:00 ` [RFC PATCH v2 3/5] mm/mmu_notifier: introduce a new callback .numa_protect Yan Zhao
@ 2023-08-10  9:00 ` Yan Zhao
  2023-08-11 18:52   ` Nadav Amit
  2023-08-10  9:02 ` [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration Yan Zhao
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-10  9:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, david, Yan Zhao

Call mmu notifier's callback .numa_protect() in change_pmd_range() when
a page is ensured to be protected by PROT_NONE for NUMA migration purpose.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 mm/huge_memory.c | 1 +
 mm/mprotect.c    | 1 +
 2 files changed, 2 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a71cf686e3b2..8ae56507da12 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1892,6 +1892,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
 		    !toptier)
 			xchg_page_access_time(page, jiffies_to_msecs(jiffies));
+		mmu_notifier_numa_protect(vma->vm_mm, addr, addr + PMD_SIZE);
 	}
 	/*
 	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a1f63df34b86..c401814b2992 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -164,6 +164,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				    !toptier)
 					xchg_page_access_time(page,
 						jiffies_to_msecs(jiffies));
+				mmu_notifier_numa_protect(vma->vm_mm, addr, addr + PAGE_SIZE);
 			}
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
                   ` (3 preceding siblings ...)
  2023-08-10  9:00 ` [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate Yan Zhao
@ 2023-08-10  9:02 ` Yan Zhao
  2023-08-10 13:16   ` bibo mao
  2023-08-10  9:34 ` [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM David Hildenbrand
  2023-08-10 13:58 ` Chao Gao
  6 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-10  9:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, david, Yan Zhao

Register to .numa_protect() callback in mmu notifier so that KVM can get
acurate information about when a page is PROT_NONE protected in primary
MMU and unmap it in secondary MMU accordingly.

In KVM's .invalidate_range_start() handler, if the event is to notify that
the range may be protected to PROT_NONE for NUMA migration purpose,
don't do the unmapping in secondary MMU. Hold on until.numa_protect()
comes.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..907444a1761b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -711,6 +711,20 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
 }
 
+static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
+					  struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+
+	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
+		return;
+
+	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
+}
+
 void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 			      unsigned long end)
 {
@@ -744,14 +758,18 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
+		       (range->flags & MMU_NOTIFIER_RANGE_NUMA);
 	const struct kvm_hva_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= !is_numa ? kvm_unmap_gfn_range :
+				  (void *)kvm_null_fn,
 		.on_lock	= kvm_mmu_invalidate_begin,
-		.on_unlock	= kvm_arch_guest_memory_reclaimed,
-		.flush_on_ret	= true,
+		.on_unlock	= !is_numa ? kvm_arch_guest_memory_reclaimed :
+				  (void *)kvm_null_fn,
+		.flush_on_ret	= !is_numa ? true : false,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
 
@@ -899,6 +917,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
+	.numa_protect		= kvm_mmu_notifier_numa_protect,
 	.release		= kvm_mmu_notifier_release,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
                   ` (4 preceding siblings ...)
  2023-08-10  9:02 ` [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration Yan Zhao
@ 2023-08-10  9:34 ` David Hildenbrand
  2023-08-10  9:50   ` Yan Zhao
  2023-08-10 13:58 ` Chao Gao
  6 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-08-10  9:34 UTC (permalink / raw)
  To: Yan Zhao, linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian

On 10.08.23 10:56, Yan Zhao wrote:
> This is an RFC series trying to fix the issue of unnecessary NUMA
> protection and TLB-shootdowns found in VMs with assigned devices or VFIO
> mediated devices during NUMA balance.
> 
> For VMs with assigned devices or VFIO mediated devices, all or part of
> guest memory are pinned for long-term.
> 
> Auto NUMA balancing will periodically selects VMAs of a process and change
> protections to PROT_NONE even though some or all pages in the selected
> ranges are long-term pinned for DMAs, which is true for VMs with assigned
> devices or VFIO mediated devices.
> 
> Though this will not cause real problem because NUMA migration will
> ultimately reject migration of those kind of pages and restore those
> PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
> with equal SPTEs finally faulted back, wasting CPU cycles and generating
> unnecessary TLB-shootdowns.
> 
> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
> event is sent for NUMA migration purpose in specific.
> 
> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
> MMU to avoid NUMA protection introduced page faults and restoration of old
> huge PMDs/PTEs in primary MMU.
> 
> Patch 3 introduces a new mmu notifier callback .numa_protect(), which
> will be called in patch 4 when a page is ensured to be PROT_NONE protected.
> 
> Then in patch 5, KVM can recognize a .invalidate_range_start() notification
> is for NUMA balancing specific and do not do the page unmap in secondary
> MMU until .numa_protect() comes.
> 

Why do we need all that, when we should simply not be applying PROT_NONE 
to pinned pages?

In change_pte_range() we already have:

if (is_cow_mapping(vma->vm_flags) &&
     page_count(page) != 1)

Which includes both, shared and pinned pages.

Staring at page #2, are we still missing something similar for THPs?

Why is that MMU notifier thingy and touching KVM code required?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-10  9:34 ` [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM David Hildenbrand
@ 2023-08-10  9:50   ` Yan Zhao
  2023-08-11 17:25     ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-10  9:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian

On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
> > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
> > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
> > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
> > event is sent for NUMA migration purpose in specific.
> > 
> > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
> > MMU to avoid NUMA protection introduced page faults and restoration of old
> > huge PMDs/PTEs in primary MMU.
> > 
> > Patch 3 introduces a new mmu notifier callback .numa_protect(), which
> > will be called in patch 4 when a page is ensured to be PROT_NONE protected.
> > 
> > Then in patch 5, KVM can recognize a .invalidate_range_start() notification
> > is for NUMA balancing specific and do not do the page unmap in secondary
> > MMU until .numa_protect() comes.
> > 
> 
> Why do we need all that, when we should simply not be applying PROT_NONE to
> pinned pages?
> 
> In change_pte_range() we already have:
> 
> if (is_cow_mapping(vma->vm_flags) &&
>     page_count(page) != 1)
> 
> Which includes both, shared and pinned pages.
Ah, right, currently in my side, I don't see any pinned pages are
outside of this condition. 
But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?

> Staring at page #2, are we still missing something similar for THPs?
Yes.

> Why is that MMU notifier thingy and touching KVM code required?
Because NUMA balancing code will firstly send .invalidate_range_start() with
event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
unconditionally, before it goes down into change_pte_range() and
change_huge_pmd() to check each page count and apply PROT_NONE.

Then current KVM will unmap all notified pages from secondary MMU
in .invalidate_range_start(), which could include pages that finally not
set to PROT_NONE in primary MMU.

For VMs with pass-through devices, though all guest pages are pinned,
KVM still periodically unmap pages in response to the
.invalidate_range_start() notification from auto NUMA balancing, which
is a waste.

So, if there's a new callback sent when pages is set to PROT_NONE for NUMA
migrate only, KVM can unmap only those pages.
As KVM still needs to unmap pages for other type of event in its handler of
.invalidate_range_start() (.i.e. kvm_mmu_notifier_invalidate_range_start()),
and MMU_NOTIFY_PROTECTION_VMA also include other reasons, so patch 1
added a range flag to help KVM not to do a blind unmap in
.invalidate_range_start(), but do it in the new .numa_protect() handler.

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-10  9:02 ` [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration Yan Zhao
@ 2023-08-10 13:16   ` bibo mao
  2023-08-11  3:45     ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: bibo mao @ 2023-08-10 13:16 UTC (permalink / raw)
  To: Yan Zhao, linux-mm, linux-kernel, kvm
  Cc: pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, david



在 2023/8/10 17:02, Yan Zhao 写道:
> Register to .numa_protect() callback in mmu notifier so that KVM can get
> acurate information about when a page is PROT_NONE protected in primary
> MMU and unmap it in secondary MMU accordingly.
> 
> In KVM's .invalidate_range_start() handler, if the event is to notify that
> the range may be protected to PROT_NONE for NUMA migration purpose,
> don't do the unmapping in secondary MMU. Hold on until.numa_protect()
> comes.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dfbaafbe3a00..907444a1761b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -711,6 +711,20 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
>  }
>  
> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> +					  struct mm_struct *mm,
> +					  unsigned long start,
> +					  unsigned long end)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +
> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> +		return;
> +
> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> +}
numa balance will scan wide memory range, and there will be one time
ipi notification with kvm_flush_remote_tlbs. With page level notification,
it may bring out lots of flush remote tlb ipi notification.

however numa balance notification, pmd table of vm maybe needs not be freed
in kvm_unmap_gfn_range.

Regards
Bibo Mao
> +
>  void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  			      unsigned long end)
>  {
> @@ -744,14 +758,18 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
> +		       (range->flags & MMU_NOTIFIER_RANGE_NUMA);
>  	const struct kvm_hva_range hva_range = {
>  		.start		= range->start,
>  		.end		= range->end,
>  		.pte		= __pte(0),
> -		.handler	= kvm_unmap_gfn_range,
> +		.handler	= !is_numa ? kvm_unmap_gfn_range :
> +				  (void *)kvm_null_fn,
>  		.on_lock	= kvm_mmu_invalidate_begin,
> -		.on_unlock	= kvm_arch_guest_memory_reclaimed,
> -		.flush_on_ret	= true,
> +		.on_unlock	= !is_numa ? kvm_arch_guest_memory_reclaimed :
> +				  (void *)kvm_null_fn,
> +		.flush_on_ret	= !is_numa ? true : false,
>  		.may_block	= mmu_notifier_range_blockable(range),
>  	};
>  
> @@ -899,6 +917,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>  	.clear_young		= kvm_mmu_notifier_clear_young,
>  	.test_young		= kvm_mmu_notifier_test_young,
>  	.change_pte		= kvm_mmu_notifier_change_pte,
> +	.numa_protect		= kvm_mmu_notifier_numa_protect,
>  	.release		= kvm_mmu_notifier_release,
>  };
>  


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
                   ` (5 preceding siblings ...)
  2023-08-10  9:34 ` [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM David Hildenbrand
@ 2023-08-10 13:58 ` Chao Gao
  2023-08-11  5:22   ` Yan Zhao
  6 siblings, 1 reply; 47+ messages in thread
From: Chao Gao @ 2023-08-10 13:58 UTC (permalink / raw)
  To: Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote:
>This is an RFC series trying to fix the issue of unnecessary NUMA
>protection and TLB-shootdowns found in VMs with assigned devices or VFIO
>mediated devices during NUMA balance.
>
>For VMs with assigned devices or VFIO mediated devices, all or part of
>guest memory are pinned for long-term.
>
>Auto NUMA balancing will periodically selects VMAs of a process and change
>protections to PROT_NONE even though some or all pages in the selected
>ranges are long-term pinned for DMAs, which is true for VMs with assigned
>devices or VFIO mediated devices.
>
>Though this will not cause real problem because NUMA migration will
>ultimately reject migration of those kind of pages and restore those
>PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
>with equal SPTEs finally faulted back, wasting CPU cycles and generating
>unnecessary TLB-shootdowns.

In my understanding, NUMA balancing also moves tasks closer to the memory
they are accessing. Can this still work with this series applied?

>
>This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
>to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
>the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
>event is sent for NUMA migration purpose in specific.
>
>Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
>MMU to avoid NUMA protection introduced page faults and restoration of old
>huge PMDs/PTEs in primary MMU.
>
>Patch 3 introduces a new mmu notifier callback .numa_protect(), which
>will be called in patch 4 when a page is ensured to be PROT_NONE protected.
>
>Then in patch 5, KVM can recognize a .invalidate_range_start() notification
>is for NUMA balancing specific and do not do the page unmap in secondary
>MMU until .numa_protect() comes.
>
>
>Changelog:
>RFC v1 --> v2:
>1. added patch 3-4 to introduce a new callback .numa_protect()
>2. Rather than have KVM duplicate logic to check if a page is pinned for
>long-term, let KVM depend on the new callback .numa_protect() to do the
>page unmap in secondary MMU for NUMA migration purpose.
>
>RFC v1:
>https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@intel.com/ 
>
>Yan Zhao (5):
>  mm/mmu_notifier: introduce a new mmu notifier flag
>    MMU_NOTIFIER_RANGE_NUMA
>  mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
>    purpose
>  mm/mmu_notifier: introduce a new callback .numa_protect
>  mm/autonuma: call .numa_protect() when page is protected for NUMA
>    migrate
>  KVM: Unmap pages only when it's indeed protected for NUMA migration
>
> include/linux/mmu_notifier.h | 16 ++++++++++++++++
> mm/huge_memory.c             |  6 ++++++
> mm/mmu_notifier.c            | 18 ++++++++++++++++++
> mm/mprotect.c                | 10 +++++++++-
> virt/kvm/kvm_main.c          | 25 ++++++++++++++++++++++---
> 5 files changed, 71 insertions(+), 4 deletions(-)
>
>-- 
>2.17.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-10 13:16   ` bibo mao
@ 2023-08-11  3:45     ` Yan Zhao
  2023-08-11  7:40       ` bibo mao
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-11  3:45 UTC (permalink / raw)
  To: bibo mao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

> > +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > +					  struct mm_struct *mm,
> > +					  unsigned long start,
> > +					  unsigned long end)
> > +{
> > +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > +
> > +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > +		return;
> > +
> > +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > +}
> numa balance will scan wide memory range, and there will be one time
Though scanning memory range is wide, .invalidate_range_start() is sent
for each 2M range.

> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> it may bring out lots of flush remote tlb ipi notification.

Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
will be reduced to 0 with this series.

For VMs without assigned devices or mdev devices, I was previously also
worried about that there might be more IPIs.
But with current test data, there's no more remote tlb IPIs on average.

The reason is below:

Before this series, kvm_unmap_gfn_range() is called for once for a 2M
range.
After this series, kvm_unmap_gfn_range() is called for once if the 2M is
mapped to a huge page in primary MMU, and called for at most 512 times
if mapped to 4K pages in primary MMU.


Though kvm_unmap_gfn_range() is only called once before this series,
as the range is blockable, when there're contentions, remote tlb IPIs
can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
if the pages are mapped in 4K in secondary MMU.

With this series, on the other hand, .numa_protect() sets range to be
unblockable. So there could be less remote tlb IPIs when a 2M range is
mapped into small PTEs in secondary MMU.
Besides, .numa_protect() is not sent for all pages in a given 2M range.

Below is my testing data on a VM without assigned devices:
The data is an average of 10 times guest boot-up.
                   
    data           | numa balancing caused  | numa balancing caused    
  on average       | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
-------------------|------------------------|--------------------------
before this series |         35             |     8625                 
after  this series |      10037             |     4610                 

For a single guest bootup,
                   | numa balancing caused  | numa balancing caused    
    best  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
-------------------|------------------------|--------------------------
before this series |         28             |       13                  
after  this series |        406             |      195                  

                   | numa balancing caused  | numa balancing caused    
   worst  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
-------------------|------------------------|--------------------------
before this series |         44             |    43920               
after  this series |      17352             |     8668                 


> 
> however numa balance notification, pmd table of vm maybe needs not be freed
> in kvm_unmap_gfn_range.
> 
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-10 13:58 ` Chao Gao
@ 2023-08-11  5:22   ` Yan Zhao
  0 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-11  5:22 UTC (permalink / raw)
  To: Chao Gao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Thu, Aug 10, 2023 at 09:58:43PM +0800, Chao Gao wrote:
> On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote:
> >This is an RFC series trying to fix the issue of unnecessary NUMA
> >protection and TLB-shootdowns found in VMs with assigned devices or VFIO
> >mediated devices during NUMA balance.
> >
> >For VMs with assigned devices or VFIO mediated devices, all or part of
> >guest memory are pinned for long-term.
> >
> >Auto NUMA balancing will periodically selects VMAs of a process and change
> >protections to PROT_NONE even though some or all pages in the selected
> >ranges are long-term pinned for DMAs, which is true for VMs with assigned
> >devices or VFIO mediated devices.
> >
> >Though this will not cause real problem because NUMA migration will
> >ultimately reject migration of those kind of pages and restore those
> >PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
> >with equal SPTEs finally faulted back, wasting CPU cycles and generating
> >unnecessary TLB-shootdowns.
> 
> In my understanding, NUMA balancing also moves tasks closer to the memory
> they are accessing. Can this still work with this series applied?
> 
For pages protected with PROT_NONE in primary MMU in scanning phase, yes;
For pages not set to PROT_NONE, no.
Because looks this task_numa_migrate() is only triggered in next page
fault when PROT_NONE and accessible VMA is found.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-11  3:45     ` Yan Zhao
@ 2023-08-11  7:40       ` bibo mao
  2023-08-11  8:01         ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: bibo mao @ 2023-08-11  7:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david



在 2023/8/11 11:45, Yan Zhao 写道:
>>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
>>> +					  struct mm_struct *mm,
>>> +					  unsigned long start,
>>> +					  unsigned long end)
>>> +{
>>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>>> +
>>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>>> +		return;
>>> +
>>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
>>> +}
>> numa balance will scan wide memory range, and there will be one time
> Though scanning memory range is wide, .invalidate_range_start() is sent
> for each 2M range.
yes, range is huge page size when changing numa protection during numa scanning.

> 
>> ipi notification with kvm_flush_remote_tlbs. With page level notification,
>> it may bring out lots of flush remote tlb ipi notification.
> 
> Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> will be reduced to 0 with this series.
> 
> For VMs without assigned devices or mdev devices, I was previously also
> worried about that there might be more IPIs.
> But with current test data, there's no more remote tlb IPIs on average.
> 
> The reason is below:
> 
> Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> range.
> After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> mapped to a huge page in primary MMU, and called for at most 512 times
> if mapped to 4K pages in primary MMU.
> 
> 
> Though kvm_unmap_gfn_range() is only called once before this series,
> as the range is blockable, when there're contentions, remote tlb IPIs
> can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
I do not know much about x86, does this happen always or only need reschedule
from code?  so that there will be many times of tlb IPIs in only once function
call about kvm_unmap_gfn_range.

> if the pages are mapped in 4K in secondary MMU.
> 
> With this series, on the other hand, .numa_protect() sets range to be
> unblockable. So there could be less remote tlb IPIs when a 2M range is
> mapped into small PTEs in secondary MMU.
> Besides, .numa_protect() is not sent for all pages in a given 2M range.
No, .numa_protect() is not sent for all pages. It depends on the workload,
whether the page is accessed for different cpu threads cross-nodes.

> 
> Below is my testing data on a VM without assigned devices:
> The data is an average of 10 times guest boot-up.
>                    
>     data           | numa balancing caused  | numa balancing caused    
>   on average       | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> -------------------|------------------------|--------------------------
> before this series |         35             |     8625                 
> after  this series |      10037             |     4610   
just be cautious, before the series there are  8625/35 = 246 IPI tlb flush ops
during one time kvm_unmap_gfn_range, is that x86 specific or generic? 

By the way are primary mmu and secondary mmu both 4K small page size "on average"?

Regards
Bibo Mao
              
> 
> For a single guest bootup,
>                    | numa balancing caused  | numa balancing caused    
>     best  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> -------------------|------------------------|--------------------------
> before this series |         28             |       13                  
> after  this series |        406             |      195                  
> 
>                    | numa balancing caused  | numa balancing caused    
>    worst  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> -------------------|------------------------|--------------------------
> before this series |         44             |    43920               
> after  this series |      17352             |     8668                 

> 
> 
>>
>> however numa balance notification, pmd table of vm maybe needs not be freed
>> in kvm_unmap_gfn_range.
>>
>  


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-11  7:40       ` bibo mao
@ 2023-08-11  8:01         ` Yan Zhao
  2023-08-11 17:14           ` Sean Christopherson
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-11  8:01 UTC (permalink / raw)
  To: bibo mao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> 
> 
> 在 2023/8/11 11:45, Yan Zhao 写道:
> >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> >>> +					  struct mm_struct *mm,
> >>> +					  unsigned long start,
> >>> +					  unsigned long end)
> >>> +{
> >>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >>> +
> >>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> >>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> >>> +		return;
> >>> +
> >>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> >>> +}
> >> numa balance will scan wide memory range, and there will be one time
> > Though scanning memory range is wide, .invalidate_range_start() is sent
> > for each 2M range.
> yes, range is huge page size when changing numa protection during numa scanning.
> 
> > 
> >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> >> it may bring out lots of flush remote tlb ipi notification.
> > 
> > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > will be reduced to 0 with this series.
> > 
> > For VMs without assigned devices or mdev devices, I was previously also
> > worried about that there might be more IPIs.
> > But with current test data, there's no more remote tlb IPIs on average.
> > 
> > The reason is below:
> > 
> > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > range.
> > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > mapped to a huge page in primary MMU, and called for at most 512 times
> > if mapped to 4K pages in primary MMU.
> > 
> > 
> > Though kvm_unmap_gfn_range() is only called once before this series,
> > as the range is blockable, when there're contentions, remote tlb IPIs
> > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> I do not know much about x86, does this happen always or only need reschedule
Ah, sorry, I missed platforms other than x86.
Maybe there will be a big difference in other platforms.

> from code?  so that there will be many times of tlb IPIs in only once function
Only when MMU lock is contended. But it's not seldom because of the contention in
TDP page fault.

> call about kvm_unmap_gfn_range.
> 
> > if the pages are mapped in 4K in secondary MMU.
> > 
> > With this series, on the other hand, .numa_protect() sets range to be
> > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > mapped into small PTEs in secondary MMU.
> > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> No, .numa_protect() is not sent for all pages. It depends on the workload,
> whether the page is accessed for different cpu threads cross-nodes.
The .numa_protect() is called in patch 4 only when PROT_NONE is set to
the page.

> 
> > 
> > Below is my testing data on a VM without assigned devices:
> > The data is an average of 10 times guest boot-up.
> >                    
> >     data           | numa balancing caused  | numa balancing caused    
> >   on average       | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> > -------------------|------------------------|--------------------------
> > before this series |         35             |     8625                 
> > after  this series |      10037             |     4610   
> just be cautious, before the series there are  8625/35 = 246 IPI tlb flush ops
> during one time kvm_unmap_gfn_range, is that x86 specific or generic? 
Only on x86. Didn't test on other platforms.

> 
> By the way are primary mmu and secondary mmu both 4K small page size "on average"?
No. 4K and 2M combined in both.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-11  8:01         ` Yan Zhao
@ 2023-08-11 17:14           ` Sean Christopherson
  2023-08-11 17:18             ` Jason Gunthorpe
  2023-08-14  6:52             ` Yan Zhao
  0 siblings, 2 replies; 47+ messages in thread
From: Sean Christopherson @ 2023-08-11 17:14 UTC (permalink / raw)
  To: Yan Zhao
  Cc: bibo mao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Fri, Aug 11, 2023, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> > 
> > 在 2023/8/11 11:45, Yan Zhao 写道:
> > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > >>> +					  struct mm_struct *mm,
> > >>> +					  unsigned long start,
> > >>> +					  unsigned long end)
> > >>> +{
> > >>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > >>> +
> > >>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > >>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > >>> +		return;
> > >>> +
> > >>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > >>> +}
> > >> numa balance will scan wide memory range, and there will be one time
> > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > for each 2M range.
> > yes, range is huge page size when changing numa protection during numa scanning.
> > 
> > > 
> > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > >> it may bring out lots of flush remote tlb ipi notification.
> > > 
> > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > will be reduced to 0 with this series.
> > > 
> > > For VMs without assigned devices or mdev devices, I was previously also
> > > worried about that there might be more IPIs.
> > > But with current test data, there's no more remote tlb IPIs on average.
> > > 
> > > The reason is below:
> > > 
> > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > range.

No, it's potentially called once per 1GiB range.  change_pmd_range() invokes the
mmu_notifier with addr+end, where "end" is the end of the range covered by the
PUD, not the the end of the current PMD.  So the worst case scenario would be a
256k increase.  Of course, if you have to migrate an entire 1GiB chunk of memory
then you likely have bigger problems, but still.

> > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > if mapped to 4K pages in primary MMU.
> > > 
> > > 
> > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > I do not know much about x86, does this happen always or only need reschedule
> Ah, sorry, I missed platforms other than x86.
> Maybe there will be a big difference in other platforms.
> 
> > from code?  so that there will be many times of tlb IPIs in only once function
> Only when MMU lock is contended. But it's not seldom because of the contention in
> TDP page fault.

No?  I don't see how mmu_lock contention would affect the number of calls to 
kvm_flush_remote_tlbs().  If vCPUs are running and not generating faults, i.e.
not trying to access the range in question, then ever zap will generate a remote
TLB flush and thus send IPIs to all running vCPUs.

> > call about kvm_unmap_gfn_range.
> > 
> > > if the pages are mapped in 4K in secondary MMU.
> > > 
> > > With this series, on the other hand, .numa_protect() sets range to be
> > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > mapped into small PTEs in secondary MMU.
> > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > whether the page is accessed for different cpu threads cross-nodes.
> The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> the page.

I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
and losing the batching of invalidations makes me nervous.  As Bibo points out,
the behavior will vary based on the workload, VM configuration, etc.

There's also a *very* subtle change, in that the notification will be sent while
holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
I'm not exactly 100% confident on that.  And the only reason there isn't a more
obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
won't yield if there's mmu_lock contention.

However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
I think we can achieve what you want without losing batching, and without changing
secondary MMUs.

Rather than muck with the notification types and add a one-off for NUMA, just
defer the notification until a present PMD/PTE is actually going to be modified.
It's not the prettiest, but other than the locking, I don't think has any chance
of regressing other workloads/configurations.

Note, I'm assuming secondary MMUs aren't allowed to map swap entries...

Compile tested only.

From: Sean Christopherson <seanjc@google.com>
Date: Fri, 11 Aug 2023 10:03:36 -0700
Subject: [PATCH] tmp

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/huge_mm.h |  4 +++-
 include/linux/mm.h      |  3 +++
 mm/huge_memory.c        |  5 ++++-
 mm/mprotect.c           | 47 +++++++++++++++++++++++++++++------------
 4 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..b88316adaaad 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -7,6 +7,8 @@
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 
+struct mmu_notifier_range;
+
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
 int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
-		    unsigned long cp_flags);
+		    unsigned long cp_flags, struct mmu_notifier_range *range);
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..284f61cf9c37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
 	return !!(vma->vm_flags & VM_WRITE);
 
 }
+
+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+					    struct mmu_notifier_range *range);
 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 extern long change_protection(struct mmu_gather *tlb,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a71cf686e3b2..47c7712b163e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  */
 int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
-		    unsigned long cp_flags)
+		    unsigned long cp_flags, struct mmu_notifier_range *range)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
@@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    !toptier)
 			xchg_page_access_time(page, jiffies_to_msecs(jiffies));
 	}
+
+	change_pmd_range_notify_secondary_mmus(addr, range);
+
 	/*
 	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
 	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d1a809167f49..f5844adbe0cb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+		struct mmu_notifier_range *range)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
 				    !toptier)
 					xchg_page_access_time(page,
 						jiffies_to_msecs(jiffies));
+
+
 			}
 
+			change_pmd_range_notify_secondary_mmus(addr, range);
+
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
 		err;							\
 	})
 
+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+					    struct mmu_notifier_range *range)
+{
+	if (range->start)
+		return;
+
+	VM_WARN_ON(addr >= range->end);
+	range->start = addr;
+	mmu_notifier_invalidate_range_start_nonblock(range);
+}
+
 static inline long change_pmd_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;
 
-	range.start = 0;
+	/*
+	 * Defer invalidation of secondary MMUs until a PMD/PTE change is
+	 * imminent, e.g. NUMA balancing in particular can "fail" for certain
+	 * types of mappings.  Initialize range.start to '0' and use it to
+	 * track whether or not the invalidation notification has been set.
+	 */
+	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+				vma->vm_mm, 0, end);
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_none(*pmd))
 			goto next;
 
-		/* invoke the mmu notifier if the pmd is populated */
-		if (!range.start) {
-			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
-				vma->vm_mm, addr, end);
-			mmu_notifier_invalidate_range_start(&range);
-		}
-
 		_pmd = pmdp_get_lockless(pmd);
 		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
+				/*
+				 * FIXME: __split_huge_pmd() performs its own
+				 * mmu_notifier invalidation prior to clearing
+				 * the PMD, ideally all invalidations for the
+				 * range would be batched.
+				 */
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 				/*
 				 * For file-backed, the pmd could have been
@@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 					break;
 				}
 			} else {
-				ret = change_huge_pmd(tlb, vma, pmd,
-						addr, newprot, cp_flags);
+				ret = change_huge_pmd(tlb, vma, pmd, addr,
+						      newprot, cp_flags, &range);
 				if (ret) {
 					if (ret == HPAGE_PMD_NR) {
 						pages += HPAGE_PMD_NR;
@@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		}
 
 		ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
-				       cp_flags);
+				       cp_flags, &range);
 		if (ret < 0)
 			goto again;
 		pages += ret;

base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
-- 



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-11 17:14           ` Sean Christopherson
@ 2023-08-11 17:18             ` Jason Gunthorpe
  2023-08-14  6:52             ` Yan Zhao
  1 sibling, 0 replies; 47+ messages in thread
From: Jason Gunthorpe @ 2023-08-11 17:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, bibo mao, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, rppt, akpm, kevin.tian, david

> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> I think we can achieve what you want without losing batching, and without changing
> secondary MMUs.

No, this is not legal.

> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range)
> +{
> +	if (range->start)
> +		return;
> +
> +	VM_WARN_ON(addr >= range->end);
> +	range->start = addr;
> +	mmu_notifier_invalidate_range_start_nonblock(range);
> +}

The 'nonblock' entry point is advisory, if it fails the caller must
not change the mm.

Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-10  9:50   ` Yan Zhao
@ 2023-08-11 17:25     ` David Hildenbrand
  2023-08-11 18:20       ` John Hubbard
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-08-11 17:25 UTC (permalink / raw)
  To: Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, John Hubbard

On 10.08.23 11:50, Yan Zhao wrote:
> On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
>>> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
>>> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
>>> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
>>> event is sent for NUMA migration purpose in specific.
>>>
>>> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
>>> MMU to avoid NUMA protection introduced page faults and restoration of old
>>> huge PMDs/PTEs in primary MMU.
>>>
>>> Patch 3 introduces a new mmu notifier callback .numa_protect(), which
>>> will be called in patch 4 when a page is ensured to be PROT_NONE protected.
>>>
>>> Then in patch 5, KVM can recognize a .invalidate_range_start() notification
>>> is for NUMA balancing specific and do not do the page unmap in secondary
>>> MMU until .numa_protect() comes.
>>>
>>
>> Why do we need all that, when we should simply not be applying PROT_NONE to
>> pinned pages?
>>
>> In change_pte_range() we already have:
>>
>> if (is_cow_mapping(vma->vm_flags) &&
>>      page_count(page) != 1)
>>
>> Which includes both, shared and pinned pages.
> Ah, right, currently in my side, I don't see any pinned pages are
> outside of this condition.
> But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
> need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?

One issue is that folio_maybe_pinned...() ... is unreliable as soon as 
your page is mapped more than 1024 times.

One might argue that we also want to exclude pages that are mapped that 
often. That might possibly work.

> 
>> Staring at page #2, are we still missing something similar for THPs?
> Yes.
> 
>> Why is that MMU notifier thingy and touching KVM code required?
> Because NUMA balancing code will firstly send .invalidate_range_start() with
> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
> unconditionally, before it goes down into change_pte_range() and
> change_huge_pmd() to check each page count and apply PROT_NONE.

Ah, okay I see, thanks. That's indeed unfortunate.

> 
> Then current KVM will unmap all notified pages from secondary MMU
> in .invalidate_range_start(), which could include pages that finally not
> set to PROT_NONE in primary MMU.
> 
> For VMs with pass-through devices, though all guest pages are pinned,
> KVM still periodically unmap pages in response to the
> .invalidate_range_start() notification from auto NUMA balancing, which
> is a waste.

Should we want to disable NUMA hinting for such VMAs instead (for 
example, by QEMU/hypervisor) that knows that any NUMA hinting activity 
on these ranges would be a complete waste of time? I recall that John H. 
once mentioned that there are similar issues with GPU memory:  NUMA 
hinting is actually counter-productive and they end up disabling it.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-11 17:25     ` David Hildenbrand
@ 2023-08-11 18:20       ` John Hubbard
  2023-08-11 18:39         ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: John Hubbard @ 2023-08-11 18:20 UTC (permalink / raw)
  To: David Hildenbrand, Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian

On 8/11/23 10:25, David Hildenbrand wrote:
...
> One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times.
> 
> One might argue that we also want to exclude pages that are mapped that often. That might possibly work.

Yes.
  
>>
>>> Staring at page #2, are we still missing something similar for THPs?
>> Yes.
>>
>>> Why is that MMU notifier thingy and touching KVM code required?
>> Because NUMA balancing code will firstly send .invalidate_range_start() with
>> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
>> unconditionally, before it goes down into change_pte_range() and
>> change_huge_pmd() to check each page count and apply PROT_NONE.
> 
> Ah, okay I see, thanks. That's indeed unfortunate.

Sigh. All this difficulty reminds me that this mechanism was created in
the early days of NUMA. I wonder sometimes lately whether the cost, in
complexity and CPU time, is still worth it on today's hardware.

But of course I am deeply biased, so don't take that too seriously.
See below. :)

> 
>>
>> Then current KVM will unmap all notified pages from secondary MMU
>> in .invalidate_range_start(), which could include pages that finally not
>> set to PROT_NONE in primary MMU.
>>
>> For VMs with pass-through devices, though all guest pages are pinned,
>> KVM still periodically unmap pages in response to the
>> .invalidate_range_start() notification from auto NUMA balancing, which
>> is a waste.
> 
> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are 
similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> 

Yes, NUMA balancing is incredibly harmful to performance, for GPU and
accelerators that map memory...and VMs as well, it seems. Basically,
anything that has its own processors and page tables needs to be left
strictly alone by NUMA balancing. Because the kernel is (still, even
today) unaware of what those processors are doing, and so it has no way
to do productive NUMA balancing.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-11 18:20       ` John Hubbard
@ 2023-08-11 18:39         ` David Hildenbrand
  2023-08-11 19:35           ` John Hubbard
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-08-11 18:39 UTC (permalink / raw)
  To: John Hubbard, Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

>> Ah, okay I see, thanks. That's indeed unfortunate.
> 
> Sigh. All this difficulty reminds me that this mechanism was created in
> the early days of NUMA. I wonder sometimes lately whether the cost, in
> complexity and CPU time, is still worth it on today's hardware.
> 
> But of course I am deeply biased, so don't take that too seriously.
> See below. :)

:)

>>
>>>
>>> Then current KVM will unmap all notified pages from secondary MMU
>>> in .invalidate_range_start(), which could include pages that finally not
>>> set to PROT_NONE in primary MMU.
>>>
>>> For VMs with pass-through devices, though all guest pages are pinned,
>>> KVM still periodically unmap pages in response to the
>>> .invalidate_range_start() notification from auto NUMA balancing, which
>>> is a waste.
>>
>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>
> 
> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> accelerators that map memory...and VMs as well, it seems. Basically,
> anything that has its own processors and page tables needs to be left
> strictly alone by NUMA balancing. Because the kernel is (still, even
> today) unaware of what those processors are doing, and so it has no way
> to do productive NUMA balancing.

Is there any existing way we could handle that better on a per-VMA 
level, or on the process level? Any magic toggles?

MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might 
be better, but with things like iouring still too restrictive eventually.

I recall that setting a mempolicy could prevent auto-numa from getting 
active, but that might be undesired.

CCing Mel.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate
  2023-08-10  9:00 ` [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate Yan Zhao
@ 2023-08-11 18:52   ` Nadav Amit
  2023-08-14  7:52     ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: Nadav Amit @ 2023-08-11 18:52 UTC (permalink / raw)
  To: Yan Zhao
  Cc: linux-mm, Linux Kernel Mailing List, kvm, Paolo Bonzini,
	Sean Christopherson, Mike Kravetz, apopple, Jason Gunthorpe,
	Mike Rapoport, Andrew Morton, kevin.tian, david


> On Aug 10, 2023, at 2:00 AM, Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> Call mmu notifier's callback .numa_protect() in change_pmd_range() when
> a page is ensured to be protected by PROT_NONE for NUMA migration purpose.

Consider squashing with the previous patch. It’s better to see the user
(caller) with the new functionality.

It would be useful to describe what the expected course of action that
numa_protect callback should take.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-11 18:39         ` David Hildenbrand
@ 2023-08-11 19:35           ` John Hubbard
  2023-08-14  9:09             ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: John Hubbard @ 2023-08-11 19:35 UTC (permalink / raw)
  To: David Hildenbrand, Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

On 8/11/23 11:39, David Hildenbrand wrote:
...
>>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
>> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>>
>>
>> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
>> accelerators that map memory...and VMs as well, it seems. Basically,
>> anything that has its own processors and page tables needs to be left
>> strictly alone by NUMA balancing. Because the kernel is (still, even
>> today) unaware of what those processors are doing, and so it has no way
>> to do productive NUMA balancing.
> 
> Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> 
> MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> 
> I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> 
> CCing Mel.
> 

Let's discern between page pinning situations, and HMM-style situations.
Page pinning of CPU memory is unnecessary when setting up for using that
memory by modern GPUs or accelerators, because the latter can handle
replayable page faults. So for such cases, the pages are in use by a GPU
or accelerator, but unpinned.

The performance problem occurs because for those pages, the NUMA
balancing causes unmapping, which generates callbacks to the device
driver, which dutifully unmaps the pages from the GPU or accelerator,
even if the GPU might be busy using those pages. The device promptly
causes a device page fault, and the driver then re-establishes the
device page table mapping, which is good until the next round of
unmapping from the NUMA balancer.

hmm_range_fault()-based memory management in particular might benefit
from having NUMA balancing disabled entirely for the memremap_pages()
region, come to think of it. That seems relatively easy and clean at
first glance anyway.

For other regions (allocated by the device driver), a per-VMA flag
seems about right: VM_NO_NUMA_BALANCING ?

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-11 17:14           ` Sean Christopherson
  2023-08-11 17:18             ` Jason Gunthorpe
@ 2023-08-14  6:52             ` Yan Zhao
  2023-08-14  7:44               ` Yan Zhao
  2023-08-14 16:40               ` Sean Christopherson
  1 sibling, 2 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-14  6:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: bibo mao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote:
> On Fri, Aug 11, 2023, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> > > 
> > > 在 2023/8/11 11:45, Yan Zhao 写道:
> > > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > > >>> +					  struct mm_struct *mm,
> > > >>> +					  unsigned long start,
> > > >>> +					  unsigned long end)
> > > >>> +{
> > > >>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > > >>> +
> > > >>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > > >>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > > >>> +		return;
> > > >>> +
> > > >>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > > >>> +}
> > > >> numa balance will scan wide memory range, and there will be one time
> > > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > > for each 2M range.
> > > yes, range is huge page size when changing numa protection during numa scanning.
> > > 
> > > > 
> > > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > > >> it may bring out lots of flush remote tlb ipi notification.
> > > > 
> > > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > > will be reduced to 0 with this series.
> > > > 
> > > > For VMs without assigned devices or mdev devices, I was previously also
> > > > worried about that there might be more IPIs.
> > > > But with current test data, there's no more remote tlb IPIs on average.
> > > > 
> > > > The reason is below:
> > > > 
> > > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > > range.
> 
> No, it's potentially called once per 1GiB range.  change_pmd_range() invokes the
> mmu_notifier with addr+end, where "end" is the end of the range covered by the
> PUD, not the the end of the current PMD.  So the worst case scenario would be a
> 256k increase.  Of course, if you have to migrate an entire 1GiB chunk of memory
> then you likely have bigger problems, but still.
Yes, thanks for pointing it out.
I realized it after re-reading the code and re-checking the log message.
This wider range also explained the collected worst data:
44 kvm_unmap_gfn_range() caused 43920 kvm_flush_remote_tlbs()requests. i.e.
998 remote tlb flush requests per kvm_unmap_gfn_range().

> 
> > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > > if mapped to 4K pages in primary MMU.
> > > > 
> > > > 
> > > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > > I do not know much about x86, does this happen always or only need reschedule
> > Ah, sorry, I missed platforms other than x86.
> > Maybe there will be a big difference in other platforms.
> > 
> > > from code?  so that there will be many times of tlb IPIs in only once function
> > Only when MMU lock is contended. But it's not seldom because of the contention in
> > TDP page fault.
> 
> No?  I don't see how mmu_lock contention would affect the number of calls to 
> kvm_flush_remote_tlbs().  If vCPUs are running and not generating faults, i.e.
> not trying to access the range in question, then ever zap will generate a remote
> TLB flush and thus send IPIs to all running vCPUs.
In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this:

1. Check necessity of mmu_lock reschedule
1.a -- if yes,
       1.a.1 do kvm_flush_remote_tlbs() if flush is true.
       1.a.2 reschedule of mmu_lock
       1.a.3 goto 2.
1.b -- if no, goto 2
2. Zap present leaf entry, and set flush to be true
3. Get next gfn and go to to 1

With a wide range to .invalidate_range_start()/end(), it's easy to find
rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1)
And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush
request is performed. (1.a.1)


Take a real case for example,
NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800.
Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break
because vCPU is faulting GFN 0xb804c.
And vCPUs fill constantly retry faulting 0xb804c for 3298 times until
.invalidate_range_end().
Then for this zap of GFN range from 0xb4000 - 0xb9800,
vCPUs retry fault of
GFN 0xb804c for 3298 times,
GFN 0xb8074 for 3161 times,
GFN 0xb81ce for 15190 times,
and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs()
for one kvm_unmap_gfn_range() because flush requests are not batched.
(this range is mapped in both 2M and 4K in secondary MMU).

Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines
all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called
after it returns.


In my test scenario, which is VM boot-up, this kind of contention is
frequent.
Here's the 10 times data for a VM boot-up collected previously.
       |      count of       |       count of          |
       | kvm_unmap_gfn_range | kvm_flush_remote_tlbs() |
-------|---------------------|-------------------------|
   1   |         38          |           14            |
   2   |         28          |         5191            |
   3   |         36          |        13398            |
   4   |         44          |        43920            |
   5   |         28          |           14            |
   6   |         36          |        10803            |
   7   |         38          |          892            |
   8   |         32          |         5905            |
   9   |         32          |           13            |
  10   |         38          |         6096            |
-------|---------------------|-------------------------|
average|         35          |         8625            |


I wonder if we could loose the frequency to check for rescheduling in
tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.

if (iter->next_last_level_gfn ==
    iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
	return false; 

> 
> > > call about kvm_unmap_gfn_range.
> > > 
> > > > if the pages are mapped in 4K in secondary MMU.
> > > > 
> > > > With this series, on the other hand, .numa_protect() sets range to be
> > > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > > mapped into small PTEs in secondary MMU.
> > > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > > whether the page is accessed for different cpu threads cross-nodes.
> > The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> > the page.
> 
> I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> and losing the batching of invalidations makes me nervous.  As Bibo points out,
> the behavior will vary based on the workload, VM configuration, etc.
> 
> There's also a *very* subtle change, in that the notification will be sent while
> holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> I'm not exactly 100% confident on that.  And the only reason there isn't a more
MMU lock is a rwlock, which is a variant of spinlock.
So, I think taking it within PMD/PTE lock is ok.
Actually we have already done that during the .change_pte() notification, where

kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers


> obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> won't yield if there's mmu_lock contention.
Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.

> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> I think we can achieve what you want without losing batching, and without changing
> secondary MMUs.
> 
> Rather than muck with the notification types and add a one-off for NUMA, just
> defer the notification until a present PMD/PTE is actually going to be modified.
> It's not the prettiest, but other than the locking, I don't think has any chance
> of regressing other workloads/configurations.
> 
> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> 
> Compile tested only.

I don't find a matching end to each
mmu_notifier_invalidate_range_start_nonblock().

> 
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 11 Aug 2023 10:03:36 -0700
> Subject: [PATCH] tmp
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/huge_mm.h |  4 +++-
>  include/linux/mm.h      |  3 +++
>  mm/huge_memory.c        |  5 ++++-
>  mm/mprotect.c           | 47 +++++++++++++++++++++++++++++------------
>  4 files changed, 44 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 20284387b841..b88316adaaad 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -7,6 +7,8 @@
>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  
> +struct mmu_notifier_range;
> +
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> @@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
>  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> -		    unsigned long cp_flags);
> +		    unsigned long cp_flags, struct mmu_notifier_range *range);
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..284f61cf9c37 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
>  	return !!(vma->vm_flags & VM_WRITE);
>  
>  }
> +
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range);
>  bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
>  extern long change_protection(struct mmu_gather *tlb,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a71cf686e3b2..47c7712b163e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>   */
>  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> -		    unsigned long cp_flags)
> +		    unsigned long cp_flags, struct mmu_notifier_range *range)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	spinlock_t *ptl;
> @@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    !toptier)
>  			xchg_page_access_time(page, jiffies_to_msecs(jiffies));
>  	}
> +
> +	change_pmd_range_notify_secondary_mmus(addr, range);
> +
>  	/*
>  	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
>  	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index d1a809167f49..f5844adbe0cb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
> -		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> +		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
> +		struct mmu_notifier_range *range)
>  {
>  	pte_t *pte, oldpte;
>  	spinlock_t *ptl;
> @@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				    !toptier)
>  					xchg_page_access_time(page,
>  						jiffies_to_msecs(jiffies));
> +
> +
>  			}
>  
> +			change_pmd_range_notify_secondary_mmus(addr, range);
> +
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
>  			ptent = pte_modify(oldpte, newprot);
>  
> @@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
>  		err;							\
>  	})
>  
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range)
> +{
> +	if (range->start)
> +		return;
> +
> +	VM_WARN_ON(addr >= range->end);
> +	range->start = addr;
> +	mmu_notifier_invalidate_range_start_nonblock(range);
This will cause range from addr to end to be invalidated, which may
include pinned ranges.

> +}
> +
>  static inline long change_pmd_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  	unsigned long nr_huge_updates = 0;
>  	struct mmu_notifier_range range;
>  
> -	range.start = 0;
> +	/*
> +	 * Defer invalidation of secondary MMUs until a PMD/PTE change is
> +	 * imminent, e.g. NUMA balancing in particular can "fail" for certain
> +	 * types of mappings.  Initialize range.start to '0' and use it to
> +	 * track whether or not the invalidation notification has been set.
> +	 */
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
> +				vma->vm_mm, 0, end);
>  
>  	pmd = pmd_offset(pud, addr);
>  	do {
> @@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  		if (pmd_none(*pmd))
>  			goto next;
>  
> -		/* invoke the mmu notifier if the pmd is populated */
> -		if (!range.start) {
> -			mmu_notifier_range_init(&range,
> -				MMU_NOTIFY_PROTECTION_VMA, 0,
> -				vma->vm_mm, addr, end);
> -			mmu_notifier_invalidate_range_start(&range);
> -		}
> -
>  		_pmd = pmdp_get_lockless(pmd);
>  		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
>  			if ((next - addr != HPAGE_PMD_SIZE) ||
>  			    pgtable_split_needed(vma, cp_flags)) {
> +				/*
> +				 * FIXME: __split_huge_pmd() performs its own
> +				 * mmu_notifier invalidation prior to clearing
> +				 * the PMD, ideally all invalidations for the
> +				 * range would be batched.
> +				 */
>  				__split_huge_pmd(vma, pmd, addr, false, NULL);
>  				/*
>  				 * For file-backed, the pmd could have been
> @@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  					break;
>  				}
>  			} else {
> -				ret = change_huge_pmd(tlb, vma, pmd,
> -						addr, newprot, cp_flags);
> +				ret = change_huge_pmd(tlb, vma, pmd, addr,
> +						      newprot, cp_flags, &range);
>  				if (ret) {
>  					if (ret == HPAGE_PMD_NR) {
>  						pages += HPAGE_PMD_NR;
> @@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  		}
>  
>  		ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
> -				       cp_flags);
> +				       cp_flags, &range);
>  		if (ret < 0)
>  			goto again;
>  		pages += ret;
> 
> base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
> -- 
> 
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-14  6:52             ` Yan Zhao
@ 2023-08-14  7:44               ` Yan Zhao
  2023-08-14 16:40               ` Sean Christopherson
  1 sibling, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-14  7:44 UTC (permalink / raw)
  To: Sean Christopherson, bibo mao, linux-mm, linux-kernel, kvm,
	pbonzini, mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian,
	david

On Mon, Aug 14, 2023 at 02:52:07PM +0800, Yan Zhao wrote:
> I wonder if we could loose the frequency to check for rescheduling in
> tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.
> 
> if (iter->next_last_level_gfn ==
>     iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> 	return false;
Correct:

@@ -712,7 +713,8 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
        WARN_ON(iter->yielded);

        /* Ensure forward progress has been made before yielding. */
-       if (iter->next_last_level_gfn == iter->yielded_gfn)
+       if (iter->next_last_level_gfn >= iter->yielded_gfn &&
+          iter->next_last_level_gfn < iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
                return false;

        if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {


This can reduce kvm_flush_remote_tlbs() a lot in one kvm_unmap_gfn_range() in KVM x86 TDP MMU.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate
  2023-08-11 18:52   ` Nadav Amit
@ 2023-08-14  7:52     ` Yan Zhao
  0 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-14  7:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Linux Kernel Mailing List, kvm, Paolo Bonzini,
	Sean Christopherson, Mike Kravetz, apopple, Jason Gunthorpe,
	Mike Rapoport, Andrew Morton, kevin.tian, david

On Fri, Aug 11, 2023 at 11:52:53AM -0700, Nadav Amit wrote:
> 
> > On Aug 10, 2023, at 2:00 AM, Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > Call mmu notifier's callback .numa_protect() in change_pmd_range() when
> > a page is ensured to be protected by PROT_NONE for NUMA migration purpose.
> 
> Consider squashing with the previous patch. It’s better to see the user
> (caller) with the new functionality.
> 
> It would be useful to describe what the expected course of action that
> numa_protect callback should take.
Thanks! I'll do in this way when I prepare patches in future :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-11 19:35           ` John Hubbard
@ 2023-08-14  9:09             ` Yan Zhao
  2023-08-15  2:34               ` John Hubbard
  2023-08-15  2:36               ` Yuan Yao
  0 siblings, 2 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-14  9:09 UTC (permalink / raw)
  To: John Hubbard
  Cc: David Hildenbrand, linux-mm, linux-kernel, kvm, pbonzini, seanjc,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> On 8/11/23 11:39, David Hildenbrand wrote:
> ...
> > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > 
> > > 
> > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > anything that has its own processors and page tables needs to be left
> > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > today) unaware of what those processors are doing, and so it has no way
> > > to do productive NUMA balancing.
> > 
> > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > 
> > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > 
> > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > 
> > CCing Mel.
> > 
> 
> Let's discern between page pinning situations, and HMM-style situations.
> Page pinning of CPU memory is unnecessary when setting up for using that
> memory by modern GPUs or accelerators, because the latter can handle
> replayable page faults. So for such cases, the pages are in use by a GPU
> or accelerator, but unpinned.
> 
> The performance problem occurs because for those pages, the NUMA
> balancing causes unmapping, which generates callbacks to the device
> driver, which dutifully unmaps the pages from the GPU or accelerator,
> even if the GPU might be busy using those pages. The device promptly
> causes a device page fault, and the driver then re-establishes the
> device page table mapping, which is good until the next round of
> unmapping from the NUMA balancer.
> 
> hmm_range_fault()-based memory management in particular might benefit
> from having NUMA balancing disabled entirely for the memremap_pages()
> region, come to think of it. That seems relatively easy and clean at
> first glance anyway.
> 
> For other regions (allocated by the device driver), a per-VMA flag
> seems about right: VM_NO_NUMA_BALANCING ?
> 
Thanks a lot for those good suggestions!
For VMs, when could a per-VMA flag be set?
Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
after it's mapped into VFIO.
Then, should VFIO set this flag on after it maps a range?
Could this flag be unset after device hot-unplug?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-14  6:52             ` Yan Zhao
  2023-08-14  7:44               ` Yan Zhao
@ 2023-08-14 16:40               ` Sean Christopherson
  2023-08-15  1:54                 ` Yan Zhao
  1 sibling, 1 reply; 47+ messages in thread
From: Sean Christopherson @ 2023-08-14 16:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: bibo mao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Mon, Aug 14, 2023, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote:
> > > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > > > if mapped to 4K pages in primary MMU.
> > > > > 
> > > > > 
> > > > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > > > I do not know much about x86, does this happen always or only need reschedule
> > > Ah, sorry, I missed platforms other than x86.
> > > Maybe there will be a big difference in other platforms.
> > > 
> > > > from code?  so that there will be many times of tlb IPIs in only once function
> > > Only when MMU lock is contended. But it's not seldom because of the contention in
> > > TDP page fault.
> > 
> > No?  I don't see how mmu_lock contention would affect the number of calls to 
> > kvm_flush_remote_tlbs().  If vCPUs are running and not generating faults, i.e.
> > not trying to access the range in question, then ever zap will generate a remote
> > TLB flush and thus send IPIs to all running vCPUs.
> In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this:
> 
> 1. Check necessity of mmu_lock reschedule

Ah, you're running a preemptible kernel.

> 1.a -- if yes,
>        1.a.1 do kvm_flush_remote_tlbs() if flush is true.
>        1.a.2 reschedule of mmu_lock
>        1.a.3 goto 2.
> 1.b -- if no, goto 2
> 2. Zap present leaf entry, and set flush to be true
> 3. Get next gfn and go to to 1
> 
> With a wide range to .invalidate_range_start()/end(), it's easy to find
> rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1)
> And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush
> request is performed. (1.a.1)
> 
> 
> Take a real case for example,
> NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800.
> Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break
> because vCPU is faulting GFN 0xb804c.
> And vCPUs fill constantly retry faulting 0xb804c for 3298 times until
> .invalidate_range_end().
> Then for this zap of GFN range from 0xb4000 - 0xb9800,
> vCPUs retry fault of
> GFN 0xb804c for 3298 times,
> GFN 0xb8074 for 3161 times,
> GFN 0xb81ce for 15190 times,
> and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs()
> for one kvm_unmap_gfn_range() because flush requests are not batched.
> (this range is mapped in both 2M and 4K in secondary MMU).
> 
> Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines
> all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called
> after it returns.
> 
> 
> In my test scenario, which is VM boot-up, this kind of contention is
> frequent.

Hmm, I think your test scenario is a bit of an outlier.  Running a preemptible
kernel on a multi-socket system (presumably your system is multi-socket if you
have NUMA) is a bit odd.

Not that that invalidates the test result, but I would be quite surprised if there
are production use cases for NUMA+KVM+preemptible.  Though maybe I'm underestimating
how much KVM is used on workstations?

> Here's the 10 times data for a VM boot-up collected previously.
>        |      count of       |       count of          |
>        | kvm_unmap_gfn_range | kvm_flush_remote_tlbs() |
> -------|---------------------|-------------------------|
>    1   |         38          |           14            |
>    2   |         28          |         5191            |
>    3   |         36          |        13398            |
>    4   |         44          |        43920            |
>    5   |         28          |           14            |
>    6   |         36          |        10803            |
>    7   |         38          |          892            |
>    8   |         32          |         5905            |
>    9   |         32          |           13            |
>   10   |         38          |         6096            |
> -------|---------------------|-------------------------|
> average|         35          |         8625            |
> 
> 
> I wonder if we could loose the frequency to check for rescheduling in
> tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.
> 
> if (iter->next_last_level_gfn ==
>     iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> 	return false;

Hrm, no.  We'd want to repeat that logic for the shadow MMU, and it could regress
other scenarios, e.g. if a vCPU is trying to fault-in an address that isn't part
of the invalidation, then yielding asap is desirable.

Unless I'm missing something, the easiest solution is to check for an invalidation
*before* acquiring mmu_lock.  The check would be prone to false negatives and false
positives, so KVM would need to recheck after acquiring mmu_lock, but the check
itself is super cheap relative to the overall cost of the page fault.

Compile tested patch at the bottom.  I'll test and post formally (probably with
similar treatment for other architectures).

> > > > call about kvm_unmap_gfn_range.
> > > > 
> > > > > if the pages are mapped in 4K in secondary MMU.
> > > > > 
> > > > > With this series, on the other hand, .numa_protect() sets range to be
> > > > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > > > mapped into small PTEs in secondary MMU.
> > > > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > > > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > > > whether the page is accessed for different cpu threads cross-nodes.
> > > The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> > > the page.
> > 
> > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> > and losing the batching of invalidations makes me nervous.  As Bibo points out,
> > the behavior will vary based on the workload, VM configuration, etc.
> > 
> > There's also a *very* subtle change, in that the notification will be sent while
> > holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> > I'm not exactly 100% confident on that.  And the only reason there isn't a more
> MMU lock is a rwlock, which is a variant of spinlock.
> So, I think taking it within PMD/PTE lock is ok.
> Actually we have already done that during the .change_pte() notification, where
> 
> kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
> while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers

.change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed
to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM
doesn't use a sleepable lock.

As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail
because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier
events.  For NUMA balancing, canceling the protection change might be ok, but for
everything else, failure is not an option.  So unfortunately, my idea won't work
as-is.

We might get away with deferring just the change_prot_numa() case, but I don't
think that's worth the mess/complexity.

I would much rather tell userspace to disable migrate-on-fault for KVM guest
mappings (mbind() should work?) for these types of setups, or to disable NUMA
balancing entirely.  If the user really cares about performance of their VM(s),
vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if
the VM spans multiple nodes, a virtual NUMA topology should be given to the guest.
At that point, NUMA balancing is likely to do more harm than good.

> > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> > won't yield if there's mmu_lock contention.
> Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> 
> > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> > I think we can achieve what you want without losing batching, and without changing
> > secondary MMUs.
> > 
> > Rather than muck with the notification types and add a one-off for NUMA, just
> > defer the notification until a present PMD/PTE is actually going to be modified.
> > It's not the prettiest, but other than the locking, I don't think has any chance
> > of regressing other workloads/configurations.
> > 
> > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > 
> > Compile tested only.
> 
> I don't find a matching end to each
> mmu_notifier_invalidate_range_start_nonblock().

It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():

	if (range.start)
		mmu_notifier_invalidate_range_end(&range);

--
From: Sean Christopherson <seanjc@google.com>
Date: Mon, 14 Aug 2023 08:59:12 -0700
Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if
 mapping is changing

Retry page faults without acquiring mmu_lock if the resolve hva is covered
by an active invalidation.  Contending for mmu_lock is especially
problematic on preemptible kernels as the invalidation task will yield the
lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault.  And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.

Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.

Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 3 +++
 include/linux/kvm_host.h | 8 +++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9e4cd8b4a202..f29718a16211 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	if (unlikely(!fault->slot))
 		return kvm_handle_noslot_fault(vcpu, fault, access);
 
+	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
+		return RET_PF_RETRY;
+
 	return RET_PF_CONTINUE;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cb86108c624d..f41d4fe61a06 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 					   unsigned long mmu_seq,
 					   unsigned long hva)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
 	/*
 	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
 	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
@@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	    hva >= kvm->mmu_invalidate_range_start &&
 	    hva < kvm->mmu_invalidate_range_end)
 		return 1;
+
+	/*
+	 * Note the lack of a memory barrier!  The caller *must* hold mmu_lock
+	 * to avoid false negatives and/or false positives (less concerning).
+	 * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking
+	 * for an in-progress invalidation to avoiding contending mmu_lock.
+	 */
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
 	return 0;

base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc
-- 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-14 16:40               ` Sean Christopherson
@ 2023-08-15  1:54                 ` Yan Zhao
  2023-08-15 14:50                   ` Sean Christopherson
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-15  1:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: bibo mao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> > > and losing the batching of invalidations makes me nervous.  As Bibo points out,
> > > the behavior will vary based on the workload, VM configuration, etc.
> > > 
> > > There's also a *very* subtle change, in that the notification will be sent while
> > > holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> > > I'm not exactly 100% confident on that.  And the only reason there isn't a more
> > MMU lock is a rwlock, which is a variant of spinlock.
> > So, I think taking it within PMD/PTE lock is ok.
> > Actually we have already done that during the .change_pte() notification, where
> > 
> > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
> > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers
> 
> .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed
> to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM
> doesn't use a sleepable lock.
.numa_protect() in patch 4 is also sent when it's not allowed to
fail and there's no sleepable lock in KVM's handler :)


> As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail
> because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier
> events.  For NUMA balancing, canceling the protection change might be ok, but for
> everything else, failure is not an option.  So unfortunately, my idea won't work
> as-is.
> 
> We might get away with deferring just the change_prot_numa() case, but I don't
> think that's worth the mess/complexity.
Yes, I also think deferring mmu_notifier_invalidate_range_start() is not working.
One possible approach is to send successful .numa_protect() in batch.
But I admit it will introduce complexity.

> 
> I would much rather tell userspace to disable migrate-on-fault for KVM guest
> mappings (mbind() should work?) for these types of setups, or to disable NUMA
> balancing entirely.  If the user really cares about performance of their VM(s),
> vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if
> the VM spans multiple nodes, a virtual NUMA topology should be given to the guest.
> At that point, NUMA balancing is likely to do more harm than good.
> 
> > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> > > won't yield if there's mmu_lock contention.
> > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> > 
> > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> > > I think we can achieve what you want without losing batching, and without changing
> > > secondary MMUs.
> > > 
> > > Rather than muck with the notification types and add a one-off for NUMA, just
> > > defer the notification until a present PMD/PTE is actually going to be modified.
> > > It's not the prettiest, but other than the locking, I don't think has any chance
> > > of regressing other workloads/configurations.
> > > 
> > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > > 
> > > Compile tested only.
> > 
> > I don't find a matching end to each
> > mmu_notifier_invalidate_range_start_nonblock().
> 
> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> 
> 	if (range.start)
> 		mmu_notifier_invalidate_range_end(&range);
No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
if we only want the range to include pages successfully set to PROT_NONE.

> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Mon, 14 Aug 2023 08:59:12 -0700
> Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if
>  mapping is changing
> 
> Retry page faults without acquiring mmu_lock if the resolve hva is covered
> by an active invalidation.  Contending for mmu_lock is especially
> problematic on preemptible kernels as the invalidation task will yield the
> lock (see rwlock_needbreak()), delay the in-progress invalidation, and
> ultimately increase the latency of resolving the page fault.  And in the
> worst case scenario, yielding will be accompanied by a remote TLB flush,
> e.g. if the invalidation covers a large range of memory and vCPUs are
> accessing addresses that were already zapped.
> 
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
> 
> Reported-by: Yan Zhao <yan.y.zhao@intel.com>
> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 3 +++
>  include/linux/kvm_host.h | 8 +++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..f29718a16211 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	if (unlikely(!fault->slot))
>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>  
> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> +		return RET_PF_RETRY;
> +
This can effectively reduce the remote flush IPIs a lot!
One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
and kvm->mmu_invalidate_range_end.
Otherwise, I'm somewhat worried about constant false positive and retry.


>  	return RET_PF_CONTINUE;
>  }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cb86108c624d..f41d4fe61a06 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  					   unsigned long mmu_seq,
>  					   unsigned long hva)
>  {
> -	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
>  	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
> @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	    hva >= kvm->mmu_invalidate_range_start &&
>  	    hva < kvm->mmu_invalidate_range_end)
>  		return 1;
> +
> +	/*
> +	 * Note the lack of a memory barrier!  The caller *must* hold mmu_lock
> +	 * to avoid false negatives and/or false positives (less concerning).
> +	 * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking
> +	 * for an in-progress invalidation to avoiding contending mmu_lock.
> +	 */
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> 
> base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc
> -- 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-14  9:09             ` Yan Zhao
@ 2023-08-15  2:34               ` John Hubbard
  2023-08-16  7:43                 ` David Hildenbrand
  2023-08-15  2:36               ` Yuan Yao
  1 sibling, 1 reply; 47+ messages in thread
From: John Hubbard @ 2023-08-15  2:34 UTC (permalink / raw)
  To: Yan Zhao
  Cc: David Hildenbrand, linux-mm, linux-kernel, kvm, pbonzini, seanjc,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

On 8/14/23 02:09, Yan Zhao wrote:
...
>> hmm_range_fault()-based memory management in particular might benefit
>> from having NUMA balancing disabled entirely for the memremap_pages()
>> region, come to think of it. That seems relatively easy and clean at
>> first glance anyway.
>>
>> For other regions (allocated by the device driver), a per-VMA flag
>> seems about right: VM_NO_NUMA_BALANCING ?
>>
> Thanks a lot for those good suggestions!
> For VMs, when could a per-VMA flag be set?
> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> after it's mapped into VFIO.
> Then, should VFIO set this flag on after it maps a range?
> Could this flag be unset after device hot-unplug?
> 

I'm hoping someone who thinks about VMs and VFIO often can chime in.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-14  9:09             ` Yan Zhao
  2023-08-15  2:34               ` John Hubbard
@ 2023-08-15  2:36               ` Yuan Yao
  2023-08-15  2:37                 ` Yan Zhao
  1 sibling, 1 reply; 47+ messages in thread
From: Yuan Yao @ 2023-08-15  2:36 UTC (permalink / raw)
  To: Yan Zhao
  Cc: John Hubbard, David Hildenbrand, linux-mm, linux-kernel, kvm,
	pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, Mel Gorman

On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> > On 8/11/23 11:39, David Hildenbrand wrote:
> > ...
> > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > >
> > > >
> > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > > anything that has its own processors and page tables needs to be left
> > > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > > today) unaware of what those processors are doing, and so it has no way
> > > > to do productive NUMA balancing.
> > >
> > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > >
> > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > >
> > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > >
> > > CCing Mel.
> > >
> >
> > Let's discern between page pinning situations, and HMM-style situations.
> > Page pinning of CPU memory is unnecessary when setting up for using that
> > memory by modern GPUs or accelerators, because the latter can handle
> > replayable page faults. So for such cases, the pages are in use by a GPU
> > or accelerator, but unpinned.
> >
> > The performance problem occurs because for those pages, the NUMA
> > balancing causes unmapping, which generates callbacks to the device
> > driver, which dutifully unmaps the pages from the GPU or accelerator,
> > even if the GPU might be busy using those pages. The device promptly
> > causes a device page fault, and the driver then re-establishes the
> > device page table mapping, which is good until the next round of
> > unmapping from the NUMA balancer.
> >
> > hmm_range_fault()-based memory management in particular might benefit
> > from having NUMA balancing disabled entirely for the memremap_pages()
> > region, come to think of it. That seems relatively easy and clean at
> > first glance anyway.
> >
> > For other regions (allocated by the device driver), a per-VMA flag
> > seems about right: VM_NO_NUMA_BALANCING ?
> >
> Thanks a lot for those good suggestions!
> For VMs, when could a per-VMA flag be set?
> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> after it's mapped into VFIO.
> Then, should VFIO set this flag on after it maps a range?
> Could this flag be unset after device hot-unplug?

Emm... syscall madvise() in my mind, it does things like change flags
on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA.

>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-15  2:36               ` Yuan Yao
@ 2023-08-15  2:37                 ` Yan Zhao
  0 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-08-15  2:37 UTC (permalink / raw)
  To: Yuan Yao
  Cc: John Hubbard, David Hildenbrand, linux-mm, linux-kernel, kvm,
	pbonzini, seanjc, mike.kravetz, apopple, jgg, rppt, akpm,
	kevin.tian, Mel Gorman

On Tue, Aug 15, 2023 at 10:36:18AM +0800, Yuan Yao wrote:
> On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> > > On 8/11/23 11:39, David Hildenbrand wrote:
> > > ...
> > > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > > >
> > > > >
> > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > > > anything that has its own processors and page tables needs to be left
> > > > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > > > today) unaware of what those processors are doing, and so it has no way
> > > > > to do productive NUMA balancing.
> > > >
> > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > > >
> > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > > >
> > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > > >
> > > > CCing Mel.
> > > >
> > >
> > > Let's discern between page pinning situations, and HMM-style situations.
> > > Page pinning of CPU memory is unnecessary when setting up for using that
> > > memory by modern GPUs or accelerators, because the latter can handle
> > > replayable page faults. So for such cases, the pages are in use by a GPU
> > > or accelerator, but unpinned.
> > >
> > > The performance problem occurs because for those pages, the NUMA
> > > balancing causes unmapping, which generates callbacks to the device
> > > driver, which dutifully unmaps the pages from the GPU or accelerator,
> > > even if the GPU might be busy using those pages. The device promptly
> > > causes a device page fault, and the driver then re-establishes the
> > > device page table mapping, which is good until the next round of
> > > unmapping from the NUMA balancer.
> > >
> > > hmm_range_fault()-based memory management in particular might benefit
> > > from having NUMA balancing disabled entirely for the memremap_pages()
> > > region, come to think of it. That seems relatively easy and clean at
> > > first glance anyway.
> > >
> > > For other regions (allocated by the device driver), a per-VMA flag
> > > seems about right: VM_NO_NUMA_BALANCING ?
> > >
> > Thanks a lot for those good suggestions!
> > For VMs, when could a per-VMA flag be set?
> > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> > after it's mapped into VFIO.
> > Then, should VFIO set this flag on after it maps a range?
> > Could this flag be unset after device hot-unplug?
> 
> Emm... syscall madvise() in my mind, it does things like change flags
> on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA.
Yes, madvise() might work.
And setting this flag might be an easy decision, while unsetting it might be hard
unless some counters introduced.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-15  1:54                 ` Yan Zhao
@ 2023-08-15 14:50                   ` Sean Christopherson
  2023-08-16  2:43                     ` bibo mao
  0 siblings, 1 reply; 47+ messages in thread
From: Sean Christopherson @ 2023-08-15 14:50 UTC (permalink / raw)
  To: Yan Zhao
  Cc: bibo mao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Tue, Aug 15, 2023, Yan Zhao wrote:
> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> > > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > > > 
> > > > Compile tested only.
> > > 
> > > I don't find a matching end to each
> > > mmu_notifier_invalidate_range_start_nonblock().
> > 
> > It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> > 
> > 	if (range.start)
> > 		mmu_notifier_invalidate_range_end(&range);
> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
> if we only want the range to include pages successfully set to PROT_NONE.

Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
to defer invalidation until it was actually needed, but still perform only a
single notification so as to batch the TLB flushes, e.g. the start() call still
used the original @end.

The idea was to play nice with the scenario where nothing in a VMA could be migrated.
It was complete untested though, so it may not have actually done anything to reduce
the number of pointless invalidations.

> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9e4cd8b4a202..f29718a16211 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	if (unlikely(!fault->slot))
> >  		return kvm_handle_noslot_fault(vcpu, fault, access);
> >  
> > +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> > +		return RET_PF_RETRY;
> > +
> This can effectively reduce the remote flush IPIs a lot!
> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
> and kvm->mmu_invalidate_range_end.
> Otherwise, I'm somewhat worried about constant false positive and retry.

If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress.  The ranges
aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they
are reloaded wouldn't do anything.  The key to making forward progress is seeing
that there is no in-progress invalidation.

I did consider adding said READ_ONCE(), but practically speaking, constant false
positives are impossible.  KVM will re-enter the guest when retrying, and there
is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit.

I suppose in theory we might someday differentiate between "retry because a different
vCPU may have fixed the fault" and "retry because there's an in-progress invalidation",
and not bother re-entering the guest for the latter, e.g. have it try to yield
instead.  

All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a
nop, so it wouldn't hurt to be paranoid in this case.

Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq
too, e.g. so that a sufficiently clever compiler doesn't completely optimize away
the check.  Losing the check wouldn't be problematic (false negatives are fine,
especially on that particular check), but the generated code would *look* buggy.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-15 14:50                   ` Sean Christopherson
@ 2023-08-16  2:43                     ` bibo mao
  2023-08-16  3:44                       ` bibo mao
  0 siblings, 1 reply; 47+ messages in thread
From: bibo mao @ 2023-08-16  2:43 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple,
	jgg, rppt, akpm, kevin.tian, david



在 2023/8/15 22:50, Sean Christopherson 写道:
> On Tue, Aug 15, 2023, Yan Zhao wrote:
>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>
>>>>> Compile tested only.
>>>>
>>>> I don't find a matching end to each
>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>
>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>
>>> 	if (range.start)
>>> 		mmu_notifier_invalidate_range_end(&range);
>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>> if we only want the range to include pages successfully set to PROT_NONE.
> 
> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
> to defer invalidation until it was actually needed, but still perform only a
> single notification so as to batch the TLB flushes, e.g. the start() call still
> used the original @end.
> 
> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
> It was complete untested though, so it may not have actually done anything to reduce
> the number of pointless invalidations.
For numa-balance scenery, can original page still be used by application even if pte
is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
in notification mmu_notifier_invalidate_range_end with precised range, the range can
be cross-range between range mmu_gather and mmu_notifier_range.

Regards
Bibo Mao
> 
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 9e4cd8b4a202..f29718a16211 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>>  	if (unlikely(!fault->slot))
>>>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>>>  
>>> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
>>> +		return RET_PF_RETRY;
>>> +
>> This can effectively reduce the remote flush IPIs a lot!
>> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
>> and kvm->mmu_invalidate_range_end.
>> Otherwise, I'm somewhat worried about constant false positive and retry.
> 
> If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress.  The ranges
> aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they
> are reloaded wouldn't do anything.  The key to making forward progress is seeing
> that there is no in-progress invalidation.
> 
> I did consider adding said READ_ONCE(), but practically speaking, constant false
> positives are impossible.  KVM will re-enter the guest when retrying, and there
> is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit.
> 
> I suppose in theory we might someday differentiate between "retry because a different
> vCPU may have fixed the fault" and "retry because there's an in-progress invalidation",
> and not bother re-entering the guest for the latter, e.g. have it try to yield
> instead.  
> 
> All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a
> nop, so it wouldn't hurt to be paranoid in this case.
> 
> Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq
> too, e.g. so that a sufficiently clever compiler doesn't completely optimize away
> the check.  Losing the check wouldn't be problematic (false negatives are fine,
> especially on that particular check), but the generated code would *look* buggy.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-16  2:43                     ` bibo mao
@ 2023-08-16  3:44                       ` bibo mao
  2023-08-16  5:14                         ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: bibo mao @ 2023-08-16  3:44 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz, apopple,
	jgg, rppt, akpm, kevin.tian, david



在 2023/8/16 10:43, bibo mao 写道:
> 
> 
> 在 2023/8/15 22:50, Sean Christopherson 写道:
>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>
>>>>>> Compile tested only.
>>>>>
>>>>> I don't find a matching end to each
>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>
>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>
>>>> 	if (range.start)
>>>> 		mmu_notifier_invalidate_range_end(&range);
>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>> if we only want the range to include pages successfully set to PROT_NONE.
>>
>> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
>> to defer invalidation until it was actually needed, but still perform only a
>> single notification so as to batch the TLB flushes, e.g. the start() call still
>> used the original @end.
>>
>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>> It was complete untested though, so it may not have actually done anything to reduce
>> the number of pointless invalidations.
> For numa-balance scenery, can original page still be used by application even if pte
> is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
new page, my meaning that can original page still be used by application even if pte
is changed with PROT_NONE and before replaced with new page?

And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
after mmu_notifier_invalidate_range_end notification for secondary mmu.

Regards
Bibo Mao
> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> be cross-range between range mmu_gather and mmu_notifier_range.
> 
> Regards
> Bibo Mao
>>
>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>> index 9e4cd8b4a202..f29718a16211 100644
>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>>>  	if (unlikely(!fault->slot))
>>>>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>>>>  
>>>> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
>>>> +		return RET_PF_RETRY;
>>>> +
>>> This can effectively reduce the remote flush IPIs a lot!
>>> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
>>> and kvm->mmu_invalidate_range_end.
>>> Otherwise, I'm somewhat worried about constant false positive and retry.
>>
>> If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress.  The ranges
>> aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they
>> are reloaded wouldn't do anything.  The key to making forward progress is seeing
>> that there is no in-progress invalidation.
>>
>> I did consider adding said READ_ONCE(), but practically speaking, constant false
>> positives are impossible.  KVM will re-enter the guest when retrying, and there
>> is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit.
>>
>> I suppose in theory we might someday differentiate between "retry because a different
>> vCPU may have fixed the fault" and "retry because there's an in-progress invalidation",
>> and not bother re-entering the guest for the latter, e.g. have it try to yield
>> instead.  
>>
>> All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a
>> nop, so it wouldn't hurt to be paranoid in this case.
>>
>> Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq
>> too, e.g. so that a sufficiently clever compiler doesn't completely optimize away
>> the check.  Losing the check wouldn't be problematic (false negatives are fine,
>> especially on that particular check), but the generated code would *look* buggy.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-16  3:44                       ` bibo mao
@ 2023-08-16  5:14                         ` Yan Zhao
  2023-08-16  7:29                           ` bibo mao
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-16  5:14 UTC (permalink / raw)
  To: bibo mao
  Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, david

On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
> 
> 
> 在 2023/8/16 10:43, bibo mao 写道:
> > 
> > 
> > 在 2023/8/15 22:50, Sean Christopherson 写道:
> >> On Tue, Aug 15, 2023, Yan Zhao wrote:
> >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> >>>>>>
> >>>>>> Compile tested only.
> >>>>>
> >>>>> I don't find a matching end to each
> >>>>> mmu_notifier_invalidate_range_start_nonblock().
> >>>>
> >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> >>>>
> >>>> 	if (range.start)
> >>>> 		mmu_notifier_invalidate_range_end(&range);
> >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
> >>> if we only want the range to include pages successfully set to PROT_NONE.
> >>
> >> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
> >> to defer invalidation until it was actually needed, but still perform only a
> >> single notification so as to batch the TLB flushes, e.g. the start() call still
> >> used the original @end.
> >>
> >> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
> >> It was complete untested though, so it may not have actually done anything to reduce
> >> the number of pointless invalidations.
> > For numa-balance scenery, can original page still be used by application even if pte
> > is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,

See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/

> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
> new page, my meaning that can original page still be used by application even if pte
> is changed with PROT_NONE and before replaced with new page?
It's not .change_pte() notification, which is sent when COW.
The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
protected with PROT_NONE, and if every check passes, a separate
.invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent.

So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
(this approach is implemented in RFC v1
https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/)

> 
> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
> after mmu_notifier_invalidate_range_end notification for secondary mmu.
> Regards
> Bibo Mao

> >> in notification mmu_notifier_invalidate_range_end with precised range, the range can
But I don't think flush tlb only in the .invalidate_range_end() in
secondary MMU is a good idea.
Flush must be done before kvm->mmu_lock is unlocked, otherwise,
confusion will be caused when multiple threads trying to update the
secondary MMU.

> >> be cross-range between range mmu_gather and mmu_notifier_range.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-16  7:29                           ` bibo mao
@ 2023-08-16  7:18                             ` Yan Zhao
  2023-08-16  7:53                               ` bibo mao
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-16  7:18 UTC (permalink / raw)
  To: bibo mao
  Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, david

On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote:
> > Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> > confusion will be caused when multiple threads trying to update the
> > secondary MMU.
> Since tlb flush is delayed after all pte entries are cleared, and currently
> there is no tlb flush range supported for secondary mmu. I do know why there
> is confusion before or after kvm->mmu_lock.

Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()?
Then check if PROT_NONE is set in primary MMU before unmap?
Looks like a good idea, I need to check if it's feasible.
Thanks!




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-16  5:14                         ` Yan Zhao
@ 2023-08-16  7:29                           ` bibo mao
  2023-08-16  7:18                             ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: bibo mao @ 2023-08-16  7:29 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, david



在 2023/8/16 13:14, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
>>
>>
>> 在 2023/8/16 10:43, bibo mao 写道:
>>>
>>>
>>> 在 2023/8/15 22:50, Sean Christopherson 写道:
>>>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>>>
>>>>>>>> Compile tested only.
>>>>>>>
>>>>>>> I don't find a matching end to each
>>>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>>>
>>>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>>>
>>>>>> 	if (range.start)
>>>>>> 		mmu_notifier_invalidate_range_end(&range);
>>>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>>>> if we only want the range to include pages successfully set to PROT_NONE.
>>>>
>>>> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
>>>> to defer invalidation until it was actually needed, but still perform only a
>>>> single notification so as to batch the TLB flushes, e.g. the start() call still
>>>> used the original @end.
>>>>
>>>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>>>> It was complete untested though, so it may not have actually done anything to reduce
>>>> the number of pointless invalidations.
>>> For numa-balance scenery, can original page still be used by application even if pte
>>> is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
> For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,
> 
> See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/
> 
>> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
>> new page, my meaning that can original page still be used by application even if pte
>> is changed with PROT_NONE and before replaced with new page?
> It's not .change_pte() notification, which is sent when COW.
> The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
> protected with PROT_NONE, and if every check passes, a separate
> .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
> sent.
yes, you are right. change_pte() notification will be will called
when migrate_vma_pages, I messed it with numa page migration. However
invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent also when new page is replaced.
> 
> So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
> keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
> (this approach is implemented in RFC v1
> https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/)
> 
>>
>> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
>> after mmu_notifier_invalidate_range_end notification for secondary mmu.
>> Regards
>> Bibo Mao
> 
>>>> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> But I don't think flush tlb only in the .invalidate_range_end() in
> secondary MMU is a good idea.
I have no good idea, and it beyond my ability to modify kvm framework now :(

> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> confusion will be caused when multiple threads trying to update the
> secondary MMU.
Since tlb flush is delayed after all pte entries are cleared, and currently
there is no tlb flush range supported for secondary mmu. I do know why there
is confusion before or after kvm->mmu_lock.

Regards
Bibo Mao
> 
>>>> be cross-range between range mmu_gather and mmu_notifier_range.
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-15  2:34               ` John Hubbard
@ 2023-08-16  7:43                 ` David Hildenbrand
  2023-08-16  9:06                   ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-08-16  7:43 UTC (permalink / raw)
  To: John Hubbard, Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

On 15.08.23 04:34, John Hubbard wrote:
> On 8/14/23 02:09, Yan Zhao wrote:
> ...
>>> hmm_range_fault()-based memory management in particular might benefit
>>> from having NUMA balancing disabled entirely for the memremap_pages()
>>> region, come to think of it. That seems relatively easy and clean at
>>> first glance anyway.
>>>
>>> For other regions (allocated by the device driver), a per-VMA flag
>>> seems about right: VM_NO_NUMA_BALANCING ?
>>>
>> Thanks a lot for those good suggestions!
>> For VMs, when could a per-VMA flag be set?
>> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
>> after it's mapped into VFIO.
>> Then, should VFIO set this flag on after it maps a range?
>> Could this flag be unset after device hot-unplug?
>>
> 
> I'm hoping someone who thinks about VMs and VFIO often can chime in.

At least QEMU could just set it on the applicable VMAs (as said by Yuan 
Yao, using madvise).

BUT, I do wonder what value there would be for autonuma to still be 
active for the remainder of the hypervisor. If there is none, a prctl() 
would be better.

We already do have a mechanism in QEMU to get notified when 
longterm-pinning in the kernel might happen (and, therefore, 
MADV_DONTNEED must not be used):
* ram_block_discard_disable()
* ram_block_uncoordinated_discard_disable()


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-16  7:18                             ` Yan Zhao
@ 2023-08-16  7:53                               ` bibo mao
  2023-08-16 13:39                                 ` Sean Christopherson
  0 siblings, 1 reply; 47+ messages in thread
From: bibo mao @ 2023-08-16  7:53 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, linux-mm, linux-kernel, kvm, pbonzini,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, david



在 2023/8/16 15:18, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote:
>>> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
>>> confusion will be caused when multiple threads trying to update the
>>> secondary MMU.
>> Since tlb flush is delayed after all pte entries are cleared, and currently
>> there is no tlb flush range supported for secondary mmu. I do know why there
>> is confusion before or after kvm->mmu_lock.
> 
> Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()?
yes, it is just sketchy thought for numa balance scenery, 
do kvm_unmap_gfn_range() in invalidate_range_end rather than
invalidate_range_start.
 
> Then check if PROT_NONE is set in primary MMU before unmap?
> Looks like a good idea, I need to check if it's feasible.
> Thanks!
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-16  7:43                 ` David Hildenbrand
@ 2023-08-16  9:06                   ` Yan Zhao
  2023-08-16  9:49                     ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-16  9:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: John Hubbard, linux-mm, linux-kernel, kvm, pbonzini, seanjc,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote:
> On 15.08.23 04:34, John Hubbard wrote:
> > On 8/14/23 02:09, Yan Zhao wrote:
> > ...
> > > > hmm_range_fault()-based memory management in particular might benefit
> > > > from having NUMA balancing disabled entirely for the memremap_pages()
> > > > region, come to think of it. That seems relatively easy and clean at
> > > > first glance anyway.
> > > > 
> > > > For other regions (allocated by the device driver), a per-VMA flag
> > > > seems about right: VM_NO_NUMA_BALANCING ?
> > > > 
> > > Thanks a lot for those good suggestions!
> > > For VMs, when could a per-VMA flag be set?
> > > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> > > after it's mapped into VFIO.
> > > Then, should VFIO set this flag on after it maps a range?
> > > Could this flag be unset after device hot-unplug?
> > > 
> > 
> > I'm hoping someone who thinks about VMs and VFIO often can chime in.
> 
> At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao,
> using madvise).
> 
> BUT, I do wonder what value there would be for autonuma to still be active
Currently MADV_* is up to 25
	#define MADV_COLLAPSE   25,
while madvise behavior is of type "int". So it's ok.

But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform.

> for the remainder of the hypervisor. If there is none, a prctl() would be
> better.
Add a new field in "struct vma_numab_state" in vma, and use prctl() to
update this field?

e.g.
struct vma_numab_state {
        unsigned long next_scan;
        unsigned long next_pid_reset;
        unsigned long access_pids[2];
	bool no_scan;
};

> 
> We already do have a mechanism in QEMU to get notified when longterm-pinning
> in the kernel might happen (and, therefore, MADV_DONTNEED must not be used):
> * ram_block_discard_disable()
> * ram_block_uncoordinated_discard_disable()
Looks this ram_block_discard allow/disallow state is global rather than per-VMA
in QEMU.
So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and
it's up to the user space to choose between per-VMA and complex way or
global and simpler way?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-16  9:06                   ` Yan Zhao
@ 2023-08-16  9:49                     ` David Hildenbrand
  2023-08-16 18:00                       ` John Hubbard
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-08-16  9:49 UTC (permalink / raw)
  To: Yan Zhao
  Cc: John Hubbard, linux-mm, linux-kernel, kvm, pbonzini, seanjc,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

On 16.08.23 11:06, Yan Zhao wrote:
> On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote:
>> On 15.08.23 04:34, John Hubbard wrote:
>>> On 8/14/23 02:09, Yan Zhao wrote:
>>> ...
>>>>> hmm_range_fault()-based memory management in particular might benefit
>>>>> from having NUMA balancing disabled entirely for the memremap_pages()
>>>>> region, come to think of it. That seems relatively easy and clean at
>>>>> first glance anyway.
>>>>>
>>>>> For other regions (allocated by the device driver), a per-VMA flag
>>>>> seems about right: VM_NO_NUMA_BALANCING ?
>>>>>
>>>> Thanks a lot for those good suggestions!
>>>> For VMs, when could a per-VMA flag be set?
>>>> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
>>>> after it's mapped into VFIO.
>>>> Then, should VFIO set this flag on after it maps a range?
>>>> Could this flag be unset after device hot-unplug?
>>>>
>>>
>>> I'm hoping someone who thinks about VMs and VFIO often can chime in.
>>
>> At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao,
>> using madvise).
>>
>> BUT, I do wonder what value there would be for autonuma to still be active
> Currently MADV_* is up to 25
> 	#define MADV_COLLAPSE   25,
> while madvise behavior is of type "int". So it's ok.
> 
> But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform.

I remember there were discussions to increase it also for 32bit. If 
that's required, we might want to go down that path.

But do 32bit architectures even care about NUMA hinting? If not, just 
ignore them ...

> 
>> for the remainder of the hypervisor. If there is none, a prctl() would be
>> better.
> Add a new field in "struct vma_numab_state" in vma, and use prctl() to
> update this field?

Rather a global toggle per MM, no need to update individual VMAs -- if 
we go down that prctl() path.

No need to consume more memory for VMAs.

[...]

>> We already do have a mechanism in QEMU to get notified when longterm-pinning
>> in the kernel might happen (and, therefore, MADV_DONTNEED must not be used):
>> * ram_block_discard_disable()
>> * ram_block_uncoordinated_discard_disable()
> Looks this ram_block_discard allow/disallow state is global rather than per-VMA
> in QEMU.

Yes. Once you transition into "discard of any kind disabled", you can go 
over all guest memory VMAs (RAMBlock) and issue an madvise() for them. 
(or alternatively, do the prctl() once )

We'll also have to handle new guest memory being created afterwards, but 
that is easy.

Once we transition to "no discarding disabled", you can go over all 
guest memory VMAs (RAMBlock) and issue an madvise() for them again (or 
alternatively, do the prctl() once).

> So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and
> it's up to the user space to choose between per-VMA and complex way or
> global and simpler way?

QEMU could do either way. The question would be if a per-vma settings 
makes sense for NUMA hinting.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
  2023-08-16  7:53                               ` bibo mao
@ 2023-08-16 13:39                                 ` Sean Christopherson
  0 siblings, 0 replies; 47+ messages in thread
From: Sean Christopherson @ 2023-08-16 13:39 UTC (permalink / raw)
  To: bibo mao
  Cc: Yan Zhao, linux-mm, linux-kernel, kvm, pbonzini, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, david

On Wed, Aug 16, 2023, bibo mao wrote:
> 
> 
> 在 2023/8/16 15:18, Yan Zhao 写道:
> > On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote:
> >>> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> >>> confusion will be caused when multiple threads trying to update the
> >>> secondary MMU.
> >> Since tlb flush is delayed after all pte entries are cleared, and currently
> >> there is no tlb flush range supported for secondary mmu. I do know why there
> >> is confusion before or after kvm->mmu_lock.
> > 
> > Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()?
> yes, it is just sketchy thought for numa balance scenery, 
> do kvm_unmap_gfn_range() in invalidate_range_end rather than
> invalidate_range_start.

That is not an option, it's a direction violation of the mmu_notifier contract.
Secondary MMUs must drop all references before returning from invalidate_range_start().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-16  9:49                     ` David Hildenbrand
@ 2023-08-16 18:00                       ` John Hubbard
  2023-08-17  5:05                         ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: John Hubbard @ 2023-08-16 18:00 UTC (permalink / raw)
  To: David Hildenbrand, Yan Zhao
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman

On 8/16/23 02:49, David Hildenbrand wrote:
> But do 32bit architectures even care about NUMA hinting? If not, just 
> ignore them ...

Probably not!

...
>> So, do you mean that let kernel provide a per-VMA allow/disallow 
>> mechanism, and
>> it's up to the user space to choose between per-VMA and complex way or
>> global and simpler way?
> 
> QEMU could do either way. The question would be if a per-vma settings 
> makes sense for NUMA hinting.

 From our experience with compute on GPUs, a per-mm setting would suffice.
No need to go all the way to VMA granularity.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-16 18:00                       ` John Hubbard
@ 2023-08-17  5:05                         ` Yan Zhao
  2023-08-17  7:38                           ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-17  5:05 UTC (permalink / raw)
  To: John Hubbard
  Cc: David Hildenbrand, linux-mm, linux-kernel, kvm, pbonzini, seanjc,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman,
	alex.williamson

On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
> On 8/16/23 02:49, David Hildenbrand wrote:
> > But do 32bit architectures even care about NUMA hinting? If not, just
> > ignore them ...
> 
> Probably not!
> 
> ...
> > > So, do you mean that let kernel provide a per-VMA allow/disallow
> > > mechanism, and
> > > it's up to the user space to choose between per-VMA and complex way or
> > > global and simpler way?
> > 
> > QEMU could do either way. The question would be if a per-vma settings
> > makes sense for NUMA hinting.
> 
> From our experience with compute on GPUs, a per-mm setting would suffice.
> No need to go all the way to VMA granularity.
> 
After an offline internal discussion, we think a per-mm setting is also
enough for device passthrough in VMs.

BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
think it's of any value to providing a flag like VM_MAYDMA?
Auto NUMA balancing or other components can decide how to use it by
themselves.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-17  5:05                         ` Yan Zhao
@ 2023-08-17  7:38                           ` David Hildenbrand
  2023-08-18  0:13                             ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-08-17  7:38 UTC (permalink / raw)
  To: Yan Zhao, John Hubbard
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman,
	alex.williamson

On 17.08.23 07:05, Yan Zhao wrote:
> On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
>> On 8/16/23 02:49, David Hildenbrand wrote:
>>> But do 32bit architectures even care about NUMA hinting? If not, just
>>> ignore them ...
>>
>> Probably not!
>>
>> ...
>>>> So, do you mean that let kernel provide a per-VMA allow/disallow
>>>> mechanism, and
>>>> it's up to the user space to choose between per-VMA and complex way or
>>>> global and simpler way?
>>>
>>> QEMU could do either way. The question would be if a per-vma settings
>>> makes sense for NUMA hinting.
>>
>>  From our experience with compute on GPUs, a per-mm setting would suffice.
>> No need to go all the way to VMA granularity.
>>
> After an offline internal discussion, we think a per-mm setting is also
> enough for device passthrough in VMs.
> 
> BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
> think it's of any value to providing a flag like VM_MAYDMA?
> Auto NUMA balancing or other components can decide how to use it by
> themselves.

Short-lived DMA is not really the problem. The problem is long-term pinning.

There was a discussion about letting user space similarly hint that 
long-term pinning might/will happen.

Because when long-term pinning a page we have to make sure to migrate it 
off of ZONE_MOVABLE / MIGRATE_CMA.

But the kernel prefers to place pages there.

So with vfio in QEMU, we might preallocate memory for the guest and 
place it on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to 
migrate all these fresh pages out of these areas again.

So letting the kernel know about that in this context might also help.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-17  7:38                           ` David Hildenbrand
@ 2023-08-18  0:13                             ` Yan Zhao
  2023-08-18  2:29                               ` John Hubbard
  0 siblings, 1 reply; 47+ messages in thread
From: Yan Zhao @ 2023-08-18  0:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: John Hubbard, linux-mm, linux-kernel, kvm, pbonzini, seanjc,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman,
	alex.williamson

On Thu, Aug 17, 2023 at 09:38:37AM +0200, David Hildenbrand wrote:
> On 17.08.23 07:05, Yan Zhao wrote:
> > On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
> > > On 8/16/23 02:49, David Hildenbrand wrote:
> > > > But do 32bit architectures even care about NUMA hinting? If not, just
> > > > ignore them ...
> > > 
> > > Probably not!
> > > 
> > > ...
> > > > > So, do you mean that let kernel provide a per-VMA allow/disallow
> > > > > mechanism, and
> > > > > it's up to the user space to choose between per-VMA and complex way or
> > > > > global and simpler way?
> > > > 
> > > > QEMU could do either way. The question would be if a per-vma settings
> > > > makes sense for NUMA hinting.
> > > 
> > >  From our experience with compute on GPUs, a per-mm setting would suffice.
> > > No need to go all the way to VMA granularity.
> > > 
> > After an offline internal discussion, we think a per-mm setting is also
> > enough for device passthrough in VMs.
> > 
> > BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
> > think it's of any value to providing a flag like VM_MAYDMA?
> > Auto NUMA balancing or other components can decide how to use it by
> > themselves.
> 
> Short-lived DMA is not really the problem. The problem is long-term pinning.
> 
> There was a discussion about letting user space similarly hint that
> long-term pinning might/will happen.
> 
> Because when long-term pinning a page we have to make sure to migrate it off
> of ZONE_MOVABLE / MIGRATE_CMA.
> 
> But the kernel prefers to place pages there.
> 
> So with vfio in QEMU, we might preallocate memory for the guest and place it
> on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to migrate all
> these fresh pages out of these areas again.
> 
> So letting the kernel know about that in this context might also help.
>
Thanks! Glad to know it :)
But consider for GPUs case as what John mentioned, since the memory is
not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
and migration.

Is that good?
Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
now?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-18  0:13                             ` Yan Zhao
@ 2023-08-18  2:29                               ` John Hubbard
  2023-09-04  9:18                                 ` Yan Zhao
  0 siblings, 1 reply; 47+ messages in thread
From: John Hubbard @ 2023-08-18  2:29 UTC (permalink / raw)
  To: Yan Zhao, David Hildenbrand
  Cc: linux-mm, linux-kernel, kvm, pbonzini, seanjc, mike.kravetz,
	apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman,
	alex.williamson

On 8/17/23 17:13, Yan Zhao wrote:
...
> But consider for GPUs case as what John mentioned, since the memory is
> not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
> For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
> IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
> and migration.
> 
> Is that good?
> Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
> now?
> 

So far, a per-mm setting seems like it would suffice. However, it is
also true that new hardware is getting really creative and large, to
the point that it's not inconceivable that a process might actually
want to let NUMA balancing run in part of its mm, but turn it off
to allow fault-able device access to another part of the mm.

We aren't seeing that yet, but on the other hand, that may be
simply because there is no practical way to set that up and see
how well it works.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
  2023-08-18  2:29                               ` John Hubbard
@ 2023-09-04  9:18                                 ` Yan Zhao
  0 siblings, 0 replies; 47+ messages in thread
From: Yan Zhao @ 2023-09-04  9:18 UTC (permalink / raw)
  To: John Hubbard
  Cc: David Hildenbrand, linux-mm, linux-kernel, kvm, pbonzini, seanjc,
	mike.kravetz, apopple, jgg, rppt, akpm, kevin.tian, Mel Gorman,
	alex.williamson

On Thu, Aug 17, 2023 at 07:29:12PM -0700, John Hubbard wrote:
> On 8/17/23 17:13, Yan Zhao wrote:
> ...
> > But consider for GPUs case as what John mentioned, since the memory is
> > not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
> > For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
> > IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
> > and migration.
> > 
> > Is that good?
> > Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
> > now?
> > 
> 
> So far, a per-mm setting seems like it would suffice. However, it is
> also true that new hardware is getting really creative and large, to
> the point that it's not inconceivable that a process might actually
> want to let NUMA balancing run in part of its mm, but turn it off
> to allow fault-able device access to another part of the mm.
> 
> We aren't seeing that yet, but on the other hand, that may be
> simply because there is no practical way to set that up and see
> how well it works.
> 
>
Hi guys,
Thanks a lot for your review and suggestions!
I'll firstly try to add a per-mm flag to fix this problem later
(but maybe not very soon)

Thanks
Yan



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-09-04  9:46 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
2023-08-10  8:57 ` [RFC PATCH v2 1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
2023-08-10  8:58 ` [RFC PATCH v2 2/5] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
2023-08-10  9:00 ` [RFC PATCH v2 3/5] mm/mmu_notifier: introduce a new callback .numa_protect Yan Zhao
2023-08-10  9:00 ` [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate Yan Zhao
2023-08-11 18:52   ` Nadav Amit
2023-08-14  7:52     ` Yan Zhao
2023-08-10  9:02 ` [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration Yan Zhao
2023-08-10 13:16   ` bibo mao
2023-08-11  3:45     ` Yan Zhao
2023-08-11  7:40       ` bibo mao
2023-08-11  8:01         ` Yan Zhao
2023-08-11 17:14           ` Sean Christopherson
2023-08-11 17:18             ` Jason Gunthorpe
2023-08-14  6:52             ` Yan Zhao
2023-08-14  7:44               ` Yan Zhao
2023-08-14 16:40               ` Sean Christopherson
2023-08-15  1:54                 ` Yan Zhao
2023-08-15 14:50                   ` Sean Christopherson
2023-08-16  2:43                     ` bibo mao
2023-08-16  3:44                       ` bibo mao
2023-08-16  5:14                         ` Yan Zhao
2023-08-16  7:29                           ` bibo mao
2023-08-16  7:18                             ` Yan Zhao
2023-08-16  7:53                               ` bibo mao
2023-08-16 13:39                                 ` Sean Christopherson
2023-08-10  9:34 ` [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM David Hildenbrand
2023-08-10  9:50   ` Yan Zhao
2023-08-11 17:25     ` David Hildenbrand
2023-08-11 18:20       ` John Hubbard
2023-08-11 18:39         ` David Hildenbrand
2023-08-11 19:35           ` John Hubbard
2023-08-14  9:09             ` Yan Zhao
2023-08-15  2:34               ` John Hubbard
2023-08-16  7:43                 ` David Hildenbrand
2023-08-16  9:06                   ` Yan Zhao
2023-08-16  9:49                     ` David Hildenbrand
2023-08-16 18:00                       ` John Hubbard
2023-08-17  5:05                         ` Yan Zhao
2023-08-17  7:38                           ` David Hildenbrand
2023-08-18  0:13                             ` Yan Zhao
2023-08-18  2:29                               ` John Hubbard
2023-09-04  9:18                                 ` Yan Zhao
2023-08-15  2:36               ` Yuan Yao
2023-08-15  2:37                 ` Yan Zhao
2023-08-10 13:58 ` Chao Gao
2023-08-11  5:22   ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).