Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

From: Yan Zhao <yan.y.zhao@intel.com>
To: Sean Christopherson <seanjc@google.com>
Cc: bibo mao <maobibo@loongson.cn>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<pbonzini@redhat.com>, <mike.kravetz@oracle.com>,
	<apopple@nvidia.com>, <jgg@nvidia.com>, <rppt@kernel.org>,
	<akpm@linux-foundation.org>, <kevin.tian@intel.com>,
	<david@redhat.com>
Subject: Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
Date: Mon, 14 Aug 2023 14:52:07 +0800	[thread overview]
Message-ID: <ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <ZNZshVZI5bRq4mZQ@google.com>

On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote:
> On Fri, Aug 11, 2023, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> > > 
> > > 在 2023/8/11 11:45, Yan Zhao 写道:
> > > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > > >>> +					  struct mm_struct *mm,
> > > >>> +					  unsigned long start,
> > > >>> +					  unsigned long end)
> > > >>> +{
> > > >>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > > >>> +
> > > >>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > > >>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > > >>> +		return;
> > > >>> +
> > > >>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > > >>> +}
> > > >> numa balance will scan wide memory range, and there will be one time
> > > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > > for each 2M range.
> > > yes, range is huge page size when changing numa protection during numa scanning.
> > > 
> > > > 
> > > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > > >> it may bring out lots of flush remote tlb ipi notification.
> > > > 
> > > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > > will be reduced to 0 with this series.
> > > > 
> > > > For VMs without assigned devices or mdev devices, I was previously also
> > > > worried about that there might be more IPIs.
> > > > But with current test data, there's no more remote tlb IPIs on average.
> > > > 
> > > > The reason is below:
> > > > 
> > > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > > range.
> 
> No, it's potentially called once per 1GiB range.  change_pmd_range() invokes the
> mmu_notifier with addr+end, where "end" is the end of the range covered by the
> PUD, not the the end of the current PMD.  So the worst case scenario would be a
> 256k increase.  Of course, if you have to migrate an entire 1GiB chunk of memory
> then you likely have bigger problems, but still.
Yes, thanks for pointing it out.
I realized it after re-reading the code and re-checking the log message.
This wider range also explained the collected worst data:
44 kvm_unmap_gfn_range() caused 43920 kvm_flush_remote_tlbs()requests. i.e.
998 remote tlb flush requests per kvm_unmap_gfn_range().

> 
> > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > > if mapped to 4K pages in primary MMU.
> > > > 
> > > > 
> > > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > > I do not know much about x86, does this happen always or only need reschedule
> > Ah, sorry, I missed platforms other than x86.
> > Maybe there will be a big difference in other platforms.
> > 
> > > from code?  so that there will be many times of tlb IPIs in only once function
> > Only when MMU lock is contended. But it's not seldom because of the contention in
> > TDP page fault.
> 
> No?  I don't see how mmu_lock contention would affect the number of calls to 
> kvm_flush_remote_tlbs().  If vCPUs are running and not generating faults, i.e.
> not trying to access the range in question, then ever zap will generate a remote
> TLB flush and thus send IPIs to all running vCPUs.
In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this:

1. Check necessity of mmu_lock reschedule
1.a -- if yes,
       1.a.1 do kvm_flush_remote_tlbs() if flush is true.
       1.a.2 reschedule of mmu_lock
       1.a.3 goto 2.
1.b -- if no, goto 2
2. Zap present leaf entry, and set flush to be true
3. Get next gfn and go to to 1

With a wide range to .invalidate_range_start()/end(), it's easy to find
rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1)
And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush
request is performed. (1.a.1)

Take a real case for example,
NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800.
Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break
because vCPU is faulting GFN 0xb804c.
And vCPUs fill constantly retry faulting 0xb804c for 3298 times until
.invalidate_range_end().
Then for this zap of GFN range from 0xb4000 - 0xb9800,
vCPUs retry fault of
GFN 0xb804c for 3298 times,
GFN 0xb8074 for 3161 times,
GFN 0xb81ce for 15190 times,
and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs()
for one kvm_unmap_gfn_range() because flush requests are not batched.
(this range is mapped in both 2M and 4K in secondary MMU).

Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines
all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called
after it returns.

In my test scenario, which is VM boot-up, this kind of contention is
frequent.
Here's the 10 times data for a VM boot-up collected previously.
       |      count of       |       count of          |
       | kvm_unmap_gfn_range | kvm_flush_remote_tlbs() |
-------|---------------------|-------------------------|
   1   |         38          |           14            |
   2   |         28          |         5191            |
   3   |         36          |        13398            |
   4   |         44          |        43920            |
   5   |         28          |           14            |
   6   |         36          |        10803            |
   7   |         38          |          892            |
   8   |         32          |         5905            |
   9   |         32          |           13            |
  10   |         38          |         6096            |
-------|---------------------|-------------------------|
average|         35          |         8625            |

I wonder if we could loose the frequency to check for rescheduling in
tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.

if (iter->next_last_level_gfn ==
    iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
	return false; 

> 
> > > call about kvm_unmap_gfn_range.
> > > 
> > > > if the pages are mapped in 4K in secondary MMU.
> > > > 
> > > > With this series, on the other hand, .numa_protect() sets range to be
> > > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > > mapped into small PTEs in secondary MMU.
> > > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > > whether the page is accessed for different cpu threads cross-nodes.
> > The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> > the page.
> 
> I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> and losing the batching of invalidations makes me nervous.  As Bibo points out,
> the behavior will vary based on the workload, VM configuration, etc.
> 
> There's also a *very* subtle change, in that the notification will be sent while
> holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> I'm not exactly 100% confident on that.  And the only reason there isn't a more
MMU lock is a rwlock, which is a variant of spinlock.
So, I think taking it within PMD/PTE lock is ok.
Actually we have already done that during the .change_pte() notification, where

kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers

> obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> won't yield if there's mmu_lock contention.
Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.

> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> I think we can achieve what you want without losing batching, and without changing
> secondary MMUs.
> 
> Rather than muck with the notification types and add a one-off for NUMA, just
> defer the notification until a present PMD/PTE is actually going to be modified.
> It's not the prettiest, but other than the locking, I don't think has any chance
> of regressing other workloads/configurations.
> 
> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> 
> Compile tested only.

I don't find a matching end to each
mmu_notifier_invalidate_range_start_nonblock().

> 
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 11 Aug 2023 10:03:36 -0700
> Subject: [PATCH] tmp
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/huge_mm.h |  4 +++-
>  include/linux/mm.h      |  3 +++
>  mm/huge_memory.c        |  5 ++++-
>  mm/mprotect.c           | 47 +++++++++++++++++++++++++++++------------
>  4 files changed, 44 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 20284387b841..b88316adaaad 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -7,6 +7,8 @@
>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  
> +struct mmu_notifier_range;
> +
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> @@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
>  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> -		    unsigned long cp_flags);
> +		    unsigned long cp_flags, struct mmu_notifier_range *range);
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..284f61cf9c37 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
>  	return !!(vma->vm_flags & VM_WRITE);
>  
>  }
> +
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range);
>  bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
>  extern long change_protection(struct mmu_gather *tlb,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a71cf686e3b2..47c7712b163e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>   */
>  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> -		    unsigned long cp_flags)
> +		    unsigned long cp_flags, struct mmu_notifier_range *range)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	spinlock_t *ptl;
> @@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    !toptier)
>  			xchg_page_access_time(page, jiffies_to_msecs(jiffies));
>  	}
> +
> +	change_pmd_range_notify_secondary_mmus(addr, range);
> +
>  	/*
>  	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
>  	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index d1a809167f49..f5844adbe0cb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
> -		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> +		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
> +		struct mmu_notifier_range *range)
>  {
>  	pte_t *pte, oldpte;
>  	spinlock_t *ptl;
> @@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				    !toptier)
>  					xchg_page_access_time(page,
>  						jiffies_to_msecs(jiffies));
> +
> +
>  			}
>  
> +			change_pmd_range_notify_secondary_mmus(addr, range);
> +
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
>  			ptent = pte_modify(oldpte, newprot);
>  
> @@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
>  		err;							\
>  	})
>  
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range)
> +{
> +	if (range->start)
> +		return;
> +
> +	VM_WARN_ON(addr >= range->end);
> +	range->start = addr;
> +	mmu_notifier_invalidate_range_start_nonblock(range);
This will cause range from addr to end to be invalidated, which may
include pinned ranges.

> +}
> +
>  static inline long change_pmd_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  	unsigned long nr_huge_updates = 0;
>  	struct mmu_notifier_range range;
>  
> -	range.start = 0;
> +	/*
> +	 * Defer invalidation of secondary MMUs until a PMD/PTE change is
> +	 * imminent, e.g. NUMA balancing in particular can "fail" for certain
> +	 * types of mappings.  Initialize range.start to '0' and use it to
> +	 * track whether or not the invalidation notification has been set.
> +	 */
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
> +				vma->vm_mm, 0, end);
>  
>  	pmd = pmd_offset(pud, addr);
>  	do {
> @@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  		if (pmd_none(*pmd))
>  			goto next;
>  
> -		/* invoke the mmu notifier if the pmd is populated */
> -		if (!range.start) {
> -			mmu_notifier_range_init(&range,
> -				MMU_NOTIFY_PROTECTION_VMA, 0,
> -				vma->vm_mm, addr, end);
> -			mmu_notifier_invalidate_range_start(&range);
> -		}
> -
>  		_pmd = pmdp_get_lockless(pmd);
>  		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
>  			if ((next - addr != HPAGE_PMD_SIZE) ||
>  			    pgtable_split_needed(vma, cp_flags)) {
> +				/*
> +				 * FIXME: __split_huge_pmd() performs its own
> +				 * mmu_notifier invalidation prior to clearing
> +				 * the PMD, ideally all invalidations for the
> +				 * range would be batched.
> +				 */
>  				__split_huge_pmd(vma, pmd, addr, false, NULL);
>  				/*
>  				 * For file-backed, the pmd could have been
> @@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  					break;
>  				}
>  			} else {
> -				ret = change_huge_pmd(tlb, vma, pmd,
> -						addr, newprot, cp_flags);
> +				ret = change_huge_pmd(tlb, vma, pmd, addr,
> +						      newprot, cp_flags, &range);
>  				if (ret) {
>  					if (ret == HPAGE_PMD_NR) {
>  						pages += HPAGE_PMD_NR;
> @@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  		}
>  
>  		ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
> -				       cp_flags);
> +				       cp_flags, &range);
>  		if (ret < 0)
>  			goto again;
>  		pages += ret;
> 
> base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
> -- 
> 
> 
>