All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yan Zhao <yan.y.zhao@intel.com>
To: Sean Christopherson <seanjc@google.com>
Cc: bibo mao <maobibo@loongson.cn>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<pbonzini@redhat.com>, <mike.kravetz@oracle.com>,
	<apopple@nvidia.com>, <jgg@nvidia.com>, <rppt@kernel.org>,
	<akpm@linux-foundation.org>, <kevin.tian@intel.com>,
	<david@redhat.com>
Subject: Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
Date: Tue, 15 Aug 2023 09:54:37 +0800	[thread overview]
Message-ID: <ZNra3eDNTaKVc7MT@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <ZNpZDH9//vk8Rqvo@google.com>

On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> > > and losing the batching of invalidations makes me nervous.  As Bibo points out,
> > > the behavior will vary based on the workload, VM configuration, etc.
> > > 
> > > There's also a *very* subtle change, in that the notification will be sent while
> > > holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> > > I'm not exactly 100% confident on that.  And the only reason there isn't a more
> > MMU lock is a rwlock, which is a variant of spinlock.
> > So, I think taking it within PMD/PTE lock is ok.
> > Actually we have already done that during the .change_pte() notification, where
> > 
> > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
> > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers
> 
> .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed
> to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM
> doesn't use a sleepable lock.
.numa_protect() in patch 4 is also sent when it's not allowed to
fail and there's no sleepable lock in KVM's handler :)


> As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail
> because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier
> events.  For NUMA balancing, canceling the protection change might be ok, but for
> everything else, failure is not an option.  So unfortunately, my idea won't work
> as-is.
> 
> We might get away with deferring just the change_prot_numa() case, but I don't
> think that's worth the mess/complexity.
Yes, I also think deferring mmu_notifier_invalidate_range_start() is not working.
One possible approach is to send successful .numa_protect() in batch.
But I admit it will introduce complexity.

> 
> I would much rather tell userspace to disable migrate-on-fault for KVM guest
> mappings (mbind() should work?) for these types of setups, or to disable NUMA
> balancing entirely.  If the user really cares about performance of their VM(s),
> vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if
> the VM spans multiple nodes, a virtual NUMA topology should be given to the guest.
> At that point, NUMA balancing is likely to do more harm than good.
> 
> > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> > > won't yield if there's mmu_lock contention.
> > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> > 
> > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> > > I think we can achieve what you want without losing batching, and without changing
> > > secondary MMUs.
> > > 
> > > Rather than muck with the notification types and add a one-off for NUMA, just
> > > defer the notification until a present PMD/PTE is actually going to be modified.
> > > It's not the prettiest, but other than the locking, I don't think has any chance
> > > of regressing other workloads/configurations.
> > > 
> > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > > 
> > > Compile tested only.
> > 
> > I don't find a matching end to each
> > mmu_notifier_invalidate_range_start_nonblock().
> 
> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> 
> 	if (range.start)
> 		mmu_notifier_invalidate_range_end(&range);
No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
if we only want the range to include pages successfully set to PROT_NONE.

> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Mon, 14 Aug 2023 08:59:12 -0700
> Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if
>  mapping is changing
> 
> Retry page faults without acquiring mmu_lock if the resolve hva is covered
> by an active invalidation.  Contending for mmu_lock is especially
> problematic on preemptible kernels as the invalidation task will yield the
> lock (see rwlock_needbreak()), delay the in-progress invalidation, and
> ultimately increase the latency of resolving the page fault.  And in the
> worst case scenario, yielding will be accompanied by a remote TLB flush,
> e.g. if the invalidation covers a large range of memory and vCPUs are
> accessing addresses that were already zapped.
> 
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
> 
> Reported-by: Yan Zhao <yan.y.zhao@intel.com>
> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 3 +++
>  include/linux/kvm_host.h | 8 +++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..f29718a16211 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	if (unlikely(!fault->slot))
>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>  
> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> +		return RET_PF_RETRY;
> +
This can effectively reduce the remote flush IPIs a lot!
One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
and kvm->mmu_invalidate_range_end.
Otherwise, I'm somewhat worried about constant false positive and retry.


>  	return RET_PF_CONTINUE;
>  }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cb86108c624d..f41d4fe61a06 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  					   unsigned long mmu_seq,
>  					   unsigned long hva)
>  {
> -	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
>  	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
> @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	    hva >= kvm->mmu_invalidate_range_start &&
>  	    hva < kvm->mmu_invalidate_range_end)
>  		return 1;
> +
> +	/*
> +	 * Note the lack of a memory barrier!  The caller *must* hold mmu_lock
> +	 * to avoid false negatives and/or false positives (less concerning).
> +	 * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking
> +	 * for an in-progress invalidation to avoiding contending mmu_lock.
> +	 */
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> 
> base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc
> -- 
> 

  reply	other threads:[~2023-08-15  2:22 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-10  8:56 [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Yan Zhao
2023-08-10  8:57 ` [RFC PATCH v2 1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA Yan Zhao
2023-08-10  8:58 ` [RFC PATCH v2 2/5] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose Yan Zhao
2023-08-10  9:00 ` [RFC PATCH v2 3/5] mm/mmu_notifier: introduce a new callback .numa_protect Yan Zhao
2023-08-10  9:00 ` [RFC PATCH v2 4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate Yan Zhao
2023-08-10 13:45   ` kernel test robot
2023-08-10 13:55   ` kernel test robot
2023-08-11 18:52   ` Nadav Amit
2023-08-14  7:52     ` Yan Zhao
2023-08-10  9:02 ` [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration Yan Zhao
2023-08-10 13:16   ` bibo mao
2023-08-11  3:45     ` Yan Zhao
2023-08-11  7:40       ` bibo mao
2023-08-11  8:01         ` Yan Zhao
2023-08-11 17:14           ` Sean Christopherson
2023-08-11 17:18             ` Jason Gunthorpe
2023-08-14  6:52             ` Yan Zhao
2023-08-14  7:44               ` Yan Zhao
2023-08-14 16:40               ` Sean Christopherson
2023-08-15  1:54                 ` Yan Zhao [this message]
2023-08-15 14:50                   ` Sean Christopherson
2023-08-16  2:43                     ` bibo mao
2023-08-16  3:44                       ` bibo mao
2023-08-16  5:14                         ` Yan Zhao
2023-08-16  7:29                           ` bibo mao
2023-08-16  7:18                             ` Yan Zhao
2023-08-16  7:53                               ` bibo mao
2023-08-16 13:39                                 ` Sean Christopherson
2023-08-10 15:19   ` kernel test robot
2023-08-10  9:34 ` [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM David Hildenbrand
2023-08-10  9:50   ` Yan Zhao
2023-08-11 17:25     ` David Hildenbrand
2023-08-11 18:20       ` John Hubbard
2023-08-11 18:39         ` David Hildenbrand
2023-08-11 19:35           ` John Hubbard
2023-08-14  9:09             ` Yan Zhao
2023-08-15  2:34               ` John Hubbard
2023-08-16  7:43                 ` David Hildenbrand
2023-08-16  9:06                   ` Yan Zhao
2023-08-16  9:49                     ` David Hildenbrand
2023-08-16 18:00                       ` John Hubbard
2023-08-17  5:05                         ` Yan Zhao
2023-08-17  7:38                           ` David Hildenbrand
2023-08-18  0:13                             ` Yan Zhao
2023-08-18  2:29                               ` John Hubbard
2023-09-04  9:18                                 ` Yan Zhao
2023-08-15  2:36               ` Yuan Yao
2023-08-15  2:37                 ` Yan Zhao
2023-08-10 13:58 ` Chao Gao
2023-08-11  5:22   ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZNra3eDNTaKVc7MT@yzhao56-desk.sh.intel.com \
    --to=yan.y.zhao@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maobibo@loongson.cn \
    --cc=mike.kravetz@oracle.com \
    --cc=pbonzini@redhat.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.