Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

From: Yan Zhao <yan.y.zhao@intel.com>
To: Sean Christopherson <seanjc@google.com>
Cc: bibo mao <maobibo@loongson.cn>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<pbonzini@redhat.com>, <mike.kravetz@oracle.com>,
	<apopple@nvidia.com>, <jgg@nvidia.com>, <rppt@kernel.org>,
	<akpm@linux-foundation.org>, <kevin.tian@intel.com>,
	<david@redhat.com>
Subject: Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration
Date: Tue, 15 Aug 2023 09:54:37 +0800	[thread overview]
Message-ID: <ZNra3eDNTaKVc7MT@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <ZNpZDH9//vk8Rqvo@google.com>

On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> > > and losing the batching of invalidations makes me nervous.  As Bibo points out,
> > > the behavior will vary based on the workload, VM configuration, etc.
> > > 
> > > There's also a *very* subtle change, in that the notification will be sent while
> > > holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> > > I'm not exactly 100% confident on that.  And the only reason there isn't a more
> > MMU lock is a rwlock, which is a variant of spinlock.
> > So, I think taking it within PMD/PTE lock is ok.
> > Actually we have already done that during the .change_pte() notification, where
> > 
> > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
> > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers
> 
> .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed
> to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM
> doesn't use a sleepable lock.
.numa_protect() in patch 4 is also sent when it's not allowed to
fail and there's no sleepable lock in KVM's handler :)


> As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail
> because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier
> events.  For NUMA balancing, canceling the protection change might be ok, but for
> everything else, failure is not an option.  So unfortunately, my idea won't work
> as-is.
> 
> We might get away with deferring just the change_prot_numa() case, but I don't
> think that's worth the mess/complexity.
Yes, I also think deferring mmu_notifier_invalidate_range_start() is not working.
One possible approach is to send successful .numa_protect() in batch.
But I admit it will introduce complexity.

> 
> I would much rather tell userspace to disable migrate-on-fault for KVM guest
> mappings (mbind() should work?) for these types of setups, or to disable NUMA
> balancing entirely.  If the user really cares about performance of their VM(s),
> vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if
> the VM spans multiple nodes, a virtual NUMA topology should be given to the guest.
> At that point, NUMA balancing is likely to do more harm than good.
> 
> > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> > > won't yield if there's mmu_lock contention.
> > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> > 
> > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> > > I think we can achieve what you want without losing batching, and without changing
> > > secondary MMUs.
> > > 
> > > Rather than muck with the notification types and add a one-off for NUMA, just
> > > defer the notification until a present PMD/PTE is actually going to be modified.
> > > It's not the prettiest, but other than the locking, I don't think has any chance
> > > of regressing other workloads/configurations.
> > > 
> > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > > 
> > > Compile tested only.
> > 
> > I don't find a matching end to each
> > mmu_notifier_invalidate_range_start_nonblock().
> 
> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> 
> 	if (range.start)
> 		mmu_notifier_invalidate_range_end(&range);
No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
if we only want the range to include pages successfully set to PROT_NONE.

> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Mon, 14 Aug 2023 08:59:12 -0700
> Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if
>  mapping is changing
> 
> Retry page faults without acquiring mmu_lock if the resolve hva is covered
> by an active invalidation.  Contending for mmu_lock is especially
> problematic on preemptible kernels as the invalidation task will yield the
> lock (see rwlock_needbreak()), delay the in-progress invalidation, and
> ultimately increase the latency of resolving the page fault.  And in the
> worst case scenario, yielding will be accompanied by a remote TLB flush,
> e.g. if the invalidation covers a large range of memory and vCPUs are
> accessing addresses that were already zapped.
> 
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
> 
> Reported-by: Yan Zhao <yan.y.zhao@intel.com>
> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 3 +++
>  include/linux/kvm_host.h | 8 +++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..f29718a16211 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	if (unlikely(!fault->slot))
>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>  
> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> +		return RET_PF_RETRY;
> +
This can effectively reduce the remote flush IPIs a lot!
One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
and kvm->mmu_invalidate_range_end.
Otherwise, I'm somewhat worried about constant false positive and retry.


>  	return RET_PF_CONTINUE;
>  }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cb86108c624d..f41d4fe61a06 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  					   unsigned long mmu_seq,
>  					   unsigned long hva)
>  {
> -	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
>  	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
> @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	    hva >= kvm->mmu_invalidate_range_start &&
>  	    hva < kvm->mmu_invalidate_range_end)
>  		return 1;
> +
> +	/*
> +	 * Note the lack of a memory barrier!  The caller *must* hold mmu_lock
> +	 * to avoid false negatives and/or false positives (less concerning).
> +	 * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking
> +	 * for an in-progress invalidation to avoiding contending mmu_lock.
> +	 */
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> 
> base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc
> -- 
>