All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Wheeler <kvm@lists.ewheeler.net>
To: Sean Christopherson <seanjc@google.com>
Cc: Amaan Cheval <amaan.cheval@gmail.com>,
	brak@gameservers.com, kvm@vger.kernel.org
Subject: Re: Deadlock due to EPT_VIOLATION
Date: Wed, 9 Aug 2023 18:27:56 -0700 (PDT)	[thread overview]
Message-ID: <e1d2a8c-ff48-bc69-693-9fe75138632b@ewheeler.net> (raw)
In-Reply-To: <e21d306a-bed6-36e1-be99-7cdab6b36d11@ewheeler.net>

On Wed, 9 Aug 2023, Eric Wheeler wrote:
> On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Amaan Cheval wrote:
> > > Hey Sean,
> > > 
> > > > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> > > > mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> > > > dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> > > > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> > > > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> > > > NUMA balancing to get false positives when determining whether or not to try and
> > > > migrate a page.
> > > 
> > > What are some situations where it might not be able to move the page in the end?
> > 
> > There's a pretty big list, see the "failure" paths of do_numa_page() and
> > migrate_misplaced_page().
> > 
> > > > That said, it's definitely very unexpected that NUMA balancing would be zapping
> > > > SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> > > > possible that that's what's happening, but quite unlikely, especially since it
> > > > sounds like you're seeing issues even with NUMA balancing disabled.

Brak indicated that they've seen this as early as v5.19.  IIRC, Hunter
said that v5.15 is working fine, so I went through the >v5.15 and <v5.19
commit logs for KVM that appear to be related to EPT. Of course if the
problem is outside of KVM, then this is moot, but maybe these are worth
a second look.

Sean, could any of these commits cause or hint at the problem?


  54275f74c KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
	- this mentions !PRESENT related to faulting out of mmu_lock.

  ec283cb1d KVM: x86/mmu: remove ept_ad field
	- looks like a simple patch, but could there be a reason that
	  this is somehow invalid in corner cases?  Here is the relevant 
	  diff snippet:

	+++ b/arch/x86/kvm/mmu/mmu.c
	@@ -5007,7 +5007,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
	 
			context->shadow_root_level = level;
	 
	-               context->ept_ad = accessed_dirty;

	+++ b/arch/x86/kvm/mmu/paging_tmpl.h
	-       #define PT_HAVE_ACCESSED_DIRTY(mmu) ((mmu)->ept_ad)
	+       #define PT_HAVE_ACCESSED_DIRTY(mmu) (!(mmu)->cpu_role.base.ad_disabled)

  ca2a7c22a KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bits
	- "No functional change intended" but it mentions EPT
	  violations.  Could something unintentional have happened here?

  4f4aa80e3 KVM: X86: Handle implicit supervisor access with SMAP
	- This is a small change, but maybe it would be worth a quick review
	
  5b22bbe71 KVM: X86: Change the type of access u32 to u64
	- This is just a datatype change in 5.17-rc3, probably not it.

-Eric

> > > 
> > > Yep, we're definitely seeing the issue occur even with numa_balancing
> > > enabled, but the likelihood of it occurring has significantly dropped since
> > > we've disabled numa_balancing.
> > 
> > So the fact that this occurs with NUMA balancing disabled means the problem likely
> > isn't with NUMA balancing itself.  NUMA balancing probably exposes some underlying
> > issue due to it generating a near-constant stream of mmu_notifier invalidation.
> > 
> > > > More likely is that there is a bug somewhere that results in the mmu_notifier
> > > > event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
> > > > the VM across a live migration...
> > > 
> > > Good news! We managed to live migrate a guest and that did "fix it".
> 
> Does the VM make progress even if is migrated to a kernel that presents 
> the bug?
> 
> What was kernel version being migrated from and to?
> 
> For example, was it from a >5.19 kernel to something earlier than 5.19?
> 
> For example, if the hung VM remains stuck after migrating to a >5.19 
> kernel but _not_ to a <5.19 kernel, then maybe bisect is an option.
> 
> 
> --
> Eric Wheeler
> 
> 
> 
> 
> > ...
> > 
> > > A colleague also modified a host kernel with KFI (Kernel Function
> > > Instrumentation) and wrote a kernel module to intercept the vmexit handler,
> > > handle_ept_violation, and does an EPT walk for each pfn, compared against
> > > /proc/iomem.
> > > 
> > > Assuming the EPT walking code is correct, we see this surprising result of a
> > > PDPTE's pfn=0:
> > 
> > Not surprising.  The entire EPTE is zero, i.e. has been zapped by KVM.  This is
> > exactly what is expected.
> > 
> > > Does this seem to indicate an mmu_notifier refcount issue to you, given that
> > > migration did fix it? Any way to verify?
> > 
> > It doesn't move the needle either way, it just confirms what we already know: the
> > vCPU is repeatedly taking !PRESENT faults.  The unexpected part is that KVM never
> > "fixes" the fault and never outright fails.
> > 
> > > We haven't found any guests with `softlockup_panic=1` yet, and since we can't
> > > reproduce the issue on command ourselves yet, we might have to wait a bit - but
> > > I imagine that the fact that live migration "fixed" the locked up guest confirms
> > > that the other guests that didn't get "fixed" were likely softlocked from the
> > > CPU stalling?
> > 
> > Yes.
> > 
> > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > further, let me know!
> > > 
> > > Thanks for all your help so far!
> > 
> > Since it sounds like you can test with a custom kernel, try running with this
> > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > below expands said tracepoint to capture information about mmu_notifiers and
> > memslots generation.  With luck, it will reveal a smoking gun.
> > 
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 10 ----------
> >  arch/x86/kvm/mmu/mmu_internal.h |  2 ++
> >  arch/x86/kvm/mmu/tdp_mmu.h      | 10 ++++++++++
> >  arch/x86/kvm/trace.h            | 28 ++++++++++++++++++++++++++--
> >  4 files changed, 38 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9e4cd8b4a202..122bfc884293 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2006,16 +2006,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
> >  	return true;
> >  }
> >  
> > -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > -{
> > -	if (sp->role.invalid)
> > -		return true;
> > -
> > -	/* TDP MMU pages do not use the MMU generation. */
> > -	return !is_tdp_mmu_page(sp) &&
> > -	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > -}
> > -
> >  struct mmu_page_path {
> >  	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
> >  	unsigned int idx[PT64_ROOT_MAX_LEVEL];
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index f1ef670058e5..cf7ba0abaa8f 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -6,6 +6,8 @@
> >  #include <linux/kvm_host.h>
> >  #include <asm/kvm_host.h>
> >  
> > +#include "mmu.h"
> > +
> >  #ifdef CONFIG_KVM_PROVE_MMU
> >  #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> >  #else
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 0a63b1afabd3..a0d7c8acf78f 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -76,4 +76,14 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
> >  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
> >  #endif
> >  
> > +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +{
> > +	if (sp->role.invalid)
> > +		return true;
> > +
> > +	/* TDP MMU pages do not use the MMU generation. */
> > +	return !is_tdp_mmu_page(sp) &&
> > +	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > +}
> > +
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 83843379813e..ff4a384ab03a 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -8,6 +8,8 @@
> >  #include <asm/clocksource.h>
> >  #include <asm/pvclock-abi.h>
> >  
> > +#include "mmu/tdp_mmu.h"
> > +
> >  #undef TRACE_SYSTEM
> >  #define TRACE_SYSTEM kvm
> >  
> > @@ -405,6 +407,13 @@ TRACE_EVENT(kvm_page_fault,
> >  		__field(	unsigned long,	guest_rip	)
> >  		__field(	u64,		fault_address	)
> >  		__field(	u64,		error_code	)
> > +		__field(	unsigned long,  mmu_invalidate_seq)
> > +		__field(	long,  mmu_invalidate_in_progress)
> > +		__field(	unsigned long,  mmu_invalidate_range_start)
> > +		__field(	unsigned long,  mmu_invalidate_range_end)
> > +		__field(	bool,		root_is_valid)
> > +		__field(	bool,		root_has_sp)
> > +		__field(	bool,		root_is_obsolete)
> >  	),
> >  
> >  	TP_fast_assign(
> > @@ -412,11 +421,26 @@ TRACE_EVENT(kvm_page_fault,
> >  		__entry->guest_rip	= kvm_rip_read(vcpu);
> >  		__entry->fault_address	= fault_address;
> >  		__entry->error_code	= error_code;
> > +		__entry->mmu_invalidate_seq		= vcpu->kvm->mmu_invalidate_seq;
> > +		__entry->mmu_invalidate_in_progress	= vcpu->kvm->mmu_invalidate_in_progress;
> > +		__entry->mmu_invalidate_range_start	= vcpu->kvm->mmu_invalidate_range_start;
> > +		__entry->mmu_invalidate_range_end	= vcpu->kvm->mmu_invalidate_range_end;
> > +		__entry->root_is_valid			= VALID_PAGE(vcpu->arch.mmu->root.hpa);
> > +		__entry->root_has_sp			= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> > +							  to_shadow_page(vcpu->arch.mmu->root.hpa);
> > +		__entry->root_is_obsolete		= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> > +							  to_shadow_page(vcpu->arch.mmu->root.hpa) &&
> > +							  is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa));
> >  	),
> >  
> > -	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx",
> > +	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx, seq = 0x%lx, in_prog = 0x%lx, start = 0x%lx, end = 0x%lx, root = %s",
> >  		  __entry->vcpu_id, __entry->guest_rip,
> > -		  __entry->fault_address, __entry->error_code)
> > +		  __entry->fault_address, __entry->error_code,
> > +		  __entry->mmu_invalidate_seq, __entry->mmu_invalidate_in_progress,
> > +		  __entry->mmu_invalidate_range_start, __entry->mmu_invalidate_range_end,
> > +		  !__entry->root_is_valid ? "invalid" :
> > +		  !__entry->root_has_sp ? "no shadow page" :
> > +		  __entry->root_is_obsolete ? "obsolete" : "fresh")
> >  );
> >  
> >  /*
> > 
> > base-commit: 240f736891887939571854bd6d734b6c9291f22e
> > -- 
> > 
> > 
> 

  reply	other threads:[~2023-08-10  1:28 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-23 14:02 Deadlock due to EPT_VIOLATION Brian Rak
2023-05-23 16:22 ` Sean Christopherson
2023-05-24 13:39   ` Brian Rak
2023-05-26 16:59     ` Brian Rak
2023-05-26 21:02       ` Sean Christopherson
2023-05-30 17:35         ` Brian Rak
2023-05-30 18:36           ` Sean Christopherson
2023-05-31 17:40             ` Brian Rak
2023-07-21 14:34             ` Amaan Cheval
2023-07-21 17:37               ` Sean Christopherson
2023-07-24 12:08                 ` Amaan Cheval
2023-07-25 17:30                   ` Sean Christopherson
2023-08-02 14:21                     ` Amaan Cheval
2023-08-02 15:34                       ` Sean Christopherson
2023-08-02 16:45                         ` Amaan Cheval
2023-08-02 17:52                           ` Sean Christopherson
2023-08-08 15:34                             ` Amaan Cheval
2023-08-08 17:07                               ` Sean Christopherson
2023-08-10  0:48                                 ` Eric Wheeler
2023-08-10  1:27                                   ` Eric Wheeler [this message]
2023-08-10 23:58                                     ` Sean Christopherson
2023-08-11 12:37                                       ` Amaan Cheval
2023-08-11 18:02                                         ` Sean Christopherson
2023-08-12  0:50                                           ` Eric Wheeler
2023-08-14 17:29                                             ` Sean Christopherson
2023-08-15  0:30                                 ` Eric Wheeler
2023-08-15 16:10                                   ` Sean Christopherson
2023-08-16 23:54                                     ` Eric Wheeler
2023-08-17 18:21                                       ` Sean Christopherson
2023-08-18  0:55                                         ` Eric Wheeler
2023-08-18 14:33                                           ` Sean Christopherson
2023-08-18 23:06                                             ` Eric Wheeler
2023-08-21 20:27                                               ` Eric Wheeler
2023-08-21 23:51                                                 ` Sean Christopherson
2023-08-22  0:11                                                   ` Sean Christopherson
2023-08-22  1:10                                                   ` Eric Wheeler
2023-08-22 15:11                                                     ` Sean Christopherson
2023-08-22 21:23                                                       ` Eric Wheeler
2023-08-22 21:32                                                         ` Sean Christopherson
2023-08-23  0:39                                                       ` Eric Wheeler
2023-08-23 17:54                                                         ` Sean Christopherson
2023-08-23 19:44                                                           ` Eric Wheeler
2023-08-23 22:12                                                           ` Eric Wheeler
2023-08-23 22:32                                                             ` Eric Wheeler
2023-08-23 23:21                                                               ` Sean Christopherson
2023-08-24  0:30                                                                 ` Eric Wheeler
2023-08-24  0:52                                                                   ` Sean Christopherson
2023-08-24 23:51                                                                     ` Eric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e1d2a8c-ff48-bc69-693-9fe75138632b@ewheeler.net \
    --to=kvm@lists.ewheeler.net \
    --cc=amaan.cheval@gmail.com \
    --cc=brak@gameservers.com \
    --cc=kvm@vger.kernel.org \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.