Re: Deadlock due to EPT_VIOLATION

From: Eric Wheeler <kvm@lists.ewheeler.net>
To: Sean Christopherson <seanjc@google.com>
Cc: Amaan Cheval <amaan.cheval@gmail.com>,
	brak@gameservers.com, kvm@vger.kernel.org
Subject: Re: Deadlock due to EPT_VIOLATION
Date: Fri, 11 Aug 2023 17:50:08 -0700 (PDT)	[thread overview]
Message-ID: <68e7d342-bdeb-39bf-5233-ba1121f0afc@ewheeler.net> (raw)
In-Reply-To: <ZNZ3owRcRjGejWFn@google.com>

On Fri, 11 Aug 2023, Sean Christopherson wrote:
> On Fri, Aug 11, 2023, Amaan Cheval wrote:
> > > Since it sounds like you can test with a custom kernel, try running with this
> > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets
> > > stuck.  The below expands said tracepoint to capture information about
> > > mmu_notifiers and memslots generation.  With luck, it will reveal a smoking
> > > gun.
> > 
> > Thanks for the patch there. We tried migrating a locked up guest to a host with
> > this modified kernel twice (logs below). The guest "fixed itself" post
> > migration, so the results may not have captured the "problematic" kind of
> > page-fault, but here they are.
> 
> The traces need to be captured from the host where a vCPU is stuck.
> 
> > Complete logs of kvm_page_fault tracepoint events, starting just before the
> > migration (with 0 guests before the migration, so the first logs should be of
> > the problematic guest) as it resolves the lockup:
> > 
> > 1. https://transfer.sh/QjB3MjeBqh/trace-kvm-kpf2.log
> > 2. https://transfer.sh/wEFQm4hLHs/trace-kvm-pf.log
> > 
> > Truncated logs of `trace-cmd record -e kvm -e kvmmmu` in case context helps:
> > 
> > 1. https://transfer.sh/FoFsNoFQCP/trace-kvm2.log
> > 2. https://transfer.sh/LBFJryOfu7/trace-kvm.log
> > 
> > Note that for migration #2 in both respectively above (trace-kvm-pf.log and
> > trace-kvm.log), we didn't confirm that the guest was locked up before migration
> > mistakenly. It most likely was but in case trace #2 doesn't present the same
> > symptoms, that's why.
> > 
> > Off an uneducated glance, it seems like `in_prog = 0x1` at least once for every
> > `seq` / kvm_page_fault that seems to be "looping" and staying unresolved -
> 
> This is completely expected.   The "in_prog" thing is just saying that a vCPU
> took a fault while there was an mmu_notifier event in-progress.
> 
> > indicating a lock contention, perhaps, in trying to invalidate/read/write the
> > same page range?
> 
> No, just a collision between the primary MMU invalidating something, e.g. to move
> a page or do KSM stuff, and a vCPU accessing the page in question.
> 
> > We do know this issue _occurs_ as late as 6.1.38 at least (i.e. hosts running
> > 6.1.38 have had guests lockup - we don't have hosts on more recent kernels, so
> > this isn't proof that it's been fixed since then, nor is migration proof of
> > that, IMO).
> 
> Note, if my hunch is correct, it's the act of migrating to a different *host* that
> resolves the problem, not the fact that the migration is to a different kernel.
> E.g. I would expect that migrating to the exact same kernel would still unstick
> the vCPU.
> 
> What I suspect is happening is that the in-progress count gets left high, e.g.
> because of a start() without a paired end(), and that causes KVM to refuse to
> install mappings for the affected range of guest memory.  Or possibly that the
> problematic host is generating an absolutely massive storm of invalidations and
> unintentionally DoS's the guest.

It would would be great to write a micro benchmark of sorts that generates 
EPT page invalidation pressure, and run it on a test system inside a 
virtual machine to see if we can get it to fault:

Can you suggest the type(s) of memory operations that could be written in 
user space (or kernel space as a module) to, find a test case that forces 
it to fail within a reasonable period of time?

We were thinking of memory mapping lots of page-sized mappings from 
/dev/zero and then randomly write and free them after there are tons of 
them allocated, and do this across multiple threads, while simultaneously 
using `taskset` (or `virsh vcpupin`) on the host to move the guest vCPUs 
across NUMA boundaries, and also with numabalance turned on.

I have also considered passing a device like null_blk.ko into the guest, 
and then doing memory mappings against it in the same way to put pressure 
or on the direct IO path from KVM into the guest user space. 

If you (or anyone else) have other suggestions then I would love to hear 
it. Maybe we can make a reproducer for this.

--
Eric Wheeler

> 
> Either way, migrating the VM to a new host and thus a new KVM instance essentially
> resets all of that metadata and allows KVM to fault-in pages and establish mappings.
> 
> Actually, one thing you could try to unstick a VM would be to do an intra-host
> migration, i.e. migrate it to a new KVM instance on the same host.  If that "fixes"
> the guest, then the bug is likely an mmu_notifier counting bug and not an
> invalidation storm.
> 
> But the easiest thing would be to catch a host in the act, i.e. capture traces
> with my debug patch from a host with a stuck vCPU.
>