Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field

From: Sean Christopherson <seanjc@google.com>
To: Anish Moorthy <amoorthy@google.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>,
	Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>,
	jthoughton@google.com, kvm@vger.kernel.org
Subject: Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
Date: Tue, 21 Mar 2023 08:21:14 -0700	[thread overview]
Message-ID: <ZBnLaidtZEM20jMp@google.com> (raw)
In-Reply-To: <CAF7b7mrVQ6zP6SLHm4QBfQLgaxQuMtxjhqU5YKjjKGkoND4MLw@mail.gmail.com>

On Mon, Mar 20, 2023, Anish Moorthy wrote:
> On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > > On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> > > > with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> > > > the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> > > > would still need a capability to advertise support to userspace, but userspace
> > > > wouldn't need to opt in.  I think this may have been my very original though, and
> > > > I just never actually wrote it down...
> > >
> > > Oh, good to know that's actually an option. I thought of that too, but
> > > assumed that returning a negative error code was a no-go for a proper
> > > vCPU exit. But if that's not true then I think it's the obvious
> > > solution because it precludes any uncaught behavior-change bugs.
> > >
> > > A couple of notes
> > > 1. Since we'll likely miss some -EFAULT returns, we'll need to make
> > > sure that the user can check for / doesn't see a stale
> > > kvm_run::memory_fault field when a missed -EFAULT makes it to
> > > userspace. It's a small and easy-to-fix detail, but I thought I'd
> > > point it out.
> >
> > Ya, this is the main concern for me as well.  I'm not as confident that it's
> > easy-to-fix/avoid though.
> >
> > > 2. I don't think this would simplify the series that much, since we
> > > still need to find the call sites returning -EFAULT to userspace and
> > > populate memory_fault only in those spots to avoid populating it for
> > > -EFAULTs which don't make it to userspace.
> >
> > Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly
> > ok.  It's not ideal, but it's ok.
> >
> > > We *could* relax that condition and just document that memory_fault should be
> > > ignored when KVM_RUN does not return -EFAULT... but I don't think that's a
> > > good solution from a coder/maintainer perspective.
> >
> > You've got things backward.  memory_fault _must_ be ignored if KVM doesn't return
> > the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT"
> > or "-EFAULT+KVM_EXIT_MEMORY_FAULT".
> >
> > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> 
> Actually, I don't think the idea of filling in kvm_run.memory_fault
> for -EFAULTs which don't make it to userspace works at all. Consider
> the direct_map function, which bubbles its -EFAULT to
> kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> kvm_arch_async_page_ready (which ignores the return value), and by
> kvm_mmu_page_fault (where the return value does make it to userspace).
> Populating kvm_run.memory_fault anywhere in or under
> kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> already-set kvm_run.memory_fault / other kvm_run field.

This particular case is a non-issue.  kvm_check_async_pf_completion() is called
only when the current task has control of the vCPU, i.e. is the current "running"
vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
context would be unstable.

That will hold true for all cases.  Using a vCPU that is not loaded (not the
current "running" vCPU in KVM's misleading terminology) to access guest memory is
simply not safe, as the vCPU state is non-deterministic.  There are paths where
KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
and making requests, but those are very controlled flows with dedicated machinery
to make them SMP safe.

That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
without the target vCPU being loaded:

	int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
	{
		preempt_disable();
		if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
			goto out;

		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
		...
	out:
		preempt_enable();
		return -EFAULT;
	}

FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
that KVM "immediately" exits to userspace isn't ideal, but given the amount of
historical code that we need to deal with, it seems like the lesser of all evils.
Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
better failure mode than KVM not filling kvm_run when it should, i.e. false
positives are ok, false negatives are fatal.

> That in turn looks problematic for the
> memory-fault-exit-on-fast-gup-failure part of this series, because
> there are at least a couple of cases for which kvm_mmu_do_page_fault
> will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> was the original purpose of this series. Another is a -EFAULT from
> FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> other cases as well. But unless userspace can/should resolve *all*
> such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> "kvm_mmu_page_fault" wouldn't be actionable.

Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
action.  As I've pointed out elsewhere in this thread, userspace needs to be able
to identify "faults" that it (userspace) can resolve without a hint from KVM.

In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
difference, for all intents and purposes, is that userspace is given a bit more
information about the source of the -EFAULT.

> At least, not without a whole lot of plumbing code to make it so.

Plumbing where?