Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO

From: Anish Moorthy <amoorthy@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: oliver.upton@linux.dev, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev, pbonzini@redhat.com, maz@kernel.org,
	robert.hoo.linux@gmail.com, jthoughton@google.com,
	bgardon@google.com, dmatlack@google.com, ricarkol@google.com,
	axelrasmussen@google.com, peterx@redhat.com,
	nadav.amit@gmail.com, isaku.yamahata@gmail.com
Subject: Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
Date: Fri, 11 Aug 2023 15:12:18 -0700	[thread overview]
Message-ID: <CAF7b7mqfkLYtWBJ=u0MK7hhARHrahQXHza9VnaughyNz5_tNug@mail.gmail.com> (raw)
In-Reply-To: <ZIn6VQSebTRN1jtX@google.com>

On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>
> Tagging a globally visible, non-static function as "inline" is odd, to say the
> least.

I think my eyes glaze over whenever I read the words "translation
unit" (my brain certainly does) so I'll have to take your word for it.
IIRC last time I tried to mark this function "static" the compiler
yelled at me, so removing the "inline" it is.

> > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>
> I strongly prefer to avoid "populate" and "efault".  Avoid "populate" because
> that verb will become stale the instance we do anything else in the helper.
> Avoid "efault" because it's imprecise, i.e. this isn't to be used for just any
> old -EFAULT scenario.  Something like kvm_handle_guest_uaccess_fault()? Definitely
> open to other names (especially less verbose names).

I've taken the kvm_handle_guest_uaccess_fault() name for now, though I
remember you saying something about "uaccess" names being bad because
they'll become inaccurate once GPM rolls around? I'll circle back on
the names before sending v5 out.

> > (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
>
> As I've stated multiple times, this can't WARN in "normal" builds because userspace
> can modify kvm_run fields at will.  I do want a WARN as it will allow fuzzers to
> find bugs for us, but it needs to be guarded with a Kconfig (or maybe a module
> param).  One idea would be to make the proposed CONFIG_KVM_PROVE_MMU[*] a generic
> Kconfig and use that.

For now I've added a KVM_WARN_MEMORY_FAULT_ANNOTATE_ALREADY_POPULATED
kconfig: open to suggestions on the name.

> I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
> think this what you intended for each case?
>
>   (A) if there are any existing paths in KVM_RUN which cause a vCPU
>       to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
>       access but ignore the failure and then (3) complete the exit to
>       userspace set up in (1), then the contents of the kvm_run struct written
>       in (1) will be corrupted.
>
>   (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
>       but does not return the EFAULT to userspace, then later returns an *un*annotated
>       EFAULT to userspace, then userspace will receive incorrect information.
>
>   (C) an annotated EFAULT which is ignored/suppressed followed by one which is
>       propagated to userspace. Here the exit-reason-unset check will prevent the
>       second annotation from being written, so userspace sees an annotation with
>       bad contents, If we believe that case (A) is a weird sequence of events
>       that shouldn't be happening in the first place, then case (C) seems more
>       important to ensure correctness in. But I don't know anything about how often
>       (A) happens in KVM, which is why I want others' opinions.

Yeah, I got lost in the weeds: you've gotten the important points though

> (A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
> side effect of deficiencies in KVM's uAPI.
>
> (B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
> versions of this series, userspace needs to guard against a vCPU getting stuck in
> an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
> address this in the kernel.  KVM's documentation should strongly encourage userspace
> to take action if KVM repeatedly exits with the same info over and over, but beyond
> that I think anything else is nice to have, not mandatory.
>
> (C) should simply not be possible.  (A) is very much a "this shouldn't happen,
> but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
> yes, prioritizing correctness for (C) is far more important.
>
> That said, prioritizing (C) doesn't mean we can't also do our best to play nice
> with (A).  None of the existing exits use anywhere near the exit info union's 256
> bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
> in with all the others, what if we split the union in two, and place memory_fault
> in the high half (doesn't have to literally be half, but you get the idea).  It'd
> kinda be similar to x86's contributory vs. benign faults; exits that can't be
> "nested" or "speculative" go in the low half, and things like memory_fault go in
> the high half.
>
> That way, if (A) does occur, the original information will be preserved when KVM
> fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
> scenarios to just fields in the second union, i.e. just memory_fault for now.
> At the very least, not clobbering would likely make it easier for us to debug when
> things go sideways.
>
> And rather than use kvm_run.exit_reason as the canary, we should carve out a
> kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> canary in common KVM code, which can't be done for kvm_run.exit_reason because
> some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> in KVM_RUN.

I think this is a good idea :D I was going to suggest something
similar a while back, but I thought it would be off the table- whoops.

My one concern is that if/when other features eventually also use the
"speculative" portion, then they're going to run into the same issues
as we're trying to avoid here. But fixing *that* (probably by
propagating these exits through return values/the call stack) would be
a really big refactor, and C doesn't really have the type system for
it in the first place :(