Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

From: Sean Christopherson <seanjc@google.com>
To: Mingwei Zhang <mizhang@google.com>
Cc: David Matlack <dmatlack@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Nagareddy Reddy <nspreddy@google.com>,
	Jim Mattson <jmattson@google.com>
Subject: Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
Date: Mon, 12 Dec 2022 16:45:59 +0000	[thread overview]
Message-ID: <Y5dax8XJV0F5adUw@google.com> (raw)
In-Reply-To: <Y5avm5VXpRt263wQ@google.com>

On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> On Fri, Dec 09, 2022, David Matlack wrote:
> > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > physical machine. There are several reasons and motivations to do so:
> > > 
> > > MMU bug is difficult to discover due to various racing conditions and
> > > corner cases and thus it extremely hard to debug. The situation gets much
> > > worse when it triggers the shutdown of a host. Host machine crash might
> > > eliminates everything including the potential clues for debugging.
> > > 
> > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > longer appropriate as the host reliability is top priority. Crashing the
> > > physical machine is almost never a good option as it eliminates innocent
> > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > reliably triggers this code by diverting the control flow or corrupting the
> > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > to cloud providers, as the death of one single host machine is not the end
> > > of the story. Without manual interferences, a failed cloud job may be
> > > dispatched to other hosts and continue host crashes until all of them are
> > > dead.
> > 
> > My only concern with using KVM_BUG() is whether the machine can keep
> > running correctly after this warning has been hit. In other words, are
> > we sure the damage is contained to just this VM?

Hmm, good point.  The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
splats where the reserved bits appear to be set by stack overflow, that has a much,
much higher probability of being a symptom of data corruption.

That said, that's more of a reason to change get_mmio_spte() than it is to ignore
potential data corruption in this case.  KVM arguably should kill the VM if
get_mmio_spte() fails too.

What about explicitly treating both get_mmio_spte() and this as potential data
corruption?  E.g. something like this, and then use the DATA_CORRUPTION variant
in pte_list_remove()?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2055e04b8f89..1cb69c6d186b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
                        pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
                               sptes[level], level,
                               get_rsvd_bits(rsvd_check, sptes[level], level));
+               KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
        }
 
        return reserved;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f16c4689322b..5c4a06f66f46 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
        unlikely(__ret);                                        \
 })
 
+#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm)                  \
+({                                                             \
+       int __ret = (cond);                                     \
+                                                               \
+       if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION))          \
+               BUG_ON(__ret);                                  \
+       else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged))      \
+               kvm_vm_bugged(kvm);                             \
+       unlikely(__ret);                                        \
+})
+
 static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_PROVE_RCU


> > If, for example, the KVM_BUG() was triggered by a use-after-free, then
> > there might be corrupted memory floating around in the machine.
> > 
> 
> David,
> 
> Your concern is quite reasonable. But given that both rmap and spte are
> pointers/data structures managed by individual VMs, i.e., none of them
> are global pointers, use-after-free is unlikely happening on cross-VM
> cases.

Being per-VM allocations doesn't change the behavior/impact of use-after-free.
E.g. if there is no rmap found for a SPTE then there's a non-zero chance KVM has
previously zapped the SPTE and freed the memory the SPTE pointed at, and thus KVM
might be reading/writing memory that is now owned by something else in the kernel.

> Even if there is, then shuting down those corrupted VMs is feasible
> here, since pte_list_remove() basically does the checking.

But the damage may already be done.  And the KVM_REQ_VM_DEAD request wont't be
recognized until the next vcpu_enter_enter_guest(), e.g. it won't prevent vCPUs
(or even this vCPU) from processing more