All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
@ 2022-11-29 19:12 Mingwei Zhang
  2022-11-29 19:12 ` [RFC PATCH v4 1/2] KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove() Mingwei Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Mingwei Zhang @ 2022-11-29 19:12 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: H. Peter Anvin, kvm, linux-kernel, Mingwei Zhang,
	Nagareddy Reddy, Jim Mattson, David Matlack

Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
physical machine. There are several reasons and motivations to do so:

MMU bug is difficult to discover due to various racing conditions and
corner cases and thus it extremely hard to debug. The situation gets much
worse when it triggers the shutdown of a host. Host machine crash might
eliminates everything including the potential clues for debugging.

From cloud computing service perspective, BUG() or BUG_ON() is probably no
longer appropriate as the host reliability is top priority. Crashing the
physical machine is almost never a good option as it eliminates innocent
VMs and cause service outage in a larger scope. Even worse, if attacker can
reliably triggers this code by diverting the control flow or corrupting the
memory, then this becomes vm-of-death attack. This is a huge attack vector
to cloud providers, as the death of one single host machine is not the end
of the story. Without manual interferences, a failed cloud job may be
dispatched to other hosts and continue host crashes until all of them are
dead.

For the above reason, we propose the replacement of BUG() in
pte_list_remove() with KVM_BUG() to crash just the VM itself.

v3 - v4:
 - update code to integrate messages into KVM_BUG() [seanjc].
 - update commit message [seanjc].

v2 -> v3:
 - plumb @kvm all the way to pte_list_remove() [seanjc, pbonzini]
 - https://lore.kernel.org/lkml/20221128002043.1555543-1-mizhang@google.com/

v1 -> v2:
 - compile test the code.
 - fill KVM_BUG() with kvm_get_running_vcpu()->kvm
 - https://lore.kernel.org/all/20221124003505.424617-1-mizhang@google.com/

rfc v1:
 - https://lore.kernel.org/all/20221123231206.274392-1-mizhang@google.com/


Mingwei Zhang (2):
  KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove()
  KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu

 arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

-- 
2.38.1.584.g0f3c55d4c2-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v4 1/2] KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove()
  2022-11-29 19:12 [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu Mingwei Zhang
@ 2022-11-29 19:12 ` Mingwei Zhang
  2022-11-29 19:12 ` [RFC PATCH v4 2/2] KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu Mingwei Zhang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Mingwei Zhang @ 2022-11-29 19:12 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: H. Peter Anvin, kvm, linux-kernel, Mingwei Zhang,
	Nagareddy Reddy, Jim Mattson, David Matlack

Plumb struct kvm all the way to pte_list_remove() to allow the usage of
KVM_BUG() and/or KVM_BUG_ON(). This is the prepration step to depricate the
usage of BUG() in pte_list_remove() in shadow mmu.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4736d7849c60..b5a44b8f5f7b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -947,7 +947,8 @@ pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
 	mmu_free_pte_list_desc(desc);
 }
 
-static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
+static void pte_list_remove(struct kvm *kvm, u64 *spte,
+			    struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
 	struct pte_list_desc *prev_desc;
@@ -987,7 +988,7 @@ static void kvm_zap_one_rmap_spte(struct kvm *kvm,
 				  struct kvm_rmap_head *rmap_head, u64 *sptep)
 {
 	mmu_spte_clear_track_bits(kvm, sptep);
-	pte_list_remove(sptep, rmap_head);
+	pte_list_remove(kvm, sptep, rmap_head);
 }
 
 /* Return true if at least one SPTE was zapped, false otherwise */
@@ -1077,7 +1078,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 	slot = __gfn_to_memslot(slots, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 
-	pte_list_remove(spte, rmap_head);
+	pte_list_remove(kvm, spte, rmap_head);
 }
 
 /*
@@ -1730,16 +1731,16 @@ static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
-static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
+static void mmu_page_remove_parent_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 				       u64 *parent_pte)
 {
-	pte_list_remove(parent_pte, &sp->parent_ptes);
+	pte_list_remove(kvm, parent_pte, &sp->parent_ptes);
 }
 
-static void drop_parent_pte(struct kvm_mmu_page *sp,
+static void drop_parent_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			    u64 *parent_pte)
 {
-	mmu_page_remove_parent_pte(sp, parent_pte);
+	mmu_page_remove_parent_pte(kvm, sp, parent_pte);
 	mmu_spte_clear_no_track(parent_pte);
 }
 
@@ -2382,7 +2383,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		if (child->role.access == direct_access)
 			return;
 
-		drop_parent_pte(child, sptep);
+		drop_parent_pte(vcpu->kvm, child, sptep);
 		kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
 	}
 }
@@ -2400,7 +2401,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			drop_spte(kvm, spte);
 		} else {
 			child = spte_to_child_sp(pte);
-			drop_parent_pte(child, spte);
+			drop_parent_pte(kvm, child, spte);
 
 			/*
 			 * Recursively zap nested TDP SPs, parentless SPs are
@@ -2431,13 +2432,13 @@ static int kvm_mmu_page_unlink_children(struct kvm *kvm,
 	return zapped;
 }
 
-static void kvm_mmu_unlink_parents(struct kvm_mmu_page *sp)
+static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
 
 	while ((sptep = rmap_get_first(&sp->parent_ptes, &iter)))
-		drop_parent_pte(sp, sptep);
+		drop_parent_pte(kvm, sp, sptep);
 }
 
 static int mmu_zap_unsync_children(struct kvm *kvm,
@@ -2475,7 +2476,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
 	++kvm->stat.mmu_shadow_zapped;
 	*nr_zapped = mmu_zap_unsync_children(kvm, sp, invalid_list);
 	*nr_zapped += kvm_mmu_page_unlink_children(kvm, sp, invalid_list);
-	kvm_mmu_unlink_parents(sp);
+	kvm_mmu_unlink_parents(kvm, sp);
 
 	/* Zapping children means active_mmu_pages has become unstable. */
 	list_unstable = *nr_zapped;
@@ -2839,7 +2840,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 			u64 pte = *sptep;
 
 			child = spte_to_child_sp(pte);
-			drop_parent_pte(child, sptep);
+			drop_parent_pte(vcpu->kvm, child, sptep);
 			flush = true;
 		} else if (pfn != spte_to_pfn(*sptep)) {
 			pgprintk("hfn old %llx new %llx\n",
-- 
2.38.1.584.g0f3c55d4c2-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v4 2/2] KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu
  2022-11-29 19:12 [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu Mingwei Zhang
  2022-11-29 19:12 ` [RFC PATCH v4 1/2] KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove() Mingwei Zhang
@ 2022-11-29 19:12 ` Mingwei Zhang
  2022-12-06 23:06 ` [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() " Mingwei Zhang
  2022-12-09 21:28 ` David Matlack
  3 siblings, 0 replies; 10+ messages in thread
From: Mingwei Zhang @ 2022-11-29 19:12 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: H. Peter Anvin, kvm, linux-kernel, Mingwei Zhang,
	Nagareddy Reddy, Jim Mattson, David Matlack

Replace BUG() in pte_list_remove() with KVM_BUG() to avoid crashing the
host. MMU bug is difficult to discover due to various racing conditions and
corner cases and thus it extremely hard to debug. The situation gets much
worse when it triggers the shutdown of a host. Host machine crash
eliminates everything including the potential clues for debugging.

BUG() or BUG_ON() is probably no longer appropriate as the host reliability
is top priority in many business scenarios. Crashing the physical machine
is almost never a good option as it eliminates innocent VMs and cause
service outage in a larger scope. Even worse, if attacker can reliably
triggers this code by diverting the control flow or corrupting the memory
or leveraging a KVM bug, then this becomes vm-of-death attack. This is a
huge attack vector to cloud providers, as the death of one single host
machine is not the end of the story. Without manual interferences, a failed
cloud job may be dispatched to other hosts and continue host crashes until
all of them are dead.

Because of the above reasons, shrink the scope of crash to the target VM
only.

Cc: Nagareddy Reddy <nspreddy@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b5a44b8f5f7b..12790ccb8731 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -954,15 +954,16 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	struct pte_list_desc *prev_desc;
 	int i;
 
-	if (!rmap_head->val) {
-		pr_err("%s: %p 0->BUG\n", __func__, spte);
-		BUG();
-	} else if (!(rmap_head->val & 1)) {
+	if (KVM_BUG(!rmap_head->val, kvm, "rmap for %p is empty", spte))
+		return;
+
+	if (!(rmap_head->val & 1)) {
 		rmap_printk("%p 1->0\n", spte);
-		if ((u64 *)rmap_head->val != spte) {
-			pr_err("%s:  %p 1->BUG\n", __func__, spte);
-			BUG();
-		}
+
+		if (KVM_BUG((u64 *)rmap_head->val != spte, kvm,
+			    "single rmap for %p doesn't match", spte))
+			return;
+
 		rmap_head->val = 0;
 	} else {
 		rmap_printk("%p many->many\n", spte);
@@ -979,8 +980,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 			prev_desc = desc;
 			desc = desc->more;
 		}
-		pr_err("%s: %p many->many\n", __func__, spte);
-		BUG();
+		KVM_BUG(true, kvm, "no rmap for %p (many->many)", spte);
 	}
 }
 
-- 
2.38.1.584.g0f3c55d4c2-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
  2022-11-29 19:12 [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu Mingwei Zhang
  2022-11-29 19:12 ` [RFC PATCH v4 1/2] KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove() Mingwei Zhang
  2022-11-29 19:12 ` [RFC PATCH v4 2/2] KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu Mingwei Zhang
@ 2022-12-06 23:06 ` Mingwei Zhang
  2022-12-09 21:28 ` David Matlack
  3 siblings, 0 replies; 10+ messages in thread
From: Mingwei Zhang @ 2022-12-06 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: H. Peter Anvin, kvm, linux-kernel, Nagareddy Reddy, Jim Mattson,
	David Matlack

gentle ping on this one?

On Tue, Nov 29, 2022 at 11:12 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> physical machine. There are several reasons and motivations to do so:
>
> MMU bug is difficult to discover due to various racing conditions and
> corner cases and thus it extremely hard to debug. The situation gets much
> worse when it triggers the shutdown of a host. Host machine crash might
> eliminates everything including the potential clues for debugging.
>
> From cloud computing service perspective, BUG() or BUG_ON() is probably no
> longer appropriate as the host reliability is top priority. Crashing the
> physical machine is almost never a good option as it eliminates innocent
> VMs and cause service outage in a larger scope. Even worse, if attacker can
> reliably triggers this code by diverting the control flow or corrupting the
> memory, then this becomes vm-of-death attack. This is a huge attack vector
> to cloud providers, as the death of one single host machine is not the end
> of the story. Without manual interferences, a failed cloud job may be
> dispatched to other hosts and continue host crashes until all of them are
> dead.
>
> For the above reason, we propose the replacement of BUG() in
> pte_list_remove() with KVM_BUG() to crash just the VM itself.
>
> v3 - v4:
>  - update code to integrate messages into KVM_BUG() [seanjc].
>  - update commit message [seanjc].
>
> v2 -> v3:
>  - plumb @kvm all the way to pte_list_remove() [seanjc, pbonzini]
>  - https://lore.kernel.org/lkml/20221128002043.1555543-1-mizhang@google.com/
>
> v1 -> v2:
>  - compile test the code.
>  - fill KVM_BUG() with kvm_get_running_vcpu()->kvm
>  - https://lore.kernel.org/all/20221124003505.424617-1-mizhang@google.com/
>
> rfc v1:
>  - https://lore.kernel.org/all/20221123231206.274392-1-mizhang@google.com/
>
>
> Mingwei Zhang (2):
>   KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove()
>   KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu
>
>  arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++----------------
>  1 file changed, 17 insertions(+), 16 deletions(-)
>
> --
> 2.38.1.584.g0f3c55d4c2-goog
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
  2022-11-29 19:12 [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu Mingwei Zhang
                   ` (2 preceding siblings ...)
  2022-12-06 23:06 ` [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() " Mingwei Zhang
@ 2022-12-09 21:28 ` David Matlack
  2022-12-12  4:35   ` Mingwei Zhang
  3 siblings, 1 reply; 10+ messages in thread
From: David Matlack @ 2022-12-09 21:28 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, H. Peter Anvin, kvm,
	linux-kernel, Nagareddy Reddy, Jim Mattson

On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> physical machine. There are several reasons and motivations to do so:
> 
> MMU bug is difficult to discover due to various racing conditions and
> corner cases and thus it extremely hard to debug. The situation gets much
> worse when it triggers the shutdown of a host. Host machine crash might
> eliminates everything including the potential clues for debugging.
> 
> From cloud computing service perspective, BUG() or BUG_ON() is probably no
> longer appropriate as the host reliability is top priority. Crashing the
> physical machine is almost never a good option as it eliminates innocent
> VMs and cause service outage in a larger scope. Even worse, if attacker can
> reliably triggers this code by diverting the control flow or corrupting the
> memory, then this becomes vm-of-death attack. This is a huge attack vector
> to cloud providers, as the death of one single host machine is not the end
> of the story. Without manual interferences, a failed cloud job may be
> dispatched to other hosts and continue host crashes until all of them are
> dead.

My only concern with using KVM_BUG() is whether the machine can keep
running correctly after this warning has been hit. In other words, are
we sure the damage is contained to just this VM?

If, for example, the KVM_BUG() was triggered by a use-after-free, then
there might be corrupted memory floating around in the machine.

What are some instances where we've seen these BUG_ON()s get triggered?
For those instances, would it actually be safe to just kill the current
VM and keep the rest of the machine running?

> 
> For the above reason, we propose the replacement of BUG() in
> pte_list_remove() with KVM_BUG() to crash just the VM itself.

How did you test this series?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
  2022-12-09 21:28 ` David Matlack
@ 2022-12-12  4:35   ` Mingwei Zhang
  2022-12-12 16:45     ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Mingwei Zhang @ 2022-12-12  4:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Paolo Bonzini, H. Peter Anvin, kvm,
	linux-kernel, Nagareddy Reddy, Jim Mattson

On Fri, Dec 09, 2022, David Matlack wrote:
> On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > physical machine. There are several reasons and motivations to do so:
> > 
> > MMU bug is difficult to discover due to various racing conditions and
> > corner cases and thus it extremely hard to debug. The situation gets much
> > worse when it triggers the shutdown of a host. Host machine crash might
> > eliminates everything including the potential clues for debugging.
> > 
> > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > longer appropriate as the host reliability is top priority. Crashing the
> > physical machine is almost never a good option as it eliminates innocent
> > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > reliably triggers this code by diverting the control flow or corrupting the
> > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > to cloud providers, as the death of one single host machine is not the end
> > of the story. Without manual interferences, a failed cloud job may be
> > dispatched to other hosts and continue host crashes until all of them are
> > dead.
> 
> My only concern with using KVM_BUG() is whether the machine can keep
> running correctly after this warning has been hit. In other words, are
> we sure the damage is contained to just this VM?
> 
> If, for example, the KVM_BUG() was triggered by a use-after-free, then
> there might be corrupted memory floating around in the machine.
> 

David,

Your concern is quite reasonable. But given that both rmap and spte are
pointers/data structures managed by individual VMs, i.e., none of them
are global pointers, use-after-free is unlikely happening on cross-VM
cases. Even if there is, then shuting down those corrupted VMs is feasible
here, since pte_list_remove() basically does the checking.
> What are some instances where we've seen these BUG_ON()s get triggered?
> For those instances, would it actually be safe to just kill the current
> VM and keep the rest of the machine running?
> 
> > 
> > For the above reason, we propose the replacement of BUG() in
> > pte_list_remove() with KVM_BUG() to crash just the VM itself.
> 
> How did you test this series?

I used a simple test case to test the series:

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0f6455072055..d4b993b26b96 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -701,7 +701,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (fault->nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(fault, *it.sptep, it.level);

-		base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
+		base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1) - 1;
 		if (it.level == fault->goal_level)
 			break;

On the testing machine, I launched a L1 VM and a L2 VM within it. The L2
will trigger the above bug in shadow MMU and I got the following error
in L0 kernel dmesg as shown below. L1 and L2 hangs with high CPU usage
for a while and after a couple of seconds, the L1 VM dies properly. The
machine is still alive and subsequent VM operations are all good
(launch/kill).

[ 1678.043378] ------------[ cut here ]------------
[ 1678.043381] gfn mismatch under direct page 1041bf (expected 10437e, got 1043be)
[ 1678.043386] WARNING: CPU: 4 PID: 23430 at arch/x86/kvm/mmu/mmu.c:737 kvm_mmu_page_set_translation+0x131/0x140
[ 1678.043395] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O)
[ 1678.043404] CPU: 4 PID: 23430 Comm: VCPU-7 Tainted: G S         O       6.1.0-smp-DEV #5
[ 1678.043406] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022
[ 1678.043407] RIP: 0010:kvm_mmu_page_set_translation+0x131/0x140
[ 1678.043411] Code: 0f 44 e0 4c 8b 6b 28 48 89 df 44 89 f6 e8 b7 fb ff ff 48 c7 c7 1b 5a 2f 82 4c 89 e6 4c 89 ea 48 89 c1 4d 89 f8 e8 9f 39 0c 00 <0f> 0b eb ac 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[ 1678.043413] RSP: 0018:ffff88811ba87918 EFLAGS: 00010246
[ 1678.043415] RAX: 1bdd851636664d00 RBX: ffff888118602e60 RCX: 0000000000000027
[ 1678.043416] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff8897e0320488
[ 1678.043417] RBP: ffff88811ba87940 R08: 0000000000000000 R09: ffffffff82b2e6f0
[ 1678.043418] R10: 00000000ffff7fff R11: 0000000000000000 R12: ffffffff822e89da
[ 1678.043419] R13: 00000000001041bf R14: 00000000000001bf R15: 00000000001043be
[ 1678.043421] FS:  00007fee198ec700(0000) GS:ffff8897e0300000(0000) knlGS:0000000000000000
[ 1678.043422] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1678.043424] CR2: 0000000000000000 CR3: 0000001857c34005 CR4: 00000000003726e0
[ 1678.043425] Call Trace:
[ 1678.043426]  <TASK>
[ 1678.043428]  __rmap_add+0x8a/0x270
[ 1678.043432]  mmu_set_spte+0x250/0x340
[ 1678.043435]  ept_fetch+0x8ad/0xc00
[ 1678.043437]  ept_page_fault+0x265/0x2f0
[ 1678.043440]  kvm_mmu_page_fault+0xfa/0x2d0
[ 1678.043443]  handle_ept_violation+0x135/0x2e0 [kvm_intel]
[ 1678.043455]  ? handle_desc+0x20/0x20 [kvm_intel]
[ 1678.043462]  __vmx_handle_exit+0x1c3/0x480 [kvm_intel]
[ 1678.043468]  vmx_handle_exit+0x12/0x40 [kvm_intel]
[ 1678.043474]  vcpu_enter_guest+0xbb3/0xf80
[ 1678.043477]  ? complete_fast_pio_in+0xcc/0x160
[ 1678.043480]  kvm_arch_vcpu_ioctl_run+0x3b0/0x770
[ 1678.043481]  kvm_vcpu_ioctl+0x52d/0x610
[ 1678.043486]  ? kvm_on_user_return+0x46/0xd0
[ 1678.043489]  __se_sys_ioctl+0x77/0xc0
[ 1678.043492]  __x64_sys_ioctl+0x1d/0x20
[ 1678.043493]  do_syscall_64+0x3d/0x80
[ 1678.043497]  ? sysvec_apic_timer_interrupt+0x49/0x90
[ 1678.043499]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1678.043501] RIP: 0033:0x7fee3ebf0347
[ 1678.043503] Code: 5d c3 cc 48 8b 05 f9 2f 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 2f 07 00 f7 d8 64 89 01 48
[ 1678.043505] RSP: 002b:00007fee198e8998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1678.043507] RAX: ffffffffffffffda RBX: 0000555308e7e4d0 RCX: 00007fee3ebf0347
[ 1678.043507] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 00000000000000b0
[ 1678.043508] RBP: 00007fee198e89c0 R08: 000055530943d920 R09: 00000000000003fa
[ 1678.043509] R10: 0000555307349b00 R11: 0000000000000246 R12: 00000000000000b0
[ 1678.043510] R13: 00005574c1a1de88 R14: 00007fee198e8a27 R15: 0000000000000000
[ 1678.043511]  </TASK>
[ 1678.043512] ---[ end trace 0000000000000000 ]---
[ 5313.657064] ------------[ cut here ]------------
[ 5313.657067] no rmap for 0000000071a2f138 (many->many)
[ 5313.657071] WARNING: CPU: 43 PID: 23398 at arch/x86/kvm/mmu/mmu.c:983 pte_list_remove+0x17a/0x190
[ 5313.657080] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O)
[ 5313.657088] CPU: 43 PID: 23398 Comm: kvm-nx-lpage-re Tainted: G S      W  O       6.1.0-smp-DEV #5
[ 5313.657090] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022
[ 5313.657092] RIP: 0010:pte_list_remove+0x17a/0x190
[ 5313.657095] Code: cf e4 01 01 48 c7 c7 4d 3c 32 82 e8 70 5e 0c 00 0f 0b e9 0a ff ff ff c6 05 d4 cf e4 01 01 48 c7 c7 9e de 33 82 e8 56 5e 0c 00 <0f> 0b 84 db 75 c8 e9 ec fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
[ 5313.657097] RSP: 0018:ffff88986d5d3c30 EFLAGS: 00010246
[ 5313.657099] RAX: 1ebf71ba511d3100 RBX: 0000000000000000 RCX: 0000000000000027
[ 5313.657101] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff88afdf3e0488
[ 5313.657102] RBP: ffff88986d5d3c40 R08: 0000000000000000 R09: ffffffff82b2e6f0
[ 5313.657104] R10: 00000000ffff7fff R11: 40000000ffff8a28 R12: 0000000000000000
[ 5313.657105] R13: ffff888118602000 R14: ffffc90020e1e000 R15: ffff88815df33030
[ 5313.657106] FS:  0000000000000000(0000) GS:ffff88afdf3c0000(0000) knlGS:0000000000000000
[ 5313.657107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5313.657109] CR2: 000017c92b50f1b8 CR3: 000000006f40a001 CR4: 00000000003726e0
[ 5313.657110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5313.657111] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5313.657112] Call Trace:
[ 5313.657113]  <TASK>
[ 5313.657114]  drop_spte+0x175/0x180
[ 5313.657117]  mmu_page_zap_pte+0xfd/0x130
[ 5313.657119]  __kvm_mmu_prepare_zap_page+0x290/0x6e0
[ 5313.657122]  ? newidle_balance+0x228/0x3b0
[ 5313.657126]  kvm_nx_huge_page_recovery_worker+0x266/0x360
[ 5313.657129]  kvm_vm_worker_thread+0x93/0x150
[ 5313.657134]  ? kvm_mmu_post_init_vm+0x40/0x40
[ 5313.657136]  ? kvm_vm_create_worker_thread+0x120/0x120
[ 5313.657139]  kthread+0x10d/0x120
[ 5313.657141]  ? kthread_blkcg+0x30/0x30
[ 5313.657142]  ret_from_fork+0x1f/0x30
[ 5313.657156]  </TASK>
[ 5313.657156] ---[ end trace 0000000000000000 ]---

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
  2022-12-12  4:35   ` Mingwei Zhang
@ 2022-12-12 16:45     ` Sean Christopherson
  2022-12-13  0:09       ` David Matlack
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2022-12-12 16:45 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: David Matlack, Paolo Bonzini, H. Peter Anvin, kvm, linux-kernel,
	Nagareddy Reddy, Jim Mattson

On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> On Fri, Dec 09, 2022, David Matlack wrote:
> > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > physical machine. There are several reasons and motivations to do so:
> > > 
> > > MMU bug is difficult to discover due to various racing conditions and
> > > corner cases and thus it extremely hard to debug. The situation gets much
> > > worse when it triggers the shutdown of a host. Host machine crash might
> > > eliminates everything including the potential clues for debugging.
> > > 
> > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > longer appropriate as the host reliability is top priority. Crashing the
> > > physical machine is almost never a good option as it eliminates innocent
> > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > reliably triggers this code by diverting the control flow or corrupting the
> > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > to cloud providers, as the death of one single host machine is not the end
> > > of the story. Without manual interferences, a failed cloud job may be
> > > dispatched to other hosts and continue host crashes until all of them are
> > > dead.
> > 
> > My only concern with using KVM_BUG() is whether the machine can keep
> > running correctly after this warning has been hit. In other words, are
> > we sure the damage is contained to just this VM?

Hmm, good point.  The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
splats where the reserved bits appear to be set by stack overflow, that has a much,
much higher probability of being a symptom of data corruption.

That said, that's more of a reason to change get_mmio_spte() than it is to ignore
potential data corruption in this case.  KVM arguably should kill the VM if
get_mmio_spte() fails too.

What about explicitly treating both get_mmio_spte() and this as potential data
corruption?  E.g. something like this, and then use the DATA_CORRUPTION variant
in pte_list_remove()?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2055e04b8f89..1cb69c6d186b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
                        pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
                               sptes[level], level,
                               get_rsvd_bits(rsvd_check, sptes[level], level));
+               KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
        }
 
        return reserved;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f16c4689322b..5c4a06f66f46 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
        unlikely(__ret);                                        \
 })
 
+#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm)                  \
+({                                                             \
+       int __ret = (cond);                                     \
+                                                               \
+       if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION))          \
+               BUG_ON(__ret);                                  \
+       else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged))      \
+               kvm_vm_bugged(kvm);                             \
+       unlikely(__ret);                                        \
+})
+
 static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_PROVE_RCU


> > If, for example, the KVM_BUG() was triggered by a use-after-free, then
> > there might be corrupted memory floating around in the machine.
> > 
> 
> David,
> 
> Your concern is quite reasonable. But given that both rmap and spte are
> pointers/data structures managed by individual VMs, i.e., none of them
> are global pointers, use-after-free is unlikely happening on cross-VM
> cases.

Being per-VM allocations doesn't change the behavior/impact of use-after-free.
E.g. if there is no rmap found for a SPTE then there's a non-zero chance KVM has
previously zapped the SPTE and freed the memory the SPTE pointed at, and thus KVM
might be reading/writing memory that is now owned by something else in the kernel.

> Even if there is, then shuting down those corrupted VMs is feasible
> here, since pte_list_remove() basically does the checking.

But the damage may already be done.  And the KVM_REQ_VM_DEAD request wont't be
recognized until the next vcpu_enter_enter_guest(), e.g. it won't prevent vCPUs
(or even this vCPU) from processing more 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
  2022-12-12 16:45     ` Sean Christopherson
@ 2022-12-13  0:09       ` David Matlack
  2022-12-13  1:39         ` Mingwei Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: David Matlack @ 2022-12-13  0:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, H. Peter Anvin, kvm, linux-kernel,
	Nagareddy Reddy, Jim Mattson

On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> > On Fri, Dec 09, 2022, David Matlack wrote:
> > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > > physical machine. There are several reasons and motivations to do so:
> > > >
> > > > MMU bug is difficult to discover due to various racing conditions and
> > > > corner cases and thus it extremely hard to debug. The situation gets much
> > > > worse when it triggers the shutdown of a host. Host machine crash might
> > > > eliminates everything including the potential clues for debugging.
> > > >
> > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > > longer appropriate as the host reliability is top priority. Crashing the
> > > > physical machine is almost never a good option as it eliminates innocent
> > > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > > reliably triggers this code by diverting the control flow or corrupting the
> > > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > > to cloud providers, as the death of one single host machine is not the end
> > > > of the story. Without manual interferences, a failed cloud job may be
> > > > dispatched to other hosts and continue host crashes until all of them are
> > > > dead.
> > >
> > > My only concern with using KVM_BUG() is whether the machine can keep
> > > running correctly after this warning has been hit. In other words, are
> > > we sure the damage is contained to just this VM?
>
> Hmm, good point.  The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
> when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
> splats where the reserved bits appear to be set by stack overflow, that has a much,
> much higher probability of being a symptom of data corruption.
>
> That said, that's more of a reason to change get_mmio_spte() than it is to ignore
> potential data corruption in this case.  KVM arguably should kill the VM if
> get_mmio_spte() fails too.
>
> What about explicitly treating both get_mmio_spte() and this as potential data
> corruption?  E.g. something like this, and then use the DATA_CORRUPTION variant
> in pte_list_remove()?
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2055e04b8f89..1cb69c6d186b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>                         pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
>                                sptes[level], level,
>                                get_rsvd_bits(rsvd_check, sptes[level], level));
> +               KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
>         }
>
>         return reserved;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f16c4689322b..5c4a06f66f46 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
>         unlikely(__ret);                                        \
>  })
>
> +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm)                  \
> +({                                                             \
> +       int __ret = (cond);                                     \
> +                                                               \
> +       if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION))          \
> +               BUG_ON(__ret);                                  \
> +       else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged))      \
> +               kvm_vm_bugged(kvm);                             \
> +       unlikely(__ret);                                        \
> +})
> +
>  static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
>  {
>  #ifdef CONFIG_PROVE_RCU

That sounds reasonable to me.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
  2022-12-13  0:09       ` David Matlack
@ 2022-12-13  1:39         ` Mingwei Zhang
  2022-12-13  4:12           ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Mingwei Zhang @ 2022-12-13  1:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Paolo Bonzini, H. Peter Anvin, kvm,
	linux-kernel, Nagareddy Reddy, Jim Mattson

On Mon, Dec 12, 2022, David Matlack wrote:
> On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> > > On Fri, Dec 09, 2022, David Matlack wrote:
> > > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > > > physical machine. There are several reasons and motivations to do so:
> > > > >
> > > > > MMU bug is difficult to discover due to various racing conditions and
> > > > > corner cases and thus it extremely hard to debug. The situation gets much
> > > > > worse when it triggers the shutdown of a host. Host machine crash might
> > > > > eliminates everything including the potential clues for debugging.
> > > > >
> > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > > > longer appropriate as the host reliability is top priority. Crashing the
> > > > > physical machine is almost never a good option as it eliminates innocent
> > > > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > > > reliably triggers this code by diverting the control flow or corrupting the
> > > > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > > > to cloud providers, as the death of one single host machine is not the end
> > > > > of the story. Without manual interferences, a failed cloud job may be
> > > > > dispatched to other hosts and continue host crashes until all of them are
> > > > > dead.
> > > >
> > > > My only concern with using KVM_BUG() is whether the machine can keep
> > > > running correctly after this warning has been hit. In other words, are
> > > > we sure the damage is contained to just this VM?
> >
> > Hmm, good point.  The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
> > when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
> > splats where the reserved bits appear to be set by stack overflow, that has a much,
> > much higher probability of being a symptom of data corruption.
> >
> > That said, that's more of a reason to change get_mmio_spte() than it is to ignore
> > potential data corruption in this case.  KVM arguably should kill the VM if
> > get_mmio_spte() fails too.
> >
> > What about explicitly treating both get_mmio_spte() and this as potential data
> > corruption?  E.g. something like this, and then use the DATA_CORRUPTION variant
> > in pte_list_remove()?
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2055e04b8f89..1cb69c6d186b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
> >                         pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
> >                                sptes[level], level,
> >                                get_rsvd_bits(rsvd_check, sptes[level], level));
> > +               KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
> >         }
> >
> >         return reserved;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f16c4689322b..5c4a06f66f46 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
> >         unlikely(__ret);                                        \
> >  })
> >
> > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm)                  \
> > +({                                                             \
> > +       int __ret = (cond);                                     \
> > +                                                               \
> > +       if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION))          \
> > +               BUG_ON(__ret);                                  \
> > +       else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged))      \
> > +               kvm_vm_bugged(kvm);                             \
> > +       unlikely(__ret);                                        \
> > +})
> > +
> >  static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
> >  {
> >  #ifdef CONFIG_PROVE_RCU
> 
> That sounds reasonable to me.

Actually, I disagree after thinking about it for a while. Since
Google turns on CONFIG_BUG_ON_DATA_CORRUPTION on default, this
KVM_BUG_ON_DATA_CORRUPTION() is literally BUG_ON(). Then there is no
point. The purpose of the series is to save the host from crash.

If we follow the upper changes on get_mmio_spte() as well, we are
introducing more BUG()!, ie., more chances to crash the host machine. So
I cannot stay with this idea.

In fact, there could many reasons why RMAPs and SPTEs are messed. In
many times, as illustrated by my example, they are not data corruptions.
They are just errors due to code refactoring or some miscalculations
under certain tricky corner situations, e.g., offset by 1.  So, treating
them blindly as data corruptions is an overkill. This is a comparison to
the list traversal bug, which clearly shows data corruptions.

So, alternatively, I think it should be reasonable to shutdown the KVM
instance if we see things messed up in MMIO SPTEs as well, but please
don not the host for that.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu
  2022-12-13  1:39         ` Mingwei Zhang
@ 2022-12-13  4:12           ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2022-12-13  4:12 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: David Matlack, Paolo Bonzini, H. Peter Anvin, kvm, linux-kernel,
	Nagareddy Reddy, Jim Mattson

On Tue, Dec 13, 2022, Mingwei Zhang wrote:
> On Mon, Dec 12, 2022, David Matlack wrote:
> > > What about explicitly treating both get_mmio_spte() and this as potential data
> > > corruption?  E.g. something like this, and then use the DATA_CORRUPTION variant
> > > in pte_list_remove()?
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2055e04b8f89..1cb69c6d186b 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
> > >                         pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
> > >                                sptes[level], level,
> > >                                get_rsvd_bits(rsvd_check, sptes[level], level));
> > > +               KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);

Random aside, this would be better:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 99c40617d325..d9c46f2a6748 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4090,7 +4090,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
                return RET_PF_EMULATE;
 
        reserved = get_mmio_spte(vcpu, addr, &spte);
-       if (WARN_ON(reserved))
+       if (KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm))
                return -EINVAL;
 
        if (is_mmio_spte(spte)) {

> > >         }
> > >
> > >         return reserved;
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index f16c4689322b..5c4a06f66f46 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
> > >         unlikely(__ret);                                        \
> > >  })
> > >
> > > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm)                  \
> > > +({                                                             \
> > > +       int __ret = (cond);                                     \
> > > +                                                               \
> > > +       if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION))          \
> > > +               BUG_ON(__ret);                                  \
> > > +       else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged))      \
> > > +               kvm_vm_bugged(kvm);                             \
> > > +       unlikely(__ret);                                        \
> > > +})
> > > +
> > >  static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
> > >  {
> > >  #ifdef CONFIG_PROVE_RCU
> > 
> > That sounds reasonable to me.
> 
> Actually, I disagree after thinking about it for a while. Since
> Google turns on CONFIG_BUG_ON_DATA_CORRUPTION on default, this
> KVM_BUG_ON_DATA_CORRUPTION() is literally BUG_ON(). Then there is no
> point. The purpose of the series is to save the host from crash.

Sure, but Google is not the only consumer of KVM.  E.g. I'm guessing many distros
use CONFIG_BUG_ON_DATA_CORRUPTION=n.  Any user that doesn't have the infrastructure
to capture crash dumps is likely better served overall by the WARN-and-continue
behavior.

> If we follow the upper changes on get_mmio_spte() as well, we are
> introducing more BUG()!, ie., more chances to crash the host machine. So
> I cannot stay with this idea.

For get_mmio_spte() specifically, I am quite confident that in production kernels,
the benefits of a BUG() far outweigh the risk of unnecessarily panicking the host.
There have been recent-ish bugs that escaped into kernel releases, but the first
one (wrong root level) escaped only because it affected an esoteric, rarely used
configuration (shadow paging with a 32-bit host kernel), and the second one was
a purely theoretical bug fix that was also limited to 32-bit kernels.

  39b4d43e6003 ("KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE")
  2aa078932ff6 ("KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()")

> In fact, there could many reasons why RMAPs and SPTEs are messed. In
> many times, as illustrated by my example, they are not data corruptions.
> They are just errors due to code refactoring or some miscalculations under
> certain tricky corner situations, e.g., offset by 1.  So, treating them
> blindly as data corruptions is an overkill.

For SPTEs in prod, not really.  Something has gone really wrong if KVM messes up a
SPTE to the point where _hardware_ unexpectedly observes a reserved bit in a
production build.  We've had bugs where we've botched the software sanity checks,
e.g. see commit

  6c6ab524cfae ("KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT")

but except for rarely used setups (see above), unexpected reserved SPTE faults
in production pretty much won't happen unless there's memory corruption or a CPU
error.

For RMAPs, I agree it's less clear cut, but at the same time the status quo is to
always BUG(), so from an upstream perspective allowing the admin to avoid the
BUG() is an unqualified improvement.

> This is a comparison to the list traversal bug, which clearly shows data corruptions.

That depends on what is considered "data corruption".  In this case,
CONFIG_BUG_ON_DATA_CORRUPTION is about the kernel memory structures being corrupted,
it's not referring to generic data corruption user data.

  Select this option if the kernel should BUG when it encounters
  data corruption in kernel memory structures when they get checked
  for validity.

And IMO that definition fits KVM's behavior perfectly.  get_mmio_spte() detects
corrupted SPTEs, pte_list_remove() detects corrupted RMAP lists.

If "data corruption" is interpeted to be corruption of user data than the list
behavior is wildly overstepping.  E.g. if a subsystem has a bug where an object
is added to two different lists, list_add() will yell about the issue but because
all writes to the list_head are aborted with CONFIG_DEBUG_LIST=y, it's entirely
possible for the bug to be benign or limited to the subsystem.

A somewhat contrived example would be if a (complex) allocator and its caller both
add the object to the same list; the second list_add() is problematic and would
trigger BUG() with CONFIG_BUG_ON_DATA_CORRUPTION=y, but with CONFIG_DEBUG_LIST=y
the bug would be completely benign.

> So, alternatively, I think it should be reasonable to shutdown the KVM
> instance if we see things messed up in MMIO SPTEs as well, but please
> don not the host for that.

Again, this doesn't unconditionally bring down the host.  Google is _choosing_
to BUG() on _potential_ data corruption.  I'm open to not tieing KVM's behavior to
CONFIG_BUG_ON_DATA_CORRUPTION if there's a good argument for doing so, but if we
(Google) plan on turning on the BUG() behavior for KVM then I don't see the point.

In other words, we should discuss internally whether or not we want/need to push
for a separate control, but from an upstream perspective I think the proposed
behavior is entirely reasonable.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-12-13  4:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29 19:12 [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu Mingwei Zhang
2022-11-29 19:12 ` [RFC PATCH v4 1/2] KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove() Mingwei Zhang
2022-11-29 19:12 ` [RFC PATCH v4 2/2] KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu Mingwei Zhang
2022-12-06 23:06 ` [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() " Mingwei Zhang
2022-12-09 21:28 ` David Matlack
2022-12-12  4:35   ` Mingwei Zhang
2022-12-12 16:45     ` Sean Christopherson
2022-12-13  0:09       ` David Matlack
2022-12-13  1:39         ` Mingwei Zhang
2022-12-13  4:12           ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.