linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2
@ 2022-12-13  3:30 Sean Christopherson
  2022-12-13  3:30 ` [PATCH 1/5] KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen Sean Christopherson
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Sean Christopherson @ 2022-12-13  3:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

Fix three fatal TDP MMU bugs introduced in 6.2, harden related code,
and clean up kvm_tdp_mmu_map() to eliminate the need for gotos.

Sean Christopherson (5):
  KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is
    frozen
  KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
  KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is
    disallowed
  KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
  KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller

 arch/x86/kvm/mmu/mmu.c          |  9 +++++++-
 arch/x86/kvm/mmu/mmu_internal.h |  1 -
 arch/x86/kvm/mmu/tdp_mmu.c      | 39 +++++++++++++++------------------
 3 files changed, 26 insertions(+), 23 deletions(-)


base-commit: 51229fd7872f82af07498aef5c79ad51baf81ea0
-- 
2.39.0.rc1.256.g54fd8350bd-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen
  2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
@ 2022-12-13  3:30 ` Sean Christopherson
  2022-12-14 11:57   ` Robert Hoo
  2022-12-13  3:30 ` [PATCH 2/5] KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached Sean Christopherson
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2022-12-13  3:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

Hoist the is_removed_spte() check above the "level == goal_level" check
when walking SPTEs during a TDP MMU page fault to avoid attempting to map
a leaf entry if said entry is frozen by a different task/vCPU.

  ------------[ cut here ]------------
  WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0
  Modules linked in: kvm_intel
  CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0
  RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246
  RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005
  RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000
  RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005
  R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000
  FS:  00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 764f7c87286f..b740f38fedcc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1162,9 +1162,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (fault->nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
-		if (iter.level == fault->goal_level)
-			break;
-
 		/*
 		 * If SPTE has been frozen by another thread, just give up and
 		 * retry, avoiding unnecessary page table allocation and free.
@@ -1172,6 +1169,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_removed_spte(iter.old_spte))
 			goto retry;
 
+		if (iter.level == fault->goal_level)
+			break;
+
 		/* Step down into the lower level page table if it exists. */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    !is_large_pte(iter.old_spte))
-- 
2.39.0.rc1.256.g54fd8350bd-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
  2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
  2022-12-13  3:30 ` [PATCH 1/5] KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen Sean Christopherson
@ 2022-12-13  3:30 ` Sean Christopherson
  2022-12-13  3:30 ` [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed Sean Christopherson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2022-12-13  3:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached.  A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs.  However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down.   The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.

Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.

  ------------[ cut here ]------------
  WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
  Modules linked in: kvm_intel
  CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
  RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
  RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
  R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
  R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
  FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---
  Invalid SPTE change: cannot replace a present leaf
  SPTE with another present leaf SPTE mapping a
  different PFN!
  as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
  ------------[ cut here ]------------
  kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
  invalid opcode: 0000 [#1] SMP
  CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
  RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
  RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
  RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
  RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
  R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
  R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
  FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_mmu_map+0x3b0/0x510
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  Modules linked in: kvm_intel
  ---[ end trace 0000000000000000 ]---

Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b740f38fedcc..e2e197d41780 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1170,7 +1170,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			goto retry;
 
 		if (iter.level == fault->goal_level)
-			break;
+			goto map_target_level;
 
 		/* Step down into the lower level page table if it exists. */
 		if (is_shadow_present_pte(iter.old_spte) &&
@@ -1192,8 +1192,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
 
 		/*
-		 * Also force the guest to retry the access if the upper level SPTEs
-		 * aren't in place.
+		 * Force the guest to retry if installing an upper level SPTE
+		 * failed, e.g. because a different task modified the SPTE.
 		 */
 		if (r) {
 			tdp_mmu_free_sp(sp);
@@ -1208,6 +1208,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 	}
 
+	/*
+	 * The walk aborted before reaching the target level, e.g. because the
+	 * iterator detected an upper level SPTE was frozen during traversal.
+	 */
+	WARN_ON_ONCE(iter.level == fault->goal_level);
+	goto retry;
+
+map_target_level:
 	ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
 
 retry:
-- 
2.39.0.rc1.256.g54fd8350bd-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed
  2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
  2022-12-13  3:30 ` [PATCH 1/5] KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen Sean Christopherson
  2022-12-13  3:30 ` [PATCH 2/5] KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached Sean Christopherson
@ 2022-12-13  3:30 ` Sean Christopherson
  2022-12-14 11:58   ` Robert Hoo
  2022-12-13  3:30 ` [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level Sean Christopherson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2022-12-13  3:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
when adding a new shadow page in the TDP MMU.  To ensure the NX reclaim
kthread can't see a not-yet-linked shadow page, the page fault path links
the new page table prior to adding the page to possible_nx_huge_pages.

If the page is zapped by different task, e.g. because dirty logging is
disabled, between linking the page and adding it to the list, KVM can end
up triggering use-after-free by adding the zapped SP to the aforementioned
list, as the zapped SP's memory is scheduled for removal via RCU callback.
The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
i.e. the below splat is just one possible signature.

  ------------[ cut here ]------------
  list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
  WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
  Modules linked in: kvm_intel
  CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #71
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__list_add_valid+0x79/0xa0
  RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
  RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
  R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
  R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
  FS:  00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   track_possible_nx_huge_page+0x53/0x80
   kvm_tdp_mmu_map+0x242/0x2c0
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
Reported-by: Greg Thelen <gthelen@google.com>
Analyzed-by: David Matlack <dmatlack@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e2e197d41780..fd4ae99790d7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (fault->huge_page_disallowed &&
 		    fault->req_level >= iter.level) {
 			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-			track_possible_nx_huge_page(kvm, sp);
+			if (sp->nx_huge_page_disallowed)
+				track_possible_nx_huge_page(kvm, sp);
 			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 		}
 	}
-- 
2.39.0.rc1.256.g54fd8350bd-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
  2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
                   ` (2 preceding siblings ...)
  2022-12-13  3:30 ` [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed Sean Christopherson
@ 2022-12-13  3:30 ` Sean Christopherson
  2022-12-13 17:59   ` David Matlack
  2022-12-13  3:30 ` [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller Sean Christopherson
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2022-12-13  3:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
match the target level of the fault, and instead have the vCPU retry the
faulting instruction after warning.  Continuing on is completely
unnecessary as the absolute worst case scenario of retrying is DoSing
the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.

  ------------[ cut here ]------------
  kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
  invalid opcode: 0000 [#1] SMP
  CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
  RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
  RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
  RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
  RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
  R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
  R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
  FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_mmu_map+0x3b0/0x510
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  Modules linked in: kvm_intel
  ---[ end trace 0000000000000000 ]---

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fd4ae99790d7..cc1fb9a65620 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1063,7 +1063,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	int ret = RET_PF_FIXED;
 	bool wrprot = false;
 
-	WARN_ON(sp->role.level != fault->goal_level);
+	if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
+		return RET_PF_RETRY;
+
 	if (unlikely(!fault->slot))
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
-- 
2.39.0.rc1.256.g54fd8350bd-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller
  2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
                   ` (3 preceding siblings ...)
  2022-12-13  3:30 ` [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level Sean Christopherson
@ 2022-12-13  3:30 ` Sean Christopherson
  2022-12-20 17:53   ` David Matlack
  2022-12-14 12:01 ` [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Robert Hoo
  2022-12-23 17:32 ` Paolo Bonzini
  6 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2022-12-13  3:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
eliminate the gotos used to bounce through rcu_read_unlock() when bailing
from the walk.

Opportunistically mark kvm_mmu_hugepage_adjust() as static as
kvm_tdp_mmu_map() was the only external user.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |  9 ++++++++-
 arch/x86/kvm/mmu/mmu_internal.h |  1 -
 arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++------------------
 3 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254bc46234e0..99c40617d325 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	return min(host_level, max_level);
 }
 
-void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
+				    struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	kvm_pfn_t mask;
@@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 	if (is_page_fault_stale(vcpu, fault))
 		goto out_unlock;
 
+	kvm_mmu_hugepage_adjust(vcpu, fault);
+
+	trace_kvm_mmu_spte_requested(fault);
+
+	rcu_read_lock();
 	r = kvm_tdp_mmu_map(vcpu, fault);
+	rcu_read_unlock();
 
 out_unlock:
 	read_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ac00bfbf32f6..66c294d67641 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -317,7 +317,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
 			      int max_level);
-void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
 
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cc1fb9a65620..78f47eb74544 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1150,13 +1150,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	int ret = RET_PF_RETRY;
-
-	kvm_mmu_hugepage_adjust(vcpu, fault);
-
-	trace_kvm_mmu_spte_requested(fault);
-
-	rcu_read_lock();
 
 	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
 		int r;
@@ -1169,10 +1162,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * retry, avoiding unnecessary page table allocation and free.
 		 */
 		if (is_removed_spte(iter.old_spte))
-			goto retry;
+			return RET_PF_RETRY;
 
 		if (iter.level == fault->goal_level)
-			goto map_target_level;
+			return tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
 
 		/* Step down into the lower level page table if it exists. */
 		if (is_shadow_present_pte(iter.old_spte) &&
@@ -1199,7 +1192,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 */
 		if (r) {
 			tdp_mmu_free_sp(sp);
-			goto retry;
+			return RET_PF_RETRY;
 		}
 
 		if (fault->huge_page_disallowed &&
@@ -1216,14 +1209,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	 * iterator detected an upper level SPTE was frozen during traversal.
 	 */
 	WARN_ON_ONCE(iter.level == fault->goal_level);
-	goto retry;
-
-map_target_level:
-	ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
-
-retry:
-	rcu_read_unlock();
-	return ret;
+	return RET_PF_RETRY;
 }
 
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
-- 
2.39.0.rc1.256.g54fd8350bd-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
  2022-12-13  3:30 ` [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level Sean Christopherson
@ 2022-12-13 17:59   ` David Matlack
  2022-12-13 18:15     ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: David Matlack @ 2022-12-13 17:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Robert Hoo, Greg Thelen,
	Ben Gardon, Mingwei Zhang

On Mon, Dec 12, 2022 at 7:30 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
> match the target level of the fault, and instead have the vCPU retry the
> faulting instruction after warning.  Continuing on is completely
> unnecessary as the absolute worst case scenario of retrying is DoSing
> the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.

Would it make sense to kill the VM instead via KVM_BUG()?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
  2022-12-13 17:59   ` David Matlack
@ 2022-12-13 18:15     ` Sean Christopherson
  2022-12-20 18:24       ` David Matlack
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2022-12-13 18:15 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, linux-kernel, Robert Hoo, Greg Thelen,
	Ben Gardon, Mingwei Zhang

On Tue, Dec 13, 2022, David Matlack wrote:
> On Mon, Dec 12, 2022 at 7:30 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
> > match the target level of the fault, and instead have the vCPU retry the
> > faulting instruction after warning.  Continuing on is completely
> > unnecessary as the absolute worst case scenario of retrying is DoSing
> > the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
> 
> Would it make sense to kill the VM instead via KVM_BUG()?

No, because if bug that hits this escapes to a release, odds are quite high that
retrying will succeed.  E.g. the fix earlier in this series is for a rare corner
case that I was able to hit consistently only by hacking KVM to effectively
synchronize the page fault and zap.  Other than an extra page fault, no harm has
been done to the guest, e.g. there's no need to kill the VM to protect it from
data corruption.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen
  2022-12-13  3:30 ` [PATCH 1/5] KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen Sean Christopherson
@ 2022-12-14 11:57   ` Robert Hoo
  0 siblings, 0 replies; 21+ messages in thread
From: Robert Hoo @ 2022-12-14 11:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Greg Thelen, David Matlack, Ben Gardon, Mingwei Zhang

On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> Hoist the is_removed_spte() check above the "level == goal_level"
> check
> when walking SPTEs during a TDP MMU page fault to avoid attempting to
> map
> a leaf entry if said entry is frozen by a different task/vCPU.
> 
>   ------------[ cut here ]------------
>   WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653
> kvm_tdp_mmu_map+0x269/0x4b0
>   Modules linked in: kvm_intel
>   CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0
> 02/06/2015
>   RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0
>   RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246
>   RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005
>   RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000
>   RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000
>   R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005
>   R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000
>   FS:  00007f01c942c740(0000) GS:ffff888277cc0000(0000)
> knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0
>   Call Trace:
>    <TASK>
>    kvm_tdp_page_fault+0x10c/0x130
>    kvm_mmu_page_fault+0x103/0x680
>    vmx_handle_exit+0x132/0x5a0 [kvm_intel]
>    vcpu_enter_guest+0x60c/0x16f0
>    kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
>    kvm_vcpu_ioctl+0x271/0x660
>    __x64_sys_ioctl+0x80/0xb0
>    do_syscall_64+0x2b/0x50
>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
> 
> Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow
> when guest has to retry")
> Cc: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..b740f38fedcc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1162,9 +1162,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
>  		if (fault->nx_huge_page_workaround_enabled)
>  			disallowed_hugepage_adjust(fault,
> iter.old_spte, iter.level);
>  
> -		if (iter.level == fault->goal_level)
> -			break;
> -
>  		/*
>  		 * If SPTE has been frozen by another thread, just give
> up and
>  		 * retry, avoiding unnecessary page table allocation
> and free.
> @@ -1172,6 +1169,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
>  		if (is_removed_spte(iter.old_spte))
>  			goto retry;
>  
> +		if (iter.level == fault->goal_level)
> +			break;
> +
>  		/* Step down into the lower level page table if it
> exists. */
>  		if (is_shadow_present_pte(iter.old_spte) &&
>  		    !is_large_pte(iter.old_spte))

Reviewed-by: Robert Hoo <robert.hu@linux.intel.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed
  2022-12-13  3:30 ` [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed Sean Christopherson
@ 2022-12-14 11:58   ` Robert Hoo
  2022-12-15  0:11     ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Robert Hoo @ 2022-12-14 11:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Greg Thelen, David Matlack, Ben Gardon, Mingwei Zhang

On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock
> spinlock
> when adding a new shadow page in the TDP MMU.  To ensure the NX
> reclaim
> kthread can't see a not-yet-linked shadow page, the page fault path
> links
> the new page table prior to adding the page to
> possible_nx_huge_pages.
> 
> If the page is zapped by different task, e.g. because dirty logging
> is
> disabled, between linking the page and adding it to the list, KVM can
> end
> up triggering use-after-free by adding the zapped SP to the
> aforementioned
> list, as the zapped SP's memory is scheduled for removal via RCU
> callback.
> The bug is detected by the sanity checks guarded by
> CONFIG_DEBUG_LIST=y,
> i.e. the below splat is just one possible signature.
> 
>   ------------[ cut here ]------------
>   list_add corruption. prev->next should be next (ffffc9000071fa70),
> but was ffff88811125ee38. (prev=ffff88811125ee38).
>   WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30
> __list_add_valid+0x79/0xa0
>   Modules linked in: kvm_intel
>   CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted:
> G        W          6.1.0-rc4+ #71
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0
> 02/06/2015
>   RIP: 0010:__list_add_valid+0x79/0xa0
>   RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
>   RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
>   RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
>   RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
>   R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
>   R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
>   FS:  00007fc0415d2740(0000) GS:ffff888277c40000(0000)
> knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
>   Call Trace:
>    <TASK>
>    track_possible_nx_huge_page+0x53/0x80
>    kvm_tdp_mmu_map+0x242/0x2c0
>    kvm_tdp_page_fault+0x10c/0x130
>    kvm_mmu_page_fault+0x103/0x680
>    vmx_handle_exit+0x132/0x5a0 [kvm_intel]
>    vcpu_enter_guest+0x60c/0x16f0
>    kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
>    kvm_vcpu_ioctl+0x271/0x660
>    __x64_sys_ioctl+0x80/0xb0
>    do_syscall_64+0x2b/0x50
>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
> 
> Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in
> TDP MMU before setting SPTE")
> Reported-by: Greg Thelen <gthelen@google.com>
> Analyzed-by: David Matlack <dmatlack@google.com>
> Cc: David Matlack <dmatlack@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e2e197d41780..fd4ae99790d7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
>  		if (fault->huge_page_disallowed &&
>  		    fault->req_level >= iter.level) {
>  			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> -			track_possible_nx_huge_page(kvm, sp);
> +			if (sp->nx_huge_page_disallowed)
> +				track_possible_nx_huge_page(kvm, sp);
>  			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  		}
>  	}

Is this possible?
The aforementioned situation happened, i.e. before above hunk
track_possible_nx_huge_page(), the sp is zapped by some other task,
tdp_mmu_unlink_sp() --> untrack_possible_nx_huge_page(kvm, sp):

--kvm->stat.nx_lpage_splits;

But looks like the stat for this sp hasn't been increased yet.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2
  2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
                   ` (4 preceding siblings ...)
  2022-12-13  3:30 ` [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller Sean Christopherson
@ 2022-12-14 12:01 ` Robert Hoo
  2022-12-14 15:48   ` Sean Christopherson
  2022-12-23 17:32 ` Paolo Bonzini
  6 siblings, 1 reply; 21+ messages in thread
From: Robert Hoo @ 2022-12-14 12:01 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Greg Thelen, David Matlack, Ben Gardon, Mingwei Zhang

On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> Fix three fatal TDP MMU bugs introduced in 6.2,

introduced in 6.1? or earlier?

>  harden related code,
> and clean up kvm_tdp_mmu_map() to eliminate the need for gotos.
> 
> Sean Christopherson (5):
>   KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is
>     frozen
>   KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
>   KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is
>     disallowed
>   KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
>   KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its
> caller
> 
>  arch/x86/kvm/mmu/mmu.c          |  9 +++++++-
>  arch/x86/kvm/mmu/mmu_internal.h |  1 -
>  arch/x86/kvm/mmu/tdp_mmu.c      | 39 +++++++++++++++--------------
> ----
>  3 files changed, 26 insertions(+), 23 deletions(-)
> 
> 
> base-commit: 51229fd7872f82af07498aef5c79ad51baf81ea0

I cannot find this base commit in my tree, where I just pulled to
latest queue yesterday. But find this series can be applied to this
latest queue as well.

commit 9d75a3251adfbcf444681474511b58042a364863 (origin/queue, queue)
Author: Sean Christopherson <seanjc@google.com>
Date:   Wed Dec 7 00:09:59 2022 +0000

    KVM: x86: Add proper ReST tables for userspace MSR exits/flags


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2
  2022-12-14 12:01 ` [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Robert Hoo
@ 2022-12-14 15:48   ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2022-12-14 15:48 UTC (permalink / raw)
  To: Robert Hoo
  Cc: Paolo Bonzini, kvm, linux-kernel, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

On Wed, Dec 14, 2022, Robert Hoo wrote:
> On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> > Fix three fatal TDP MMU bugs introduced in 6.2,
> 
> introduced in 6.1? or earlier?

6.2, or more precisely, code sitting in kvm/next that will hopefully become part
of 6.2-rc1.

> >  harden related code,
> > and clean up kvm_tdp_mmu_map() to eliminate the need for gotos.
> > 
> > Sean Christopherson (5):
> >   KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is
> >     frozen
> >   KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
> >   KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is
> >     disallowed
> >   KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
> >   KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its
> > caller
> > 
> >  arch/x86/kvm/mmu/mmu.c          |  9 +++++++-
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 -
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 39 +++++++++++++++--------------
> > ----
> >  3 files changed, 26 insertions(+), 23 deletions(-)
> > 
> > 
> > base-commit: 51229fd7872f82af07498aef5c79ad51baf81ea0
> 
> I cannot find this base commit in my tree, where I just pulled to
> latest queue yesterday. But find this series can be applied to this
> latest queue as well.

Ya, I have an extra commit in my local repo sitting on top of kvm/queue so that
my standard builds don't fail.

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index cc3e8c7d0850..2c7f2a26421e 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -898,6 +898,7 @@ bool kvm_hv_assist_page_enabled(struct kvm_vcpu *vcpu)
                return false;
        return vcpu->arch.pv_eoi.msr_val & KVM_MSR_ENABLED;
 }
+EXPORT_SYMBOL_GPL(kvm_hv_assist_page_enabled);
 
 int kvm_hv_get_assist_page(struct kvm_vcpu *vcpu)
 {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed
  2022-12-14 11:58   ` Robert Hoo
@ 2022-12-15  0:11     ` Sean Christopherson
  2022-12-15  6:26       ` Robert Hoo
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2022-12-15  0:11 UTC (permalink / raw)
  To: Robert Hoo
  Cc: Paolo Bonzini, kvm, linux-kernel, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

On Wed, Dec 14, 2022, Robert Hoo wrote:
> On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index e2e197d41780..fd4ae99790d7 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> > struct kvm_page_fault *fault)
> >  		if (fault->huge_page_disallowed &&
> >  		    fault->req_level >= iter.level) {
> >  			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > -			track_possible_nx_huge_page(kvm, sp);
> > +			if (sp->nx_huge_page_disallowed)
> > +				track_possible_nx_huge_page(kvm, sp);
> >  			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >  		}
> >  	}
> 
> Is this possible?
> The aforementioned situation happened, i.e. before above hunk
> track_possible_nx_huge_page(), the sp is zapped by some other task,
> tdp_mmu_unlink_sp() --> untrack_possible_nx_huge_page(kvm, sp):

It's possible for untrack_possible_nx_huge_page() to be called before the above
snippet, but the stat won't be decremented in that case since the page won't be on
the list of possible NX huge pages, i.e. list_empty() will be true.

  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
  {
	if (list_empty(&sp->possible_nx_huge_page_link))
		return;

	--kvm->stat.nx_lpage_splits;

And by not calling track_possible_nx_huge_page() (this bug fix), nx_lpage_splits
won't be incorrectly incremented.

> 
> --kvm->stat.nx_lpage_splits;
> 
> But looks like the stat for this sp hasn't been increased yet.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed
  2022-12-15  0:11     ` Sean Christopherson
@ 2022-12-15  6:26       ` Robert Hoo
  0 siblings, 0 replies; 21+ messages in thread
From: Robert Hoo @ 2022-12-15  6:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

On Thu, 2022-12-15 at 00:11 +0000, Sean Christopherson wrote:
> On Wed, Dec 14, 2022, Robert Hoo wrote:
> > On Tue, 2022-12-13 at 03:30 +0000, Sean Christopherson wrote:
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c
> > > b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index e2e197d41780..fd4ae99790d7 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1203,7 +1203,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu,
> > > struct kvm_page_fault *fault)
> > >  		if (fault->huge_page_disallowed &&
> > >  		    fault->req_level >= iter.level) {
> > >  			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > -			track_possible_nx_huge_page(kvm, sp);
> > > +			if (sp->nx_huge_page_disallowed)
> > > +				track_possible_nx_huge_page(kvm, sp);
> > >  			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > >  		}
> > >  	}
> > 
> > Is this possible?
> > The aforementioned situation happened, i.e. before above hunk
> > track_possible_nx_huge_page(), the sp is zapped by some other task,
> > tdp_mmu_unlink_sp() --> untrack_possible_nx_huge_page(kvm, sp):
> 
> It's possible for untrack_possible_nx_huge_page() to be called before
> the above
> snippet, but the stat won't be decremented in that case since the
> page won't be on
> the list of possible NX huge pages, i.e. list_empty() will be true.

Right, I was fooled by the name of list_empty(), it's actually
list_node_empty(). Thanks for explaining.
> 
>   void untrack_possible_nx_huge_page(struct kvm *kvm, struct
> kvm_mmu_page *sp)
>   {
> 	if (list_empty(&sp->possible_nx_huge_page_link))
> 		return;
> 
> 	--kvm->stat.nx_lpage_splits;
> 
> And by not calling track_possible_nx_huge_page() (this bug fix),
> nx_lpage_splits
> won't be incorrectly incremented.
> 
> > 
> > --kvm->stat.nx_lpage_splits;
> > 
> > But looks like the stat for this sp hasn't been increased yet.
> > 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller
  2022-12-13  3:30 ` [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller Sean Christopherson
@ 2022-12-20 17:53   ` David Matlack
  2022-12-21 18:32     ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: David Matlack @ 2022-12-20 17:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Robert Hoo, Greg Thelen,
	Ben Gardon, Mingwei Zhang

On Tue, Dec 13, 2022 at 03:30:30AM +0000, Sean Christopherson wrote:
> Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
> kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
> eliminate the gotos used to bounce through rcu_read_unlock() when bailing
> from the walk.
> 
> Opportunistically mark kvm_mmu_hugepage_adjust() as static as
> kvm_tdp_mmu_map() was the only external user.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          |  9 ++++++++-
>  arch/x86/kvm/mmu/mmu_internal.h |  1 -
>  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++------------------
>  3 files changed, 12 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..99c40617d325 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  	return min(host_level, max_level);
>  }
>  
> -void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
> +				    struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
>  	kvm_pfn_t mask;
> @@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  	if (is_page_fault_stale(vcpu, fault))
>  		goto out_unlock;
>  
> +	kvm_mmu_hugepage_adjust(vcpu, fault);

Can you also move the call to kvm_mmu_hugepage_adjust() from
direct_map() to direct_page_fault()? I do think it's worth the
maintenence burden to keep those functions consistent.

> +
> +	trace_kvm_mmu_spte_requested(fault);
> +
> +	rcu_read_lock();
>  	r = kvm_tdp_mmu_map(vcpu, fault);
> +	rcu_read_unlock();

I would prefer to keep these in tdp_mmu.c, to reduce the amount of TDP
MMU details that bleed into mmu.c (RCU) and for consistency with other
TDP MMU APIs that don't require the caller to acquire RCU.  This will
also be helpful for the Common MMU, as the tracepoint and RCU will be
common.

e.g.

static int __kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
	...
}

int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
	int r;

	trace_kvm_mmu_spte_requested(fault);

	rcu_read_lock();
	r = __kvm_tdp_mmu_map(vcpu, fault);
	rcu_read_unlock();

	return r;
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
  2022-12-13 18:15     ` Sean Christopherson
@ 2022-12-20 18:24       ` David Matlack
  0 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2022-12-20 18:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Robert Hoo, Greg Thelen,
	Ben Gardon, Mingwei Zhang

On Tue, Dec 13, 2022 at 06:15:56PM +0000, Sean Christopherson wrote:
> On Tue, Dec 13, 2022, David Matlack wrote:
> > On Mon, Dec 12, 2022 at 7:30 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
> > > match the target level of the fault, and instead have the vCPU retry the
> > > faulting instruction after warning.  Continuing on is completely
> > > unnecessary as the absolute worst case scenario of retrying is DoSing
> > > the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
> > 
> > Would it make sense to kill the VM instead via KVM_BUG()?
> 
> No, because if bug that hits this escapes to a release, odds are quite high that
> retrying will succeed.  E.g. the fix earlier in this series is for a rare corner
> case that I was able to hit consistently only by hacking KVM to effectively
> synchronize the page fault and zap.  Other than an extra page fault, no harm has
> been done to the guest, e.g. there's no need to kill the VM to protect it from
> data corruption.

Good points, agreed!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller
  2022-12-20 17:53   ` David Matlack
@ 2022-12-21 18:32     ` Sean Christopherson
  2022-12-29 19:51       ` David Matlack
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2022-12-21 18:32 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, linux-kernel, Robert Hoo, Greg Thelen,
	Ben Gardon, Mingwei Zhang

On Tue, Dec 20, 2022, David Matlack wrote:
> On Tue, Dec 13, 2022 at 03:30:30AM +0000, Sean Christopherson wrote:
> > Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
> > kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
> > eliminate the gotos used to bounce through rcu_read_unlock() when bailing
> > from the walk.
> > 
> > Opportunistically mark kvm_mmu_hugepage_adjust() as static as
> > kvm_tdp_mmu_map() was the only external user.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          |  9 ++++++++-
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 -
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++------------------
> >  3 files changed, 12 insertions(+), 20 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 254bc46234e0..99c40617d325 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  	return min(host_level, max_level);
> >  }
> >  
> > -void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > +static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
> > +				    struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> >  	kvm_pfn_t mask;
> > @@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> >  	if (is_page_fault_stale(vcpu, fault))
> >  		goto out_unlock;
> >  
> > +	kvm_mmu_hugepage_adjust(vcpu, fault);
> 
> Can you also move the call to kvm_mmu_hugepage_adjust() from
> direct_map() to direct_page_fault()? I do think it's worth the
> maintenence burden to keep those functions consistent.

Sure.

> > +	trace_kvm_mmu_spte_requested(fault);
> > +
> > +	rcu_read_lock();
> >  	r = kvm_tdp_mmu_map(vcpu, fault);
> > +	rcu_read_unlock();
> 
> I would prefer to keep these in tdp_mmu.c, to reduce the amount of TDP
> MMU details that bleed into mmu.c (RCU) and for consistency with other
> TDP MMU APIs that don't require the caller to acquire RCU.  This will
> also be helpful for the Common MMU, as the tracepoint and RCU will be
> common.
> 
> e.g.
> 
> static int __kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> 	...
> }
> 
> int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> 	int r;
> 
> 	trace_kvm_mmu_spte_requested(fault);
> 
> 	rcu_read_lock();
> 	r = __kvm_tdp_mmu_map(vcpu, fault);
> 	rcu_read_unlock();
> 
> 	return r;
> }

I did that originally, but it felt really silly to have the trivial wrapper, especially
because mmu.c already has TDP MMU details, e.g. kvm_tdp_mmu_page_fault() takes mmu_lock
for read and other flows acquire rcu_read_lock() to protected the TDP MMU.

What about the below (split into multiple patches) instead?  kvm_tdp_mmu_page_fault()
really should live in tdp_mmu.c, the only reason it's in mmu.c is to get at various
helpers, e.g. fast_page_fault() and kvm_faultin_pfn().

Or is that doomed to fail because the TDP MMU will want to add code before
kvm_faultin_pfn() (I can't remember what motivated splitting out kvm_tdp_mmu_page_fault()
in the first place).

---
 arch/x86/kvm/mmu/mmu.c          | 132 ++++++++------------------------
 arch/x86/kvm/mmu/mmu_internal.h |  50 ++++++++++++
 arch/x86/kvm/mmu/spte.h         |   7 --
 arch/x86/kvm/mmu/tdp_mmu.c      |  41 ++++++----
 arch/x86/kvm/mmu/tdp_mmu.h      |   8 +-
 5 files changed, 108 insertions(+), 130 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254bc46234e0..8203b1dd2753 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1927,16 +1927,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
 	return true;
 }
 
-static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
-	if (sp->role.invalid)
-		return true;
-
-	/* TDP MMU pages do not use the MMU generation. */
-	return !is_tdp_mmu_page(sp) &&
-	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
-}
-
 struct mmu_page_path {
 	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
 	unsigned int idx[PT64_ROOT_MAX_LEVEL];
@@ -3148,9 +3138,6 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	int ret;
 	gfn_t base_gfn = fault->gfn;
 
-	kvm_mmu_hugepage_adjust(vcpu, fault);
-
-	trace_kvm_mmu_spte_requested(fault);
 	for_each_shadow_entry(vcpu, fault->addr, it) {
 		/*
 		 * We cannot overwrite existing page tables with an NX
@@ -4270,54 +4257,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	return RET_PF_CONTINUE;
 }
 
-/*
- * Returns true if the page fault is stale and needs to be retried, i.e. if the
- * root was invalidated by a memslot update or a relevant mmu_notifier fired.
- */
-static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
-				struct kvm_page_fault *fault)
+static int __direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
-	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
+	int r = RET_PF_RETRY;
 
-	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
-	if (sp && is_obsolete_sp(vcpu->kvm, sp))
-		return true;
-
-	/*
-	 * Roots without an associated shadow page are considered invalid if
-	 * there is a pending request to free obsolete roots.  The request is
-	 * only a hint that the current root _may_ be obsolete and needs to be
-	 * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
-	 * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
-	 * to reload even if no vCPU is actively using the root.
-	 */
-	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
-		return true;
-
-	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
-}
-
-static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
-{
-	int r;
-
-	if (page_fault_handle_page_track(vcpu, fault))
-		return RET_PF_EMULATE;
-
-	r = fast_page_fault(vcpu, fault);
-	if (r != RET_PF_INVALID)
-		return r;
-
-	r = mmu_topup_memory_caches(vcpu, false);
-	if (r)
-		return r;
-
-	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
-	if (r != RET_PF_CONTINUE)
-		return r;
-
-	r = RET_PF_RETRY;
 	write_lock(&vcpu->kvm->mmu_lock);
 
 	if (is_page_fault_stale(vcpu, fault))
@@ -4327,6 +4270,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (r)
 		goto out_unlock;
 
+	kvm_mmu_hugepage_adjust(vcpu, fault);
+
+	trace_kvm_mmu_spte_requested(fault);
+
 	r = direct_map(vcpu, fault);
 
 out_unlock:
@@ -4335,6 +4282,32 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	return r;
 }
 
+static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	int r;
+
+	if (page_fault_handle_page_track(vcpu, fault))
+		return RET_PF_EMULATE;
+
+	r = fast_page_fault(vcpu, fault);
+	if (r != RET_PF_INVALID)
+		return r;
+
+	r = mmu_topup_memory_caches(vcpu, false);
+	if (r)
+		return r;
+
+	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
+	if (r != RET_PF_CONTINUE)
+		return r;
+
+#ifdef CONFIG_X86_64
+	if (tdp_mmu_enabled)
+		return kvm_tdp_mmu_page_fault(vcpu, fault);
+#endif
+	return __direct_page_fault(vcpu, fault);
+}
+
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
 				struct kvm_page_fault *fault)
 {
@@ -4378,42 +4351,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 }
 EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
 
-#ifdef CONFIG_X86_64
-static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
-				  struct kvm_page_fault *fault)
-{
-	int r;
-
-	if (page_fault_handle_page_track(vcpu, fault))
-		return RET_PF_EMULATE;
-
-	r = fast_page_fault(vcpu, fault);
-	if (r != RET_PF_INVALID)
-		return r;
-
-	r = mmu_topup_memory_caches(vcpu, false);
-	if (r)
-		return r;
-
-	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
-	if (r != RET_PF_CONTINUE)
-		return r;
-
-	r = RET_PF_RETRY;
-	read_lock(&vcpu->kvm->mmu_lock);
-
-	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
-
-	r = kvm_tdp_mmu_map(vcpu, fault);
-
-out_unlock:
-	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
-	return r;
-}
-#endif
-
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	/*
@@ -4438,11 +4375,6 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 	}
 
-#ifdef CONFIG_X86_64
-	if (tdp_mmu_enabled)
-		return kvm_tdp_mmu_page_fault(vcpu, fault);
-#endif
-
 	return direct_page_fault(vcpu, fault);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ac00bfbf32f6..2c7c2b49f719 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -133,6 +133,28 @@ struct kvm_mmu_page {
 
 extern struct kmem_cache *mmu_page_header_cache;
 
+static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
+{
+	struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
+
+	return (struct kvm_mmu_page *)page_private(page);
+}
+
+static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp)
+{
+	return IS_ENABLED(CONFIG_X86_64) && sp->tdp_mmu_page;
+}
+
+static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	if (sp->role.invalid)
+		return true;
+
+	/* TDP MMU pages do not use the MMU generation. */
+	return !is_tdp_mmu_page(sp) &&
+	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
 static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
 {
 	return role.smm ? 1 : 0;
@@ -314,6 +336,34 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	return r;
 }
 
+/*
+ * Returns true if the page fault is stale and needs to be retried, i.e. if the
+ * root was invalidated by a memslot update or a relevant mmu_notifier fired.
+ */
+static inline bool is_page_fault_stale(struct kvm_vcpu *vcpu,
+				       struct kvm_page_fault *fault)
+{
+	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
+
+	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
+	if (sp && is_obsolete_sp(vcpu->kvm, sp))
+		return true;
+
+	/*
+	 * Roots without an associated shadow page are considered invalid if
+	 * there is a pending request to free obsolete roots.  The request is
+	 * only a hint that the current root _may_ be obsolete and needs to be
+	 * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
+	 * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
+	 * to reload even if no vCPU is actively using the root.
+	 */
+	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
+		return true;
+
+	return fault->slot &&
+	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+}
+
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
 			      int max_level);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1f03701b943a..23e8f8c152b5 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -219,13 +219,6 @@ static inline int spte_index(u64 *sptep)
  */
 extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
 
-static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
-{
-	struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
-
-	return (struct kvm_mmu_page *)page_private(page);
-}
-
 static inline struct kvm_mmu_page *spte_to_child_sp(u64 spte)
 {
 	return to_shadow_page(spte & SPTE_BASE_ADDR_MASK);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cc1fb9a65620..4bb58c0f465b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1144,19 +1144,12 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
  */
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	int ret = RET_PF_RETRY;
-
-	kvm_mmu_hugepage_adjust(vcpu, fault);
-
-	trace_kvm_mmu_spte_requested(fault);
-
-	rcu_read_lock();
 
 	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
 		int r;
@@ -1169,10 +1162,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * retry, avoiding unnecessary page table allocation and free.
 		 */
 		if (is_removed_spte(iter.old_spte))
-			goto retry;
+			return RET_PF_RETRY;
 
 		if (iter.level == fault->goal_level)
-			goto map_target_level;
+			return tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
 
 		/* Step down into the lower level page table if it exists. */
 		if (is_shadow_present_pte(iter.old_spte) &&
@@ -1199,7 +1192,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 */
 		if (r) {
 			tdp_mmu_free_sp(sp);
-			goto retry;
+			return RET_PF_RETRY;
 		}
 
 		if (fault->huge_page_disallowed &&
@@ -1216,14 +1209,30 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	 * iterator detected an upper level SPTE was frozen during traversal.
 	 */
 	WARN_ON_ONCE(iter.level == fault->goal_level);
-	goto retry;
+	return RET_PF_RETRY;
+}
 
-map_target_level:
-	ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	int r = RET_PF_RETRY;
 
-retry:
+	read_lock(&vcpu->kvm->mmu_lock);
+
+	if (is_page_fault_stale(vcpu, fault))
+		goto out_unlock;
+
+	kvm_mmu_hugepage_adjust(vcpu, fault);
+
+	trace_kvm_mmu_spte_requested(fault);
+
+	rcu_read_lock();
+	r = kvm_tdp_mmu_map(vcpu, fault);
 	rcu_read_unlock();
-	return ret;
+
+out_unlock:
+	read_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(fault->pfn);
+	return r;
 }
 
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..697dca948d0a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -27,7 +27,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush);
@@ -70,10 +70,4 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
 					u64 *spte);
 
-#ifdef CONFIG_X86_64
-static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
-#else
-static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
-#endif
-
 #endif /* __KVM_X86_MMU_TDP_MMU_H */

base-commit: ffc4e70d9855921b740aee9bcbc8503cc753aee2
-- 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2
  2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
                   ` (5 preceding siblings ...)
  2022-12-14 12:01 ` [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Robert Hoo
@ 2022-12-23 17:32 ` Paolo Bonzini
  6 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2022-12-23 17:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, David Matlack,
	Ben Gardon, Mingwei Zhang

Queued, thanks.

Paolo



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller
  2022-12-21 18:32     ` Sean Christopherson
@ 2022-12-29 19:51       ` David Matlack
  2022-12-29 21:06         ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: David Matlack @ 2022-12-29 19:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Robert Hoo, Greg Thelen,
	Ben Gardon, Mingwei Zhang

On Wed, Dec 21, 2022 at 06:32:05PM +0000, Sean Christopherson wrote:
> On Tue, Dec 20, 2022, David Matlack wrote:
> > On Tue, Dec 13, 2022 at 03:30:30AM +0000, Sean Christopherson wrote:
> > > Move the hugepage adjust, tracepoint, and RCU (un)lock logic out of
> > > kvm_tdp_mmu_map() and into its sole caller, kvm_tdp_mmu_page_fault(), to
> > > eliminate the gotos used to bounce through rcu_read_unlock() when bailing
> > > from the walk.
> > > 
> > > Opportunistically mark kvm_mmu_hugepage_adjust() as static as
> > > kvm_tdp_mmu_map() was the only external user.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          |  9 ++++++++-
> > >  arch/x86/kvm/mmu/mmu_internal.h |  1 -
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 22 ++++------------------
> > >  3 files changed, 12 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 254bc46234e0..99c40617d325 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3085,7 +3085,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  	return min(host_level, max_level);
> > >  }
> > >  
> > > -void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > +static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
> > > +				    struct kvm_page_fault *fault)
> > >  {
> > >  	struct kvm_memory_slot *slot = fault->slot;
> > >  	kvm_pfn_t mask;
> > > @@ -4405,7 +4406,13 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > >  	if (is_page_fault_stale(vcpu, fault))
> > >  		goto out_unlock;
> > >  
> > > +	kvm_mmu_hugepage_adjust(vcpu, fault);
> > 
> > Can you also move the call to kvm_mmu_hugepage_adjust() from
> > direct_map() to direct_page_fault()? I do think it's worth the
> > maintenence burden to keep those functions consistent.
> 
> Sure.
> 
> > > +	trace_kvm_mmu_spte_requested(fault);
> > > +
> > > +	rcu_read_lock();
> > >  	r = kvm_tdp_mmu_map(vcpu, fault);
> > > +	rcu_read_unlock();
> > 
> > I would prefer to keep these in tdp_mmu.c, to reduce the amount of TDP
> > MMU details that bleed into mmu.c (RCU) and for consistency with other
> > TDP MMU APIs that don't require the caller to acquire RCU.  This will
> > also be helpful for the Common MMU, as the tracepoint and RCU will be
> > common.
> > 
> > e.g.
> > 
> > static int __kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > 	...
> > }
> > 
> > int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > 	int r;
> > 
> > 	trace_kvm_mmu_spte_requested(fault);
> > 
> > 	rcu_read_lock();
> > 	r = __kvm_tdp_mmu_map(vcpu, fault);
> > 	rcu_read_unlock();
> > 
> > 	return r;
> > }
> 
> I did that originally, but it felt really silly to have the trivial wrapper, especially
> because mmu.c already has TDP MMU details, e.g. kvm_tdp_mmu_page_fault() takes mmu_lock
> for read and other flows acquire rcu_read_lock() to protected the TDP MMU.

A trivial wrapper is useful in this case. While mmu.c does already have
some TDP MMU RCU details, I'd like to decrease that not increase it.

> 
> What about the below (split into multiple patches) instead?  kvm_tdp_mmu_page_fault()
> really should live in tdp_mmu.c, the only reason it's in mmu.c is to get at various
> helpers, e.g. fast_page_fault() and kvm_faultin_pfn().

Maybe, I'm not sure. The page fault handling routines have more to do
with mmu.c than tdp_mmu.c. i.e. It's more about integrating with the
rest of KVM/x86 (page fault tracking, MMU notifiers, etc.). We only go
into tdp_mmu.c to manipulate page tables.

> 
> Or is that doomed to fail because the TDP MMU will want to add code before
> kvm_faultin_pfn() (I can't remember what motivated splitting out kvm_tdp_mmu_page_fault()
> in the first place).

To improve readability (less conditionals: if (tdp_mmu_enabled)) and
prepare for more divergence.

Your proposal (below) to split out the "lower half" of the page fault
handling routine works now because that's where all the divergence is.
But with the common MMU there's also going to be divergence in the fast
page fault handler. So I prefer to just keep the routines separate to
avoid thrashing down the road.

> 
> ---
>  arch/x86/kvm/mmu/mmu.c          | 132 ++++++++------------------------
>  arch/x86/kvm/mmu/mmu_internal.h |  50 ++++++++++++
>  arch/x86/kvm/mmu/spte.h         |   7 --
>  arch/x86/kvm/mmu/tdp_mmu.c      |  41 ++++++----
>  arch/x86/kvm/mmu/tdp_mmu.h      |   8 +-
>  5 files changed, 108 insertions(+), 130 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..8203b1dd2753 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1927,16 +1927,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
>  	return true;
>  }
>  
> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> -	if (sp->role.invalid)
> -		return true;
> -
> -	/* TDP MMU pages do not use the MMU generation. */
> -	return !is_tdp_mmu_page(sp) &&
> -	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> -}
> -
>  struct mmu_page_path {
>  	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
>  	unsigned int idx[PT64_ROOT_MAX_LEVEL];
> @@ -3148,9 +3138,6 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	int ret;
>  	gfn_t base_gfn = fault->gfn;
>  
> -	kvm_mmu_hugepage_adjust(vcpu, fault);
> -
> -	trace_kvm_mmu_spte_requested(fault);
>  	for_each_shadow_entry(vcpu, fault->addr, it) {
>  		/*
>  		 * We cannot overwrite existing page tables with an NX
> @@ -4270,54 +4257,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	return RET_PF_CONTINUE;
>  }
>  
> -/*
> - * Returns true if the page fault is stale and needs to be retried, i.e. if the
> - * root was invalidated by a memslot update or a relevant mmu_notifier fired.
> - */
> -static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> -				struct kvm_page_fault *fault)
> +static int __direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
> -	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
> +	int r = RET_PF_RETRY;
>  
> -	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
> -	if (sp && is_obsolete_sp(vcpu->kvm, sp))
> -		return true;
> -
> -	/*
> -	 * Roots without an associated shadow page are considered invalid if
> -	 * there is a pending request to free obsolete roots.  The request is
> -	 * only a hint that the current root _may_ be obsolete and needs to be
> -	 * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> -	 * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> -	 * to reload even if no vCPU is actively using the root.
> -	 */
> -	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> -		return true;
> -
> -	return fault->slot &&
> -	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
> -}
> -
> -static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> -{
> -	int r;
> -
> -	if (page_fault_handle_page_track(vcpu, fault))
> -		return RET_PF_EMULATE;
> -
> -	r = fast_page_fault(vcpu, fault);
> -	if (r != RET_PF_INVALID)
> -		return r;
> -
> -	r = mmu_topup_memory_caches(vcpu, false);
> -	if (r)
> -		return r;
> -
> -	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> -	if (r != RET_PF_CONTINUE)
> -		return r;
> -
> -	r = RET_PF_RETRY;
>  	write_lock(&vcpu->kvm->mmu_lock);
>  
>  	if (is_page_fault_stale(vcpu, fault))
> @@ -4327,6 +4270,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	if (r)
>  		goto out_unlock;
>  
> +	kvm_mmu_hugepage_adjust(vcpu, fault);
> +
> +	trace_kvm_mmu_spte_requested(fault);
> +
>  	r = direct_map(vcpu, fault);
>  
>  out_unlock:
> @@ -4335,6 +4282,32 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	return r;
>  }
>  
> +static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +{
> +	int r;
> +
> +	if (page_fault_handle_page_track(vcpu, fault))
> +		return RET_PF_EMULATE;
> +
> +	r = fast_page_fault(vcpu, fault);
> +	if (r != RET_PF_INVALID)
> +		return r;
> +
> +	r = mmu_topup_memory_caches(vcpu, false);
> +	if (r)
> +		return r;
> +
> +	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> +	if (r != RET_PF_CONTINUE)
> +		return r;
> +
> +#ifdef CONFIG_X86_64
> +	if (tdp_mmu_enabled)
> +		return kvm_tdp_mmu_page_fault(vcpu, fault);
> +#endif
> +	return __direct_page_fault(vcpu, fault);
> +}
> +
>  static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
>  				struct kvm_page_fault *fault)
>  {
> @@ -4378,42 +4351,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  }
>  EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
>  
> -#ifdef CONFIG_X86_64
> -static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> -				  struct kvm_page_fault *fault)
> -{
> -	int r;
> -
> -	if (page_fault_handle_page_track(vcpu, fault))
> -		return RET_PF_EMULATE;
> -
> -	r = fast_page_fault(vcpu, fault);
> -	if (r != RET_PF_INVALID)
> -		return r;
> -
> -	r = mmu_topup_memory_caches(vcpu, false);
> -	if (r)
> -		return r;
> -
> -	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> -	if (r != RET_PF_CONTINUE)
> -		return r;
> -
> -	r = RET_PF_RETRY;
> -	read_lock(&vcpu->kvm->mmu_lock);
> -
> -	if (is_page_fault_stale(vcpu, fault))
> -		goto out_unlock;
> -
> -	r = kvm_tdp_mmu_map(vcpu, fault);
> -
> -out_unlock:
> -	read_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> -	return r;
> -}
> -#endif
> -
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	/*
> @@ -4438,11 +4375,6 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		}
>  	}
>  
> -#ifdef CONFIG_X86_64
> -	if (tdp_mmu_enabled)
> -		return kvm_tdp_mmu_page_fault(vcpu, fault);
> -#endif
> -
>  	return direct_page_fault(vcpu, fault);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..2c7c2b49f719 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -133,6 +133,28 @@ struct kvm_mmu_page {
>  
>  extern struct kmem_cache *mmu_page_header_cache;
>  
> +static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
> +{
> +	struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
> +
> +	return (struct kvm_mmu_page *)page_private(page);
> +}
> +
> +static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp)
> +{
> +	return IS_ENABLED(CONFIG_X86_64) && sp->tdp_mmu_page;
> +}
> +
> +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	if (sp->role.invalid)
> +		return true;
> +
> +	/* TDP MMU pages do not use the MMU generation. */
> +	return !is_tdp_mmu_page(sp) &&
> +	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
>  {
>  	return role.smm ? 1 : 0;
> @@ -314,6 +336,34 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	return r;
>  }
>  
> +/*
> + * Returns true if the page fault is stale and needs to be retried, i.e. if the
> + * root was invalidated by a memslot update or a relevant mmu_notifier fired.
> + */
> +static inline bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> +				       struct kvm_page_fault *fault)
> +{
> +	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
> +
> +	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
> +	if (sp && is_obsolete_sp(vcpu->kvm, sp))
> +		return true;
> +
> +	/*
> +	 * Roots without an associated shadow page are considered invalid if
> +	 * there is a pending request to free obsolete roots.  The request is
> +	 * only a hint that the current root _may_ be obsolete and needs to be
> +	 * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> +	 * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> +	 * to reload even if no vCPU is actively using the root.
> +	 */
> +	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> +		return true;
> +
> +	return fault->slot &&
> +	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
> +}

is_page_fault_stale() is overkill for the TDP MMU and is
KVM/x86-specific. If we do go with your way of splitting things, I'd
prefer to have the TDP MMU not call is_page_fault_stale() so that more
code can be shared across architectures when the TDP MMU is common.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 99c40617d325..68db6805072a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3085,8 +3085,7 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	return min(host_level, max_level);
 }
 
-static void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
-				    struct kvm_page_fault *fault)
+void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	kvm_pfn_t mask;
@@ -4295,8 +4294,28 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
 		return true;
 
-	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+	return mmu_invalidate_retry_fault(vcpu, fault);
+}
+
+static int __direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	int r = RET_PF_RETRY;
+
+	write_lock(&vcpu->kvm->mmu_lock);
+
+	if (is_page_fault_stale(vcpu, fault))
+		goto out_unlock;
+
+	r = make_mmu_pages_available(vcpu);
+	if (r)
+		goto out_unlock;
+
+	r = direct_map(vcpu, fault);
+
+out_unlock:
+	write_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(fault->pfn);
+	return r;
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -4318,22 +4337,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (r != RET_PF_CONTINUE)
 		return r;
 
-	r = RET_PF_RETRY;
-	write_lock(&vcpu->kvm->mmu_lock);
-
-	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
-
-	r = make_mmu_pages_available(vcpu);
-	if (r)
-		goto out_unlock;
+#ifdef CONFIG_X86_64
+	if (tdp_mmu_enabled)
+		return kvm_tdp_mmu_page_fault(vcpu, fault);
+#endif
 
-	r = direct_map(vcpu, fault);
+	return __direct_page_fault(vcpu, fault);
 
-out_unlock:
-	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
-	return r;
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
@@ -4379,48 +4389,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 }
 EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
 
-#ifdef CONFIG_X86_64
-static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
-				  struct kvm_page_fault *fault)
-{
-	int r;
-
-	if (page_fault_handle_page_track(vcpu, fault))
-		return RET_PF_EMULATE;
-
-	r = fast_page_fault(vcpu, fault);
-	if (r != RET_PF_INVALID)
-		return r;
-
-	r = mmu_topup_memory_caches(vcpu, false);
-	if (r)
-		return r;
-
-	r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
-	if (r != RET_PF_CONTINUE)
-		return r;
-
-	r = RET_PF_RETRY;
-	read_lock(&vcpu->kvm->mmu_lock);
-
-	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
-
-	kvm_mmu_hugepage_adjust(vcpu, fault);
-
-	trace_kvm_mmu_spte_requested(fault);
-
-	rcu_read_lock();
-	r = kvm_tdp_mmu_map(vcpu, fault);
-	rcu_read_unlock();
-
-out_unlock:
-	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
-	return r;
-}
-#endif
-
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	/*
@@ -4445,11 +4413,6 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 	}
 
-#ifdef CONFIG_X86_64
-	if (tdp_mmu_enabled)
-		return kvm_tdp_mmu_page_fault(vcpu, fault);
-#endif
-
 	return direct_page_fault(vcpu, fault);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 66c294d67641..776b0ad4e58a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -323,5 +323,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+
+static inline bool mmu_invalidate_retry_fault(struct kvm_vcpu *vcpu,
+					      struct kvm_page_fault *fault)
+{
+	return fault->slot &&
+	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+}
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 78f47eb74544..d4736cb91c9f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1144,7 +1144,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
  */
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct kvm *kvm = vcpu->kvm;
@@ -1212,6 +1212,33 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return RET_PF_RETRY;
 }
 
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	struct kvm_mmu_page *root = to_shadow_page(vcpu->arch.mmu->root.hpa);
+	int r = RET_PF_RETRY;
+
+	read_lock(&vcpu->kvm->mmu_lock);
+
+	if (root->role.invalid)
+		goto out;
+
+	if (mmu_invalidate_retry_fault(vcpu, fault))
+		goto out;
+
+	kvm_mmu_hugepage_adjust(vcpu, fault);
+
+	trace_kvm_mmu_spte_requested(fault);
+
+	rcu_read_lock();
+	r = kvm_tdp_mmu_map(vcpu, fault);
+	rcu_read_unlock();
+
+out:
+	read_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(fault->pfn);
+	return r;
+}
+
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..849e5886e73b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -27,7 +27,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 
-int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller
  2022-12-29 19:51       ` David Matlack
@ 2022-12-29 21:06         ` Paolo Bonzini
  2023-01-03 22:21           ` David Matlack
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2022-12-29 21:06 UTC (permalink / raw)
  To: David Matlack, Sean Christopherson
  Cc: kvm, linux-kernel, Robert Hoo, Greg Thelen, Ben Gardon, Mingwei Zhang

On 12/29/22 20:51, David Matlack wrote:
> Your proposal (below) to split out the "lower half" of the page fault
> handling routine works now because that's where all the divergence is.
> But with the common MMU there's also going to be divergence in the fast
> page fault handler. So I prefer to just keep the routines separate to
> avoid thrashing down the road.

Can you put the changes at the beginning of the common MMU series? 
Large parts of the whole common MMU refactoring can be merged piece by 
piece, so they can be taken as soon as they're ready.

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller
  2022-12-29 21:06         ` Paolo Bonzini
@ 2023-01-03 22:21           ` David Matlack
  0 siblings, 0 replies; 21+ messages in thread
From: David Matlack @ 2023-01-03 22:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, kvm, linux-kernel, Robert Hoo, Greg Thelen,
	Ben Gardon, Mingwei Zhang

On Thu, Dec 29, 2022 at 1:06 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 12/29/22 20:51, David Matlack wrote:
> > Your proposal (below) to split out the "lower half" of the page fault
> > handling routine works now because that's where all the divergence is.
> > But with the common MMU there's also going to be divergence in the fast
> > page fault handler. So I prefer to just keep the routines separate to
> > avoid thrashing down the road.
>
> Can you put the changes at the beginning of the common MMU series?

Can do. By "the changes" I assume you mean the yet-to-be-written
changes to split out a fast_page_fault() handler for the TDP MMU? Or
do you mean Sean's changes (this series)?

> Large parts of the whole common MMU refactoring can be merged piece by
> piece, so they can be taken as soon as they're ready.

Ack.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-01-03 22:25 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-13  3:30 [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Sean Christopherson
2022-12-13  3:30 ` [PATCH 1/5] KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen Sean Christopherson
2022-12-14 11:57   ` Robert Hoo
2022-12-13  3:30 ` [PATCH 2/5] KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached Sean Christopherson
2022-12-13  3:30 ` [PATCH 3/5] KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed Sean Christopherson
2022-12-14 11:58   ` Robert Hoo
2022-12-15  0:11     ` Sean Christopherson
2022-12-15  6:26       ` Robert Hoo
2022-12-13  3:30 ` [PATCH 4/5] KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level Sean Christopherson
2022-12-13 17:59   ` David Matlack
2022-12-13 18:15     ` Sean Christopherson
2022-12-20 18:24       ` David Matlack
2022-12-13  3:30 ` [PATCH 5/5] KVM: x86/mmu: Move kvm_tdp_mmu_map()'s prolog and epilog to its caller Sean Christopherson
2022-12-20 17:53   ` David Matlack
2022-12-21 18:32     ` Sean Christopherson
2022-12-29 19:51       ` David Matlack
2022-12-29 21:06         ` Paolo Bonzini
2023-01-03 22:21           ` David Matlack
2022-12-14 12:01 ` [PATCH 0/5] KVM: x86/mmu: TDP MMU fixes for 6.2 Robert Hoo
2022-12-14 15:48   ` Sean Christopherson
2022-12-23 17:32 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).