kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding
@ 2021-12-14  3:35 Sean Christopherson
  2021-12-14 11:50 ` Ignat Korchagin
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Sean Christopherson @ 2021-12-14  3:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ignat Korchagin

After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator.  Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.

When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.

Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.

Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.

  WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
  RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
   kvm_destroy_vm+0x162/0x2a0 [kvm]
   kvm_vcpu_release+0x34/0x60 [kvm]
   __fput+0x82/0x240
   task_work_run+0x5c/0x90
   do_exit+0x364/0xa10
   ? futex_unqueue+0x38/0x60
   do_group_exit+0x33/0xa0
   get_signal+0x155/0x850
   arch_do_signal_or_restart+0xed/0x750
   exit_to_user_mode_prepare+0xc5/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x48/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list.  This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.

  WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
  RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
  Call Trace:
   <TASK>
   kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
   __handle_changed_spte+0x92e/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   zap_gfn_range+0x549/0x620 [kvm]
   kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
   mmu_free_root_page+0x219/0x2c0 [kvm]
   kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
   kvm_mmu_unload+0x1c/0xa0 [kvm]
   kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
   kvm_put_kvm+0x3b1/0x8b0 [kvm]
   kvm_vcpu_release+0x4e/0x70 [kvm]
   __fput+0x1f7/0x8c0
   task_work_run+0xf8/0x1a0
   do_exit+0x97b/0x2230
   do_group_exit+0xda/0x2a0
   get_signal+0x3be/0x1e50
   arch_do_signal_or_restart+0x244/0x17f0
   exit_to_user_mode_prepare+0xcb/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x4d/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry.  But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.

Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label.  The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.

Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.

Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c |  6 ++++++
 arch/x86/kvm/mmu/tdp_iter.h |  6 ++++++
 arch/x86/kvm/mmu/tdp_mmu.c  | 29 ++++++++++++++++-------------
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index b3ed302c1a35..caa96c270b95 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -26,6 +26,7 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
  */
 void tdp_iter_restart(struct tdp_iter *iter)
 {
+	iter->yielded = false;
 	iter->yielded_gfn = iter->next_last_level_gfn;
 	iter->level = iter->root_level;
 
@@ -160,6 +161,11 @@ static bool try_step_up(struct tdp_iter *iter)
  */
 void tdp_iter_next(struct tdp_iter *iter)
 {
+	if (iter->yielded) {
+		tdp_iter_restart(iter);
+		return;
+	}
+
 	if (try_step_down(iter))
 		return;
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b1748b988d3a..e19cabbcb65c 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -45,6 +45,12 @@ struct tdp_iter {
 	 * iterator walks off the end of the paging structure.
 	 */
 	bool valid;
+	/*
+	 * True if KVM dropped mmu_lock and yielded in the middle of a walk, in
+	 * which case tdp_iter_next() needs to restart the walk at the root
+	 * level instead of advancing to the next entry.
+	 */
+	bool yielded;
 };
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b69e47e68307..7b1bc816b7c3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -502,6 +502,8 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					   struct tdp_iter *iter,
 					   u64 new_spte)
 {
+	WARN_ON_ONCE(iter->yielded);
+
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	/*
@@ -575,6 +577,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 				      u64 new_spte, bool record_acc_track,
 				      bool record_dirty_log)
 {
+	WARN_ON_ONCE(iter->yielded);
+
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
@@ -640,18 +644,19 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
  * If this function should yield and flush is set, it will perform a remote
  * TLB flush before yielding.
  *
- * If this function yields, it will also reset the tdp_iter's walk over the
- * paging structure and the calling function should skip to the next
- * iteration to allow the iterator to continue its traversal from the
- * paging structure root.
+ * If this function yields, iter->yielded is set and the caller must skip to
+ * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
+ * over the paging structures to allow the iterator to continue its traversal
+ * from the paging structure root.
  *
- * Return true if this function yielded and the iterator's traversal was reset.
- * Return false if a yield was not needed.
+ * Returns true if this function yielded.
  */
-static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
-					     struct tdp_iter *iter, bool flush,
-					     bool shared)
+static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
+							  struct tdp_iter *iter,
+							  bool flush, bool shared)
 {
+	WARN_ON(iter->yielded);
+
 	/* Ensure forward progress has been made before yielding. */
 	if (iter->next_last_level_gfn == iter->yielded_gfn)
 		return false;
@@ -671,12 +676,10 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 
 		WARN_ON(iter->gfn > iter->next_last_level_gfn);
 
-		tdp_iter_restart(iter);
-
-		return true;
+		iter->yielded = true;
 	}
 
-	return false;
+	return iter->yielded;
 }
 
 /*
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding
  2021-12-14  3:35 [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding Sean Christopherson
@ 2021-12-14 11:50 ` Ignat Korchagin
  2021-12-15  0:37 ` David Matlack
  2021-12-15 17:13 ` Paolo Bonzini
  2 siblings, 0 replies; 4+ messages in thread
From: Ignat Korchagin @ 2021-12-14 11:50 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

On Tue, Dec 14, 2021 at 3:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> After dropping mmu_lock in the TDP MMU, restart the iterator during
> tdp_iter_next() and do not advance the iterator.  Advancing the iterator
> results in skipping the top-level SPTE and all its children, which is
> fatal if any of the skipped SPTEs were not visited before yielding.
>
> When zapping all SPTEs, i.e. when min_level == root_level, restarting the
> iter and then invoking tdp_iter_next() is always fatal if the current gfn
> has as a valid SPTE, as advancing the iterator results in try_step_side()
> skipping the current gfn, which wasn't visited before yielding.
>
> Sprinkle WARNs on iter->yielded being true in various helpers that are
> often used in conjunction with yielding, and tag the helper with
> __must_check to reduce the probabily of improper usage.
>
> Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
> SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
> the shadow page will be leaked and KVM will WARN accordingly.
>
>   WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
>   RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
>    kvm_destroy_vm+0x162/0x2a0 [kvm]
>    kvm_vcpu_release+0x34/0x60 [kvm]
>    __fput+0x82/0x240
>    task_work_run+0x5c/0x90
>    do_exit+0x364/0xa10
>    ? futex_unqueue+0x38/0x60
>    do_group_exit+0x33/0xa0
>    get_signal+0x155/0x850
>    arch_do_signal_or_restart+0xed/0x750
>    exit_to_user_mode_prepare+0xc5/0x120
>    syscall_exit_to_user_mode+0x1d/0x40
>    do_syscall_64+0x48/0xc0
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
> kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
> marking a struct page as dirty/accessed after it has been put back on the
> free list.  This directly triggers a WARN due to encountering a page with
> page_count() == 0, but it can also lead to data corruption and additional
> errors in the kernel.
>
>   WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
>   RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
>    __handle_changed_spte+0x92e/0xca0 [kvm]
>    __handle_changed_spte+0x63c/0xca0 [kvm]
>    __handle_changed_spte+0x63c/0xca0 [kvm]
>    __handle_changed_spte+0x63c/0xca0 [kvm]
>    zap_gfn_range+0x549/0x620 [kvm]
>    kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
>    mmu_free_root_page+0x219/0x2c0 [kvm]
>    kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
>    kvm_mmu_unload+0x1c/0xa0 [kvm]
>    kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
>    kvm_put_kvm+0x3b1/0x8b0 [kvm]
>    kvm_vcpu_release+0x4e/0x70 [kvm]
>    __fput+0x1f7/0x8c0
>    task_work_run+0xf8/0x1a0
>    do_exit+0x97b/0x2230
>    do_group_exit+0xda/0x2a0
>    get_signal+0x3be/0x1e50
>    arch_do_signal_or_restart+0x244/0x17f0
>    exit_to_user_mode_prepare+0xcb/0x120
>    syscall_exit_to_user_mode+0x1d/0x40
>    do_syscall_64+0x4d/0x90
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
> x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
> tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
> incorrectly advance past a top-level entry when yielding on a lower-level
> entry.  But with respect to leaking shadow pages, the bug was introduced
> by yielding before processing the current gfn.
>
> Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
> callers could jump to their "retry" label.  The downside of that approach
> is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
> in the loop, and there's no easy way to enfornce that requirement.
>
> Ideally, KVM would handling the cond_resched() fully within the iterator
> macro (the code is actually quite clean) and avoid this entire class of
> bugs, but that is extremely difficult do while also supporting yielding
> after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
> SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
> bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
> may block operations on the SPTE for a significant amount of time.
>
> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
> Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
> Reported-by: Ignat Korchagin <ignat@cloudflare.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_iter.c |  6 ++++++
>  arch/x86/kvm/mmu/tdp_iter.h |  6 ++++++
>  arch/x86/kvm/mmu/tdp_mmu.c  | 29 ++++++++++++++++-------------
>  3 files changed, 28 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b3ed302c1a35..caa96c270b95 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -26,6 +26,7 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
>   */
>  void tdp_iter_restart(struct tdp_iter *iter)
>  {
> +       iter->yielded = false;
>         iter->yielded_gfn = iter->next_last_level_gfn;
>         iter->level = iter->root_level;
>
> @@ -160,6 +161,11 @@ static bool try_step_up(struct tdp_iter *iter)
>   */
>  void tdp_iter_next(struct tdp_iter *iter)
>  {
> +       if (iter->yielded) {
> +               tdp_iter_restart(iter);
> +               return;
> +       }
> +
>         if (try_step_down(iter))
>                 return;
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index b1748b988d3a..e19cabbcb65c 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -45,6 +45,12 @@ struct tdp_iter {
>          * iterator walks off the end of the paging structure.
>          */
>         bool valid;
> +       /*
> +        * True if KVM dropped mmu_lock and yielded in the middle of a walk, in
> +        * which case tdp_iter_next() needs to restart the walk at the root
> +        * level instead of advancing to the next entry.
> +        */
> +       bool yielded;
>  };
>
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b69e47e68307..7b1bc816b7c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -502,6 +502,8 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>                                            struct tdp_iter *iter,
>                                            u64 new_spte)
>  {
> +       WARN_ON_ONCE(iter->yielded);
> +
>         lockdep_assert_held_read(&kvm->mmu_lock);
>
>         /*
> @@ -575,6 +577,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>                                       u64 new_spte, bool record_acc_track,
>                                       bool record_dirty_log)
>  {
> +       WARN_ON_ONCE(iter->yielded);
> +
>         lockdep_assert_held_write(&kvm->mmu_lock);
>
>         /*
> @@ -640,18 +644,19 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>   * If this function should yield and flush is set, it will perform a remote
>   * TLB flush before yielding.
>   *
> - * If this function yields, it will also reset the tdp_iter's walk over the
> - * paging structure and the calling function should skip to the next
> - * iteration to allow the iterator to continue its traversal from the
> - * paging structure root.
> + * If this function yields, iter->yielded is set and the caller must skip to
> + * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
> + * over the paging structures to allow the iterator to continue its traversal
> + * from the paging structure root.
>   *
> - * Return true if this function yielded and the iterator's traversal was reset.
> - * Return false if a yield was not needed.
> + * Returns true if this function yielded.
>   */
> -static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> -                                            struct tdp_iter *iter, bool flush,
> -                                            bool shared)
> +static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
> +                                                         struct tdp_iter *iter,
> +                                                         bool flush, bool shared)
>  {
> +       WARN_ON(iter->yielded);
> +
>         /* Ensure forward progress has been made before yielding. */
>         if (iter->next_last_level_gfn == iter->yielded_gfn)
>                 return false;
> @@ -671,12 +676,10 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>
>                 WARN_ON(iter->gfn > iter->next_last_level_gfn);
>
> -               tdp_iter_restart(iter);
> -
> -               return true;
> +               iter->yielded = true;
>         }
>
> -       return false;
> +       return iter->yielded;
>  }
>
>  /*
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

Tested-by: Ignat Korchagin <ignat@cloudflare.com>

Can confirm this patch resolves the issue in my test environment.
Thanks Sean!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding
  2021-12-14  3:35 [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding Sean Christopherson
  2021-12-14 11:50 ` Ignat Korchagin
@ 2021-12-15  0:37 ` David Matlack
  2021-12-15 17:13 ` Paolo Bonzini
  2 siblings, 0 replies; 4+ messages in thread
From: David Matlack @ 2021-12-15  0:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ignat Korchagin

On Mon, Dec 13, 2021 at 7:37 PM Sean Christopherson <seanjc@google.com> wrote:
>
> After dropping mmu_lock in the TDP MMU, restart the iterator during
> tdp_iter_next() and do not advance the iterator.  Advancing the iterator
> results in skipping the top-level SPTE and all its children, which is
> fatal if any of the skipped SPTEs were not visited before yielding.
>
> When zapping all SPTEs, i.e. when min_level == root_level, restarting the
> iter and then invoking tdp_iter_next() is always fatal if the current gfn
> has as a valid SPTE, as advancing the iterator results in try_step_side()
> skipping the current gfn, which wasn't visited before yielding.
>
> Sprinkle WARNs on iter->yielded being true in various helpers that are
> often used in conjunction with yielding, and tag the helper with
> __must_check to reduce the probabily of improper usage.
>
> Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
> SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
> the shadow page will be leaked and KVM will WARN accordingly.
>
>   WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
>   RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
>    kvm_destroy_vm+0x162/0x2a0 [kvm]
>    kvm_vcpu_release+0x34/0x60 [kvm]
>    __fput+0x82/0x240
>    task_work_run+0x5c/0x90
>    do_exit+0x364/0xa10
>    ? futex_unqueue+0x38/0x60
>    do_group_exit+0x33/0xa0
>    get_signal+0x155/0x850
>    arch_do_signal_or_restart+0xed/0x750
>    exit_to_user_mode_prepare+0xc5/0x120
>    syscall_exit_to_user_mode+0x1d/0x40
>    do_syscall_64+0x48/0xc0
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
> kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
> marking a struct page as dirty/accessed after it has been put back on the
> free list.  This directly triggers a WARN due to encountering a page with
> page_count() == 0, but it can also lead to data corruption and additional
> errors in the kernel.
>
>   WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
>   RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
>    __handle_changed_spte+0x92e/0xca0 [kvm]
>    __handle_changed_spte+0x63c/0xca0 [kvm]
>    __handle_changed_spte+0x63c/0xca0 [kvm]
>    __handle_changed_spte+0x63c/0xca0 [kvm]
>    zap_gfn_range+0x549/0x620 [kvm]
>    kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
>    mmu_free_root_page+0x219/0x2c0 [kvm]
>    kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
>    kvm_mmu_unload+0x1c/0xa0 [kvm]
>    kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
>    kvm_put_kvm+0x3b1/0x8b0 [kvm]
>    kvm_vcpu_release+0x4e/0x70 [kvm]
>    __fput+0x1f7/0x8c0
>    task_work_run+0xf8/0x1a0
>    do_exit+0x97b/0x2230
>    do_group_exit+0xda/0x2a0
>    get_signal+0x3be/0x1e50
>    arch_do_signal_or_restart+0x244/0x17f0
>    exit_to_user_mode_prepare+0xcb/0x120
>    syscall_exit_to_user_mode+0x1d/0x40
>    do_syscall_64+0x4d/0x90
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
> x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
> tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
> incorrectly advance past a top-level entry when yielding on a lower-level
> entry.  But with respect to leaking shadow pages, the bug was introduced
> by yielding before processing the current gfn.
>
> Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
> callers could jump to their "retry" label.  The downside of that approach
> is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
> in the loop, and there's no easy way to enfornce that requirement.
>
> Ideally, KVM would handling the cond_resched() fully within the iterator
> macro (the code is actually quite clean) and avoid this entire class of
> bugs, but that is extremely difficult do while also supporting yielding
> after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
> SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
> bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
> may block operations on the SPTE for a significant amount of time.
>
> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
> Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
> Reported-by: Ignat Korchagin <ignat@cloudflare.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>


> ---
>  arch/x86/kvm/mmu/tdp_iter.c |  6 ++++++
>  arch/x86/kvm/mmu/tdp_iter.h |  6 ++++++
>  arch/x86/kvm/mmu/tdp_mmu.c  | 29 ++++++++++++++++-------------
>  3 files changed, 28 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b3ed302c1a35..caa96c270b95 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -26,6 +26,7 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
>   */
>  void tdp_iter_restart(struct tdp_iter *iter)
>  {
> +       iter->yielded = false;
>         iter->yielded_gfn = iter->next_last_level_gfn;
>         iter->level = iter->root_level;
>
> @@ -160,6 +161,11 @@ static bool try_step_up(struct tdp_iter *iter)
>   */
>  void tdp_iter_next(struct tdp_iter *iter)
>  {
> +       if (iter->yielded) {
> +               tdp_iter_restart(iter);
> +               return;
> +       }
> +
>         if (try_step_down(iter))
>                 return;
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index b1748b988d3a..e19cabbcb65c 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -45,6 +45,12 @@ struct tdp_iter {
>          * iterator walks off the end of the paging structure.
>          */
>         bool valid;
> +       /*
> +        * True if KVM dropped mmu_lock and yielded in the middle of a walk, in
> +        * which case tdp_iter_next() needs to restart the walk at the root
> +        * level instead of advancing to the next entry.
> +        */
> +       bool yielded;
>  };
>
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b69e47e68307..7b1bc816b7c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -502,6 +502,8 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>                                            struct tdp_iter *iter,
>                                            u64 new_spte)
>  {
> +       WARN_ON_ONCE(iter->yielded);
> +
>         lockdep_assert_held_read(&kvm->mmu_lock);
>
>         /*
> @@ -575,6 +577,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>                                       u64 new_spte, bool record_acc_track,
>                                       bool record_dirty_log)
>  {
> +       WARN_ON_ONCE(iter->yielded);
> +
>         lockdep_assert_held_write(&kvm->mmu_lock);
>
>         /*
> @@ -640,18 +644,19 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>   * If this function should yield and flush is set, it will perform a remote
>   * TLB flush before yielding.
>   *
> - * If this function yields, it will also reset the tdp_iter's walk over the
> - * paging structure and the calling function should skip to the next
> - * iteration to allow the iterator to continue its traversal from the
> - * paging structure root.
> + * If this function yields, iter->yielded is set and the caller must skip to
> + * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
> + * over the paging structures to allow the iterator to continue its traversal
> + * from the paging structure root.
>   *
> - * Return true if this function yielded and the iterator's traversal was reset.
> - * Return false if a yield was not needed.
> + * Returns true if this function yielded.
>   */
> -static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> -                                            struct tdp_iter *iter, bool flush,
> -                                            bool shared)
> +static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
> +                                                         struct tdp_iter *iter,
> +                                                         bool flush, bool shared)
>  {
> +       WARN_ON(iter->yielded);
> +
>         /* Ensure forward progress has been made before yielding. */
>         if (iter->next_last_level_gfn == iter->yielded_gfn)
>                 return false;
> @@ -671,12 +676,10 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>
>                 WARN_ON(iter->gfn > iter->next_last_level_gfn);
>
> -               tdp_iter_restart(iter);
> -
> -               return true;
> +               iter->yielded = true;
>         }
>
> -       return false;
> +       return iter->yielded;
>  }
>
>  /*
> --
> 2.34.1.173.g76aa8bc2d0-goog
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding
  2021-12-14  3:35 [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding Sean Christopherson
  2021-12-14 11:50 ` Ignat Korchagin
  2021-12-15  0:37 ` David Matlack
@ 2021-12-15 17:13 ` Paolo Bonzini
  2 siblings, 0 replies; 4+ messages in thread
From: Paolo Bonzini @ 2021-12-15 17:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ignat Korchagin

On 12/14/21 04:35, Sean Christopherson wrote:
> After dropping mmu_lock in the TDP MMU, restart the iterator during
> tdp_iter_next() and do not advance the iterator.  Advancing the iterator
> results in skipping the top-level SPTE and all its children, which is
> fatal if any of the skipped SPTEs were not visited before yielding.
> 
> When zapping all SPTEs, i.e. when min_level == root_level, restarting the
> iter and then invoking tdp_iter_next() is always fatal if the current gfn
> has as a valid SPTE, as advancing the iterator results in try_step_side()
> skipping the current gfn, which wasn't visited before yielding.
> 
> Sprinkle WARNs on iter->yielded being true in various helpers that are
> often used in conjunction with yielding, and tag the helper with
> __must_check to reduce the probabily of improper usage.
> 
> Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
> SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
> the shadow page will be leaked and KVM will WARN accordingly.
> 
>    WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
>    RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
>    Call Trace:
>     <TASK>
>     kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
>     kvm_destroy_vm+0x162/0x2a0 [kvm]
>     kvm_vcpu_release+0x34/0x60 [kvm]
>     __fput+0x82/0x240
>     task_work_run+0x5c/0x90
>     do_exit+0x364/0xa10
>     ? futex_unqueue+0x38/0x60
>     do_group_exit+0x33/0xa0
>     get_signal+0x155/0x850
>     arch_do_signal_or_restart+0xed/0x750
>     exit_to_user_mode_prepare+0xc5/0x120
>     syscall_exit_to_user_mode+0x1d/0x40
>     do_syscall_64+0x48/0xc0
>     entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
> kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
> marking a struct page as dirty/accessed after it has been put back on the
> free list.  This directly triggers a WARN due to encountering a page with
> page_count() == 0, but it can also lead to data corruption and additional
> errors in the kernel.
> 
>    WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
>    RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
>    Call Trace:
>     <TASK>
>     kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
>     __handle_changed_spte+0x92e/0xca0 [kvm]
>     __handle_changed_spte+0x63c/0xca0 [kvm]
>     __handle_changed_spte+0x63c/0xca0 [kvm]
>     __handle_changed_spte+0x63c/0xca0 [kvm]
>     zap_gfn_range+0x549/0x620 [kvm]
>     kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
>     mmu_free_root_page+0x219/0x2c0 [kvm]
>     kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
>     kvm_mmu_unload+0x1c/0xa0 [kvm]
>     kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
>     kvm_put_kvm+0x3b1/0x8b0 [kvm]
>     kvm_vcpu_release+0x4e/0x70 [kvm]
>     __fput+0x1f7/0x8c0
>     task_work_run+0xf8/0x1a0
>     do_exit+0x97b/0x2230
>     do_group_exit+0xda/0x2a0
>     get_signal+0x3be/0x1e50
>     arch_do_signal_or_restart+0x244/0x17f0
>     exit_to_user_mode_prepare+0xcb/0x120
>     syscall_exit_to_user_mode+0x1d/0x40
>     do_syscall_64+0x4d/0x90
>     entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
> x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
> tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
> incorrectly advance past a top-level entry when yielding on a lower-level
> entry.  But with respect to leaking shadow pages, the bug was introduced
> by yielding before processing the current gfn.
> 
> Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
> callers could jump to their "retry" label.  The downside of that approach
> is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
> in the loop, and there's no easy way to enfornce that requirement.
> 
> Ideally, KVM would handling the cond_resched() fully within the iterator
> macro (the code is actually quite clean) and avoid this entire class of
> bugs, but that is extremely difficult do while also supporting yielding
> after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
> SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
> bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
> may block operations on the SPTE for a significant amount of time.
> 
> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
> Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
> Reported-by: Ignat Korchagin <ignat@cloudflare.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/tdp_iter.c |  6 ++++++
>   arch/x86/kvm/mmu/tdp_iter.h |  6 ++++++
>   arch/x86/kvm/mmu/tdp_mmu.c  | 29 ++++++++++++++++-------------
>   3 files changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b3ed302c1a35..caa96c270b95 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -26,6 +26,7 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
>    */
>   void tdp_iter_restart(struct tdp_iter *iter)
>   {
> +	iter->yielded = false;
>   	iter->yielded_gfn = iter->next_last_level_gfn;
>   	iter->level = iter->root_level;
>   
> @@ -160,6 +161,11 @@ static bool try_step_up(struct tdp_iter *iter)
>    */
>   void tdp_iter_next(struct tdp_iter *iter)
>   {
> +	if (iter->yielded) {
> +		tdp_iter_restart(iter);
> +		return;
> +	}
> +
>   	if (try_step_down(iter))
>   		return;
>   
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index b1748b988d3a..e19cabbcb65c 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -45,6 +45,12 @@ struct tdp_iter {
>   	 * iterator walks off the end of the paging structure.
>   	 */
>   	bool valid;
> +	/*
> +	 * True if KVM dropped mmu_lock and yielded in the middle of a walk, in
> +	 * which case tdp_iter_next() needs to restart the walk at the root
> +	 * level instead of advancing to the next entry.
> +	 */
> +	bool yielded;
>   };
>   
>   /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b69e47e68307..7b1bc816b7c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -502,6 +502,8 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>   					   struct tdp_iter *iter,
>   					   u64 new_spte)
>   {
> +	WARN_ON_ONCE(iter->yielded);
> +
>   	lockdep_assert_held_read(&kvm->mmu_lock);
>   
>   	/*
> @@ -575,6 +577,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>   				      u64 new_spte, bool record_acc_track,
>   				      bool record_dirty_log)
>   {
> +	WARN_ON_ONCE(iter->yielded);
> +
>   	lockdep_assert_held_write(&kvm->mmu_lock);
>   
>   	/*
> @@ -640,18 +644,19 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>    * If this function should yield and flush is set, it will perform a remote
>    * TLB flush before yielding.
>    *
> - * If this function yields, it will also reset the tdp_iter's walk over the
> - * paging structure and the calling function should skip to the next
> - * iteration to allow the iterator to continue its traversal from the
> - * paging structure root.
> + * If this function yields, iter->yielded is set and the caller must skip to
> + * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
> + * over the paging structures to allow the iterator to continue its traversal
> + * from the paging structure root.
>    *
> - * Return true if this function yielded and the iterator's traversal was reset.
> - * Return false if a yield was not needed.
> + * Returns true if this function yielded.
>    */
> -static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> -					     struct tdp_iter *iter, bool flush,
> -					     bool shared)
> +static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
> +							  struct tdp_iter *iter,
> +							  bool flush, bool shared)
>   {
> +	WARN_ON(iter->yielded);
> +
>   	/* Ensure forward progress has been made before yielding. */
>   	if (iter->next_last_level_gfn == iter->yielded_gfn)
>   		return false;
> @@ -671,12 +676,10 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>   
>   		WARN_ON(iter->gfn > iter->next_last_level_gfn);
>   
> -		tdp_iter_restart(iter);
> -
> -		return true;
> +		iter->yielded = true;
>   	}
>   
> -	return false;
> +	return iter->yielded;
>   }
>   
>   /*
> 

Queued, thnaks.

Paolo

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-12-15 17:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-14  3:35 [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding Sean Christopherson
2021-12-14 11:50 ` Ignat Korchagin
2021-12-15  0:37 ` David Matlack
2021-12-15 17:13 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).