[22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
diff mbox series

Message ID 20210112181041.356734-23-bgardon@google.com
State New, archived
Headers show
Series
  • Allow parallel page faults with TDP MMU
Related show

Commit Message

Ben Gardon Jan. 12, 2021, 6:10 p.m. UTC
When the TDP MMU is allowed to handle page faults in parallel there is
the possiblity of a race where an SPTE is cleared and then imediately
replaced with a present SPTE pointing to a different PFN, before the
TLBs can be flushed. This race would violate architectural specs. Ensure
that the TLBs are flushed properly before other threads are allowed to
install any present value for the SPTE.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.h    | 16 +++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++++------
 2 files changed, 68 insertions(+), 10 deletions(-)

Comments

Sean Christopherson Jan. 21, 2021, 12:05 a.m. UTC | #1
On Tue, Jan 12, 2021, Ben Gardon wrote:
> When the TDP MMU is allowed to handle page faults in parallel there is
> the possiblity of a race where an SPTE is cleared and then imediately
> replaced with a present SPTE pointing to a different PFN, before the
> TLBs can be flushed. This race would violate architectural specs. Ensure
> that the TLBs are flushed properly before other threads are allowed to
> install any present value for the SPTE.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/spte.h    | 16 +++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++++------
>  2 files changed, 68 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 2b3a30bd38b0..ecd9bfbccef4 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -130,6 +130,20 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>  					  PT64_EPT_EXECUTABLE_MASK)
>  #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
>  
> +/*
> + * If a thread running without exclusive control of the MMU lock must perform a
> + * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a
> + * non-present intermediate value. This will guarantee that other threads will
> + * not modify the spte.
> + *
> + * This constant works because it is considered non-present on both AMD and
> + * Intel CPUs and does not create a L1TF vulnerability because the pfn section
> + * is zeroed out.
> + *
> + * Only used by the TDP MMU.
> + */
> +#define FROZEN_SPTE (1ull << 59)

I dislike FROZEN, for similar reasons that I disliked "disconnected".  The SPTE
isn't frozen in the sense that it's temporarily immutable, rather it's been
removed but hasn't been flushed and so can't yet be reused.  Given that
FROZEN_SPTEs are treated as not-preset SPTEs, there's zero chance that this can
be extended in the future to be a generic temporarily freeze mechanism.

Mabye REMOVED_SPTE to match earlier feedback?

> +
>  /*
>   * In some cases, we need to preserve the GFN of a non-present or reserved
>   * SPTE when we usurp the upper five bits of the physical address space to
> @@ -187,7 +201,7 @@ static inline bool is_access_track_spte(u64 spte)
>  
>  static inline int is_shadow_present_pte(u64 pte)

Waaaay off topic, I'm going to send a patch to have this, and any other pte
helpers that return an int, return a bool.  While futzing around with ideas I
managed to turn this into a nop by doing

	return pte & SPTE_PRESENT;

which is guaranteed to be 0 if SPTE_PRESENT is a bit > 31.  I'm sure others will
point out that I'm a heathen for not doing !!(pte & SPTE_PRESENT), but still...

>  {
> -	return (pte != 0) && !is_mmio_spte(pte);
> +	return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);

For all other checks, I'd strongly prefer to add a helper, e.g. is_removed_spte()
or whatever.  That way changing the implementation won't be as painful, and we
can add assertions and whatnot if we break things.  Especially since FROZEN_SPTE
is a single bit, which makes it look like a flag even though it's used as a full
64-bit constant.

For this, I worry that is_shadow_present_pte() is getting bloated.  It's also a
bit unfortunate that it's bloated for the old MMU, without any benefit. That
being said, most that bloat is from the existing MMIO checks.  Looking
elsewhere, TDX's SEPT also has a similar concept that may or may not need to
hook is_shadow_present_pte().

Rather than bundle MMIO SPTEs into the access-tracking flags and have a bunch of
special cases for not-present SPTEs, what if we add an explicit flag to mark
SPTEs as present (or not-present)?  Defining SPTE_PRESENT instead of
SPTE_NOT_PRESENT might require a few more changes, but it would be the most
optimal for is_shadow_present_pte().

I'm thinking something like this (completely untested):

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index c51ad544f25b..86f6c84569c4 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -38,7 +38,7 @@ static u64 generation_mmio_spte_mask(u64 gen)
        u64 mask;

        WARN_ON(gen & ~MMIO_SPTE_GEN_MASK);
-       BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_SPECIAL_MASK);
+       BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_MMIO);

        mask = (gen << MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_SPTE_GEN_LOW_MASK;
        mask |= (gen << MMIO_SPTE_GEN_HIGH_SHIFT) & MMIO_SPTE_GEN_HIGH_MASK;
@@ -86,7 +86,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
                     bool can_unsync, bool host_writable, bool ad_disabled,
                     u64 *new_spte)
 {
-       u64 spte = 0;
+       u64 spte = SPTE_PRESENT;
        int ret = 0;

        if (ad_disabled)
@@ -247,7 +247,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
        BUG_ON((u64)(unsigned)access_mask != access_mask);
        WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << SHADOW_NONPRESENT_OR_RSVD_MASK_LEN));
        WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
-       shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
+       shadow_mmio_value = mmio_value | SPTE_MMIO;
        shadow_mmio_access_mask = access_mask;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ecd9bfbccef4..465e43d34034 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -5,18 +5,15 @@

 #include "mmu_internal.h"

+/* Software available bits for present SPTEs. */
 #define PT_FIRST_AVAIL_BITS_SHIFT 10
 #define PT64_SECOND_AVAIL_BITS_SHIFT 54

-/*
- * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
- * Access Tracking SPTEs.
- */
+/* The mask used to denote Access Tracking SPTEs.  Note, val=3 is available. */
 #define SPTE_SPECIAL_MASK (3ULL << 52)
 #define SPTE_AD_ENABLED_MASK (0ULL << 52)
 #define SPTE_AD_DISABLED_MASK (1ULL << 52)
 #define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
-#define SPTE_MMIO_MASK (3ULL << 52)

 #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
 #define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
@@ -55,12 +52,16 @@
 #define SPTE_HOST_WRITEABLE    (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
 #define SPTE_MMU_WRITEABLE     (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))

+#define SPTE_REMOVED           BIT_ULL(60)
+#define SPTE_MMIO              BIT_ULL(61)
+#define SPTE_PRESENT           BIT_ULL(62)
+
 /*
  * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
  * the memslots generation and is derived as follows:
  *
  * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
- * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62
+ * Bits 9-17 of the MMIO generation are propagated to spte bits 52-60
  *
  * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
  * the MMIO generation number, as doing so would require stealing a bit from
@@ -73,8 +74,8 @@
 #define MMIO_SPTE_GEN_LOW_START                3
 #define MMIO_SPTE_GEN_LOW_END          11

-#define MMIO_SPTE_GEN_HIGH_START       PT64_SECOND_AVAIL_BITS_SHIFT
-#define MMIO_SPTE_GEN_HIGH_END         62
+#define MMIO_SPTE_GEN_HIGH_START       52
+#define MMIO_SPTE_GEN_HIGH_END         60

 #define MMIO_SPTE_GEN_LOW_MASK         GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
                                                    MMIO_SPTE_GEN_LOW_START)
@@ -162,7 +163,7 @@ extern u8 __read_mostly shadow_phys_bits;

 static inline bool is_mmio_spte(u64 spte)
 {
-       return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
+       return spte & SPTE_MMIO;
 }

 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
@@ -199,9 +200,9 @@ static inline bool is_access_track_spte(u64 spte)
        return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
 }

-static inline int is_shadow_present_pte(u64 pte)
+static inline bool is_shadow_present_pte(u64 pte)
 {
-       return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);
+       return pte & SPTE_PRESENT;
 }

 static inline int is_large_pte(u64 pte)


>  }
>  
>  static inline int is_large_pte(u64 pte)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b12a87a4124..5c9d053000ad 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -429,15 +429,19 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	 */
>  	if (!was_present && !is_present) {
>  		/*
> -		 * If this change does not involve a MMIO SPTE, it is
> -		 * unexpected. Log the change, though it should not impact the
> -		 * guest since both the former and current SPTEs are nonpresent.
> +		 * If this change does not involve a MMIO SPTE or FROZEN_SPTE,

For comments and error message, I think we should avoid using the exact constant
name, and instead call them "removed SPTE", similar to MMIO SPTE.  That will
help reduce thrash and/or stale comments if the name changes.

> +		 * it is unexpected. Log the change, though it should not
> +		 * impact the guest since both the former and current SPTEs
> +		 * are nonpresent.
>  		 */
> -		if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
> +		if (WARN_ON(!is_mmio_spte(old_spte) &&
> +			    !is_mmio_spte(new_spte) &&
> +			    new_spte != FROZEN_SPTE))
>  			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
>  			       "should not be replaced with another,\n"
>  			       "different nonpresent SPTE, unless one or both\n"
> -			       "are MMIO SPTEs.\n"
> +			       "are MMIO SPTEs, or the new SPTE is\n"
> +			       "FROZEN_SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
>  		return;

Patch
diff mbox series

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 2b3a30bd38b0..ecd9bfbccef4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -130,6 +130,20 @@  extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 					  PT64_EPT_EXECUTABLE_MASK)
 #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
 
+/*
+ * If a thread running without exclusive control of the MMU lock must perform a
+ * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a
+ * non-present intermediate value. This will guarantee that other threads will
+ * not modify the spte.
+ *
+ * This constant works because it is considered non-present on both AMD and
+ * Intel CPUs and does not create a L1TF vulnerability because the pfn section
+ * is zeroed out.
+ *
+ * Only used by the TDP MMU.
+ */
+#define FROZEN_SPTE (1ull << 59)
+
 /*
  * In some cases, we need to preserve the GFN of a non-present or reserved
  * SPTE when we usurp the upper five bits of the physical address space to
@@ -187,7 +201,7 @@  static inline bool is_access_track_spte(u64 spte)
 
 static inline int is_shadow_present_pte(u64 pte)
 {
-	return (pte != 0) && !is_mmio_spte(pte);
+	return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);
 }
 
 static inline int is_large_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b12a87a4124..5c9d053000ad 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -429,15 +429,19 @@  static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 */
 	if (!was_present && !is_present) {
 		/*
-		 * If this change does not involve a MMIO SPTE, it is
-		 * unexpected. Log the change, though it should not impact the
-		 * guest since both the former and current SPTEs are nonpresent.
+		 * If this change does not involve a MMIO SPTE or FROZEN_SPTE,
+		 * it is unexpected. Log the change, though it should not
+		 * impact the guest since both the former and current SPTEs
+		 * are nonpresent.
 		 */
-		if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
+		if (WARN_ON(!is_mmio_spte(old_spte) &&
+			    !is_mmio_spte(new_spte) &&
+			    new_spte != FROZEN_SPTE))
 			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
 			       "should not be replaced with another,\n"
 			       "different nonpresent SPTE, unless one or both\n"
-			       "are MMIO SPTEs.\n"
+			       "are MMIO SPTEs, or the new SPTE is\n"
+			       "FROZEN_SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
 		return;
@@ -488,6 +492,13 @@  static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
 	kvm_mmu_lock_assert_held_shared(kvm);
 
+	/*
+	 * Do not change FROZEN_SPTEs. Only the thread that froze the SPTE
+	 * may modify it.
+	 */
+	if (iter->old_spte == FROZEN_SPTE)
+		return false;
+
 	if (cmpxchg64(iter->sptep, iter->old_spte, new_spte) != iter->old_spte)
 		return false;
 
@@ -497,6 +508,34 @@  static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	return true;
 }
 
+static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+					   struct tdp_iter *iter)
+{
+	/*
+	 * Freeze the SPTE by setting it to a special,
+	 * non-present value. This will stop other threads from
+	 * immediately installing a present entry in its place
+	 * before the TLBs are flushed.
+	 */
+	if (!tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE))
+		return false;
+
+	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
+					   KVM_PAGES_PER_HPAGE(iter->level));
+
+	/*
+	 * No other thread can overwrite the frozen SPTE as they
+	 * must either wait on the MMU lock or use
+	 * tdp_mmu_set_spte_atomic which will not overrite the
+	 * special frozen SPTE value. No bookkeeping is needed
+	 * here since the SPTE is going from non-present
+	 * to non-present.
+	 */
+	WRITE_ONCE(*iter->sptep, 0);
+
+	return true;
+}
+
 
 /*
  * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
@@ -524,6 +563,14 @@  static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 
 	kvm_mmu_lock_assert_held_exclusive(kvm);
 
+	/*
+	 * No thread should be using this function to set SPTEs to FROZEN_SPTE.
+	 * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
+	 * should be used. If operating under the MMU lock in write mode, the
+	 * use of FROZEN_SPTE should not be necessary.
+	 */
+	WARN_ON(iter->old_spte == FROZEN_SPTE);
+
 	WRITE_ONCE(*iter->sptep, new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
@@ -801,12 +848,9 @@  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
+			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
 				break;
 
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
-					KVM_PAGES_PER_HPAGE(iter.level));
-
 			/*
 			 * The iter must explicitly re-read the spte here
 			 * because the new value informs the !present