From: Sean Christopherson <seanjc@google.com>
To: Ben Gardon <bgardon@google.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>,
Peter Shier <pshier@google.com>,
Peter Feiner <pfeiner@google.com>,
Junaid Shahid <junaids@google.com>,
Jim Mattson <jmattson@google.com>,
Yulei Zhang <yulei.kernel@gmail.com>,
Wanpeng Li <kernellwp@gmail.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Subject: Re: [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
Date: Wed, 20 Jan 2021 16:05:26 -0800 [thread overview]
Message-ID: <YAjFRoCPB9anInnj@google.com> (raw)
In-Reply-To: <20210112181041.356734-23-bgardon@google.com>
On Tue, Jan 12, 2021, Ben Gardon wrote:
> When the TDP MMU is allowed to handle page faults in parallel there is
> the possiblity of a race where an SPTE is cleared and then imediately
> replaced with a present SPTE pointing to a different PFN, before the
> TLBs can be flushed. This race would violate architectural specs. Ensure
> that the TLBs are flushed properly before other threads are allowed to
> install any present value for the SPTE.
>
> Reviewed-by: Peter Feiner <pfeiner@google.com>
>
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
> arch/x86/kvm/mmu/spte.h | 16 +++++++++-
> arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++++------
> 2 files changed, 68 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 2b3a30bd38b0..ecd9bfbccef4 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -130,6 +130,20 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> PT64_EPT_EXECUTABLE_MASK)
> #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
>
> +/*
> + * If a thread running without exclusive control of the MMU lock must perform a
> + * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a
> + * non-present intermediate value. This will guarantee that other threads will
> + * not modify the spte.
> + *
> + * This constant works because it is considered non-present on both AMD and
> + * Intel CPUs and does not create a L1TF vulnerability because the pfn section
> + * is zeroed out.
> + *
> + * Only used by the TDP MMU.
> + */
> +#define FROZEN_SPTE (1ull << 59)
I dislike FROZEN, for similar reasons that I disliked "disconnected". The SPTE
isn't frozen in the sense that it's temporarily immutable, rather it's been
removed but hasn't been flushed and so can't yet be reused. Given that
FROZEN_SPTEs are treated as not-preset SPTEs, there's zero chance that this can
be extended in the future to be a generic temporarily freeze mechanism.
Mabye REMOVED_SPTE to match earlier feedback?
> +
> /*
> * In some cases, we need to preserve the GFN of a non-present or reserved
> * SPTE when we usurp the upper five bits of the physical address space to
> @@ -187,7 +201,7 @@ static inline bool is_access_track_spte(u64 spte)
>
> static inline int is_shadow_present_pte(u64 pte)
Waaaay off topic, I'm going to send a patch to have this, and any other pte
helpers that return an int, return a bool. While futzing around with ideas I
managed to turn this into a nop by doing
return pte & SPTE_PRESENT;
which is guaranteed to be 0 if SPTE_PRESENT is a bit > 31. I'm sure others will
point out that I'm a heathen for not doing !!(pte & SPTE_PRESENT), but still...
> {
> - return (pte != 0) && !is_mmio_spte(pte);
> + return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);
For all other checks, I'd strongly prefer to add a helper, e.g. is_removed_spte()
or whatever. That way changing the implementation won't be as painful, and we
can add assertions and whatnot if we break things. Especially since FROZEN_SPTE
is a single bit, which makes it look like a flag even though it's used as a full
64-bit constant.
For this, I worry that is_shadow_present_pte() is getting bloated. It's also a
bit unfortunate that it's bloated for the old MMU, without any benefit. That
being said, most that bloat is from the existing MMIO checks. Looking
elsewhere, TDX's SEPT also has a similar concept that may or may not need to
hook is_shadow_present_pte().
Rather than bundle MMIO SPTEs into the access-tracking flags and have a bunch of
special cases for not-present SPTEs, what if we add an explicit flag to mark
SPTEs as present (or not-present)? Defining SPTE_PRESENT instead of
SPTE_NOT_PRESENT might require a few more changes, but it would be the most
optimal for is_shadow_present_pte().
I'm thinking something like this (completely untested):
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index c51ad544f25b..86f6c84569c4 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -38,7 +38,7 @@ static u64 generation_mmio_spte_mask(u64 gen)
u64 mask;
WARN_ON(gen & ~MMIO_SPTE_GEN_MASK);
- BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_SPECIAL_MASK);
+ BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_MMIO);
mask = (gen << MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_SPTE_GEN_LOW_MASK;
mask |= (gen << MMIO_SPTE_GEN_HIGH_SHIFT) & MMIO_SPTE_GEN_HIGH_MASK;
@@ -86,7 +86,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
bool can_unsync, bool host_writable, bool ad_disabled,
u64 *new_spte)
{
- u64 spte = 0;
+ u64 spte = SPTE_PRESENT;
int ret = 0;
if (ad_disabled)
@@ -247,7 +247,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
BUG_ON((u64)(unsigned)access_mask != access_mask);
WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << SHADOW_NONPRESENT_OR_RSVD_MASK_LEN));
WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
- shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
+ shadow_mmio_value = mmio_value | SPTE_MMIO;
shadow_mmio_access_mask = access_mask;
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ecd9bfbccef4..465e43d34034 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -5,18 +5,15 @@
#include "mmu_internal.h"
+/* Software available bits for present SPTEs. */
#define PT_FIRST_AVAIL_BITS_SHIFT 10
#define PT64_SECOND_AVAIL_BITS_SHIFT 54
-/*
- * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
- * Access Tracking SPTEs.
- */
+/* The mask used to denote Access Tracking SPTEs. Note, val=3 is available. */
#define SPTE_SPECIAL_MASK (3ULL << 52)
#define SPTE_AD_ENABLED_MASK (0ULL << 52)
#define SPTE_AD_DISABLED_MASK (1ULL << 52)
#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
-#define SPTE_MMIO_MASK (3ULL << 52)
#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
@@ -55,12 +52,16 @@
#define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
#define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
+#define SPTE_REMOVED BIT_ULL(60)
+#define SPTE_MMIO BIT_ULL(61)
+#define SPTE_PRESENT BIT_ULL(62)
+
/*
* Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
* the memslots generation and is derived as follows:
*
* Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
- * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62
+ * Bits 9-17 of the MMIO generation are propagated to spte bits 52-60
*
* The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
* the MMIO generation number, as doing so would require stealing a bit from
@@ -73,8 +74,8 @@
#define MMIO_SPTE_GEN_LOW_START 3
#define MMIO_SPTE_GEN_LOW_END 11
-#define MMIO_SPTE_GEN_HIGH_START PT64_SECOND_AVAIL_BITS_SHIFT
-#define MMIO_SPTE_GEN_HIGH_END 62
+#define MMIO_SPTE_GEN_HIGH_START 52
+#define MMIO_SPTE_GEN_HIGH_END 60
#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
MMIO_SPTE_GEN_LOW_START)
@@ -162,7 +163,7 @@ extern u8 __read_mostly shadow_phys_bits;
static inline bool is_mmio_spte(u64 spte)
{
- return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
+ return spte & SPTE_MMIO;
}
static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
@@ -199,9 +200,9 @@ static inline bool is_access_track_spte(u64 spte)
return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
}
-static inline int is_shadow_present_pte(u64 pte)
+static inline bool is_shadow_present_pte(u64 pte)
{
- return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);
+ return pte & SPTE_PRESENT;
}
static inline int is_large_pte(u64 pte)
> }
>
> static inline int is_large_pte(u64 pte)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b12a87a4124..5c9d053000ad 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -429,15 +429,19 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> */
> if (!was_present && !is_present) {
> /*
> - * If this change does not involve a MMIO SPTE, it is
> - * unexpected. Log the change, though it should not impact the
> - * guest since both the former and current SPTEs are nonpresent.
> + * If this change does not involve a MMIO SPTE or FROZEN_SPTE,
For comments and error message, I think we should avoid using the exact constant
name, and instead call them "removed SPTE", similar to MMIO SPTE. That will
help reduce thrash and/or stale comments if the name changes.
> + * it is unexpected. Log the change, though it should not
> + * impact the guest since both the former and current SPTEs
> + * are nonpresent.
> */
> - if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
> + if (WARN_ON(!is_mmio_spte(old_spte) &&
> + !is_mmio_spte(new_spte) &&
> + new_spte != FROZEN_SPTE))
> pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
> "should not be replaced with another,\n"
> "different nonpresent SPTE, unless one or both\n"
> - "are MMIO SPTEs.\n"
> + "are MMIO SPTEs, or the new SPTE is\n"
> + "FROZEN_SPTE.\n"
> "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
> as_id, gfn, old_spte, new_spte, level);
> return;
next prev parent reply other threads:[~2021-01-21 1:00 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
2021-01-12 18:10 ` [PATCH 01/24] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
2021-01-12 18:10 ` [PATCH 02/24] sched: Add needbreak " Ben Gardon
2021-01-12 18:10 ` [PATCH 03/24] sched: Add cond_resched_rwlock Ben Gardon
2021-01-12 18:10 ` [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
2021-01-20 18:38 ` Sean Christopherson
2021-01-21 20:22 ` Paolo Bonzini
2021-01-26 14:11 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU Ben Gardon
2021-01-20 19:28 ` Sean Christopherson
2021-01-22 1:06 ` Ben Gardon
2021-01-12 18:10 ` [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
2021-01-20 19:51 ` Sean Christopherson
2021-01-25 23:51 ` Ben Gardon
2021-01-12 18:10 ` [PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
2021-01-26 14:13 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
2021-01-20 19:58 ` Sean Christopherson
2021-01-26 14:13 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
2021-01-20 20:06 ` Sean Christopherson
2021-01-26 14:14 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt Ben Gardon
2021-01-20 20:30 ` Sean Christopherson
2021-01-26 14:14 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section Ben Gardon
2021-01-20 22:19 ` Sean Christopherson
2021-01-12 18:10 ` [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links Ben Gardon
2021-01-22 18:32 ` Sean Christopherson
2021-01-26 18:17 ` Ben Gardon
2021-01-12 18:10 ` [PATCH 13/24] kvm: x86/mmu: Only free tdp_mmu pages after a grace period Ben Gardon
2021-01-12 18:10 ` [PATCH 14/24] kvm: mmu: Wrap mmu_lock lock / unlock in a function Ben Gardon
2021-01-12 18:10 ` [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak Ben Gardon
2021-01-21 0:19 ` Sean Christopherson
2021-01-21 20:17 ` Paolo Bonzini
2021-01-26 14:38 ` Paolo Bonzini
2021-01-26 17:47 ` Ben Gardon
2021-01-26 17:55 ` Paolo Bonzini
2021-01-26 18:11 ` Ben Gardon
2021-01-26 20:47 ` Paolo Bonzini
2021-01-27 20:08 ` Ben Gardon
2021-01-27 20:55 ` Paolo Bonzini
2021-01-27 21:20 ` Ben Gardon
2021-01-28 8:18 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions Ben Gardon
2021-01-26 14:29 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 17/24] kvm: mmu: Move mmu_lock to struct kvm_arch Ben Gardon
2021-01-12 18:10 ` [PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU Ben Gardon
2021-01-21 0:45 ` Sean Christopherson
2021-01-12 18:10 ` [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock Ben Gardon
2021-01-21 19:22 ` Sean Christopherson
2021-01-21 21:32 ` Sean Christopherson
2021-01-26 14:27 ` Paolo Bonzini
2021-01-26 21:47 ` Ben Gardon
2021-01-26 22:02 ` Sean Christopherson
2021-01-26 22:09 ` Sean Christopherson
2021-01-27 12:40 ` Paolo Bonzini
2021-01-26 13:37 ` Paolo Bonzini
2021-01-26 21:07 ` Ben Gardon
2021-01-12 18:10 ` [PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs Ben Gardon
2021-01-26 14:21 ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 21/24] kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
2021-01-12 18:10 ` [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
2021-01-21 0:05 ` Sean Christopherson [this message]
2021-01-12 18:10 ` [PATCH 23/24] kvm: x86/mmu: Freeze SPTEs in disconnected pages Ben Gardon
2021-01-12 18:10 ` [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
2021-01-21 0:55 ` Sean Christopherson
2021-01-26 21:57 ` Ben Gardon
2021-01-27 17:14 ` Sean Christopherson
2021-01-26 13:37 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YAjFRoCPB9anInnj@google.com \
--to=seanjc@google.com \
--cc=bgardon@google.com \
--cc=jmattson@google.com \
--cc=junaids@google.com \
--cc=kernellwp@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=pfeiner@google.com \
--cc=pshier@google.com \
--cc=vkuznets@redhat.com \
--cc=xiaoguangrong.eric@gmail.com \
--cc=yulei.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).