All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Junaid Shahid <junaids@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.19 13/38] kvm: x86: mmu: Always flush TLBs when enabling dirty logging
Date: Fri, 16 Sep 2022 12:08:47 +0200	[thread overview]
Message-ID: <20220916100449.030118113@linuxfoundation.org> (raw)
In-Reply-To: <20220916100448.431016349@linuxfoundation.org>

From: Junaid Shahid <junaids@google.com>

[ Upstream commit b64d740ea7ddc929d97b28de4c0665f7d5db9e2a ]

When A/D bits are not available, KVM uses a software access tracking
mechanism, which involves making the SPTEs inaccessible. However,
the clear_young() MMU notifier does not flush TLBs. So it is possible
that there may still be stale, potentially writable, TLB entries.
This is usually fine, but can be problematic when enabling dirty
logging, because it currently only does a TLB flush if any SPTEs were
modified. But if all SPTEs are in access-tracked state, then there
won't be a TLB flush, which means that the guest could still possibly
write to memory and not have it reflected in the dirty bitmap.

So just unconditionally flush the TLBs when enabling dirty logging.
As an alternative, KVM could explicitly check the MMU-Writable bit when
write-protecting SPTEs to decide if a flush is needed (instead of
checking the Writable bit), but given that a flush almost always happens
anyway, so just making it unconditional seems simpler.

Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20220810224939.2611160-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kvm/mmu/mmu.c  | 45 +++++++----------------------------------
 arch/x86/kvm/mmu/spte.h | 14 +++++++++----
 arch/x86/kvm/x86.c      | 44 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 356226c7ebbdc..aa1ba803659cd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5907,47 +5907,18 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level)
 {
-	bool flush = false;
-
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
-					  start_level, KVM_MAX_HUGEPAGE_LEVEL,
-					  false);
+		slot_handle_level(kvm, memslot, slot_rmap_write_protect,
+				  start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
 		write_unlock(&kvm->mmu_lock);
 	}
 
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
-		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
+		kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
 		read_unlock(&kvm->mmu_lock);
 	}
-
-	/*
-	 * Flush TLBs if any SPTEs had to be write-protected to ensure that
-	 * guest writes are reflected in the dirty bitmap before the memslot
-	 * update completes, i.e. before enabling dirty logging is visible to
-	 * userspace.
-	 *
-	 * Perform the TLB flush outside the mmu_lock to reduce the amount of
-	 * time the lock is held. However, this does mean that another CPU can
-	 * now grab mmu_lock and encounter a write-protected SPTE while CPUs
-	 * still have a writable mapping for the associated GFN in their TLB.
-	 *
-	 * This is safe but requires KVM to be careful when making decisions
-	 * based on the write-protection status of an SPTE. Specifically, KVM
-	 * also write-protects SPTEs to monitor changes to guest page tables
-	 * during shadow paging, and must guarantee no CPUs can write to those
-	 * page before the lock is dropped. As mentioned in the previous
-	 * paragraph, a write-protected SPTE is no guarantee that CPU cannot
-	 * perform writes. So to determine if a TLB flush is truly required, KVM
-	 * will clear a separate software-only bit (MMU-writable) and skip the
-	 * flush if-and-only-if this bit was already clear.
-	 *
-	 * See is_writable_pte() for more details.
-	 */
-	if (flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
 /* Must be called with the mmu_lock held in write-mode. */
@@ -6070,32 +6041,30 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot)
 {
-	bool flush = false;
-
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		/*
 		 * Clear dirty bits only on 4k SPTEs since the legacy MMU only
 		 * support dirty logging at a 4k granularity.
 		 */
-		flush = slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
+		slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
 		write_unlock(&kvm->mmu_lock);
 	}
 
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
-		flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
+		kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
 		read_unlock(&kvm->mmu_lock);
 	}
 
 	/*
+	 * The caller will flush the TLBs after this function returns.
+	 *
 	 * It's also safe to flush TLBs out of mmu lock here as currently this
 	 * function is only used for dirty logging, in which case flushing TLB
 	 * out of mmu lock also guarantees no dirty pages will be lost in
 	 * dirty_bitmap.
 	 */
-	if (flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
 void kvm_mmu_zap_all(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f80dbb628df57..e09bdcf1e47c5 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -326,7 +326,7 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
 }
 
 /*
- * An shadow-present leaf SPTE may be non-writable for 3 possible reasons:
+ * A shadow-present leaf SPTE may be non-writable for 4 possible reasons:
  *
  *  1. To intercept writes for dirty logging. KVM write-protects huge pages
  *     so that they can be split be split down into the dirty logging
@@ -344,8 +344,13 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
  *     read-only memslot or guest memory backed by a read-only VMA. Writes to
  *     such pages are disallowed entirely.
  *
- * To keep track of why a given SPTE is write-protected, KVM uses 2
- * software-only bits in the SPTE:
+ *  4. To emulate the Accessed bit for SPTEs without A/D bits.  Note, in this
+ *     case, the SPTE is access-protected, not just write-protected!
+ *
+ * For cases #1 and #4, KVM can safely make such SPTEs writable without taking
+ * mmu_lock as capturing the Accessed/Dirty state doesn't require taking it.
+ * To differentiate #1 and #4 from #2 and #3, KVM uses two software-only bits
+ * in the SPTE:
  *
  *  shadow_mmu_writable_mask, aka MMU-writable -
  *    Cleared on SPTEs that KVM is currently write-protecting for shadow paging
@@ -374,7 +379,8 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
  * shadow page tables between vCPUs. Write-protecting an SPTE for dirty logging
  * (which does not clear the MMU-writable bit), does not flush TLBs before
  * dropping the lock, as it only needs to synchronize guest writes with the
- * dirty bitmap.
+ * dirty bitmap. Similarly, making the SPTE inaccessible (and non-writable) for
+ * access-tracking via the clear_young() MMU notifier also does not flush TLBs.
  *
  * So, there is the problem: clearing the MMU-writable bit can encounter a
  * write-protected SPTE while CPUs still have writable mappings for that SPTE
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 55de0d1981e52..5b36866528568 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12265,6 +12265,50 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		} else {
 			kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
 		}
+
+		/*
+		 * Unconditionally flush the TLBs after enabling dirty logging.
+		 * A flush is almost always going to be necessary (see below),
+		 * and unconditionally flushing allows the helpers to omit
+		 * the subtly complex checks when removing write access.
+		 *
+		 * Do the flush outside of mmu_lock to reduce the amount of
+		 * time mmu_lock is held.  Flushing after dropping mmu_lock is
+		 * safe as KVM only needs to guarantee the slot is fully
+		 * write-protected before returning to userspace, i.e. before
+		 * userspace can consume the dirty status.
+		 *
+		 * Flushing outside of mmu_lock requires KVM to be careful when
+		 * making decisions based on writable status of an SPTE, e.g. a
+		 * !writable SPTE doesn't guarantee a CPU can't perform writes.
+		 *
+		 * Specifically, KVM also write-protects guest page tables to
+		 * monitor changes when using shadow paging, and must guarantee
+		 * no CPUs can write to those page before mmu_lock is dropped.
+		 * Because CPUs may have stale TLB entries at this point, a
+		 * !writable SPTE doesn't guarantee CPUs can't perform writes.
+		 *
+		 * KVM also allows making SPTES writable outside of mmu_lock,
+		 * e.g. to allow dirty logging without taking mmu_lock.
+		 *
+		 * To handle these scenarios, KVM uses a separate software-only
+		 * bit (MMU-writable) to track if a SPTE is !writable due to
+		 * a guest page table being write-protected (KVM clears the
+		 * MMU-writable flag when write-protecting for shadow paging).
+		 *
+		 * The use of MMU-writable is also the primary motivation for
+		 * the unconditional flush.  Because KVM must guarantee that a
+		 * CPU doesn't contain stale, writable TLB entries for a
+		 * !MMU-writable SPTE, KVM must flush if it encounters any
+		 * MMU-writable SPTE regardless of whether the actual hardware
+		 * writable bit was set.  I.e. KVM is almost guaranteed to need
+		 * to flush, while unconditionally flushing allows the "remove
+		 * write access" helpers to ignore MMU-writable entirely.
+		 *
+		 * See is_writable_pte() for more details (the case involving
+		 * access-tracked SPTEs is particularly relevant).
+		 */
+		kvm_arch_flush_remote_tlbs_memslot(kvm, new);
 	}
 }
 
-- 
2.35.1




  parent reply	other threads:[~2022-09-16 10:22 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-16 10:08 [PATCH 5.19 00/38] 5.19.10-rc1 review Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 01/38] iommu/vt-d: Fix kdump kernels boot failure with scalable mode Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 02/38] net/mlx5: Introduce ifc bits for using software vhca id Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 03/38] net/mlx5: Use software VHCA id when its supported Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 04/38] RDMA/mlx5: Rely on RoCE fw cap instead of devlink when setting profile Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 05/38] RDMA/mlx5: Add a umr recovery flow Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 06/38] RDMA/mlx5: Fix UMR cleanup on error flow of driver init Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 07/38] ACPI: resource: skip IRQ override on AMD Zen platforms Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 08/38] Input: goodix - add support for GT1158 Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 09/38] platform/surface: aggregator_registry: Add support for Surface Laptop Go 2 Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 10/38] drm/msm/rd: Fix FIFO-full deadlock Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 11/38] peci: cpu: Fix use-after-free in adev_release() Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 12/38] hwmon: (pmbus) Use dev_err_probe() to filter -EPROBE_DEFER error messages Greg Kroah-Hartman
2022-09-16 10:08 ` Greg Kroah-Hartman [this message]
2022-09-16 10:08 ` [PATCH 5.19 14/38] dt-bindings: iio: gyroscope: bosch,bmg160: correct number of pins Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 15/38] HID: ishtp-hid-clientHID: ishtp-hid-client: Fix comment typo Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 16/38] hid: intel-ish-hid: ishtp: Fix ishtp client sending disordered message Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 17/38] Bluetooth: MGMT: Fix Get Device Flags Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 18/38] tg3: Disable tg3 device on system reboot to avoid triggering AER Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 19/38] r8152: add PID for the Lenovo OneLink+ Dock Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 20/38] gpio: mockup: remove gpio debugfs when remove device Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 21/38] ieee802154: cc2520: add rc code in cc2520_tx() Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 22/38] Input: iforce - add support for Boeder Force Feedback Wheel Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 23/38] drm/amdgpu: disable FRU access on special SIENNA CICHLID card Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 24/38] drm/amd/pm: use vbios carried pptable for all SMU13.0.7 SKUs Greg Kroah-Hartman
2022-09-16 10:08 ` [PATCH 5.19 25/38] nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM610 Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 26/38] nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change() Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 27/38] drm/amd/amdgpu: skip ucode loading if ucode_size == 0 Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 28/38] net: dsa: hellcreek: Print warning only once Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 29/38] perf/arm_pmu_platform: fix tests for platform_get_irq() failure Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 30/38] platform/x86: acer-wmi: Acer Aspire One AOD270/Packard Bell Dot keymap fixes Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 31/38] usb: storage: Add ASUS <0x0b05:0x1932> to IGNORE_UAS Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 32/38] platform/x86: asus-wmi: Increase FAN_CURVE_BUF_LEN to 32 Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 33/38] LoongArch: Fix section mismatch due to acpi_os_ioremap() Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 34/38] LoongArch: Fix arch_remove_memory() undefined build error Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 35/38] gpio: 104-dio-48e: Make irq_chip immutable Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 36/38] gpio: 104-idio-16: " Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 37/38] RDMA/irdma: Use s/g array in post send only when its valid Greg Kroah-Hartman
2022-09-16 10:09 ` [PATCH 5.19 38/38] Input: goodix - add compatible string for GT1158 Greg Kroah-Hartman
2022-09-16 21:49 ` [PATCH 5.19 00/38] 5.19.10-rc1 review Guenter Roeck
2022-09-16 22:35 ` Ron Economos
2022-09-17  7:26 ` Fenil Jain
2022-09-17  8:16 ` Bagas Sanjaya
2022-09-17 14:29 ` Sudip Mukherjee (Codethink)
2022-09-17 15:03 ` Naresh Kamboju
2022-09-19  0:12 ` Justin Forbes
2022-09-19  1:06 ` Florian Fainelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220916100449.030118113@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=junaids@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.