kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case
@ 2019-09-26 23:17 Ben Gardon
  2019-09-26 23:17 ` [RFC PATCH 01/28] kvm: mmu: Separate generating and setting mmio ptes Ben Gardon
                   ` (29 more replies)
  0 siblings, 30 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended upon shadow paging to run all guests, we now have
the use of two dimensional paging (TDP). This RFC proposes and
demonstrates two major changes to the MMU. First, an iterator abstraction 
that simplifies traversal of TDP paging structures when running an L1
guest. This abstraction takes advantage of the relative simplicity of TDP
to simplify the implementation of MMU functions. Second, this RFC changes
the synchronization model to enable more parallelism than the monolithic
MMU lock. This "direct mode" MMU is currently in use at Google and has
given us the performance necessary to live migrate our 416 vCPU, 12TiB
m2-ultramem-416 VMs.

The primary motivation for this work was to handle page faults in
parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
MMU lock suffers from extreme contention, resulting in soft-lockups and
jitter in the guest. To demonstrate this I also written, and will submit
a demand paging test to KVM selftests. The test creates N vCPUs, which
each touch disjoint regions of memory. Page faults are picked up by N
user fault FD handlers, one for each vCPU. Over a 1 second profile of
the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
execution time was spent waiting for the MMU lock! With this patch
series the total execution time for the test was reduced by 89% and the
execution was dominated by get_user_pages and the user fault FD ioctl.
As a secondary benefit, the iterator-based implementation does not use
the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
overheads.

The goal of this  RFC is to demonstrate and gather feedback on the
iterator pattern, the memory savings it enables for the "direct case"
and the changes to the synchronization model. Though they are interwoven
in this series, I will separate the iterator from the synchronization
changes in a future series. I recognize that some feature work will be
needed to make this patch set ready for merging. That work is detailed
at the end of this cover letter.

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and
programs the EPT/NPT paging structures with the GPA -> HPA mapping. The
guest has no way to change this mapping and only one version of the
paging structure is needed per L1 address space (normal execution or
system management mode, on x86).

This RFC implements a "direct MMU" through alternative implementations
of MMU functions for running L1 guests with TDP. The direct MMU gets its
name from the direct role bit in struct kvm_mmu_page in the existing MMU
implementation, which indicates that the PTEs in a page table (and their
children) map a linear range of L1 GPAs. Though the direct MMU does not
currently use struct kvm_mmu_page, all of its pages would implicitly
have that bit set. The direct MMU falls back to the existing shadow
paging implementation when TDP is not available, and interoperates with
the existing shadow paging implementation for nesting. 

In order to handle page faults in parallel, the MMU needs to allow a
variety of changes to PTEs concurrently. The first step in this series
is to replace the MMU lock with a read/write lock to enable multiple
threads to perform operations at the same time and interoperate with
functions that still need the monolithic lock. With threads handling
page faults in parallel, the functions operating on the page table
need to: a) ensure PTE modifications are atomic, and  b) ensure that page
table memory is freed and accessed safely Conveniently, the iterator
pattern introduced in this series handles both concerns.

The direct walk iterator implements a pre-order traversal of the TDP
paging structures. Threads are able to read and write page table memory
safely in this traversal through the use of RCU and page table memory is
freed in RCU callbacks, as part of a three step process. (More on that
below.) To ensure that PTEs are updated atomically, the iterator
provides a function for updating the current pte. If the update
succeeds, the iterator handles bookkeeping based on the current and
previous value of the PTE. If it fails, some other thread will have
succeeded, and the iterator repeats that PTE on the next iteration,
transparently retrying the operation. The iterator also handles yielding
and reacquiring the appropriate MMU lock, and flushing the TLB or
queuing work to be done on the next flush.

In order to minimize TLB flushes, we expand the tlbs_dirty count to
track unflushed changes made through the iterator, so that other threads
know that the in-memory page tables they traverse might not be what the
guest is using to access memory. Page table pages that have been
disconnected from the paging structure root are freed in a three step
process. First the pages are filled with special, nonpresent PTEs so
that guest accesses to them, through the paging structure caches result
in TDP page faults. Second, the pages are added to a disconnected list,
a snapshot of which is transferred to a free list, after each TLB flush.
The TLB flush clears the paging structure caches, so the guest will no
longer use the disconnected pages. Lastly, the free list is processed
asynchronously to queue RCU callbacks which free the memory. The RCU
grace period ensures no kernel threads are using the disconnected pages.
This allows the MMU to leave the guest in an inconsistent, but safe,
state with respect to the in-memory paging structure. When functions
need to guarantee that the guest will use the in-memory state after a
traversal, they can either flush the TLBs unconditionally or, if using
the MMU lock in write mode, flush the TLBs under the lock only if the
tlbs_dirty count is elevated.

The use of the direct MMU can be controlled by a module parameter which
is snapshotted on VM creation and follows the life of the VM. This
snapshot is used in many functions to decide whether or not to use
direct MMU handlers for a given operation. This is a maintenance burden
and in future versions of this series I will address that and remove
some of the code the direct MMU replaces. I am especially interested in
feedback from the community as to how this series can best be merged. I
see two broad approaches: replacement and integration or modularization.

Replacement and integration would require amending the existing shadow
paging implementation to use a similar iterator pattern. This would mean
expanding the iterator to work with an rmap to support shadow paging and
reconciling the synchronization changes made to the direct case with the
complexities of shadow paging and nesting.

The modularization approach would require factoring out the "direct MMU"
or "TDP MMU" and "shadow MMU(s)." The function pointers in the MMU
struct would need to be expanded to fully encompass the interface of the
MMU and multiple, simpler, implementations of those functions would be
needed. As it is, use of the module parameter snapshot gives us a rough
outline of the previously undocumented shape of the MMU interface, which
could facilitate modularization. Modularization could allow for the
separation of the shadow paging implementations for running guests
without TDP, and running nested guests with TDP, and the breakup of
paging_tmpl.h.

In addition to the integration question, below are some of the work
items I plan to address before sending the series out again:

Disentangle the iterator pattern from the synchronization changes
	Currently the direct_walk_iterator is very closely tied to the use
	of atomic operations, RCU, and a rwlock for MMU operations. This
	does not need to be the case: instead I would like to see those
	synchronization changes built on top of this iterator pattern.

Support 5 level paging and PAE
	Currently the direct walk iterator only supports 4 level, 64bit
	architectures.

Support MMU memory reclaim
	Currently this patch series does not respect memory limits applied
	through kvm_vm_ioctl_set_nr_mmu_pages.

Support nonpaging guests
	Guests that are not using virtual addresses can be direct mapped,
	even without TDP.

Implement fast invalidation of all PTEs
	This series was prepared between when the fast invalidate_all
	mechanism was removed and when it was re-added. Currently, there
	is no fast path for invalidating all direct MMU PTEs.

Move more operations to execute concurrently
	In this patch series, only page faults are able to execute
	concurrently, however several other functions can also execute
	concurrently, simply by changing the write lock acquisition to a
	read lock.

This series can also be viewed in Gerrit here:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/1416 (Thanks to
Dmitry Vyukov <dvyukov@google.com> for setting up the Gerrit instance)

Ben Gardon (28):
  kvm: mmu: Separate generating and setting mmio ptes
  kvm: mmu: Separate pte generation from set_spte
  kvm: mmu: Zero page cache memory at allocation time
  kvm: mmu: Update the lpages stat atomically
  sched: Add cond_resched_rwlock
  kvm: mmu: Replace mmu_lock with a read/write lock
  kvm: mmu: Add functions for handling changed PTEs
  kvm: mmu: Init / Uninit the direct MMU
  kvm: mmu: Free direct MMU page table memory in an RCU callback
  kvm: mmu: Flush TLBs before freeing direct MMU page table memory
  kvm: mmu: Optimize for freeing direct MMU PTs on teardown
  kvm: mmu: Set tlbs_dirty atomically
  kvm: mmu: Add an iterator for concurrent paging structure walks
  kvm: mmu: Batch updates to the direct mmu disconnected list
  kvm: mmu: Support invalidate_zap_all_pages
  kvm: mmu: Add direct MMU page fault handler
  kvm: mmu: Add direct MMU fast page fault handler
  kvm: mmu: Add an hva range iterator for memslot GFNs
  kvm: mmu: Make address space ID a property of memslots
  kvm: mmu: Implement the invalidation MMU notifiers for the direct MMU
  kvm: mmu: Integrate the direct mmu with the changed pte notifier
  kvm: mmu: Implement access tracking for the direct MMU
  kvm: mmu: Make mark_page_dirty_in_slot usable from outside kvm_main
  kvm: mmu: Support dirty logging in the direct MMU
  kvm: mmu: Support kvm_zap_gfn_range in the direct MMU
  kvm: mmu: Integrate direct MMU with nesting
  kvm: mmu: Lazily allocate rmap when direct MMU is enabled
  kvm: mmu: Support MMIO in the direct MMU

 arch/x86/include/asm/kvm_host.h |   66 +-
 arch/x86/kvm/Kconfig            |    1 +
 arch/x86/kvm/mmu.c              | 2578 ++++++++++++++++++++++++++-----
 arch/x86/kvm/mmu.h              |    2 +
 arch/x86/kvm/mmutrace.h         |   50 +
 arch/x86/kvm/page_track.c       |    8 +-
 arch/x86/kvm/paging_tmpl.h      |   37 +-
 arch/x86/kvm/vmx/vmx.c          |   10 +-
 arch/x86/kvm/x86.c              |   96 +-
 arch/x86/kvm/x86.h              |    2 +
 include/linux/kvm_host.h        |    6 +-
 include/linux/sched.h           |   11 +
 kernel/sched/core.c             |   23 +
 virt/kvm/kvm_main.c             |   57 +-
 14 files changed, 2503 insertions(+), 444 deletions(-)

-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH 01/28] kvm: mmu: Separate generating and setting mmio ptes
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
@ 2019-09-26 23:17 ` Ben Gardon
  2019-11-27 18:15   ` Sean Christopherson
  2019-09-26 23:17 ` [RFC PATCH 02/28] kvm: mmu: Separate pte generation from set_spte Ben Gardon
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Separate the functions for generating MMIO page table entries from the
function that inserts them into the paging structure. This refactoring
will allow changes to the MMU sychronization model to use atomic
compare / exchanges (which are not guaranteed to succeed) instead of a
monolithic MMU lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5269aa057dfa6..781c2ca7455e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -390,8 +390,7 @@ static u64 get_mmio_spte_generation(u64 spte)
 	return gen;
 }
 
-static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
-			   unsigned access)
+static u64 generate_mmio_pte(struct kvm_vcpu *vcpu, u64 gfn, unsigned access)
 {
 	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
 	u64 mask = generation_mmio_spte_mask(gen);
@@ -403,6 +402,17 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
 	mask |= (gpa & shadow_nonpresent_or_rsvd_mask)
 		<< shadow_nonpresent_or_rsvd_mask_len;
 
+	return mask;
+}
+
+static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
+			   unsigned access)
+{
+	u64 mask = generate_mmio_pte(vcpu, gfn, access);
+	unsigned int gen = get_mmio_spte_generation(mask);
+
+	access = mask & ACC_ALL;
+
 	trace_mark_mmio_spte(sptep, gfn, access, gen);
 	mmu_spte_set(sptep, mask);
 }
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 02/28] kvm: mmu: Separate pte generation from set_spte
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
  2019-09-26 23:17 ` [RFC PATCH 01/28] kvm: mmu: Separate generating and setting mmio ptes Ben Gardon
@ 2019-09-26 23:17 ` Ben Gardon
  2019-11-27 18:25   ` Sean Christopherson
  2019-09-26 23:17 ` [RFC PATCH 03/28] kvm: mmu: Zero page cache memory at allocation time Ben Gardon
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Separate the functions for generating leaf page table entries from the
function that inserts them into the paging structure. This refactoring
will allow changes to the MMU sychronization model to use atomic
compare / exchanges (which are not guaranteed to succeed) instead of a
monolithic MMU lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 93 ++++++++++++++++++++++++++++------------------
 1 file changed, 57 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 781c2ca7455e3..7e5ab9c6e2b09 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2964,21 +2964,15 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 #define SET_SPTE_WRITE_PROTECTED_PT	BIT(0)
 #define SET_SPTE_NEED_REMOTE_TLB_FLUSH	BIT(1)
 
-static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
-		    unsigned pte_access, int level,
-		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
-		    bool can_unsync, bool host_writable)
+static int generate_pte(struct kvm_vcpu *vcpu, unsigned pte_access, int level,
+		    gfn_t gfn, kvm_pfn_t pfn, u64 old_pte, bool speculative,
+		    bool can_unsync, bool host_writable, bool ad_disabled,
+		    u64 *ptep)
 {
-	u64 spte = 0;
+	u64 pte;
 	int ret = 0;
-	struct kvm_mmu_page *sp;
-
-	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
-		return 0;
 
-	sp = page_header(__pa(sptep));
-	if (sp_ad_disabled(sp))
-		spte |= shadow_acc_track_value;
+	*ptep = 0;
 
 	/*
 	 * For the EPT case, shadow_present_mask is 0 if hardware
@@ -2986,36 +2980,39 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	 * ACC_USER_MASK and shadow_user_mask are used to represent
 	 * read access.  See FNAME(gpte_access) in paging_tmpl.h.
 	 */
-	spte |= shadow_present_mask;
+	pte = shadow_present_mask;
+
+	if (ad_disabled)
+		pte |= shadow_acc_track_value;
+
 	if (!speculative)
-		spte |= spte_shadow_accessed_mask(spte);
+		pte |= spte_shadow_accessed_mask(pte);
 
 	if (pte_access & ACC_EXEC_MASK)
-		spte |= shadow_x_mask;
+		pte |= shadow_x_mask;
 	else
-		spte |= shadow_nx_mask;
+		pte |= shadow_nx_mask;
 
 	if (pte_access & ACC_USER_MASK)
-		spte |= shadow_user_mask;
+		pte |= shadow_user_mask;
 
 	if (level > PT_PAGE_TABLE_LEVEL)
-		spte |= PT_PAGE_SIZE_MASK;
+		pte |= PT_PAGE_SIZE_MASK;
 	if (tdp_enabled)
-		spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
+		pte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
 			kvm_is_mmio_pfn(pfn));
 
 	if (host_writable)
-		spte |= SPTE_HOST_WRITEABLE;
+		pte |= SPTE_HOST_WRITEABLE;
 	else
 		pte_access &= ~ACC_WRITE_MASK;
 
 	if (!kvm_is_mmio_pfn(pfn))
-		spte |= shadow_me_mask;
+		pte |= shadow_me_mask;
 
-	spte |= (u64)pfn << PAGE_SHIFT;
+	pte |= (u64)pfn << PAGE_SHIFT;
 
 	if (pte_access & ACC_WRITE_MASK) {
-
 		/*
 		 * Other vcpu creates new sp in the window between
 		 * mapping_level() and acquiring mmu-lock. We can
@@ -3024,9 +3021,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		 */
 		if (level > PT_PAGE_TABLE_LEVEL &&
 		    mmu_gfn_lpage_is_disallowed(vcpu, gfn, level))
-			goto done;
+			return 0;
 
-		spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
+		pte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
 
 		/*
 		 * Optimization: for pte sync, if spte was writable the hash
@@ -3034,30 +3031,54 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		 * is responsibility of mmu_get_page / kvm_sync_page.
 		 * Same reasoning can be applied to dirty page accounting.
 		 */
-		if (!can_unsync && is_writable_pte(*sptep))
-			goto set_pte;
+		if (!can_unsync && is_writable_pte(old_pte)) {
+			*ptep = pte;
+			return 0;
+		}
 
 		if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
 			pgprintk("%s: found shadow page for %llx, marking ro\n",
 				 __func__, gfn);
-			ret |= SET_SPTE_WRITE_PROTECTED_PT;
+			ret = SET_SPTE_WRITE_PROTECTED_PT;
 			pte_access &= ~ACC_WRITE_MASK;
-			spte &= ~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
+			pte &= ~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
 		}
 	}
 
-	if (pte_access & ACC_WRITE_MASK) {
-		kvm_vcpu_mark_page_dirty(vcpu, gfn);
-		spte |= spte_shadow_dirty_mask(spte);
-	}
+	if (pte_access & ACC_WRITE_MASK)
+		pte |= spte_shadow_dirty_mask(pte);
 
 	if (speculative)
-		spte = mark_spte_for_access_track(spte);
+		pte = mark_spte_for_access_track(pte);
+
+	*ptep = pte;
+	return ret;
+}
+
+static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
+		    int level, gfn_t gfn, kvm_pfn_t pfn, bool speculative,
+		    bool can_unsync, bool host_writable)
+{
+	u64 spte;
+	int ret;
+	struct kvm_mmu_page *sp;
+
+	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
+		return 0;
+
+	sp = page_header(__pa(sptep));
+
+	ret = generate_pte(vcpu, pte_access, level, gfn, pfn, *sptep,
+			   speculative, can_unsync, host_writable,
+			   sp_ad_disabled(sp), &spte);
+	if (!spte)
+		return 0;
+
+	if (spte & PT_WRITABLE_MASK)
+		kvm_vcpu_mark_page_dirty(vcpu, gfn);
 
-set_pte:
 	if (mmu_spte_update(sptep, spte))
 		ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
-done:
 	return ret;
 }
 
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 03/28] kvm: mmu: Zero page cache memory at allocation time
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
  2019-09-26 23:17 ` [RFC PATCH 01/28] kvm: mmu: Separate generating and setting mmio ptes Ben Gardon
  2019-09-26 23:17 ` [RFC PATCH 02/28] kvm: mmu: Separate pte generation from set_spte Ben Gardon
@ 2019-09-26 23:17 ` Ben Gardon
  2019-11-27 18:32   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 04/28] kvm: mmu: Update the lpages stat atomically Ben Gardon
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Simplify use of the MMU page cache by allocating pages pre-zeroed. This
ensures that future code does not accidentally add non-zeroed memory to
the paging structure and moves the work of zeroing page page out from
under the MMU lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7e5ab9c6e2b09..1ecd6d51c0ee0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1037,7 +1037,7 @@ static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
 	if (cache->nobjs >= min)
 		return 0;
 	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
+		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 		if (!page)
 			return cache->nobjs >= min ? 0 : -ENOMEM;
 		cache->objects[cache->nobjs++] = page;
@@ -2548,7 +2548,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		if (level > PT_PAGE_TABLE_LEVEL && need_sync)
 			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
 	}
-	clear_page(sp->spt);
 	trace_kvm_mmu_get_page(sp, true);
 
 	kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 04/28] kvm: mmu: Update the lpages stat atomically
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (2 preceding siblings ...)
  2019-09-26 23:17 ` [RFC PATCH 03/28] kvm: mmu: Zero page cache memory at allocation time Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-11-27 18:39   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 05/28] sched: Add cond_resched_rwlock Ben Gardon
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

In order to pave the way for more concurrent MMU operations, updates to
VM-global stats need to be done atomically. Change updates to the lpages
stat to be atomic in preparation for the introduction of parallel page
fault handling.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1ecd6d51c0ee0..56587655aecb9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1532,7 +1532,7 @@ static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
 		WARN_ON(page_header(__pa(sptep))->role.level ==
 			PT_PAGE_TABLE_LEVEL);
 		drop_spte(kvm, sptep);
-		--kvm->stat.lpages;
+		xadd(&kvm->stat.lpages, -1);
 		return true;
 	}
 
@@ -2676,7 +2676,7 @@ static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 		if (is_last_spte(pte, sp->role.level)) {
 			drop_spte(kvm, spte);
 			if (is_large_pte(pte))
-				--kvm->stat.lpages;
+				xadd(&kvm->stat.lpages, -1);
 		} else {
 			child = page_header(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, spte);
@@ -3134,7 +3134,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
 	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
 	trace_kvm_mmu_set_spte(level, gfn, sptep);
 	if (!was_rmapped && is_large_pte(*sptep))
-		++vcpu->kvm->stat.lpages;
+		xadd(&vcpu->kvm->stat.lpages, 1);
 
 	if (is_shadow_present_pte(*sptep)) {
 		if (!was_rmapped) {
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 05/28] sched: Add cond_resched_rwlock
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (3 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 04/28] kvm: mmu: Update the lpages stat atomically Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-11-27 18:42   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 06/28] kvm: mmu: Replace mmu_lock with a read/write lock Ben Gardon
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Rescheduling while holding a spin lock is essential for keeping long
running kernel operations running smoothly. Add the facility to
cond_resched read/write spin locks.

RFC_NOTE: The current implementation of this patch set uses a read/write
lock to replace the existing MMU spin lock. See the next patch in this
series for more on why a read/write lock was chosen, and possible
alternatives.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/sched.h | 11 +++++++++++
 kernel/sched/core.c   | 23 +++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 70db597d6fd4f..4d1fd96693d9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1767,12 +1767,23 @@ static inline int _cond_resched(void) { return 0; }
 })
 
 extern int __cond_resched_lock(spinlock_t *lock);
+extern int __cond_resched_rwlock(rwlock_t *lock, bool write_lock);
 
 #define cond_resched_lock(lock) ({				\
 	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
 	__cond_resched_lock(lock);				\
 })
 
+#define cond_resched_rwlock_read(lock) ({			\
+	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	__cond_resched_rwlock(lock, false);			\
+})
+
+#define cond_resched_rwlock_write(lock) ({			\
+	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	__cond_resched_rwlock(lock, true);			\
+})
+
 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9a1346a5fa95..ba7ed4bed5036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5663,6 +5663,29 @@ int __cond_resched_lock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_lock);
 
+int __cond_resched_rwlock(rwlock_t *lock, bool write_lock)
+{
+	int ret = 0;
+
+	lockdep_assert_held(lock);
+	if (should_resched(PREEMPT_LOCK_OFFSET)) {
+		if (write_lock) {
+			write_unlock(lock);
+			preempt_schedule_common();
+			write_lock(lock);
+		} else {
+			read_unlock(lock);
+			preempt_schedule_common();
+			read_lock(lock);
+		}
+
+		ret = 1;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock);
+
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 06/28] kvm: mmu: Replace mmu_lock with a read/write lock
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (4 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 05/28] sched: Add cond_resched_rwlock Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-11-27 18:47   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 07/28] kvm: mmu: Add functions for handling changed PTEs Ben Gardon
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Replace the KVM MMU spinlock with a read/write lock so that some parts of
the MMU can be made more concurrent in future commits by switching some
write mode aquisitions to read mode. A read/write lock was chosen over
other synchronization options beause it has minimal initial impact: this
change simply changes all uses of the MMU spin lock to an MMU read/write
lock, in write mode. This change has no effect on the logic of the code
and only a small performance penalty.

Other, more invasive options were considered for synchronizing access to
the paging structures. Sharding the MMU lock to protect 2MB chunks of
addresses, as the main MM does, would also work, however it makes
acquiring locks for operations on large regions of memory expensive.
Further, the parallel page fault handling algorithm introduced later in
this series does not require exclusive access to the region of memory
for which it is handling a fault.

There are several disadvantages to the read/write lock approach:
1. The reader/writer terminology does not apply well to MMU operations.
2. Many operations require exclusive access to a region of memory
(often a memslot), but not all of memory. The read/write lock does not
facilitate this.
3. Contention between readers and writers can still create problems in
the face of long running MMU operations.

Despite these issues,the use of a read/write lock facilitates
substantial improvements over the monolithic locking scheme.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c         | 106 +++++++++++++++++++------------------
 arch/x86/kvm/page_track.c  |   8 +--
 arch/x86/kvm/paging_tmpl.h |   8 +--
 arch/x86/kvm/x86.c         |   4 +-
 include/linux/kvm_host.h   |   3 +-
 virt/kvm/kvm_main.c        |  34 ++++++------
 6 files changed, 83 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 56587655aecb9..0311d18d9a995 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2446,9 +2446,9 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
 			mmu_pages_clear_parents(&parents);
 		}
-		if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {
+		if (need_resched()) {
 			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
-			cond_resched_lock(&vcpu->kvm->mmu_lock);
+			cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
 			flush = false;
 		}
 	}
@@ -2829,7 +2829,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 {
 	LIST_HEAD(invalid_list);
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
 		/* Need to free some mmu pages to achieve the goal. */
@@ -2843,7 +2843,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 
 	kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
@@ -2854,7 +2854,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 
 	pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
 	r = 0;
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
 		pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
 			 sp->role.word);
@@ -2862,7 +2862,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	return r;
 }
@@ -3578,7 +3578,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 		return r;
 
 	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
@@ -3586,8 +3586,9 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 	if (likely(!force_pt_level))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
 	r = __direct_map(vcpu, v, write, map_writable, level, pfn, prefault);
+
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -3629,7 +3630,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			return;
 	}
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
@@ -3653,7 +3654,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	}
 
 	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);
 
@@ -3675,31 +3676,31 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	unsigned i;
 
 	if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
-		spin_lock(&vcpu->kvm->mmu_lock);
+		write_lock(&vcpu->kvm->mmu_lock);
 		if(make_mmu_pages_available(vcpu) < 0) {
-			spin_unlock(&vcpu->kvm->mmu_lock);
+			write_unlock(&vcpu->kvm->mmu_lock);
 			return -ENOSPC;
 		}
 		sp = kvm_mmu_get_page(vcpu, 0, 0,
 				vcpu->arch.mmu->shadow_root_level, 1, ACC_ALL);
 		++sp->root_count;
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu->root_hpa = __pa(sp->spt);
 	} else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL) {
 		for (i = 0; i < 4; ++i) {
 			hpa_t root = vcpu->arch.mmu->pae_root[i];
 
 			MMU_WARN_ON(VALID_PAGE(root));
-			spin_lock(&vcpu->kvm->mmu_lock);
+			write_lock(&vcpu->kvm->mmu_lock);
 			if (make_mmu_pages_available(vcpu) < 0) {
-				spin_unlock(&vcpu->kvm->mmu_lock);
+				write_unlock(&vcpu->kvm->mmu_lock);
 				return -ENOSPC;
 			}
 			sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT),
 					i << 30, PT32_ROOT_LEVEL, 1, ACC_ALL);
 			root = __pa(sp->spt);
 			++sp->root_count;
-			spin_unlock(&vcpu->kvm->mmu_lock);
+			write_unlock(&vcpu->kvm->mmu_lock);
 			vcpu->arch.mmu->pae_root[i] = root | PT_PRESENT_MASK;
 		}
 		vcpu->arch.mmu->root_hpa = __pa(vcpu->arch.mmu->pae_root);
@@ -3732,16 +3733,16 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 
 		MMU_WARN_ON(VALID_PAGE(root));
 
-		spin_lock(&vcpu->kvm->mmu_lock);
+		write_lock(&vcpu->kvm->mmu_lock);
 		if (make_mmu_pages_available(vcpu) < 0) {
-			spin_unlock(&vcpu->kvm->mmu_lock);
+			write_unlock(&vcpu->kvm->mmu_lock);
 			return -ENOSPC;
 		}
 		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
 				vcpu->arch.mmu->shadow_root_level, 0, ACC_ALL);
 		root = __pa(sp->spt);
 		++sp->root_count;
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu->root_hpa = root;
 		goto set_root_cr3;
 	}
@@ -3769,16 +3770,16 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			if (mmu_check_root(vcpu, root_gfn))
 				return 1;
 		}
-		spin_lock(&vcpu->kvm->mmu_lock);
+		write_lock(&vcpu->kvm->mmu_lock);
 		if (make_mmu_pages_available(vcpu) < 0) {
-			spin_unlock(&vcpu->kvm->mmu_lock);
+			write_unlock(&vcpu->kvm->mmu_lock);
 			return -ENOSPC;
 		}
 		sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL,
 				      0, ACC_ALL);
 		root = __pa(sp->spt);
 		++sp->root_count;
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 
 		vcpu->arch.mmu->pae_root[i] = root | pm_mask;
 	}
@@ -3854,17 +3855,17 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 		    !smp_load_acquire(&sp->unsync_children))
 			return;
 
-		spin_lock(&vcpu->kvm->mmu_lock);
+		write_lock(&vcpu->kvm->mmu_lock);
 		kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
 		mmu_sync_children(vcpu, sp);
 
 		kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 		return;
 	}
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
 	for (i = 0; i < 4; ++i) {
@@ -3878,7 +3879,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 	}
 
 	kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
 
@@ -4204,7 +4205,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 		return r;
 
 	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
@@ -4212,8 +4213,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	if (likely(!force_pt_level))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
 	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault);
+
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -5338,7 +5340,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	 */
 	mmu_topup_memory_caches(vcpu);
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 
 	gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, &bytes);
 
@@ -5374,7 +5376,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	}
 	kvm_mmu_flush_or_zap(vcpu, &invalid_list, remote_flush, local_flush);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PTE_WRITE);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
@@ -5581,14 +5583,14 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		if (iterator.rmap)
 			flush |= fn(kvm, iterator.rmap);
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (need_resched()) {
 			if (flush && lock_flush_tlb) {
 				kvm_flush_remote_tlbs_with_address(kvm,
 						start_gfn,
 						iterator.gfn - start_gfn + 1);
 				flush = false;
 			}
-			cond_resched_lock(&kvm->mmu_lock);
+			cond_resched_rwlock_write(&kvm->mmu_lock);
 		}
 	}
 
@@ -5738,7 +5740,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 		 * be in active use by the guest.
 		 */
 		if (batch >= BATCH_ZAP_PAGES &&
-		    cond_resched_lock(&kvm->mmu_lock)) {
+		    cond_resched_rwlock_write(&kvm->mmu_lock)) {
 			batch = 0;
 			goto restart;
 		}
@@ -5771,7 +5773,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 {
 	lockdep_assert_held(&kvm->slots_lock);
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	trace_kvm_mmu_zap_all_fast(kvm);
 
 	/*
@@ -5794,7 +5796,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	kvm_reload_remote_mmus(kvm);
 
 	kvm_zap_obsolete_pages(kvm);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
@@ -5831,7 +5833,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	struct kvm_memory_slot *memslot;
 	int i;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(memslot, slots) {
@@ -5848,7 +5850,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		}
 	}
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
@@ -5862,10 +5864,10 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
 				      false);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	/*
 	 * kvm_mmu_slot_remove_write_access() and kvm_vm_ioctl_get_dirty_log()
@@ -5933,10 +5935,10 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot)
 {
 	/* FIXME: const-ify all uses of struct kvm_memory_slot.  */
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
 			 kvm_mmu_zap_collapsible_spte, true);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
@@ -5944,9 +5946,9 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	lockdep_assert_held(&kvm->slots_lock);
 
@@ -5967,10 +5969,10 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
 					false);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	/* see kvm_mmu_slot_remove_write_access */
 	lockdep_assert_held(&kvm->slots_lock);
@@ -5986,9 +5988,9 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	lockdep_assert_held(&kvm->slots_lock);
 
@@ -6005,19 +6007,19 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	LIST_HEAD(invalid_list);
 	int ign;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 restart:
 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
 		if (sp->role.invalid && sp->root_count)
 			continue;
 		if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
 			goto restart;
-		if (cond_resched_lock(&kvm->mmu_lock))
+		if (cond_resched_rwlock_write(&kvm->mmu_lock))
 			goto restart;
 	}
 
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
@@ -6077,7 +6079,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			continue;
 
 		idx = srcu_read_lock(&kvm->srcu);
-		spin_lock(&kvm->mmu_lock);
+		write_lock(&kvm->mmu_lock);
 
 		if (kvm_has_zapped_obsolete_pages(kvm)) {
 			kvm_mmu_commit_zap_page(kvm,
@@ -6090,7 +6092,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
 unlock:
-		spin_unlock(&kvm->mmu_lock);
+		write_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, idx);
 
 		/*
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 3521e2d176f2f..a43f4fa020db2 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -188,9 +188,9 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 
 	head = &kvm->arch.track_notifier_head;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	hlist_add_head_rcu(&n->node, &head->track_notifier_list);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_register_notifier);
 
@@ -206,9 +206,9 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 
 	head = &kvm->arch.track_notifier_head;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	hlist_del_rcu(&n->node);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	synchronize_srcu(&head->track_srcu);
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7d5cdb3af5943..97903c8dcad16 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -841,7 +841,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	}
 
 	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 
@@ -855,7 +855,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -892,7 +892,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 		return;
 	}
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	for_each_shadow_entry_using_root(vcpu, root_hpa, gva, iterator) {
 		level = iterator.level;
 		sptep = iterator.sptep;
@@ -925,7 +925,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
 			break;
 	}
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0ed07d8d2caa0..9ecf83da396c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6376,9 +6376,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 	if (vcpu->arch.mmu->direct_map) {
 		unsigned int indirect_shadow_pages;
 
-		spin_lock(&vcpu->kvm->mmu_lock);
+		write_lock(&vcpu->kvm->mmu_lock);
 		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 
 		if (indirect_shadow_pages)
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fcb46b3374c60..baed80f8a7f00 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -441,7 +441,8 @@ struct kvm_memslots {
 };
 
 struct kvm {
-	spinlock_t mmu_lock;
+	rwlock_t mmu_lock;
+
 	struct mutex slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
 	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e6de3159e682f..9ce067b6882b7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -356,13 +356,13 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	int idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	kvm->mmu_notifier_seq++;
 
 	if (kvm_set_spte_hva(kvm, address, pte))
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -374,7 +374,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
@@ -387,7 +387,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	if (need_tlb_flush)
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start,
 					range->end,
@@ -403,7 +403,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
@@ -417,7 +417,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	 * in conjunction with the smp_rmb in mmu_notifier_retry().
 	 */
 	kvm->mmu_notifier_count--;
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
@@ -431,13 +431,13 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	young = kvm_age_hva(kvm, start, end);
 	if (young)
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -452,7 +452,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	/*
 	 * Even though we do not flush TLB, this will still adversely
 	 * affect performance on pre-Haswell Intel EPT, where there is
@@ -467,7 +467,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * more sophisticated heuristic later.
 	 */
 	young = kvm_age_hva(kvm, start, end);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -481,9 +481,9 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	young = kvm_test_age_hva(kvm, address);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -632,7 +632,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
 
-	spin_lock_init(&kvm->mmu_lock);
+	rwlock_init(&kvm->mmu_lock);
 	mmgrab(current->mm);
 	kvm->mm = current->mm;
 	kvm_eventfd_init(kvm);
@@ -1193,7 +1193,7 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
 		memset(dirty_bitmap_buffer, 0, n);
 
-		spin_lock(&kvm->mmu_lock);
+		write_lock(&kvm->mmu_lock);
 		for (i = 0; i < n / sizeof(long); i++) {
 			unsigned long mask;
 			gfn_t offset;
@@ -1209,7 +1209,7 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
 								offset, mask);
 		}
-		spin_unlock(&kvm->mmu_lock);
+		write_unlock(&kvm->mmu_lock);
 	}
 
 	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
@@ -1263,7 +1263,7 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
 		return -EFAULT;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	for (offset = log->first_page, i = offset / BITS_PER_LONG,
 		 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
 	     i++, offset += BITS_PER_LONG) {
@@ -1286,7 +1286,7 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
 								offset, mask);
 		}
 	}
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	return 0;
 }
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 07/28] kvm: mmu: Add functions for handling changed PTEs
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (5 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 06/28] kvm: mmu: Replace mmu_lock with a read/write lock Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-11-27 19:04   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 08/28] kvm: mmu: Init / Uninit the direct MMU Ben Gardon
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

The existing bookkeeping done by KVM when a PTE is changed is
spread around several functions. This makes it difficult to remember all
the stats, bitmaps, and other subsystems that need to be updated whenever
a PTE is modified. When a non-leaf PTE is marked non-present or becomes
a leaf PTE, page table memory must also be freed. Further, most of the
bookkeeping is done before the PTE is actually set. This works well with
a monolithic MMU lock, however if changes use atomic compare/exchanges,
the bookkeeping cannot be done before the change is made. In either
case, there is a short window in which some statistics, e.g. the dirty
bitmap will be inconsistent, however consistency is still restored
before the MMU lock is released. To simplify the MMU and facilitate the
use of atomic operations on PTEs, create functions to handle some of the
bookkeeping required as a result of the change.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 145 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0311d18d9a995..50413f17c7cd0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -143,6 +143,18 @@ module_param(dbg, bool, 0644);
 #define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
 #define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
 
+/*
+ * PTEs in a disconnected page table can be set to DISCONNECTED_PTE to indicate
+ * to other threads that the page table in which the pte resides is no longer
+ * connected to the root of a paging structure.
+ *
+ * This constant works because it is considered non-present on both AMD and
+ * Intel CPUs and does not create a L1TF vulnerability because the pfn section
+ * is zeroed out. PTE bit 57 is available to software, per vol 3, figure 28-1
+ * of the Intel SDM and vol 2, figures 5-18 to 5-21 of the AMD APM.
+ */
+#define DISCONNECTED_PTE (1ull << 57)
+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 /* make pte_list_desc fit well in cache line */
@@ -555,6 +567,16 @@ static int is_shadow_present_pte(u64 pte)
 	return (pte != 0) && !is_mmio_spte(pte);
 }
 
+static inline int is_disconnected_pte(u64 pte)
+{
+	return pte == DISCONNECTED_PTE;
+}
+
+static int is_present_direct_pte(u64 pte)
+{
+	return is_shadow_present_pte(pte) && !is_disconnected_pte(pte);
+}
+
 static int is_large_pte(u64 pte)
 {
 	return pte & PT_PAGE_SIZE_MASK;
@@ -1659,6 +1681,129 @@ static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
 	return flush;
 }
 
+static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
+			       u64 old_pte, u64 new_pte, int level);
+
+/**
+ * mark_pte_disconnected - Mark a PTE as part of a disconnected PT
+ * @kvm: kvm instance
+ * @as_id: the address space of the paging structure the PTE was a part of
+ * @gfn: the base GFN that was mapped by the PTE
+ * @ptep: a pointer to the PTE to be marked disconnected
+ * @level: the level of the PT this PTE was a part of, when it was part of the
+ *	paging structure
+ */
+static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
+				  u64 *ptep, int level)
+{
+	u64 old_pte;
+
+	old_pte = xchg(ptep, DISCONNECTED_PTE);
+	BUG_ON(old_pte == DISCONNECTED_PTE);
+
+	handle_changed_pte(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE, level);
+}
+
+/**
+ * handle_disconnected_pt - Mark a PT as disconnected and handle associated
+ * bookkeeping and freeing
+ * @kvm: kvm instance
+ * @as_id: the address space of the paging structure the PT was a part of
+ * @pt_base_gfn: the base GFN that was mapped by the first PTE in the PT
+ * @pfn: The physical frame number of the disconnected PT page
+ * @level: the level of the PT, when it was part of the paging structure
+ *
+ * Given a pointer to a page table that has been removed from the paging
+ * structure and its level, recursively free child page tables and mark their
+ * entries as disconnected.
+ */
+static void handle_disconnected_pt(struct kvm *kvm, int as_id,
+				   gfn_t pt_base_gfn, kvm_pfn_t pfn, int level)
+{
+	int i;
+	gfn_t gfn = pt_base_gfn;
+	u64 *pt = pfn_to_kaddr(pfn);
+
+	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+		/*
+		 * Mark the PTE as disconnected so that no other thread will
+		 * try to map in an entry there or try to free any child page
+		 * table the entry might have pointed to.
+		 */
+		mark_pte_disconnected(kvm, as_id, gfn, &pt[i], level);
+
+		gfn += KVM_PAGES_PER_HPAGE(level);
+	}
+
+	free_page((unsigned long)pt);
+}
+
+/**
+ * handle_changed_pte - handle bookkeeping associated with a PTE change
+ * @kvm: kvm instance
+ * @as_id: the address space of the paging structure the PTE was a part of
+ * @gfn: the base GFN that was mapped by the PTE
+ * @old_pte: The value of the PTE before the atomic compare / exchange
+ * @new_pte: The value of the PTE after the atomic compare / exchange
+ * @level: the level of the PT the PTE is part of in the paging structure
+ *
+ * Handle bookkeeping that might result from the modification of a PTE.
+ * This function should be called in the same RCU read critical section as the
+ * atomic cmpxchg on the pte. This function must be called for all direct pte
+ * modifications except those which strictly emulate hardware, for example
+ * setting the dirty bit on a pte.
+ */
+static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
+			       u64 old_pte, u64 new_pte, int level)
+{
+	bool was_present = is_present_direct_pte(old_pte);
+	bool is_present = is_present_direct_pte(new_pte);
+	bool was_leaf = was_present && is_last_spte(old_pte, level);
+	bool pfn_changed = spte_to_pfn(old_pte) != spte_to_pfn(new_pte);
+	int child_level;
+
+	BUG_ON(level > PT64_ROOT_MAX_LEVEL);
+	BUG_ON(level < PT_PAGE_TABLE_LEVEL);
+	BUG_ON(gfn % KVM_PAGES_PER_HPAGE(level));
+
+	/*
+	 * The only times a pte should be changed from a non-present to
+	 * non-present state is when an entry in an unlinked page table is
+	 * marked as a disconnected PTE as part of freeing the page table,
+	 * or an MMIO entry is installed/modified. In these cases there is
+	 * nothing to do.
+	 */
+	if (!was_present && !is_present) {
+		/*
+		 * If this change is not on an MMIO PTE and not setting a PTE
+		 * as disconnected, then it is unexpected. Log the change,
+		 * though it should not impact the guest since both the former
+		 * and current PTEs are nonpresent.
+		 */
+		WARN_ON((new_pte != DISCONNECTED_PTE) &&
+			!is_mmio_spte(new_pte));
+		return;
+	}
+
+	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
+		/*
+		 * The level of the page table being freed is one level lower
+		 * than the level at which it is mapped.
+		 */
+		child_level = level - 1;
+
+		/*
+		 * If there was a present non-leaf entry before, and now the
+		 * entry points elsewhere, the lpage stats and dirty logging /
+		 * access tracking status for all the entries the old pte
+		 * pointed to must be updated and the page table pages it
+		 * pointed to must be freed.
+		 */
+		handle_disconnected_pt(kvm, as_id, gfn, spte_to_pfn(old_pte),
+				       child_level);
+	}
+}
+
 /**
  * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 08/28] kvm: mmu: Init / Uninit the direct MMU
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (6 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 07/28] kvm: mmu: Add functions for handling changed PTEs Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-12-02 23:40   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 09/28] kvm: mmu: Free direct MMU page table memory in an RCU callback Ben Gardon
                   ` (21 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

The direct MMU introduces several new fields that need to be initialized
and torn down. Add functions to do that initialization / cleanup.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  51 ++++++++----
 arch/x86/kvm/mmu.c              | 132 +++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |  16 +++-
 3 files changed, 169 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 23edf56cf577c..1f8164c577d50 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -236,6 +236,22 @@ enum {
  */
 #define KVM_APIC_PV_EOI_PENDING	1
 
+#define HF_GIF_MASK		(1 << 0)
+#define HF_HIF_MASK		(1 << 1)
+#define HF_VINTR_MASK		(1 << 2)
+#define HF_NMI_MASK		(1 << 3)
+#define HF_IRET_MASK		(1 << 4)
+#define HF_GUEST_MASK		(1 << 5) /* VCPU is in guest-mode */
+#define HF_SMM_MASK		(1 << 6)
+#define HF_SMM_INSIDE_NMI_MASK	(1 << 7)
+
+#define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
+#define KVM_ADDRESS_SPACE_NUM 2
+
+#define kvm_arch_vcpu_memslots_id(vcpu) \
+		((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
+#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+
 struct kvm_kernel_irq_routing_entry;
 
 /*
@@ -940,6 +956,24 @@ struct kvm_arch {
 	bool exception_payload_enabled;
 
 	struct kvm_pmu_event_filter *pmu_event_filter;
+
+	/*
+	 * Whether the direct MMU is enabled for this VM. This contains a
+	 * snapshot of the direct MMU module parameter from when the VM was
+	 * created and remains unchanged for the life of the VM. If this is
+	 * true, direct MMU handler functions will run for various MMU
+	 * operations.
+	 */
+	bool direct_mmu_enabled;
+	/*
+	 * Indicates that the paging structure built by the direct MMU is
+	 * currently the only one in use. If nesting is used, prompting the
+	 * creation of shadow page tables for L2, this will be set to false.
+	 * While this is true, only direct MMU handlers will be run for many
+	 * MMU functions. Ignored if !direct_mmu_enabled.
+	 */
+	bool pure_direct_mmu;
+	hpa_t direct_root_hpa[KVM_ADDRESS_SPACE_NUM];
 };
 
 struct kvm_vm_stat {
@@ -1255,7 +1289,7 @@ void kvm_mmu_module_exit(void);
 
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
-void kvm_mmu_init_vm(struct kvm *kvm);
+int kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
@@ -1519,21 +1553,6 @@ enum {
 	TASK_SWITCH_GATE = 3,
 };
 
-#define HF_GIF_MASK		(1 << 0)
-#define HF_HIF_MASK		(1 << 1)
-#define HF_VINTR_MASK		(1 << 2)
-#define HF_NMI_MASK		(1 << 3)
-#define HF_IRET_MASK		(1 << 4)
-#define HF_GUEST_MASK		(1 << 5) /* VCPU is in guest-mode */
-#define HF_SMM_MASK		(1 << 6)
-#define HF_SMM_INSIDE_NMI_MASK	(1 << 7)
-
-#define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
-#define KVM_ADDRESS_SPACE_NUM 2
-
-#define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
-#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
 asmlinkage void kvm_spurious_fault(void);
 
 /*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 50413f17c7cd0..788edbda02f69 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -47,6 +47,10 @@
 #include <asm/kvm_page_track.h>
 #include "trace.h"
 
+static bool __read_mostly direct_mmu_enabled;
+module_param_named(enable_direct_mmu, direct_mmu_enabled, bool,
+		   S_IRUGO | S_IWUSR);
+
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
  * where the hardware walks 2 page tables:
@@ -3754,27 +3758,56 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 	*root_hpa = INVALID_PAGE;
 }
 
+static bool is_direct_mmu_root(struct kvm *kvm, hpa_t root)
+{
+	int as_id;
+
+	for (as_id = 0; as_id < KVM_ADDRESS_SPACE_NUM; as_id++)
+		if (root == kvm->arch.direct_root_hpa[as_id])
+			return true;
+
+	return false;
+}
+
 /* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
 void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			ulong roots_to_free)
 {
 	int i;
 	LIST_HEAD(invalid_list);
-	bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
 
 	BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
 
-	/* Before acquiring the MMU lock, see if we need to do any real work. */
-	if (!(free_active_root && VALID_PAGE(mmu->root_hpa))) {
-		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
-			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
-			    VALID_PAGE(mmu->prev_roots[i].hpa))
-				break;
+	/*
+	 * Direct MMU paging structures follow the life of the VM, so instead of
+	 * destroying direct MMU paging structure root, simply mark the root
+	 * HPA pointing to it as invalid.
+	 */
+	if (vcpu->kvm->arch.direct_mmu_enabled &&
+	    roots_to_free & KVM_MMU_ROOT_CURRENT &&
+	    is_direct_mmu_root(vcpu->kvm, mmu->root_hpa))
+		mmu->root_hpa = INVALID_PAGE;
 
-		if (i == KVM_MMU_NUM_PREV_ROOTS)
-			return;
+	if (!VALID_PAGE(mmu->root_hpa))
+		roots_to_free &= ~KVM_MMU_ROOT_CURRENT;
+
+	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+		if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) {
+			if (is_direct_mmu_root(vcpu->kvm,
+					       mmu->prev_roots[i].hpa))
+				mmu->prev_roots[i].hpa = INVALID_PAGE;
+			if (!VALID_PAGE(mmu->prev_roots[i].hpa))
+				roots_to_free &= ~KVM_MMU_ROOT_PREVIOUS(i);
+		}
 	}
 
+	/*
+	 * If there are no valid roots that need freeing at this point, avoid
+	 * acquiring the MMU lock and return.
+	 */
+	if (!roots_to_free)
+		return;
+
 	write_lock(&vcpu->kvm->mmu_lock);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
@@ -3782,7 +3815,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			mmu_free_root_page(vcpu->kvm, &mmu->prev_roots[i].hpa,
 					   &invalid_list);
 
-	if (free_active_root) {
+	if (roots_to_free & KVM_MMU_ROOT_CURRENT) {
 		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
 		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
 			mmu_free_root_page(vcpu->kvm, &mmu->root_hpa,
@@ -3820,7 +3853,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_page *sp;
 	unsigned i;
 
-	if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
+	if (vcpu->kvm->arch.direct_mmu_enabled) {
+		// TODO: Support 5 level paging in the direct MMU
+		BUG_ON(vcpu->arch.mmu->shadow_root_level > PT64_ROOT_4LEVEL);
+		vcpu->arch.mmu->root_hpa = vcpu->kvm->arch.direct_root_hpa[
+			kvm_arch_vcpu_memslots_id(vcpu)];
+	} else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
 		write_lock(&vcpu->kvm->mmu_lock);
 		if(make_mmu_pages_available(vcpu) < 0) {
 			write_unlock(&vcpu->kvm->mmu_lock);
@@ -3863,6 +3901,10 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	gfn_t root_gfn, root_cr3;
 	int i;
 
+	write_lock(&vcpu->kvm->mmu_lock);
+	vcpu->kvm->arch.pure_direct_mmu = false;
+	write_unlock(&vcpu->kvm->mmu_lock);
+
 	root_cr3 = vcpu->arch.mmu->get_cr3(vcpu);
 	root_gfn = root_cr3 >> PAGE_SHIFT;
 
@@ -5710,6 +5752,64 @@ void kvm_disable_tdp(void)
 }
 EXPORT_SYMBOL_GPL(kvm_disable_tdp);
 
+static bool is_direct_mmu_enabled(void)
+{
+	if (!READ_ONCE(direct_mmu_enabled))
+		return false;
+
+	if (WARN_ONCE(!tdp_enabled,
+		      "Creating a VM with direct MMU enabled requires TDP."))
+		return false;
+
+	return true;
+}
+
+static int kvm_mmu_init_direct_mmu(struct kvm *kvm)
+{
+	struct page *page;
+	int i;
+
+	if (!is_direct_mmu_enabled())
+		return 0;
+
+	/*
+	 * Allocate the direct MMU root pages. These pages follow the life of
+	 * the VM.
+	 */
+	for (i = 0; i < ARRAY_SIZE(kvm->arch.direct_root_hpa); i++) {
+		page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+		if (!page)
+			goto err;
+		kvm->arch.direct_root_hpa[i] = page_to_phys(page);
+	}
+
+	/* This should not be changed for the lifetime of the VM. */
+	kvm->arch.direct_mmu_enabled = true;
+
+	kvm->arch.pure_direct_mmu = true;
+	return 0;
+err:
+	for (i = 0; i < ARRAY_SIZE(kvm->arch.direct_root_hpa); i++) {
+		if (kvm->arch.direct_root_hpa[i] &&
+		    VALID_PAGE(kvm->arch.direct_root_hpa[i]))
+			free_page((unsigned long)kvm->arch.direct_root_hpa[i]);
+		kvm->arch.direct_root_hpa[i] = INVALID_PAGE;
+	}
+	return -ENOMEM;
+}
+
+static void kvm_mmu_uninit_direct_mmu(struct kvm *kvm)
+{
+	int i;
+
+	if (!kvm->arch.direct_mmu_enabled)
+		return;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+		handle_disconnected_pt(kvm, i, 0,
+			(kvm_pfn_t)(kvm->arch.direct_root_hpa[i] >> PAGE_SHIFT),
+			PT64_ROOT_4LEVEL);
+}
 
 /* The return value indicates if tlb flush on all vcpus is needed. */
 typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head);
@@ -5956,13 +6056,19 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 	kvm_mmu_zap_all_fast(kvm);
 }
 
-void kvm_mmu_init_vm(struct kvm *kvm)
+int kvm_mmu_init_vm(struct kvm *kvm)
 {
 	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
+	int r;
+
+	r = kvm_mmu_init_direct_mmu(kvm);
+	if (r)
+		return r;
 
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+	return 0;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -5970,6 +6076,8 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
 	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
 
 	kvm_page_track_unregister_notifier(kvm, node);
+
+	kvm_mmu_uninit_direct_mmu(kvm);
 }
 
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9ecf83da396c9..2972b6c6029fb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9421,6 +9421,8 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
+	int err;
+
 	if (type)
 		return -EINVAL;
 
@@ -9450,9 +9452,19 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	kvm_hv_init_vm(kvm);
 	kvm_page_track_init(kvm);
-	kvm_mmu_init_vm(kvm);
+	err = kvm_mmu_init_vm(kvm);
+	if (err)
+		return err;
+
+	err = kvm_x86_ops->vm_init(kvm);
+	if (err)
+		goto error;
+
+	return 0;
 
-	return kvm_x86_ops->vm_init(kvm);
+error:
+	kvm_mmu_uninit_vm(kvm);
+	return err;
 }
 
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 09/28] kvm: mmu: Free direct MMU page table memory in an RCU callback
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (7 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 08/28] kvm: mmu: Init / Uninit the direct MMU Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 10/28] kvm: mmu: Flush TLBs before freeing direct MMU page table memory Ben Gardon
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

The direct walk iterator, introduced in a later commit in this series,
uses RCU to ensure that its concurrent access to paging structure memory
is safe. This requires that page table memory not be freed until an RCU
grace period has elapsed. In order to keep the threads removing page
table memory from the paging structure from blocking, free the disonnected
page table memory in an RCU callback.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 788edbda02f69..9fe57ef7baa29 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1685,6 +1685,21 @@ static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
 	return flush;
 }
 
+/*
+ * This function is called through call_rcu in order to free direct page table
+ * memory safely, with resepct to other KVM MMU threads that might be operating
+ * on it. By only accessing direct page table memory in a RCU read critical
+ * section, and freeing it after a grace period, lockless access to that memory
+ * won't use it after it is freed.
+ */
+static void free_pt_rcu_callback(struct rcu_head *rp)
+{
+	struct page *req = container_of(rp, struct page, rcu_head);
+	u64 *disconnected_pt = page_address(req);
+
+	free_page((unsigned long)disconnected_pt);
+}
+
 static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 			       u64 old_pte, u64 new_pte, int level);
 
@@ -1720,6 +1735,11 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
  * Given a pointer to a page table that has been removed from the paging
  * structure and its level, recursively free child page tables and mark their
  * entries as disconnected.
+ *
+ * RCU dereferences are not necessary to protect access to the disconnected
+ * page table or its children because it has been atomically removed from the
+ * root of the paging structure, so no other thread will be trying to free the
+ * memory.
  */
 static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 				   gfn_t pt_base_gfn, kvm_pfn_t pfn, int level)
@@ -1727,6 +1747,7 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 	int i;
 	gfn_t gfn = pt_base_gfn;
 	u64 *pt = pfn_to_kaddr(pfn);
+	struct page *page;
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		/*
@@ -1739,7 +1760,12 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 		gfn += KVM_PAGES_PER_HPAGE(level);
 	}
 
-	free_page((unsigned long)pt);
+	/*
+	 * Free the pt page in an RCU callback, once it's safe to do
+	 * so.
+	 */
+	page = pfn_to_page(pfn);
+	call_rcu(&page->rcu_head, free_pt_rcu_callback);
 }
 
 /**
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 10/28] kvm: mmu: Flush TLBs before freeing direct MMU page table memory
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (8 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 09/28] kvm: mmu: Free direct MMU page table memory in an RCU callback Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-12-02 23:46   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 11/28] kvm: mmu: Optimize for freeing direct MMU PTs on teardown Ben Gardon
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

If page table memory is freed before a TLB flush, it can result in
improper guest access to memory through paging structure caches.
Specifically, until a TLB flush, memory that was part of the paging
structure could be used by the hardware for address translation if a
partial walk leading to it is stored in the paging structure cache. Ensure
that there is a TLB flush before page table memory is freed by
transferring disconnected pages to a disconnected list, and on a flush
transferring a snapshot of the disconnected list to a free list. The free
list is processed asynchronously to avoid slowing TLB flushes.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |   5 ++
 arch/x86/kvm/Kconfig            |   1 +
 arch/x86/kvm/mmu.c              | 127 ++++++++++++++++++++++++++++++--
 include/linux/kvm_host.h        |   1 +
 virt/kvm/kvm_main.c             |   9 ++-
 5 files changed, 136 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1f8164c577d50..9bf149dce146d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -974,6 +974,11 @@ struct kvm_arch {
 	 */
 	bool pure_direct_mmu;
 	hpa_t direct_root_hpa[KVM_ADDRESS_SPACE_NUM];
+	spinlock_t direct_mmu_disconnected_pts_lock;
+	struct list_head direct_mmu_disconnected_pts;
+	spinlock_t direct_mmu_pt_free_list_lock;
+	struct list_head direct_mmu_pt_free_list;
+	struct work_struct direct_mmu_free_work;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 840e12583b85b..7c615f3cebf8f 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -45,6 +45,7 @@ config KVM
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
 	select SRCU
+	select HAVE_KVM_ARCH_TLB_FLUSH_ALL
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9fe57ef7baa29..317e9238f17b2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1700,6 +1700,100 @@ static void free_pt_rcu_callback(struct rcu_head *rp)
 	free_page((unsigned long)disconnected_pt);
 }
 
+/*
+ * Takes a snapshot of, and clears, the direct MMU disconnected pt list. Once
+ * TLBs have been flushed, this snapshot can be transferred to the direct MMU
+ * PT free list to be freed.
+ */
+static void direct_mmu_cut_disconnected_pt_list(struct kvm *kvm,
+						struct list_head *snapshot)
+{
+	spin_lock(&kvm->arch.direct_mmu_disconnected_pts_lock);
+	list_splice_tail_init(&kvm->arch.direct_mmu_disconnected_pts, snapshot);
+	spin_unlock(&kvm->arch.direct_mmu_disconnected_pts_lock);
+}
+
+/*
+ * Takes a snapshot of, and clears, the direct MMU PT free list and then sets
+ * each page in the snapshot to be freed after an RCU grace period.
+ */
+static void direct_mmu_process_pt_free_list(struct kvm *kvm)
+{
+	LIST_HEAD(free_list);
+	struct page *page;
+	struct page *next;
+
+	spin_lock(&kvm->arch.direct_mmu_pt_free_list_lock);
+	list_splice_tail_init(&kvm->arch.direct_mmu_pt_free_list, &free_list);
+	spin_unlock(&kvm->arch.direct_mmu_pt_free_list_lock);
+
+	list_for_each_entry_safe(page, next, &free_list, lru) {
+		list_del(&page->lru);
+		/*
+		 * Free the pt page in an RCU callback, once it's safe to do
+		 * so.
+		 */
+		call_rcu(&page->rcu_head, free_pt_rcu_callback);
+	}
+}
+
+static void direct_mmu_free_work_fn(struct work_struct *work)
+{
+	struct kvm *kvm = container_of(work, struct kvm,
+				       arch.direct_mmu_free_work);
+
+	direct_mmu_process_pt_free_list(kvm);
+}
+
+/*
+ * Propagate a snapshot of the direct MMU disonnected pt list to the direct MMU
+ * PT free list, after TLBs have been flushed. Schedule work to free the pages
+ * in the direct MMU PT free list.
+ */
+static void direct_mmu_process_free_list_async(struct kvm *kvm,
+					       struct list_head *snapshot)
+{
+	spin_lock(&kvm->arch.direct_mmu_pt_free_list_lock);
+	list_splice_tail_init(snapshot, &kvm->arch.direct_mmu_pt_free_list);
+	spin_unlock(&kvm->arch.direct_mmu_pt_free_list_lock);
+
+	schedule_work(&kvm->arch.direct_mmu_free_work);
+}
+
+/*
+ * To be used during teardown once all VCPUs are paused. Ensures that the
+ * direct MMU disconnected PT and PT free lists are emptied and outstanding
+ * page table memory freed.
+ */
+static void direct_mmu_process_pt_free_list_sync(struct kvm *kvm)
+{
+	LIST_HEAD(snapshot);
+
+	cancel_work_sync(&kvm->arch.direct_mmu_free_work);
+	direct_mmu_cut_disconnected_pt_list(kvm, &snapshot);
+
+	spin_lock(&kvm->arch.direct_mmu_pt_free_list_lock);
+	list_splice_tail_init(&snapshot, &kvm->arch.direct_mmu_pt_free_list);
+	spin_unlock(&kvm->arch.direct_mmu_pt_free_list_lock);
+
+	direct_mmu_process_pt_free_list(kvm);
+}
+
+/*
+ * Add a page of memory that has been disconnected from the paging structure to
+ * a queue to be freed. This is a two step process: after a page has been
+ * disconnected, the TLBs must be flushed, and an RCU grace period must elapse
+ * before the memory can be freed.
+ */
+static void direct_mmu_disconnected_pt_list_add(struct kvm *kvm,
+						struct page *page)
+{
+	spin_lock(&kvm->arch.direct_mmu_disconnected_pts_lock);
+	list_add_tail(&page->lru, &kvm->arch.direct_mmu_disconnected_pts);
+	spin_unlock(&kvm->arch.direct_mmu_disconnected_pts_lock);
+}
+
+
 static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 			       u64 old_pte, u64 new_pte, int level);
 
@@ -1760,12 +1854,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 		gfn += KVM_PAGES_PER_HPAGE(level);
 	}
 
-	/*
-	 * Free the pt page in an RCU callback, once it's safe to do
-	 * so.
-	 */
 	page = pfn_to_page(pfn);
-	call_rcu(&page->rcu_head, free_pt_rcu_callback);
+	direct_mmu_disconnected_pt_list_add(kvm, page);
 }
 
 /**
@@ -5813,6 +5903,12 @@ static int kvm_mmu_init_direct_mmu(struct kvm *kvm)
 	kvm->arch.direct_mmu_enabled = true;
 
 	kvm->arch.pure_direct_mmu = true;
+	spin_lock_init(&kvm->arch.direct_mmu_disconnected_pts_lock);
+	INIT_LIST_HEAD(&kvm->arch.direct_mmu_disconnected_pts);
+	spin_lock_init(&kvm->arch.direct_mmu_pt_free_list_lock);
+	INIT_LIST_HEAD(&kvm->arch.direct_mmu_pt_free_list);
+	INIT_WORK(&kvm->arch.direct_mmu_free_work, direct_mmu_free_work_fn);
+
 	return 0;
 err:
 	for (i = 0; i < ARRAY_SIZE(kvm->arch.direct_root_hpa); i++) {
@@ -5831,6 +5927,8 @@ static void kvm_mmu_uninit_direct_mmu(struct kvm *kvm)
 	if (!kvm->arch.direct_mmu_enabled)
 		return;
 
+	direct_mmu_process_pt_free_list_sync(kvm);
+
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
 		handle_disconnected_pt(kvm, i, 0,
 			(kvm_pfn_t)(kvm->arch.direct_root_hpa[i] >> PAGE_SHIFT),
@@ -6516,3 +6614,22 @@ void kvm_mmu_module_exit(void)
 	unregister_shrinker(&mmu_shrinker);
 	mmu_audit_disable();
 }
+
+void kvm_flush_remote_tlbs(struct kvm *kvm)
+{
+	LIST_HEAD(disconnected_snapshot);
+
+	if (kvm->arch.direct_mmu_enabled)
+		direct_mmu_cut_disconnected_pt_list(kvm,
+						    &disconnected_snapshot);
+
+	/*
+	 * Synchronously flush the TLBs before processing the direct MMU free
+	 * list.
+	 */
+	__kvm_flush_remote_tlbs(kvm);
+
+	if (kvm->arch.direct_mmu_enabled)
+		direct_mmu_process_free_list_async(kvm, &disconnected_snapshot);
+}
+EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index baed80f8a7f00..350a3b79cc8d1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -786,6 +786,7 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 int kvm_vcpu_yield_to(struct kvm_vcpu *target);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
 
+void __kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9ce067b6882b7..c8559a86625ce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -255,8 +255,7 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	return called;
 }
 
-#ifndef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
-void kvm_flush_remote_tlbs(struct kvm *kvm)
+void __kvm_flush_remote_tlbs(struct kvm *kvm)
 {
 	/*
 	 * Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in
@@ -280,6 +279,12 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 		++kvm->stat.remote_tlb_flush;
 	cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
 }
+
+#ifndef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
+void kvm_flush_remote_tlbs(struct kvm *kvm)
+{
+	__kvm_flush_remote_tlbs(kvm);
+}
 EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
 #endif
 
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 11/28] kvm: mmu: Optimize for freeing direct MMU PTs on teardown
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (9 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 10/28] kvm: mmu: Flush TLBs before freeing direct MMU page table memory Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-12-02 23:54   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 12/28] kvm: mmu: Set tlbs_dirty atomically Ben Gardon
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Waiting for a TLB flush and an RCU grace priod before freeing page table
memory grants safety in steady state operation, however these
protections are not always necessary. On VM teardown, only one thread is
operating on the paging structures and no vCPUs are running. As a result
a fast path can be added to the disconnected page table handler which
frees the memory immediately. Add the fast path and use it when tearing
down VMs.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 317e9238f17b2..263718d49f730 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1795,7 +1795,8 @@ static void direct_mmu_disconnected_pt_list_add(struct kvm *kvm,
 
 
 static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
-			       u64 old_pte, u64 new_pte, int level);
+			       u64 old_pte, u64 new_pte, int level,
+			       bool vm_teardown);
 
 /**
  * mark_pte_disconnected - Mark a PTE as part of a disconnected PT
@@ -1805,16 +1806,19 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
  * @ptep: a pointer to the PTE to be marked disconnected
  * @level: the level of the PT this PTE was a part of, when it was part of the
  *	paging structure
+ * @vm_teardown: all vCPUs are paused and the VM is being torn down. Yield and
+ *	free child page table memory immediately.
  */
 static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
-				  u64 *ptep, int level)
+				  u64 *ptep, int level, bool vm_teardown)
 {
 	u64 old_pte;
 
 	old_pte = xchg(ptep, DISCONNECTED_PTE);
 	BUG_ON(old_pte == DISCONNECTED_PTE);
 
-	handle_changed_pte(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE, level);
+	handle_changed_pte(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE, level,
+			   vm_teardown);
 }
 
 /**
@@ -1825,6 +1829,8 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
  * @pt_base_gfn: the base GFN that was mapped by the first PTE in the PT
  * @pfn: The physical frame number of the disconnected PT page
  * @level: the level of the PT, when it was part of the paging structure
+ * @vm_teardown: all vCPUs are paused and the VM is being torn down. Yield and
+ *	free child page table memory immediately.
  *
  * Given a pointer to a page table that has been removed from the paging
  * structure and its level, recursively free child page tables and mark their
@@ -1834,9 +1840,17 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
  * page table or its children because it has been atomically removed from the
  * root of the paging structure, so no other thread will be trying to free the
  * memory.
+ *
+ * If vm_teardown=true, this function will yield while handling the
+ * disconnected page tables and will free memory immediately. This option
+ * should only be used during VM teardown when no other CPUs are accessing the
+ * direct paging structures. Yielding is necessary because the paging structure
+ * could be quite large, and freeing it without yielding would induce
+ * soft-lockups or scheduler warnings.
  */
 static void handle_disconnected_pt(struct kvm *kvm, int as_id,
-				   gfn_t pt_base_gfn, kvm_pfn_t pfn, int level)
+				   gfn_t pt_base_gfn, kvm_pfn_t pfn, int level,
+				   bool vm_teardown)
 {
 	int i;
 	gfn_t gfn = pt_base_gfn;
@@ -1849,13 +1863,20 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 		 * try to map in an entry there or try to free any child page
 		 * table the entry might have pointed to.
 		 */
-		mark_pte_disconnected(kvm, as_id, gfn, &pt[i], level);
+		mark_pte_disconnected(kvm, as_id, gfn, &pt[i], level,
+				      vm_teardown);
 
 		gfn += KVM_PAGES_PER_HPAGE(level);
 	}
 
-	page = pfn_to_page(pfn);
-	direct_mmu_disconnected_pt_list_add(kvm, page);
+	if (vm_teardown) {
+		BUG_ON(atomic_read(&kvm->online_vcpus) != 0);
+		cond_resched();
+		free_page((unsigned long)pt);
+	} else {
+		page = pfn_to_page(pfn);
+		direct_mmu_disconnected_pt_list_add(kvm, page);
+	}
 }
 
 /**
@@ -1866,6 +1887,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
  * @old_pte: The value of the PTE before the atomic compare / exchange
  * @new_pte: The value of the PTE after the atomic compare / exchange
  * @level: the level of the PT the PTE is part of in the paging structure
+ * @vm_teardown: all vCPUs are paused and the VM is being torn down. Yield and
+ *	free child page table memory immediately.
  *
  * Handle bookkeeping that might result from the modification of a PTE.
  * This function should be called in the same RCU read critical section as the
@@ -1874,7 +1897,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
  * setting the dirty bit on a pte.
  */
 static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
-			       u64 old_pte, u64 new_pte, int level)
+			       u64 old_pte, u64 new_pte, int level,
+			       bool vm_teardown)
 {
 	bool was_present = is_present_direct_pte(old_pte);
 	bool is_present = is_present_direct_pte(new_pte);
@@ -1920,7 +1944,7 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 		 * pointed to must be freed.
 		 */
 		handle_disconnected_pt(kvm, as_id, gfn, spte_to_pfn(old_pte),
-				       child_level);
+				       child_level, vm_teardown);
 	}
 }
 
@@ -5932,7 +5956,7 @@ static void kvm_mmu_uninit_direct_mmu(struct kvm *kvm)
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
 		handle_disconnected_pt(kvm, i, 0,
 			(kvm_pfn_t)(kvm->arch.direct_root_hpa[i] >> PAGE_SHIFT),
-			PT64_ROOT_4LEVEL);
+			PT64_ROOT_4LEVEL, true);
 }
 
 /* The return value indicates if tlb flush on all vcpus is needed. */
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 12/28] kvm: mmu: Set tlbs_dirty atomically
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (10 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 11/28] kvm: mmu: Optimize for freeing direct MMU PTs on teardown Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-12-03  0:13   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks Ben Gardon
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

The tlbs_dirty mechanism for deferring flushes can be expanded beyond
its current use case. This allows MMU operations which do not
themselves require TLB flushes to notify other threads that there are
unflushed modifications to the paging structure. In order to use this
mechanism concurrently, the updates to the global tlbs_dirty must be
made atomically.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/paging_tmpl.h | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 97903c8dcad16..cc3630c8bd3ea 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -986,6 +986,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 	bool host_writable;
 	gpa_t first_pte_gpa;
 	int set_spte_ret = 0;
+	int ret;
+	int tlbs_dirty = 0;
 
 	/* direct kvm_mmu_page can not be unsync. */
 	BUG_ON(sp->role.direct);
@@ -1004,17 +1006,13 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
 
 		if (kvm_vcpu_read_guest_atomic(vcpu, pte_gpa, &gpte,
-					       sizeof(pt_element_t)))
-			return 0;
+					       sizeof(pt_element_t))) {
+			ret = 0;
+			goto out;
+		}
 
 		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
-			/*
-			 * Update spte before increasing tlbs_dirty to make
-			 * sure no tlb flush is lost after spte is zapped; see
-			 * the comments in kvm_flush_remote_tlbs().
-			 */
-			smp_wmb();
-			vcpu->kvm->tlbs_dirty++;
+			tlbs_dirty++;
 			continue;
 		}
 
@@ -1029,12 +1027,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 
 		if (gfn != sp->gfns[i]) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
-			/*
-			 * The same as above where we are doing
-			 * prefetch_invalid_gpte().
-			 */
-			smp_wmb();
-			vcpu->kvm->tlbs_dirty++;
+			tlbs_dirty++;
 			continue;
 		}
 
@@ -1051,7 +1044,11 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 	if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)
 		kvm_flush_remote_tlbs(vcpu->kvm);
 
-	return nr_present;
+	ret = nr_present;
+
+out:
+	xadd(&vcpu->kvm->tlbs_dirty, tlbs_dirty);
+	return ret;
 }
 
 #undef pt_element_t
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (11 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 12/28] kvm: mmu: Set tlbs_dirty atomically Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-12-03  2:15   ` Sean Christopherson
  2019-09-26 23:18 ` [RFC PATCH 14/28] kvm: mmu: Batch updates to the direct mmu disconnected list Ben Gardon
                   ` (16 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Add a utility for concurrent paging structure traversals. This iterator
uses several mechanisms to ensure that its accesses to paging structure
memory are safe, and that memory can be freed safely in the face of
lockless access. The purpose of the iterator is to create a unified
pattern for concurrent paging structure traversals and simplify the
implementation of other MMU functions.

This iterator implements a pre-order traversal of PTEs for a given GFN
range within a given address space. The iterator abstracts away
bookkeeping on successful changes to PTEs, retrying on failed PTE
modifications, TLB flushing, and yielding during long operations.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c      | 455 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmutrace.h |  50 +++++
 2 files changed, 505 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 263718d49f730..59d1866398c42 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1948,6 +1948,461 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 
+/*
+ * Given a host page table entry and its level, returns a pointer containing
+ * the host virtual address of the child page table referenced by the page table
+ * entry. Returns null if there is no such entry.
+ */
+static u64 *pte_to_child_pt(u64 pte, int level)
+{
+	u64 *pt;
+	/* There's no child entry if this entry isn't present */
+	if (!is_present_direct_pte(pte))
+		return NULL;
+
+	/* There is no child page table if this is a leaf entry. */
+	if (is_last_spte(pte, level))
+		return NULL;
+
+	pt = (u64 *)__va(pte & PT64_BASE_ADDR_MASK);
+	return pt;
+}
+
+enum mmu_lock_mode {
+	MMU_NO_LOCK = 0,
+	MMU_READ_LOCK = 1,
+	MMU_WRITE_LOCK = 2,
+	MMU_LOCK_MAY_RESCHED = 4
+};
+
+/*
+ * A direct walk iterator encapsulates a walk through a direct paging structure.
+ * It handles ensuring that the walk uses RCU to safely access page table
+ * memory.
+ */
+struct direct_walk_iterator {
+	/* Internal */
+	gfn_t walk_start;
+	gfn_t walk_end;
+	gfn_t target_gfn;
+	long tlbs_dirty;
+
+	/* the address space id. */
+	int as_id;
+	u64 *pt_path[PT64_ROOT_4LEVEL];
+	bool walk_in_progress;
+
+	/*
+	 * If set, the next call to direct_walk_iterator_next_pte_raw will
+	 * simply reread the current pte and return. This is useful in cases
+	 * where a thread misses a race to set a pte and wants to retry. This
+	 * should be set with a call to direct_walk_iterator_retry_pte.
+	 */
+	bool retry_pte;
+
+	/*
+	 * If set, the next call to direct_walk_iterator_next_pte_raw will not
+	 * step down to a lower level on its next step, even if it is at a
+	 * present, non-leaf pte. This is useful when, for example, splitting
+	 * pages, since we know that the entries below the now split page don't
+	 * need to be handled again.
+	 */
+	bool skip_step_down;
+
+	enum mmu_lock_mode lock_mode;
+	struct kvm *kvm;
+
+	/* Output */
+
+	/* The iterator's current level within the paging structure */
+	int level;
+	/* A pointer to the current PTE */
+	u64 *ptep;
+	/* The a snapshot of the PTE pointed to by ptep */
+	u64 old_pte;
+	/* The lowest GFN mapped by the current PTE */
+	gfn_t pte_gfn_start;
+	/* The highest GFN mapped by the current PTE, + 1 */
+	gfn_t pte_gfn_end;
+};
+
+static void direct_walk_iterator_start_traversal(
+		struct direct_walk_iterator *iter)
+{
+	int level;
+
+	/*
+	 * Only clear the levels below the root. The root level page table is
+	 * allocated at VM creation time and will never change for the life of
+	 * the VM.
+	 */
+	for (level = PT_PAGE_TABLE_LEVEL; level < PT64_ROOT_4LEVEL; level++)
+		iter->pt_path[level - 1] = NULL;
+	iter->level = 0;
+	iter->ptep = NULL;
+	iter->old_pte = 0;
+	iter->pte_gfn_start = 0;
+	iter->pte_gfn_end = 0;
+	iter->walk_in_progress = false;
+	iter->retry_pte = false;
+	iter->skip_step_down = false;
+}
+
+static bool direct_walk_iterator_flush_needed(struct direct_walk_iterator *iter)
+{
+	long tlbs_dirty;
+
+	if (iter->tlbs_dirty) {
+		tlbs_dirty = xadd(&iter->kvm->tlbs_dirty, iter->tlbs_dirty) +
+				iter->tlbs_dirty;
+		iter->tlbs_dirty = 0;
+	} else {
+		tlbs_dirty = READ_ONCE(iter->kvm->tlbs_dirty);
+	}
+
+	return (iter->lock_mode & MMU_WRITE_LOCK) && tlbs_dirty;
+}
+
+static bool direct_walk_iterator_end_traversal(
+		struct direct_walk_iterator *iter)
+{
+	if (iter->walk_in_progress)
+		rcu_read_unlock();
+	return direct_walk_iterator_flush_needed(iter);
+}
+
+/*
+ * Resets a direct walk iterator to the root of the paging structure and RCU
+ * unlocks. After calling this function, the traversal can be reattempted.
+ */
+static void direct_walk_iterator_reset_traversal(
+		struct direct_walk_iterator *iter)
+{
+	/*
+	 * It's okay it ignore the return value, indicating whether a TLB flush
+	 * is needed here because we are ending and then restarting the
+	 * traversal without releasing the MMU lock. At this point the
+	 * iterator tlbs_dirty will have been flushed to the kvm tlbs_dirty, so
+	 * the next end_traversal will return that a flush is needed, if there's
+	 * not an intervening flush for some other reason.
+	 */
+	direct_walk_iterator_end_traversal(iter);
+	direct_walk_iterator_start_traversal(iter);
+}
+
+/*
+ * Sets a direct walk iterator to seek the gfn range [start, end).
+ * If end is greater than the maximum possible GFN, it will be changed to the
+ * maximum possible gfn + 1. (Note that start/end is and inclusive/exclusive
+ * range, so the last gfn to be interated over would be the largest possible
+ * GFN, in this scenario.)
+ */
+__attribute__((unused))
+static void direct_walk_iterator_setup_walk(struct direct_walk_iterator *iter,
+	struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
+	enum mmu_lock_mode lock_mode)
+{
+	BUG_ON(!kvm->arch.direct_mmu_enabled);
+	BUG_ON((lock_mode & MMU_WRITE_LOCK) && (lock_mode & MMU_READ_LOCK));
+	BUG_ON(as_id < 0);
+	BUG_ON(as_id >= KVM_ADDRESS_SPACE_NUM);
+	BUG_ON(!VALID_PAGE(kvm->arch.direct_root_hpa[as_id]));
+
+	/* End cannot be greater than the maximum possible gfn. */
+	end = min(end, 1ULL << (PT64_ROOT_4LEVEL * PT64_PT_BITS));
+
+	iter->as_id = as_id;
+	iter->pt_path[PT64_ROOT_4LEVEL - 1] =
+			(u64 *)__va(kvm->arch.direct_root_hpa[as_id]);
+
+	iter->walk_start = start;
+	iter->walk_end = end;
+	iter->target_gfn = start;
+
+	iter->lock_mode = lock_mode;
+	iter->kvm = kvm;
+	iter->tlbs_dirty = 0;
+
+	direct_walk_iterator_start_traversal(iter);
+}
+
+__attribute__((unused))
+static void direct_walk_iterator_retry_pte(struct direct_walk_iterator *iter)
+{
+	BUG_ON(!iter->walk_in_progress);
+	iter->retry_pte = true;
+}
+
+__attribute__((unused))
+static void direct_walk_iterator_skip_step_down(
+		struct direct_walk_iterator *iter)
+{
+	BUG_ON(!iter->walk_in_progress);
+	iter->skip_step_down = true;
+}
+
+/*
+ * Steps down one level in the paging structure towards the previously set
+ * target gfn. Returns true if the iterator was able to step down a level,
+ * false otherwise.
+ */
+static bool direct_walk_iterator_try_step_down(
+		struct direct_walk_iterator *iter)
+{
+	u64 *child_pt;
+
+	/*
+	 * Reread the pte before stepping down to avoid traversing into page
+	 * tables that are no longer linked from this entry. This is not
+	 * needed for correctness - just a small optimization.
+	 */
+	iter->old_pte = READ_ONCE(*iter->ptep);
+
+	child_pt = pte_to_child_pt(iter->old_pte, iter->level);
+	if (child_pt == NULL)
+		return false;
+	child_pt = rcu_dereference(child_pt);
+
+	iter->level--;
+	iter->pt_path[iter->level - 1] = child_pt;
+	return true;
+}
+
+/*
+ * Steps to the next entry in the current page table, at the current page table
+ * level. The next entry could map a page of guest memory, another page table,
+ * or it could be non-present or invalid. Returns true if the iterator was able
+ * to step to the next entry in the page table, false otherwise.
+ */
+static bool direct_walk_iterator_try_step_side(
+		struct direct_walk_iterator *iter)
+{
+	/*
+	 * If the current gfn maps past the target gfn range, the next entry in
+	 * the current page table will be outside the target range.
+	 */
+	if (iter->pte_gfn_end >= iter->walk_end)
+		return false;
+
+	/*
+	 * Check if the iterator is already at the end of the current page
+	 * table.
+	 */
+	if (!(iter->pte_gfn_end % KVM_PAGES_PER_HPAGE(iter->level + 1)))
+		return false;
+
+	iter->target_gfn = iter->pte_gfn_end;
+	return true;
+}
+
+/*
+ * Tries to back up a level in the paging structure so that the walk can
+ * continue from the next entry in the parent page table. Returns true on a
+ * successful step up, false otherwise.
+ */
+static bool direct_walk_iterator_try_step_up(struct direct_walk_iterator *iter)
+{
+	if (iter->level == PT64_ROOT_4LEVEL)
+		return false;
+
+	iter->level++;
+	return true;
+}
+
+/*
+ * Step to the next pte in a pre-order traversal of the target gfn range.
+ * To get to the next pte, the iterator either steps down towards the current
+ * target gfn, if at a present, non-leaf pte, or over to a pte mapping a
+ * highter gfn, if there's room in the gfn range. If there is no step within
+ * the target gfn range, returns false.
+ */
+static bool direct_walk_iterator_next_pte_raw(struct direct_walk_iterator *iter)
+{
+	bool retry_pte = iter->retry_pte;
+	bool skip_step_down = iter->skip_step_down;
+
+	iter->retry_pte = false;
+	iter->skip_step_down = false;
+
+	if (iter->target_gfn >= iter->walk_end)
+		return false;
+
+	/* If the walk is just starting, set up initial values. */
+	if (!iter->walk_in_progress) {
+		rcu_read_lock();
+
+		iter->level = PT64_ROOT_4LEVEL;
+		iter->walk_in_progress = true;
+		return true;
+	}
+
+	if (retry_pte)
+		return true;
+
+	if (!skip_step_down && direct_walk_iterator_try_step_down(iter))
+		return true;
+
+	while (!direct_walk_iterator_try_step_side(iter))
+		if (!direct_walk_iterator_try_step_up(iter))
+			return false;
+	return true;
+}
+
+static void direct_walk_iterator_recalculate_output_fields(
+		struct direct_walk_iterator *iter)
+{
+	iter->ptep = iter->pt_path[iter->level - 1] +
+			PT64_INDEX(iter->target_gfn << PAGE_SHIFT, iter->level);
+	iter->old_pte = READ_ONCE(*iter->ptep);
+	iter->pte_gfn_start = ALIGN_DOWN(iter->target_gfn,
+			KVM_PAGES_PER_HPAGE(iter->level));
+	iter->pte_gfn_end = iter->pte_gfn_start +
+			KVM_PAGES_PER_HPAGE(iter->level);
+}
+
+static void direct_walk_iterator_prepare_cond_resched(
+		struct direct_walk_iterator *iter)
+{
+	if (direct_walk_iterator_end_traversal(iter))
+		kvm_flush_remote_tlbs(iter->kvm);
+
+	if (iter->lock_mode & MMU_WRITE_LOCK)
+		write_unlock(&iter->kvm->mmu_lock);
+	else if (iter->lock_mode & MMU_READ_LOCK)
+		read_unlock(&iter->kvm->mmu_lock);
+
+}
+
+static void direct_walk_iterator_finish_cond_resched(
+		struct direct_walk_iterator *iter)
+{
+	if (iter->lock_mode & MMU_WRITE_LOCK)
+		write_lock(&iter->kvm->mmu_lock);
+	else if (iter->lock_mode & MMU_READ_LOCK)
+		read_lock(&iter->kvm->mmu_lock);
+
+	direct_walk_iterator_start_traversal(iter);
+}
+
+static void direct_walk_iterator_cond_resched(struct direct_walk_iterator *iter)
+{
+	if (!(iter->lock_mode & MMU_LOCK_MAY_RESCHED) || !need_resched())
+		return;
+
+	direct_walk_iterator_prepare_cond_resched(iter);
+	cond_resched();
+	direct_walk_iterator_finish_cond_resched(iter);
+}
+
+static bool direct_walk_iterator_next_pte(struct direct_walk_iterator *iter)
+{
+	/*
+	 * This iterator could be iterating over a large number of PTEs, such
+	 * that if this thread did not yield, it would cause scheduler\
+	 * problems. To avoid this, yield if needed. Note the check on
+	 * MMU_LOCK_MAY_RESCHED in direct_walk_iterator_cond_resched. This
+	 * iterator will not yield unless that flag is set in its lock_mode.
+	 */
+	direct_walk_iterator_cond_resched(iter);
+
+	while (true) {
+		if (!direct_walk_iterator_next_pte_raw(iter))
+			return false;
+
+		direct_walk_iterator_recalculate_output_fields(iter);
+		if (iter->old_pte != DISCONNECTED_PTE)
+			break;
+
+		/*
+		 * The iterator has encountered a disconnected pte, so it is in
+		 * a page that has been disconnected from the root. Restart the
+		 * traversal from the root in this case.
+		 */
+		direct_walk_iterator_reset_traversal(iter);
+	}
+
+	trace_kvm_mmu_direct_walk_iterator_step(iter->walk_start,
+			iter->walk_end, iter->pte_gfn_start,
+			iter->level, iter->old_pte);
+
+	return true;
+}
+
+/*
+ * As direct_walk_iterator_next_pte but skips over non-present ptes.
+ * (i.e. ptes that are 0 or invalidated.)
+ */
+static bool direct_walk_iterator_next_present_pte(
+		struct direct_walk_iterator *iter)
+{
+	while (direct_walk_iterator_next_pte(iter))
+		if (is_present_direct_pte(iter->old_pte))
+			return true;
+
+	return false;
+}
+
+/*
+ * As direct_walk_iterator_next_present_pte but skips over non-leaf ptes.
+ */
+__attribute__((unused))
+static bool direct_walk_iterator_next_present_leaf_pte(
+		struct direct_walk_iterator *iter)
+{
+	while (direct_walk_iterator_next_present_pte(iter))
+		if (is_last_spte(iter->old_pte, iter->level))
+			return true;
+
+	return false;
+}
+
+/*
+ * Performs an atomic compare / exchange of ptes.
+ * Returns true if the pte was successfully set to the new value, false if the
+ * there was a race and the compare exchange needs to be retried.
+ */
+static bool cmpxchg_pte(u64 *ptep, u64 old_pte, u64 new_pte, int level, u64 gfn)
+{
+	u64 r;
+
+	r = cmpxchg64(ptep, old_pte, new_pte);
+	if (r == old_pte)
+		trace_kvm_mmu_set_pte_atomic(gfn, level, old_pte, new_pte);
+
+	return r == old_pte;
+}
+
+__attribute__((unused))
+static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
+					 u64 new_pte)
+{
+	bool r;
+
+	if (!(iter->lock_mode & (MMU_READ_LOCK | MMU_WRITE_LOCK))) {
+		BUG_ON(is_present_direct_pte(iter->old_pte) !=
+				is_present_direct_pte(new_pte));
+		BUG_ON(spte_to_pfn(iter->old_pte) != spte_to_pfn(new_pte));
+		BUG_ON(is_last_spte(iter->old_pte, iter->level) !=
+				is_last_spte(new_pte, iter->level));
+	}
+
+	if (iter->old_pte == new_pte)
+		return true;
+
+	r = cmpxchg_pte(iter->ptep, iter->old_pte, new_pte, iter->level,
+			iter->pte_gfn_start);
+	if (r) {
+		handle_changed_pte(iter->kvm, iter->as_id, iter->pte_gfn_start,
+				   iter->old_pte, new_pte, iter->level, false);
+
+		if (iter->lock_mode & (MMU_WRITE_LOCK | MMU_READ_LOCK))
+			iter->tlbs_dirty++;
+	} else
+		direct_walk_iterator_retry_pte(iter);
+
+	return r;
+}
+
 /**
  * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 7ca8831c7d1a2..530723038296a 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -166,6 +166,56 @@ TRACE_EVENT(
 		  __entry->created ? "new" : "existing")
 );
 
+TRACE_EVENT(
+	kvm_mmu_direct_walk_iterator_step,
+	TP_PROTO(u64 walk_start, u64 walk_end, u64 base_gfn, int level,
+		u64 pte),
+	TP_ARGS(walk_start, walk_end, base_gfn, level, pte),
+
+	TP_STRUCT__entry(
+		__field(u64, walk_start)
+		__field(u64, walk_end)
+		__field(u64, base_gfn)
+		__field(int, level)
+		__field(u64, pte)
+		),
+
+	TP_fast_assign(
+		__entry->walk_start = walk_start;
+		__entry->walk_end = walk_end;
+		__entry->base_gfn = base_gfn;
+		__entry->level = level;
+		__entry->pte = pte;
+		),
+
+	TP_printk("walk_start=%llx walk_end=%llx base_gfn=%llx lvl=%d pte=%llx",
+		__entry->walk_start, __entry->walk_end, __entry->base_gfn,
+		__entry->level, __entry->pte)
+);
+
+TRACE_EVENT(
+	kvm_mmu_set_pte_atomic,
+	TP_PROTO(u64 gfn, int level, u64 old_pte, u64 new_pte),
+	TP_ARGS(gfn, level, old_pte, new_pte),
+
+	TP_STRUCT__entry(
+		__field(u64, gfn)
+		__field(int, level)
+		__field(u64, old_pte)
+		__field(u64, new_pte)
+		),
+
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->level = level;
+		__entry->old_pte = old_pte;
+		__entry->new_pte = new_pte;
+		),
+
+	TP_printk("gfn=%llx level=%d old_pte=%llx new_pte=%llx", __entry->gfn,
+		  __entry->level, __entry->old_pte, __entry->new_pte)
+);
+
 DECLARE_EVENT_CLASS(kvm_mmu_page_class,
 
 	TP_PROTO(struct kvm_mmu_page *sp),
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 14/28] kvm: mmu: Batch updates to the direct mmu disconnected list
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (12 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 15/28] kvm: mmu: Support invalidate_zap_all_pages Ben Gardon
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

When many threads are removing pages of page table memory from the
paging structures, the number of list operations on the disconnected
page table list can be quite high. Since a spin lock protects the
disconnected list, the high rate of list additions can lead to contention.
Instead, queue disconnected pages in the paging structure walk iterator
and add them to the global list when updating tlbs_dirty, right before
releasing the MMU lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 54 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 59d1866398c42..234db5f4246a4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1780,23 +1780,30 @@ static void direct_mmu_process_pt_free_list_sync(struct kvm *kvm)
 }
 
 /*
- * Add a page of memory that has been disconnected from the paging structure to
+ * Add pages of memory that have been disconnected from the paging structure to
  * a queue to be freed. This is a two step process: after a page has been
  * disconnected, the TLBs must be flushed, and an RCU grace period must elapse
  * before the memory can be freed.
  */
 static void direct_mmu_disconnected_pt_list_add(struct kvm *kvm,
-						struct page *page)
+						struct list_head *list)
 {
+	/*
+	 * No need to acquire the disconnected pts lock if we're adding an
+	 * empty list.
+	 */
+	if (list_empty(list))
+		return;
+
 	spin_lock(&kvm->arch.direct_mmu_disconnected_pts_lock);
-	list_add_tail(&page->lru, &kvm->arch.direct_mmu_disconnected_pts);
+	list_splice_tail_init(list, &kvm->arch.direct_mmu_disconnected_pts);
 	spin_unlock(&kvm->arch.direct_mmu_disconnected_pts_lock);
 }
 
-
 static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 			       u64 old_pte, u64 new_pte, int level,
-			       bool vm_teardown);
+			       bool vm_teardown,
+			       struct list_head *disconnected_pts);
 
 /**
  * mark_pte_disconnected - Mark a PTE as part of a disconnected PT
@@ -1808,9 +1815,12 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
  *	paging structure
  * @vm_teardown: all vCPUs are paused and the VM is being torn down. Yield and
  *	free child page table memory immediately.
+ * @disconnected_pts: a local list of page table pages that need to be freed.
+ *	Used to batch updtes to the disconnected pts list.
  */
 static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
-				  u64 *ptep, int level, bool vm_teardown)
+				  u64 *ptep, int level, bool vm_teardown,
+				  struct list_head *disconnected_pts)
 {
 	u64 old_pte;
 
@@ -1818,7 +1828,7 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
 	BUG_ON(old_pte == DISCONNECTED_PTE);
 
 	handle_changed_pte(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE, level,
-			   vm_teardown);
+			   vm_teardown, disconnected_pts);
 }
 
 /**
@@ -1831,6 +1841,8 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
  * @level: the level of the PT, when it was part of the paging structure
  * @vm_teardown: all vCPUs are paused and the VM is being torn down. Yield and
  *	free child page table memory immediately.
+ * @disconnected_pts: a local list of page table pages that need to be freed.
+ *	Used to batch updtes to the disconnected pts list.
  *
  * Given a pointer to a page table that has been removed from the paging
  * structure and its level, recursively free child page tables and mark their
@@ -1850,7 +1862,8 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
  */
 static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 				   gfn_t pt_base_gfn, kvm_pfn_t pfn, int level,
-				   bool vm_teardown)
+				   bool vm_teardown,
+				   struct list_head *disconnected_pts)
 {
 	int i;
 	gfn_t gfn = pt_base_gfn;
@@ -1864,7 +1877,7 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 		 * table the entry might have pointed to.
 		 */
 		mark_pte_disconnected(kvm, as_id, gfn, &pt[i], level,
-				      vm_teardown);
+				      vm_teardown, disconnected_pts);
 
 		gfn += KVM_PAGES_PER_HPAGE(level);
 	}
@@ -1875,7 +1888,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
 		free_page((unsigned long)pt);
 	} else {
 		page = pfn_to_page(pfn);
-		direct_mmu_disconnected_pt_list_add(kvm, page);
+		BUG_ON(!disconnected_pts);
+		list_add_tail(&page->lru, disconnected_pts);
 	}
 }
 
@@ -1889,6 +1903,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
  * @level: the level of the PT the PTE is part of in the paging structure
  * @vm_teardown: all vCPUs are paused and the VM is being torn down. Yield and
  *	free child page table memory immediately.
+ * @disconnected_pts: a local list of page table pages that need to be freed.
+ *	Used to batch updtes to the disconnected pts list.
  *
  * Handle bookkeeping that might result from the modification of a PTE.
  * This function should be called in the same RCU read critical section as the
@@ -1898,7 +1914,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
  */
 static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 			       u64 old_pte, u64 new_pte, int level,
-			       bool vm_teardown)
+			       bool vm_teardown,
+			       struct list_head *disconnected_pts)
 {
 	bool was_present = is_present_direct_pte(old_pte);
 	bool is_present = is_present_direct_pte(new_pte);
@@ -1944,7 +1961,8 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 		 * pointed to must be freed.
 		 */
 		handle_disconnected_pt(kvm, as_id, gfn, spte_to_pfn(old_pte),
-				       child_level, vm_teardown);
+				       child_level, vm_teardown,
+				       disconnected_pts);
 	}
 }
 
@@ -1987,6 +2005,8 @@ struct direct_walk_iterator {
 	gfn_t target_gfn;
 	long tlbs_dirty;
 
+	struct list_head disconnected_pts;
+
 	/* the address space id. */
 	int as_id;
 	u64 *pt_path[PT64_ROOT_4LEVEL];
@@ -2056,6 +2076,9 @@ static bool direct_walk_iterator_flush_needed(struct direct_walk_iterator *iter)
 		tlbs_dirty = xadd(&iter->kvm->tlbs_dirty, iter->tlbs_dirty) +
 				iter->tlbs_dirty;
 		iter->tlbs_dirty = 0;
+
+		direct_mmu_disconnected_pt_list_add(iter->kvm,
+						    &iter->disconnected_pts);
 	} else {
 		tlbs_dirty = READ_ONCE(iter->kvm->tlbs_dirty);
 	}
@@ -2115,6 +2138,8 @@ static void direct_walk_iterator_setup_walk(struct direct_walk_iterator *iter,
 	iter->pt_path[PT64_ROOT_4LEVEL - 1] =
 			(u64 *)__va(kvm->arch.direct_root_hpa[as_id]);
 
+	INIT_LIST_HEAD(&iter->disconnected_pts);
+
 	iter->walk_start = start;
 	iter->walk_end = end;
 	iter->target_gfn = start;
@@ -2393,7 +2418,8 @@ static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
 			iter->pte_gfn_start);
 	if (r) {
 		handle_changed_pte(iter->kvm, iter->as_id, iter->pte_gfn_start,
-				   iter->old_pte, new_pte, iter->level, false);
+				   iter->old_pte, new_pte, iter->level, false,
+				   &iter->disconnected_pts);
 
 		if (iter->lock_mode & (MMU_WRITE_LOCK | MMU_READ_LOCK))
 			iter->tlbs_dirty++;
@@ -6411,7 +6437,7 @@ static void kvm_mmu_uninit_direct_mmu(struct kvm *kvm)
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
 		handle_disconnected_pt(kvm, i, 0,
 			(kvm_pfn_t)(kvm->arch.direct_root_hpa[i] >> PAGE_SHIFT),
-			PT64_ROOT_4LEVEL, true);
+			PT64_ROOT_4LEVEL, true, NULL);
 }
 
 /* The return value indicates if tlb flush on all vcpus is needed. */
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 15/28] kvm: mmu: Support invalidate_zap_all_pages
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (13 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 14/28] kvm: mmu: Batch updates to the direct mmu disconnected list Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler Ben Gardon
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Adds a function for zapping ranges of GFNs in an address space which
uses the paging structure iterator and uses the function to support
invalidate_zap_all_pages for the direct MMU.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 234db5f4246a4..f0696658b527c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2120,7 +2120,6 @@ static void direct_walk_iterator_reset_traversal(
  * range, so the last gfn to be interated over would be the largest possible
  * GFN, in this scenario.)
  */
-__attribute__((unused))
 static void direct_walk_iterator_setup_walk(struct direct_walk_iterator *iter,
 	struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
 	enum mmu_lock_mode lock_mode)
@@ -2151,7 +2150,6 @@ static void direct_walk_iterator_setup_walk(struct direct_walk_iterator *iter,
 	direct_walk_iterator_start_traversal(iter);
 }
 
-__attribute__((unused))
 static void direct_walk_iterator_retry_pte(struct direct_walk_iterator *iter)
 {
 	BUG_ON(!iter->walk_in_progress);
@@ -2397,7 +2395,6 @@ static bool cmpxchg_pte(u64 *ptep, u64 old_pte, u64 new_pte, int level, u64 gfn)
 	return r == old_pte;
 }
 
-__attribute__((unused))
 static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
 					 u64 new_pte)
 {
@@ -2725,6 +2722,44 @@ static int kvm_handle_hva_range(struct kvm *kvm,
 	return ret;
 }
 
+/*
+ * Marks the range of gfns, [start, end), non-present.
+ */
+static bool zap_direct_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
+				gfn_t end, enum mmu_lock_mode lock_mode)
+{
+	struct direct_walk_iterator iter;
+
+	direct_walk_iterator_setup_walk(&iter, kvm, as_id, start, end,
+					lock_mode);
+	while (direct_walk_iterator_next_present_pte(&iter)) {
+		/*
+		 * The gfn range should be handled at the largest granularity
+		 * possible, however since the functions which handle changed
+		 * PTEs (and freeing child PTs) will not yield, zapping an
+		 * entry with too many child PTEs can lead to scheduler
+		 * problems. In order to avoid scheduler problems, only zap
+		 * PTEs at PDPE level and lower. The root level entries will be
+		 * zapped and the high level page table pages freed on VM
+		 * teardown.
+		 */
+		if ((iter.pte_gfn_start < start ||
+		     iter.pte_gfn_end > end ||
+		     iter.level > PT_PDPE_LEVEL) &&
+		    !is_last_spte(iter.old_pte, iter.level))
+			continue;
+
+		/*
+		 * If the compare / exchange succeeds, then we will continue on
+		 * to the next pte. If it fails, the next iteration will repeat
+		 * the current pte. We'll handle both cases in the same way, so
+		 * we don't need to check the result here.
+		 */
+		direct_walk_iterator_set_pte(&iter, 0);
+	}
+	return direct_walk_iterator_end_traversal(&iter);
+}
+
 static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
 			  unsigned long data,
 			  int (*handler)(struct kvm *kvm,
@@ -6645,11 +6680,26 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
  */
 static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 {
+	int i;
+
 	lockdep_assert_held(&kvm->slots_lock);
 
 	write_lock(&kvm->mmu_lock);
 	trace_kvm_mmu_zap_all_fast(kvm);
 
+	/* Zap all direct MMU PTEs slowly */
+	if (kvm->arch.direct_mmu_enabled) {
+		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+			zap_direct_gfn_range(kvm, i, 0, ~0ULL,
+					MMU_WRITE_LOCK | MMU_LOCK_MAY_RESCHED);
+	}
+
+	if (kvm->arch.pure_direct_mmu) {
+		kvm_flush_remote_tlbs(kvm);
+		write_unlock(&kvm->mmu_lock);
+		return;
+	}
+
 	/*
 	 * Toggle mmu_valid_gen between '0' and '1'.  Because slots_lock is
 	 * held for the entire duration of zapping obsolete pages, it's
@@ -6888,8 +6938,21 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	struct kvm_mmu_page *sp, *node;
 	LIST_HEAD(invalid_list);
 	int ign;
+	int i;
 
 	write_lock(&kvm->mmu_lock);
+	if (kvm->arch.direct_mmu_enabled) {
+		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+			zap_direct_gfn_range(kvm, i, 0, ~0ULL,
+					MMU_WRITE_LOCK | MMU_LOCK_MAY_RESCHED);
+		kvm_flush_remote_tlbs(kvm);
+	}
+
+	if (kvm->arch.pure_direct_mmu) {
+		write_unlock(&kvm->mmu_lock);
+		return;
+	}
+
 restart:
 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
 		if (sp->role.invalid && sp->root_count)
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (14 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 15/28] kvm: mmu: Support invalidate_zap_all_pages Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2020-01-08 17:20   ` Peter Xu
  2019-09-26 23:18 ` [RFC PATCH 17/28] kvm: mmu: Add direct MMU fast " Ben Gardon
                   ` (13 subsequent siblings)
  29 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Adds handler functions to replace __direct_map in handling direct page
faults. These functions, unlike __direct_map can handle page faults on
multiple VCPUs simultaneously.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 192 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 179 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f0696658b527c..f3a26a32c8174 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1117,6 +1117,24 @@ static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
 	return mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
 }
 
+/*
+ * Return an unused object to the specified cache. The object's memory should
+ * be zeroed before being returned if that memory was modified after allocation
+ * from the cache.
+ */
+static void mmu_memory_cache_return(struct kvm_mmu_memory_cache *mc,
+				     void *obj)
+{
+	/*
+	 * Since this object was allocated from the cache, the cache should
+	 * have at least one spare capacity to put the object back.
+	 */
+	BUG_ON(mc->nobjs >= ARRAY_SIZE(mc->objects));
+
+	mc->objects[mc->nobjs] = obj;
+	mc->nobjs++;
+}
+
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -2426,6 +2444,21 @@ static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
 	return r;
 }
 
+static u64 generate_nonleaf_pte(u64 *child_pt, bool ad_disabled)
+{
+	u64 pte;
+
+	pte = __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
+	       shadow_user_mask | shadow_x_mask | shadow_me_mask;
+
+	if (ad_disabled)
+		pte |= shadow_acc_track_value;
+	else
+		pte |= shadow_accessed_mask;
+
+	return pte;
+}
+
 /**
  * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
@@ -3432,13 +3465,7 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
 
-	spte = __pa(sp->spt) | shadow_present_mask | PT_WRITABLE_MASK |
-	       shadow_user_mask | shadow_x_mask | shadow_me_mask;
-
-	if (sp_ad_disabled(sp))
-		spte |= shadow_acc_track_value;
-	else
-		spte |= shadow_accessed_mask;
+	spte = generate_nonleaf_pte(sp->spt, sp_ad_disabled(sp));
 
 	mmu_spte_set(sptep, spte);
 
@@ -4071,6 +4098,126 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 	return ret;
 }
 
+static int direct_page_fault_handle_target_level(struct kvm_vcpu *vcpu,
+		int write, int map_writable, struct direct_walk_iterator *iter,
+		kvm_pfn_t pfn, bool prefault)
+{
+	u64 new_pte;
+	int ret = 0;
+	int generate_pte_ret = 0;
+
+	if (unlikely(is_noslot_pfn(pfn)))
+		new_pte = generate_mmio_pte(vcpu, iter->pte_gfn_start, ACC_ALL);
+	else {
+		generate_pte_ret = generate_pte(vcpu, ACC_ALL, iter->level,
+						iter->pte_gfn_start, pfn,
+						iter->old_pte, prefault, false,
+						map_writable, false, &new_pte);
+		/* Failed to construct a PTE. Retry the page fault. */
+		if (!new_pte)
+			return RET_PF_RETRY;
+	}
+
+	/*
+	 * If the page fault was caused by a write but the page is write
+	 * protected, emulation is needed. If the emulation was skipped,
+	 * the vcpu would have the same fault again.
+	 */
+	if ((generate_pte_ret & SET_SPTE_WRITE_PROTECTED_PT) && write)
+		ret = RET_PF_EMULATE;
+
+	/* If an MMIO PTE was installed, the MMIO will need to be emulated. */
+	if (unlikely(is_mmio_spte(new_pte)))
+		ret = RET_PF_EMULATE;
+
+	/*
+	 * If this would not change the PTE then some other thread must have
+	 * already fixed the page fault and there's no need to proceed.
+	 */
+	if (iter->old_pte == new_pte)
+		return ret;
+
+	/*
+	 * If this warning were to trigger, it would indicate that there was a
+	 * missing MMU notifier or this thread raced with some notifier
+	 * handler. The page fault handler should never change a present, leaf
+	 * PTE to point to a differnt PFN. A notifier handler should have
+	 * zapped the PTE before the main MM's page table was changed.
+	 */
+	WARN_ON(is_present_direct_pte(iter->old_pte) &&
+		is_present_direct_pte(new_pte) &&
+		is_last_spte(iter->old_pte, iter->level) &&
+		is_last_spte(new_pte, iter->level) &&
+		spte_to_pfn(iter->old_pte) != spte_to_pfn(new_pte));
+
+	/*
+	 * If the page fault handler lost the race to set the PTE, retry the
+	 * page fault.
+	 */
+	if (!direct_walk_iterator_set_pte(iter, new_pte))
+		return RET_PF_RETRY;
+
+	/*
+	 * Update some stats for this page fault, if the page
+	 * fault was not speculative.
+	 */
+	if (!prefault)
+		vcpu->stat.pf_fixed++;
+
+	return ret;
+
+}
+
+static int handle_direct_page_fault(struct kvm_vcpu *vcpu,
+		unsigned long mmu_seq, int write, int map_writable, int level,
+		gpa_t gpa, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
+{
+	struct direct_walk_iterator iter;
+	struct kvm_mmu_memory_cache *pf_pt_cache = &vcpu->arch.mmu_page_cache;
+	u64 *child_pt;
+	u64 new_pte;
+	int ret = RET_PF_RETRY;
+
+	direct_walk_iterator_setup_walk(&iter, vcpu->kvm,
+			kvm_arch_vcpu_memslots_id(vcpu), gpa >> PAGE_SHIFT,
+			(gpa >> PAGE_SHIFT) + 1, MMU_READ_LOCK);
+	while (direct_walk_iterator_next_pte(&iter)) {
+		if (iter.level == level) {
+			ret = direct_page_fault_handle_target_level(vcpu,
+					write, map_writable, &iter, pfn,
+					prefault);
+
+			break;
+		} else if (!is_present_direct_pte(iter.old_pte) ||
+			   is_large_pte(iter.old_pte)) {
+			/*
+			 * The leaf PTE for this fault must be mapped at a
+			 * lower level, so a non-leaf PTE must be inserted into
+			 * the paging structure. If the assignment below
+			 * succeeds, it will add the non-leaf PTE and a new
+			 * page of page table memory. Then the iterator can
+			 * traverse into that new page. If the atomic compare/
+			 * exchange fails, the iterator will repeat the current
+			 * PTE, so the only thing this function must do
+			 * differently is return the page table memory to the
+			 * vCPU's fault cache.
+			 */
+			child_pt = mmu_memory_cache_alloc(pf_pt_cache);
+			new_pte = generate_nonleaf_pte(child_pt, false);
+
+			if (!direct_walk_iterator_set_pte(&iter, new_pte))
+				mmu_memory_cache_return(pf_pt_cache, child_pt);
+		}
+	}
+	direct_walk_iterator_end_traversal(&iter);
+
+	/* If emulating, flush this vcpu's TLB. */
+	if (ret == RET_PF_EMULATE)
+		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+
+	return ret;
+}
+
 static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk)
 {
 	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk);
@@ -5014,7 +5161,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
 	int write = error_code & PFERR_WRITE_MASK;
-	bool map_writable;
+	bool map_writable = false;
 
 	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
 
@@ -5035,8 +5182,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	}
 
-	if (fast_page_fault(vcpu, gpa, level, error_code))
-		return RET_PF_RETRY;
+	if (!vcpu->kvm->arch.direct_mmu_enabled)
+		if (fast_page_fault(vcpu, gpa, level, error_code))
+			return RET_PF_RETRY;
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
@@ -5048,17 +5196,31 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 		return r;
 
 	r = RET_PF_RETRY;
-	write_lock(&vcpu->kvm->mmu_lock);
+	if (vcpu->kvm->arch.direct_mmu_enabled)
+		read_lock(&vcpu->kvm->mmu_lock);
+	else
+		write_lock(&vcpu->kvm->mmu_lock);
+
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
 	if (likely(!force_pt_level))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault);
+
+	if (vcpu->kvm->arch.direct_mmu_enabled)
+		r = handle_direct_page_fault(vcpu, mmu_seq, write, map_writable,
+				level, gpa, gfn, pfn, prefault);
+	else
+		r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
+				 prefault);
 
 out_unlock:
-	write_unlock(&vcpu->kvm->mmu_lock);
+	if (vcpu->kvm->arch.direct_mmu_enabled)
+		read_unlock(&vcpu->kvm->mmu_lock);
+	else
+		write_unlock(&vcpu->kvm->mmu_lock);
+
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -6242,6 +6404,10 @@ static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
 {
 	LIST_HEAD(invalid_list);
 
+	if (vcpu->arch.mmu->direct_map && vcpu->kvm->arch.direct_mmu_enabled)
+		/* Reclaim is a todo. */
+		return true;
+
 	if (likely(kvm_mmu_available_pages(vcpu->kvm) >= KVM_MIN_FREE_MMU_PAGES))
 		return 0;
 
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 17/28] kvm: mmu: Add direct MMU fast page fault handler
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (15 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 18/28] kvm: mmu: Add an hva range iterator for memslot GFNs Ben Gardon
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

While the direct MMU can handle page faults much faster than the
existing implementation, it cannot handle faults caused by write
protection or access tracking as quickly. Add a fast path similar to the
existing fast path to handle these cases without the MMU read lock or
calls to get_user_pages.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 93 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 92 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3a26a32c8174..3d4a78f2461a9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4490,6 +4490,93 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	return fault_handled;
 }
 
+/*
+ * Attempt to handle a page fault without the use of get_user_pages, or
+ * acquiring the MMU lock. This function can handle page faults resulting from
+ * missing permissions on a PTE, set up by KVM for dirty logging or access
+ * tracking.
+ *
+ * Return value:
+ * - true: The page fault may have been fixed by this function. Let the vCPU
+ *	   access on the same address again.
+ * - false: This function cannot handle the page fault. Let the full page fault
+ *	    path fix it.
+ */
+static bool fast_direct_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, int level,
+				   u32 error_code)
+{
+	struct direct_walk_iterator iter;
+	bool fault_handled = false;
+	bool remove_write_prot;
+	bool remove_acc_track;
+	u64 new_pte;
+
+	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+		return false;
+
+	if (!page_fault_can_be_fast(error_code))
+		return false;
+
+	direct_walk_iterator_setup_walk(&iter, vcpu->kvm,
+			kvm_arch_vcpu_memslots_id(vcpu), gpa >> PAGE_SHIFT,
+			(gpa >> PAGE_SHIFT) + 1, MMU_NO_LOCK);
+	while (direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		remove_write_prot = (error_code & PFERR_WRITE_MASK);
+		remove_write_prot &= !(iter.old_pte & PT_WRITABLE_MASK);
+		remove_write_prot &= spte_can_locklessly_be_made_writable(
+				iter.old_pte);
+
+		remove_acc_track = is_access_track_spte(iter.old_pte);
+
+		/* Verify that the fault can be handled in the fast path */
+		if (!remove_acc_track && !remove_write_prot)
+			break;
+
+		/*
+		 * If dirty logging is enabled:
+		 *
+		 * Do not fix write-permission on the large spte since we only
+		 * dirty the first page into the dirty-bitmap in
+		 * fast_pf_fix_direct_spte() that means other pages are missed
+		 * if its slot is dirty-logged.
+		 *
+		 * Instead, we let the slow page fault path create a normal spte
+		 * to fix the access.
+		 *
+		 * See the comments in kvm_arch_commit_memory_region().
+		 */
+		if (remove_write_prot &&
+		    iter.level > PT_PAGE_TABLE_LEVEL)
+			break;
+
+		new_pte = iter.old_pte;
+		if (remove_acc_track)
+			new_pte = restore_acc_track_spte(iter.old_pte);
+		if (remove_write_prot)
+			new_pte |= PT_WRITABLE_MASK;
+
+		if (new_pte == iter.old_pte) {
+			fault_handled = true;
+			break;
+		}
+
+		if (!direct_walk_iterator_set_pte(&iter, new_pte))
+			continue;
+
+		if (remove_write_prot)
+			kvm_vcpu_mark_page_dirty(vcpu, iter.pte_gfn_start);
+
+		fault_handled = true;
+		break;
+	}
+	direct_walk_iterator_end_traversal(&iter);
+
+	trace_fast_page_fault(vcpu, gpa, error_code, iter.ptep,
+			      iter.old_pte, fault_handled);
+
+	return fault_handled;
+}
+
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable);
 static int make_mmu_pages_available(struct kvm_vcpu *vcpu);
@@ -5182,9 +5269,13 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	}
 
-	if (!vcpu->kvm->arch.direct_mmu_enabled)
+	if (vcpu->kvm->arch.direct_mmu_enabled) {
+		if (fast_direct_page_fault(vcpu, gpa, level, error_code))
+			return RET_PF_RETRY;
+	} else {
 		if (fast_page_fault(vcpu, gpa, level, error_code))
 			return RET_PF_RETRY;
+	}
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 18/28] kvm: mmu: Add an hva range iterator for memslot GFNs
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (16 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 17/28] kvm: mmu: Add direct MMU fast " Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 19/28] kvm: mmu: Make address space ID a property of memslots Ben Gardon
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Factors out a utility for iterating over host virtual address ranges to
get the gfn ranges they map from kvm_handle_hva_range. This moves the
rmap-reliant HVA iterator approach used for shadow paging to a wrapper
around an HVA range to GFN range iterator. Since the direct MMU only
maps each GFN to one physical address, and does not use the rmap, it
can use the GFN ranges directly.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 96 +++++++++++++++++++++++++++++++---------------
 1 file changed, 66 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d4a78f2461a9..32426536723c6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2701,27 +2701,14 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
 	rmap_walk_init_level(iterator, iterator->level);
 }
 
-#define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_,	\
-	   _start_gfn, _end_gfn, _iter_)				\
-	for (slot_rmap_walk_init(_iter_, _slot_, _start_level_,		\
-				 _end_level_, _start_gfn, _end_gfn);	\
-	     slot_rmap_walk_okay(_iter_);				\
-	     slot_rmap_walk_next(_iter_))
-
-static int kvm_handle_hva_range(struct kvm *kvm,
-				unsigned long start,
-				unsigned long end,
-				unsigned long data,
-				int (*handler)(struct kvm *kvm,
-					       struct kvm_rmap_head *rmap_head,
-					       struct kvm_memory_slot *slot,
-					       gfn_t gfn,
-					       int level,
-					       unsigned long data))
+static int kvm_handle_direct_hva_range(struct kvm *kvm, unsigned long start,
+		unsigned long end, unsigned long data,
+		int (*handler)(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			       gfn_t gfn_start, gfn_t gfn_end,
+			       unsigned long data))
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
-	struct slot_rmap_walk_iterator iterator;
 	int ret = 0;
 	int i;
 
@@ -2736,25 +2723,74 @@ static int kvm_handle_hva_range(struct kvm *kvm,
 				      (memslot->npages << PAGE_SHIFT));
 			if (hva_start >= hva_end)
 				continue;
-			/*
-			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-			 */
 			gfn_start = hva_to_gfn_memslot(hva_start, memslot);
-			gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-			for_each_slot_rmap_range(memslot, PT_PAGE_TABLE_LEVEL,
-						 PT_MAX_HUGEPAGE_LEVEL,
-						 gfn_start, gfn_end - 1,
-						 &iterator)
-				ret |= handler(kvm, iterator.rmap, memslot,
-					       iterator.gfn, iterator.level, data);
+			gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1,
+						     memslot);
+
+			ret |= handler(kvm, memslot, gfn_start, gfn_end, data);
 		}
 	}
 
 	return ret;
 }
 
+#define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_,	\
+	   _start_gfn, _end_gfn, _iter_)				\
+	for (slot_rmap_walk_init(_iter_, _slot_, _start_level_,		\
+				 _end_level_, _start_gfn, _end_gfn);	\
+	     slot_rmap_walk_okay(_iter_);				\
+	     slot_rmap_walk_next(_iter_))
+
+
+struct handle_hva_range_shadow_data {
+	unsigned long data;
+	int (*handler)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+		       struct kvm_memory_slot *slot, gfn_t gfn, int level,
+		       unsigned long data);
+};
+
+static int handle_hva_range_shadow_handler(struct kvm *kvm,
+					   struct kvm_memory_slot *memslot,
+					   gfn_t gfn_start, gfn_t gfn_end,
+					   unsigned long data)
+{
+	int ret = 0;
+	struct slot_rmap_walk_iterator iterator;
+	struct handle_hva_range_shadow_data *shadow_data =
+		(struct handle_hva_range_shadow_data *)data;
+
+	for_each_slot_rmap_range(memslot, PT_PAGE_TABLE_LEVEL,
+				 PT_MAX_HUGEPAGE_LEVEL,
+				 gfn_start, gfn_end - 1, &iterator) {
+		BUG_ON(!iterator.rmap);
+		ret |= shadow_data->handler(kvm, iterator.rmap, memslot,
+			       iterator.gfn, iterator.level, shadow_data->data);
+	}
+
+	return ret;
+}
+
+static int kvm_handle_hva_range(struct kvm *kvm,
+				unsigned long start,
+				unsigned long end,
+				unsigned long data,
+				int (*handler)(struct kvm *kvm,
+					       struct kvm_rmap_head *rmap_head,
+					       struct kvm_memory_slot *slot,
+					       gfn_t gfn,
+					       int level,
+					       unsigned long data))
+{
+	struct handle_hva_range_shadow_data shadow_data;
+
+	shadow_data.data = data;
+	shadow_data.handler = handler;
+
+	return kvm_handle_direct_hva_range(kvm, start, end,
+					   (unsigned long)&shadow_data,
+					   handle_hva_range_shadow_handler);
+}
+
 /*
  * Marks the range of gfns, [start, end), non-present.
  */
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 19/28] kvm: mmu: Make address space ID a property of memslots
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (17 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 18/28] kvm: mmu: Add an hva range iterator for memslot GFNs Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 20/28] kvm: mmu: Implement the invalidation MMU notifiers for the direct MMU Ben Gardon
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Save address space ID as a field in each memslot so that functions that
do not use rmaps (which implicitly encode the address space id) can
handle multiple address spaces correctly.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 350a3b79cc8d1..ce6b22fcb90f3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -347,6 +347,7 @@ struct kvm_memory_slot {
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
 	u32 flags;
+	int as_id;
 	short id;
 };
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c8559a86625ce..d494044104270 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -969,6 +969,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new.base_gfn = base_gfn;
 	new.npages = npages;
 	new.flags = mem->flags;
+	new.as_id = as_id;
 
 	if (npages) {
 		if (!old.npages)
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 20/28] kvm: mmu: Implement the invalidation MMU notifiers for the direct MMU
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (18 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 19/28] kvm: mmu: Make address space ID a property of memslots Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 21/28] kvm: mmu: Integrate the direct mmu with the changed pte notifier Ben Gardon
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Implements arch specific handler functions for the invalidation MMU
notifiers, using a paging structure iterator. These handlers are
responsible for zapping paging structure entries to enable the main MM
to safely remap memory that was used to back guest memory.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 32426536723c6..ca9b3f574f401 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2829,6 +2829,22 @@ static bool zap_direct_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
 	return direct_walk_iterator_end_traversal(&iter);
 }
 
+static int zap_direct_gfn_range_handler(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					gfn_t start, gfn_t end,
+					unsigned long data)
+{
+	return zap_direct_gfn_range(kvm, slot->as_id, start, end,
+				    MMU_WRITE_LOCK);
+}
+
+static bool zap_direct_hva_range(struct kvm *kvm, unsigned long start,
+		unsigned long end)
+{
+	return kvm_handle_direct_hva_range(kvm, start, end, 0,
+					   zap_direct_gfn_range_handler);
+}
+
 static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
 			  unsigned long data,
 			  int (*handler)(struct kvm *kvm,
@@ -2842,7 +2858,13 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
 
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-	return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+	int r = 0;
+
+	if (kvm->arch.direct_mmu_enabled)
+		r |= zap_direct_hva_range(kvm, start, end);
+	if (!kvm->arch.pure_direct_mmu)
+		r |= kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+	return r;
 }
 
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 21/28] kvm: mmu: Integrate the direct mmu with the changed pte notifier
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (19 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 20/28] kvm: mmu: Implement the invalidation MMU notifiers for the direct MMU Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 22/28] kvm: mmu: Implement access tracking for the direct MMU Ben Gardon
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Implements arch specific handler functions for the changed pte MMU
notifier. This handler uses the paging structure walk iterator and is
needed to allow the main MM to update page permissions safely on pages
backing guest memory.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ca9b3f574f401..b144c803c36d2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2386,7 +2386,6 @@ static bool direct_walk_iterator_next_present_pte(
 /*
  * As direct_walk_iterator_next_present_pte but skips over non-leaf ptes.
  */
-__attribute__((unused))
 static bool direct_walk_iterator_next_present_leaf_pte(
 		struct direct_walk_iterator *iter)
 {
@@ -2867,9 +2866,59 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
 	return r;
 }
 
+static int set_direct_pte_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+			      gfn_t start, gfn_t end, unsigned long pte)
+{
+	struct direct_walk_iterator iter;
+	pte_t host_pte;
+	kvm_pfn_t new_pfn;
+	u64 new_pte;
+
+	host_pte.pte = pte;
+	new_pfn = pte_pfn(host_pte);
+
+	direct_walk_iterator_setup_walk(&iter, kvm, slot->as_id, start, end,
+					MMU_WRITE_LOCK);
+	while (direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		BUG_ON(iter.level != PT_PAGE_TABLE_LEVEL);
+
+		if (pte_write(host_pte))
+			new_pte = 0;
+		else {
+			new_pte = iter.old_pte & ~PT64_BASE_ADDR_MASK;
+			new_pte |= new_pfn << PAGE_SHIFT;
+			new_pte &= ~PT_WRITABLE_MASK;
+			new_pte &= ~SPTE_HOST_WRITEABLE;
+			new_pte &= ~shadow_dirty_mask;
+			new_pte &= ~shadow_accessed_mask;
+			new_pte = mark_spte_for_access_track(new_pte);
+		}
+
+		if (!direct_walk_iterator_set_pte(&iter, new_pte))
+			continue;
+	}
+	return direct_walk_iterator_end_traversal(&iter);
+}
+
+static int set_direct_pte_hva(struct kvm *kvm, unsigned long address,
+			    pte_t host_pte)
+{
+	return kvm_handle_direct_hva_range(kvm, address, address + 1,
+					   host_pte.pte, set_direct_pte_gfn);
+}
+
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
-	return kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+	int need_flush = 0;
+
+	WARN_ON(pte_huge(pte));
+
+	if (kvm->arch.direct_mmu_enabled)
+		need_flush |= set_direct_pte_hva(kvm, hva, pte);
+	if (!kvm->arch.pure_direct_mmu)
+		need_flush |= kvm_handle_hva(kvm, hva, (unsigned long)&pte,
+					     kvm_set_pte_rmapp);
+	return need_flush;
 }
 
 static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 22/28] kvm: mmu: Implement access tracking for the direct MMU
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (20 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 21/28] kvm: mmu: Integrate the direct mmu with the changed pte notifier Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 23/28] kvm: mmu: Make mark_page_dirty_in_slot usable from outside kvm_main Ben Gardon
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Adds functions for dealing with the accessed state of PTEs which
can operate with the direct MMU.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c  | 153 +++++++++++++++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c |   7 +-
 2 files changed, 150 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b144c803c36d2..cc81ba5ee46d6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -779,6 +779,17 @@ static bool spte_has_volatile_bits(u64 spte)
 	return false;
 }
 
+static bool is_accessed_direct_pte(u64 pte, int level)
+{
+	if (!is_last_spte(pte, level))
+		return false;
+
+	if (shadow_accessed_mask)
+		return pte & shadow_accessed_mask;
+
+	return pte & shadow_acc_track_mask;
+}
+
 static bool is_accessed_spte(u64 spte)
 {
 	u64 accessed_mask = spte_shadow_accessed_mask(spte);
@@ -929,6 +940,14 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
+static u64 save_pte_permissions_for_access_track(u64 pte)
+{
+	pte |= (pte & shadow_acc_track_saved_bits_mask) <<
+		shadow_acc_track_saved_bits_shift;
+	pte &= ~shadow_acc_track_mask;
+	return pte;
+}
+
 static u64 mark_spte_for_access_track(u64 spte)
 {
 	if (spte_ad_enabled(spte))
@@ -944,16 +963,13 @@ static u64 mark_spte_for_access_track(u64 spte)
 	 */
 	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
 		  !spte_can_locklessly_be_made_writable(spte),
-		  "kvm: Writable SPTE is not locklessly dirty-trackable\n");
+		  "kvm: Writable PTE is not locklessly dirty-trackable\n");
 
 	WARN_ONCE(spte & (shadow_acc_track_saved_bits_mask <<
 			  shadow_acc_track_saved_bits_shift),
 		  "kvm: Access Tracking saved bit locations are not zero\n");
 
-	spte |= (spte & shadow_acc_track_saved_bits_mask) <<
-		shadow_acc_track_saved_bits_shift;
-	spte &= ~shadow_acc_track_mask;
-
+	spte = save_pte_permissions_for_access_track(spte);
 	return spte;
 }
 
@@ -1718,6 +1734,15 @@ static void free_pt_rcu_callback(struct rcu_head *rp)
 	free_page((unsigned long)disconnected_pt);
 }
 
+static void handle_changed_pte_acc_track(u64 old_pte, u64 new_pte, int level)
+{
+	bool pfn_changed = spte_to_pfn(old_pte) != spte_to_pfn(new_pte);
+
+	if (is_accessed_direct_pte(old_pte, level) &&
+	    (!is_accessed_direct_pte(new_pte, level) || pfn_changed))
+		kvm_set_pfn_accessed(spte_to_pfn(old_pte));
+}
+
 /*
  * Takes a snapshot of, and clears, the direct MMU disconnected pt list. Once
  * TLBs have been flushed, this snapshot can be transferred to the direct MMU
@@ -1847,6 +1872,7 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	handle_changed_pte(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE, level,
 			   vm_teardown, disconnected_pts);
+	handle_changed_pte_acc_track(old_pte, DISCONNECTED_PTE, level);
 }
 
 /**
@@ -2412,8 +2438,8 @@ static bool cmpxchg_pte(u64 *ptep, u64 old_pte, u64 new_pte, int level, u64 gfn)
 	return r == old_pte;
 }
 
-static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
-					 u64 new_pte)
+static bool direct_walk_iterator_set_pte_raw(struct direct_walk_iterator *iter,
+					 u64 new_pte, bool handle_acc_track)
 {
 	bool r;
 
@@ -2435,6 +2461,10 @@ static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
 				   iter->old_pte, new_pte, iter->level, false,
 				   &iter->disconnected_pts);
 
+		if (handle_acc_track)
+			handle_changed_pte_acc_track(iter->old_pte, new_pte,
+						     iter->level);
+
 		if (iter->lock_mode & (MMU_WRITE_LOCK | MMU_READ_LOCK))
 			iter->tlbs_dirty++;
 	} else
@@ -2443,6 +2473,18 @@ static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
 	return r;
 }
 
+static bool direct_walk_iterator_set_pte_no_acc_track(
+		struct direct_walk_iterator *iter, u64 new_pte)
+{
+	return direct_walk_iterator_set_pte_raw(iter, new_pte, false);
+}
+
+static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
+					 u64 new_pte)
+{
+	return direct_walk_iterator_set_pte_raw(iter, new_pte, true);
+}
+
 static u64 generate_nonleaf_pte(u64 *child_pt, bool ad_disabled)
 {
 	u64 pte;
@@ -2965,14 +3007,107 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 }
 
+static int age_direct_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gfn_t start, gfn_t end, unsigned long ignored)
+{
+	struct direct_walk_iterator iter;
+	int young = 0;
+	u64 new_pte = 0;
+
+	direct_walk_iterator_setup_walk(&iter, kvm, slot->as_id, start, end,
+					MMU_WRITE_LOCK);
+	while (direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		/*
+		 * If we have a non-accessed entry we don't need to change the
+		 * pte.
+		 */
+		if (!is_accessed_direct_pte(iter.old_pte, iter.level))
+			continue;
+
+		if (shadow_accessed_mask)
+			new_pte = iter.old_pte & ~shadow_accessed_mask;
+		else {
+			new_pte = save_pte_permissions_for_access_track(
+					iter.old_pte);
+			new_pte |= shadow_acc_track_value;
+		}
+
+		/*
+		 * We've created a new pte with the accessed state cleared.
+		 * Warn if we're about to put in a pte that still looks
+		 * accessed.
+		 */
+		WARN_ON(is_accessed_direct_pte(new_pte, iter.level));
+
+		if (!direct_walk_iterator_set_pte_no_acc_track(&iter, new_pte))
+			continue;
+
+		young = true;
+
+		if (shadow_accessed_mask)
+			trace_kvm_age_page(iter.pte_gfn_start, iter.level, slot,
+					   young);
+	}
+	direct_walk_iterator_end_traversal(&iter);
+
+	return young;
+}
+
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-	return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
+	int young = 0;
+
+	if (kvm->arch.direct_mmu_enabled)
+		young |= kvm_handle_direct_hva_range(kvm, start, end, 0,
+						     age_direct_gfn_range);
+
+	if (!kvm->arch.pure_direct_mmu)
+		young |= kvm_handle_hva_range(kvm, start, end, 0,
+					      kvm_age_rmapp);
+	return young;
+}
+
+static int test_age_direct_gfn_range(struct kvm *kvm,
+				     struct kvm_memory_slot *slot,
+				     gfn_t start, gfn_t end,
+				     unsigned long ignored)
+{
+	struct direct_walk_iterator iter;
+	int young = 0;
+
+	direct_walk_iterator_setup_walk(&iter, kvm, slot->as_id, start, end,
+					MMU_WRITE_LOCK);
+	while (direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		if (is_accessed_direct_pte(iter.old_pte, iter.level)) {
+			young = true;
+			break;
+		}
+	}
+	direct_walk_iterator_end_traversal(&iter);
+
+	return young;
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
 {
-	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+	int young = 0;
+
+	/*
+	 * If there's no access bit in the secondary pte set by the
+	 * hardware it's up to gup-fast/gup to set the access bit in
+	 * the primary pte or in the page structure.
+	 */
+	if (!shadow_accessed_mask)
+		return young;
+
+	if (kvm->arch.direct_mmu_enabled)
+		young |= kvm_handle_direct_hva_range(kvm, hva, hva + 1, 0,
+						     test_age_direct_gfn_range);
+
+	if (!kvm->arch.pure_direct_mmu)
+		young |= kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+
+	return young;
 }
 
 #ifdef MMU_DEBUG
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d494044104270..771e159d6bea9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -439,7 +439,12 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	write_lock(&kvm->mmu_lock);
 
 	young = kvm_age_hva(kvm, start, end);
-	if (young)
+
+	/*
+	 * If there was an accessed page in the provided range, or there are
+	 * un-flushed paging structure changes, flush the TLBs.
+	 */
+	if (young || kvm->tlbs_dirty)
 		kvm_flush_remote_tlbs(kvm);
 
 	write_unlock(&kvm->mmu_lock);
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 23/28] kvm: mmu: Make mark_page_dirty_in_slot usable from outside kvm_main
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (21 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 22/28] kvm: mmu: Implement access tracking for the direct MMU Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 24/28] kvm: mmu: Support dirty logging in the direct MMU Ben Gardon
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

When operating on PTEs within a memslot, the dirty status of the page
must be recorded for dirty logging. Currently the only mechanism for
marking pages dirty in mmu.c is mark_page_dirty, which assumes address
space 0. This means that dirty pages in other address spaces will be lost.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 6 ++----
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ce6b22fcb90f3..1212d5c8a3f6d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -753,6 +753,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 771e159d6bea9..ffc6951f2bc93 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -130,8 +130,6 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
-
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
 
@@ -2214,8 +2212,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
-				    gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn)
 {
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
@@ -2223,6 +2220,7 @@ static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
 		set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
+EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
 
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 {
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 24/28] kvm: mmu: Support dirty logging in the direct MMU
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (22 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 23/28] kvm: mmu: Make mark_page_dirty_in_slot usable from outside kvm_main Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 25/28] kvm: mmu: Support kvm_zap_gfn_range " Ben Gardon
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Adds functions for handling changes to the dirty state of PTEs and
functions for enabling / resetting dirty logging which use a paging
structure iterator.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  10 ++
 arch/x86/kvm/mmu.c              | 259 ++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.c          |  10 +-
 arch/x86/kvm/x86.c              |   4 +-
 4 files changed, 269 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9bf149dce146d..b6a3380e66d44 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1305,6 +1305,16 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      struct kvm_memory_slot *memslot);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
+
+#define KVM_DIRTY_LOG_MODE_WRPROT	1
+#define KVM_DIRTY_LOG_MODE_PML		2
+
+void kvm_mmu_zap_collapsible_direct_ptes(struct kvm *kvm,
+					 const struct kvm_memory_slot *memslot);
+void reset_direct_mmu_dirty_logging(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    int dirty_log_mode,
+				    bool record_dirty_pages);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cc81ba5ee46d6..ca58b27a17c52 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -790,6 +790,18 @@ static bool is_accessed_direct_pte(u64 pte, int level)
 	return pte & shadow_acc_track_mask;
 }
 
+static bool is_dirty_direct_pte(u64 pte, int dlog_mode)
+{
+	/* If the pte is non-present, the entry cannot have been dirtied. */
+	if (!is_present_direct_pte(pte))
+		return false;
+
+	if (dlog_mode == KVM_DIRTY_LOG_MODE_WRPROT)
+		return pte & PT_WRITABLE_MASK;
+
+	return pte & shadow_dirty_mask;
+}
+
 static bool is_accessed_spte(u64 spte)
 {
 	u64 accessed_mask = spte_shadow_accessed_mask(spte);
@@ -1743,6 +1755,38 @@ static void handle_changed_pte_acc_track(u64 old_pte, u64 new_pte, int level)
 		kvm_set_pfn_accessed(spte_to_pfn(old_pte));
 }
 
+static void handle_changed_pte_dlog(struct kvm *kvm, int as_id, gfn_t gfn,
+				    u64 old_pte, u64 new_pte, int level)
+{
+	bool pfn_changed = spte_to_pfn(old_pte) != spte_to_pfn(new_pte);
+	bool was_wrprot_dirty = is_dirty_direct_pte(old_pte,
+						    KVM_DIRTY_LOG_MODE_WRPROT);
+	bool is_wrprot_dirty = is_dirty_direct_pte(new_pte,
+						   KVM_DIRTY_LOG_MODE_WRPROT);
+	bool wrprot_dirty = (!was_wrprot_dirty || pfn_changed) &&
+			    is_wrprot_dirty;
+	struct kvm_memory_slot *slot;
+
+	if (level > PT_PAGE_TABLE_LEVEL)
+		return;
+
+	/*
+	 * Only mark pages dirty if they are becoming writable or no longer have
+	 * the dbit set and dbit dirty logging is enabled.
+	 * If pages are marked dirty when unsetting the dbit when dbit
+	 * dirty logging isn't on, it can cause spurious dirty pages, e.g. from
+	 * zapping PTEs during VM teardown.
+	 * If, on the other hand, pages were only marked dirty when becoming
+	 * writable when in wrprot dirty logging, that would also cause problems
+	 * because dirty pages could be lost when switching from dbit to wrprot
+	 * dirty logging.
+	 */
+	if (wrprot_dirty) {
+		slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
+		mark_page_dirty_in_slot(slot, gfn);
+	}
+}
+
 /*
  * Takes a snapshot of, and clears, the direct MMU disconnected pt list. Once
  * TLBs have been flushed, this snapshot can be transferred to the direct MMU
@@ -1873,6 +1917,8 @@ static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
 	handle_changed_pte(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE, level,
 			   vm_teardown, disconnected_pts);
 	handle_changed_pte_acc_track(old_pte, DISCONNECTED_PTE, level);
+	handle_changed_pte_dlog(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE,
+				level);
 }
 
 /**
@@ -1964,6 +2010,14 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_present = is_present_direct_pte(old_pte);
 	bool is_present = is_present_direct_pte(new_pte);
 	bool was_leaf = was_present && is_last_spte(old_pte, level);
+	bool was_dirty = is_dirty_direct_pte(old_pte,
+				KVM_DIRTY_LOG_MODE_WRPROT) ||
+			 is_dirty_direct_pte(old_pte,
+				KVM_DIRTY_LOG_MODE_PML);
+	bool is_dirty = is_dirty_direct_pte(new_pte,
+				KVM_DIRTY_LOG_MODE_WRPROT) ||
+			 is_dirty_direct_pte(new_pte,
+				KVM_DIRTY_LOG_MODE_PML);
 	bool pfn_changed = spte_to_pfn(old_pte) != spte_to_pfn(new_pte);
 	int child_level;
 
@@ -1990,6 +2044,9 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
 		return;
 	}
 
+	if (((was_dirty && !is_dirty) || pfn_changed) && was_leaf)
+		kvm_set_pfn_dirty(spte_to_pfn(old_pte));
+
 	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
 		/*
 		 * The level of the page table being freed is one level lower
@@ -2439,7 +2496,8 @@ static bool cmpxchg_pte(u64 *ptep, u64 old_pte, u64 new_pte, int level, u64 gfn)
 }
 
 static bool direct_walk_iterator_set_pte_raw(struct direct_walk_iterator *iter,
-					 u64 new_pte, bool handle_acc_track)
+					     u64 new_pte, bool handle_acc_track,
+					     bool handle_dlog)
 {
 	bool r;
 
@@ -2464,6 +2522,11 @@ static bool direct_walk_iterator_set_pte_raw(struct direct_walk_iterator *iter,
 		if (handle_acc_track)
 			handle_changed_pte_acc_track(iter->old_pte, new_pte,
 						     iter->level);
+		if (handle_dlog)
+			handle_changed_pte_dlog(iter->kvm, iter->as_id,
+						iter->pte_gfn_start,
+						iter->old_pte, new_pte,
+						iter->level);
 
 		if (iter->lock_mode & (MMU_WRITE_LOCK | MMU_READ_LOCK))
 			iter->tlbs_dirty++;
@@ -2476,13 +2539,19 @@ static bool direct_walk_iterator_set_pte_raw(struct direct_walk_iterator *iter,
 static bool direct_walk_iterator_set_pte_no_acc_track(
 		struct direct_walk_iterator *iter, u64 new_pte)
 {
-	return direct_walk_iterator_set_pte_raw(iter, new_pte, false);
+	return direct_walk_iterator_set_pte_raw(iter, new_pte, false, true);
+}
+
+static bool direct_walk_iterator_set_pte_no_dlog(
+		struct direct_walk_iterator *iter, u64 new_pte)
+{
+	return direct_walk_iterator_set_pte_raw(iter, new_pte, true, false);
 }
 
 static bool direct_walk_iterator_set_pte(struct direct_walk_iterator *iter,
 					 u64 new_pte)
 {
-	return direct_walk_iterator_set_pte_raw(iter, new_pte, true);
+	return direct_walk_iterator_set_pte_raw(iter, new_pte, true, true);
 }
 
 static u64 generate_nonleaf_pte(u64 *child_pt, bool ad_disabled)
@@ -2500,6 +2569,83 @@ static u64 generate_nonleaf_pte(u64 *child_pt, bool ad_disabled)
 	return pte;
 }
 
+static u64 mark_direct_pte_for_dirty_track(u64 pte, int dlog_mode)
+{
+	if (dlog_mode == KVM_DIRTY_LOG_MODE_WRPROT)
+		pte &= ~PT_WRITABLE_MASK;
+	else
+		pte &= ~shadow_dirty_mask;
+
+	return pte;
+}
+
+void reset_direct_mmu_dirty_logging(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    int dirty_log_mode, bool record_dirty_pages)
+{
+	struct direct_walk_iterator iter;
+	u64 new_pte;
+	bool pte_set;
+
+	write_lock(&kvm->mmu_lock);
+
+	direct_walk_iterator_setup_walk(&iter, kvm, slot->as_id, slot->base_gfn,
+			slot->base_gfn + slot->npages,
+			MMU_WRITE_LOCK);
+	while (direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		if (iter.level == PT_PAGE_TABLE_LEVEL &&
+		    !is_dirty_direct_pte(iter.old_pte, dirty_log_mode))
+			continue;
+
+		new_pte = mark_direct_pte_for_dirty_track(iter.old_pte,
+							  dirty_log_mode);
+
+		if (record_dirty_pages)
+			pte_set = direct_walk_iterator_set_pte(&iter, new_pte);
+		else
+			pte_set = direct_walk_iterator_set_pte_no_dlog(&iter,
+								       new_pte);
+		if (!pte_set)
+			continue;
+	}
+	if (direct_walk_iterator_end_traversal(&iter))
+		kvm_flush_remote_tlbs(kvm);
+	write_unlock(&kvm->mmu_lock);
+}
+EXPORT_SYMBOL_GPL(reset_direct_mmu_dirty_logging);
+
+static bool clear_direct_dirty_log_gfn_masked(struct kvm *kvm,
+		struct kvm_memory_slot *slot, gfn_t gfn, unsigned long mask,
+		int dirty_log_mode, enum mmu_lock_mode lock_mode)
+{
+	struct direct_walk_iterator iter;
+	u64 new_pte;
+
+	direct_walk_iterator_setup_walk(&iter, kvm, slot->as_id,
+			gfn + __ffs(mask), gfn + BITS_PER_LONG, lock_mode);
+	while (mask && direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		if (iter.level > PT_PAGE_TABLE_LEVEL) {
+			BUG_ON(iter.old_pte & PT_WRITABLE_MASK);
+			continue;
+		}
+
+		if (!is_dirty_direct_pte(iter.old_pte, dirty_log_mode))
+			continue;
+
+		if (!(mask & (1UL << (iter.pte_gfn_start - gfn))))
+			continue;
+
+		new_pte = mark_direct_pte_for_dirty_track(iter.old_pte,
+							  dirty_log_mode);
+
+		if (!direct_walk_iterator_set_pte_no_dlog(&iter, new_pte))
+			continue;
+
+		mask &= ~(1UL << (iter.pte_gfn_start - gfn));
+	}
+	return direct_walk_iterator_end_traversal(&iter);
+}
+
 /**
  * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
@@ -2509,12 +2655,24 @@ static u64 generate_nonleaf_pte(u64 *child_pt, bool ad_disabled)
  *
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
+ *
+ * We don't need to worry about flushing tlbs here as they are flushed
+ * unconditionally at a higher level. See the comments on
+ * kvm_vm_ioctl_get_dirty_log and kvm_mmu_slot_remove_write_access.
  */
 static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 				     struct kvm_memory_slot *slot,
 				     gfn_t gfn_offset, unsigned long mask)
 {
 	struct kvm_rmap_head *rmap_head;
+	gfn_t gfn = slot->base_gfn + gfn_offset;
+
+	if (kvm->arch.direct_mmu_enabled)
+		clear_direct_dirty_log_gfn_masked(kvm, slot, gfn, mask,
+						  KVM_DIRTY_LOG_MODE_WRPROT,
+						  MMU_WRITE_LOCK);
+	if (kvm->arch.pure_direct_mmu)
+		return;
 
 	while (mask) {
 		rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
@@ -2541,6 +2699,16 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				     gfn_t gfn_offset, unsigned long mask)
 {
 	struct kvm_rmap_head *rmap_head;
+	gfn_t gfn = slot->base_gfn + gfn_offset;
+
+	if (!mask)
+		return;
+
+	if (kvm->arch.direct_mmu_enabled)
+		clear_direct_dirty_log_gfn_masked(kvm, slot, gfn, mask,
+				KVM_DIRTY_LOG_MODE_PML, MMU_WRITE_LOCK);
+	if (kvm->arch.pure_direct_mmu)
+		return;
 
 	while (mask) {
 		rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
@@ -3031,6 +3199,7 @@ static int age_direct_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
 					iter.old_pte);
 			new_pte |= shadow_acc_track_value;
 		}
+		new_pte &= ~shadow_dirty_mask;
 
 		/*
 		 * We've created a new pte with the accessed state cleared.
@@ -7293,11 +7462,17 @@ static bool slot_rmap_write_protect(struct kvm *kvm,
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      struct kvm_memory_slot *memslot)
 {
-	bool flush;
+	bool flush = false;
+
+	if (kvm->arch.direct_mmu_enabled)
+		reset_direct_mmu_dirty_logging(kvm, memslot,
+				KVM_DIRTY_LOG_MODE_WRPROT, false);
 
 	write_lock(&kvm->mmu_lock);
-	flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
-				      false);
+	if (!kvm->arch.pure_direct_mmu)
+		flush = slot_handle_all_level(kvm, memslot,
+					      slot_rmap_write_protect,
+					      false);
 	write_unlock(&kvm->mmu_lock);
 
 	/*
@@ -7367,8 +7542,42 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 {
 	/* FIXME: const-ify all uses of struct kvm_memory_slot.  */
 	write_lock(&kvm->mmu_lock);
-	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
-			 kvm_mmu_zap_collapsible_spte, true);
+	if (!kvm->arch.pure_direct_mmu)
+		slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
+				 kvm_mmu_zap_collapsible_spte, true);
+	write_unlock(&kvm->mmu_lock);
+}
+
+void kvm_mmu_zap_collapsible_direct_ptes(struct kvm *kvm,
+					 const struct kvm_memory_slot *memslot)
+{
+	struct direct_walk_iterator iter;
+	kvm_pfn_t pfn;
+
+	if (!kvm->arch.direct_mmu_enabled)
+		return;
+
+	write_lock(&kvm->mmu_lock);
+
+	direct_walk_iterator_setup_walk(&iter, kvm, memslot->as_id,
+					memslot->base_gfn,
+					memslot->base_gfn + memslot->npages,
+					MMU_READ_LOCK | MMU_LOCK_MAY_RESCHED);
+	while (direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		pfn = spte_to_pfn(iter.old_pte);
+		if (kvm_is_reserved_pfn(pfn) ||
+		    !PageTransCompoundMap(pfn_to_page(pfn)))
+			continue;
+		/*
+		 * If the compare / exchange succeeds, then we will continue on
+		 * to the next pte. If it fails, the next iteration will repeat
+		 * the current pte. We'll handle both cases in the same way, so
+		 * we don't need to check the result here.
+		 */
+		direct_walk_iterator_set_pte(&iter, 0);
+	}
+	direct_walk_iterator_end_traversal(&iter);
+
 	write_unlock(&kvm->mmu_lock);
 }
 
@@ -7414,18 +7623,46 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_slot_largepage_remove_write_access);
 
+static bool slot_set_dirty_direct(struct kvm *kvm,
+			    struct kvm_memory_slot *memslot)
+{
+	struct direct_walk_iterator iter;
+	u64 new_pte;
+
+	direct_walk_iterator_setup_walk(&iter, kvm, memslot->as_id,
+			memslot->base_gfn, memslot->base_gfn + memslot->npages,
+			MMU_WRITE_LOCK | MMU_LOCK_MAY_RESCHED);
+	while (direct_walk_iterator_next_present_pte(&iter)) {
+		new_pte = iter.old_pte | shadow_dirty_mask;
+
+		if (!direct_walk_iterator_set_pte(&iter, new_pte))
+			continue;
+	}
+	return direct_walk_iterator_end_traversal(&iter);
+}
+
 void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 			    struct kvm_memory_slot *memslot)
 {
-	bool flush;
+	bool flush = false;
 
 	write_lock(&kvm->mmu_lock);
-	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
+	if (kvm->arch.direct_mmu_enabled)
+		flush |= slot_set_dirty_direct(kvm, memslot);
+
+	if (!kvm->arch.pure_direct_mmu)
+		flush |= slot_handle_all_level(kvm, memslot, __rmap_set_dirty,
+					       false);
 	write_unlock(&kvm->mmu_lock);
 
 	lockdep_assert_held(&kvm->slots_lock);
 
-	/* see kvm_mmu_slot_leaf_clear_dirty */
+	/*
+	 * It's also safe to flush TLBs out of mmu lock here as currently this
+	 * function is only used for dirty logging, in which case flushing TLB
+	 * out of mmu lock also guarantees no dirty pages will be lost in
+	 * dirty_bitmap.
+	 */
 	if (flush)
 		kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
 				memslot->npages);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d4575ffb3cec7..aab8f3ab456ec 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7221,8 +7221,14 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 static void vmx_slot_enable_log_dirty(struct kvm *kvm,
 				     struct kvm_memory_slot *slot)
 {
-	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
-	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+	if (kvm->arch.direct_mmu_enabled)
+		reset_direct_mmu_dirty_logging(kvm, slot,
+					       KVM_DIRTY_LOG_MODE_PML, false);
+
+	if (!kvm->arch.pure_direct_mmu) {
+		kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
+		kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+	}
 }
 
 static void vmx_slot_disable_log_dirty(struct kvm *kvm,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2972b6c6029fb..edd7d7bece2fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9776,8 +9776,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	 */
 	if (change == KVM_MR_FLAGS_ONLY &&
 		(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
-		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
+		!(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) {
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
+		kvm_mmu_zap_collapsible_direct_ptes(kvm, new);
+	}
 
 	/*
 	 * Set up write protection and/or dirty logging for the new slot.
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 25/28] kvm: mmu: Support kvm_zap_gfn_range in the direct MMU
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (23 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 24/28] kvm: mmu: Support dirty logging in the direct MMU Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 26/28] kvm: mmu: Integrate direct MMU with nesting Ben Gardon
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Add a function for zapping ranges of GFNs in a memslot to support
kvm_zap_gfn_range for the direct MMU.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 27 +++++++++++++++++++++------
 arch/x86/kvm/mmu.h |  2 ++
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ca58b27a17c52..a0c5271ae2381 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -7427,13 +7427,32 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
 	kvm_mmu_uninit_direct_mmu(kvm);
 }
 
+void kvm_zap_slot_gfn_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			    gfn_t start, gfn_t end)
+{
+	write_lock(&kvm->mmu_lock);
+	if (kvm->arch.direct_mmu_enabled) {
+		zap_direct_gfn_range(kvm, memslot->as_id, start, end,
+				     MMU_READ_LOCK);
+	}
+
+	if (kvm->arch.pure_direct_mmu) {
+		write_unlock(&kvm->mmu_lock);
+		return;
+	}
+
+	slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
+				PT_PAGE_TABLE_LEVEL, PT_MAX_HUGEPAGE_LEVEL,
+				start, end - 1, true);
+	write_unlock(&kvm->mmu_lock);
+}
+
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
 	int i;
 
-	write_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(memslot, slots) {
@@ -7444,13 +7463,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 			if (start >= end)
 				continue;
 
-			slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
-						PT_PAGE_TABLE_LEVEL, PT_MAX_HUGEPAGE_LEVEL,
-						start, end - 1, true);
+			kvm_zap_slot_gfn_range(kvm, memslot, start, end);
 		}
 	}
-
-	write_unlock(&kvm->mmu_lock);
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 11f8ec89433b6..4ea8a72c8868d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -204,6 +204,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+void kvm_zap_slot_gfn_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			    gfn_t start, gfn_t end);
 
 void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 26/28] kvm: mmu: Integrate direct MMU with nesting
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (24 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 25/28] kvm: mmu: Support kvm_zap_gfn_range " Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 27/28] kvm: mmu: Lazily allocate rmap when direct MMU is enabled Ben Gardon
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Allows the existing nesting implementation to interoperate with the
direct MMU.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a0c5271ae2381..e0f35da0d1027 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2742,6 +2742,29 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
+static bool rmap_write_protect_direct_gfn(struct kvm *kvm,
+					  struct kvm_memory_slot *slot,
+					  gfn_t gfn)
+{
+	struct direct_walk_iterator iter;
+	u64 new_pte;
+
+	direct_walk_iterator_setup_walk(&iter, kvm, slot->as_id, gfn, gfn + 1,
+					MMU_WRITE_LOCK);
+	while (direct_walk_iterator_next_present_leaf_pte(&iter)) {
+		if (!is_writable_pte(iter.old_pte) &&
+		    !spte_can_locklessly_be_made_writable(iter.old_pte))
+			break;
+
+		new_pte = iter.old_pte &
+			~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
+
+		if (!direct_walk_iterator_set_pte(&iter, new_pte))
+			continue;
+	}
+	return direct_walk_iterator_end_traversal(&iter);
+}
+
 /**
  * kvm_arch_write_log_dirty - emulate dirty page logging
  * @vcpu: Guest mode vcpu
@@ -2764,6 +2787,10 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	int i;
 	bool write_protected = false;
 
+	if (kvm->arch.direct_mmu_enabled)
+		write_protected |= rmap_write_protect_direct_gfn(kvm, slot,
+								 gfn);
+
 	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
 		rmap_head = __gfn_to_rmap(gfn, i, slot);
 		write_protected |= __rmap_write_protect(kvm, rmap_head, true);
@@ -5755,6 +5782,8 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_cr3,
 	uint i;
 	struct kvm_mmu_root_info root;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	bool direct_mmu_root = (vcpu->kvm->arch.direct_mmu_enabled &&
+				new_role.direct);
 
 	root.cr3 = mmu->root_cr3;
 	root.hpa = mmu->root_hpa;
@@ -5762,10 +5791,14 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_cr3,
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
 		swap(root, mmu->prev_roots[i]);
 
-		if (new_cr3 == root.cr3 && VALID_PAGE(root.hpa) &&
-		    page_header(root.hpa) != NULL &&
-		    new_role.word == page_header(root.hpa)->role.word)
-			break;
+		if (new_cr3 == root.cr3 && VALID_PAGE(root.hpa)) {
+			BUG_ON(direct_mmu_root &&
+				!is_direct_mmu_root(vcpu->kvm, root.hpa));
+
+			if (direct_mmu_root || (page_header(root.hpa) != NULL &&
+			    new_role.word == page_header(root.hpa)->role.word))
+				break;
+		}
 	}
 
 	mmu->root_hpa = root.hpa;
@@ -5813,8 +5846,14 @@ static bool fast_cr3_switch(struct kvm_vcpu *vcpu, gpa_t new_cr3,
 			 */
 			vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 
-			__clear_sp_write_flooding_count(
-				page_header(mmu->root_hpa));
+			/*
+			 * If this is a direct MMU root page, it doesn't have a
+			 * write flooding count.
+			 */
+			if (!(vcpu->kvm->arch.direct_mmu_enabled &&
+			      new_role.direct))
+				__clear_sp_write_flooding_count(
+						page_header(mmu->root_hpa));
 
 			return true;
 		}
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 27/28] kvm: mmu: Lazily allocate rmap when direct MMU is enabled
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (25 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 26/28] kvm: mmu: Integrate direct MMU with nesting Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-09-26 23:18 ` [RFC PATCH 28/28] kvm: mmu: Support MMIO in the direct MMU Ben Gardon
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

When the MMU is in pure direct mode, it uses a paging structure walk
iterator and does not require the rmap. The rmap requires 8 bytes for
every PTE that could be used to map guest memory. It is an expensive data
strucutre at ~0.2% of the size of guest memory. Delay allocating the rmap
until the MMU is no longer in pure direct mode. This could be caused,
for example, by the guest launching a nested, L2 VM.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 15 ++++++++++
 arch/x86/kvm/x86.c | 72 ++++++++++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/x86.h |  2 ++
 3 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e0f35da0d1027..72c2289132c43 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5228,8 +5228,23 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	u64 pdptr, pm_mask;
 	gfn_t root_gfn, root_cr3;
 	int i;
+	int r;
 
 	write_lock(&vcpu->kvm->mmu_lock);
+	if (vcpu->kvm->arch.pure_direct_mmu) {
+		write_unlock(&vcpu->kvm->mmu_lock);
+		/*
+		 * If this is the first time a VCPU has allocated shadow roots
+		 * and the direct MMU is enabled on this VM, it will need to
+		 * allocate rmaps for all its memslots. If the rmaps are already
+		 * allocated, this call will have no effect.
+		 */
+		r = kvm_allocate_rmaps(vcpu->kvm);
+		if (r < 0)
+			return r;
+		write_lock(&vcpu->kvm->mmu_lock);
+	}
+
 	vcpu->kvm->arch.pure_direct_mmu = false;
 	write_unlock(&vcpu->kvm->mmu_lock);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index edd7d7bece2fe..566521f956425 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9615,14 +9615,21 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
 	kvm_page_track_free_memslot(free, dont);
 }
 
-int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
-			    unsigned long npages)
+static int allocate_memslot_rmap(struct kvm *kvm,
+				   struct kvm_memory_slot *slot,
+				   unsigned long npages)
 {
 	int i;
 
+	/*
+	 * rmaps are allocated all-or-nothing under the slots
+	 * lock, so we only need to check that the first rmap
+	 * has been allocated.
+	 */
+	if (slot->arch.rmap[0])
+		return 0;
+
 	for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
-		struct kvm_lpage_info *linfo;
-		unsigned long ugfn;
 		int lpages;
 		int level = i + 1;
 
@@ -9634,8 +9641,61 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				 GFP_KERNEL_ACCOUNT);
 		if (!slot->arch.rmap[i])
 			goto out_free;
-		if (i == 0)
-			continue;
+	}
+	return 0;
+
+out_free:
+	for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
+		kvfree(slot->arch.rmap[i]);
+		slot->arch.rmap[i] = NULL;
+	}
+	return -ENOMEM;
+}
+
+int kvm_allocate_rmaps(struct kvm *kvm)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *slot;
+	int r = 0;
+	int i;
+
+	mutex_lock(&kvm->slots_lock);
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(slot, slots) {
+			r = allocate_memslot_rmap(kvm, slot, slot->npages);
+			if (r < 0)
+				break;
+		}
+	}
+	mutex_unlock(&kvm->slots_lock);
+	return r;
+}
+
+int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    unsigned long npages)
+{
+	int i;
+	int r;
+
+	/* Set the rmap pointer for each level to NULL */
+	memset(slot->arch.rmap, 0,
+	       ARRAY_SIZE(slot->arch.rmap) * sizeof(*slot->arch.rmap));
+
+	if (!kvm->arch.pure_direct_mmu) {
+		r = allocate_memslot_rmap(kvm, slot, npages);
+		if (r < 0)
+			return r;
+	}
+
+	for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
+		struct kvm_lpage_info *linfo;
+		unsigned long ugfn;
+		int lpages;
+		int level = i + 1;
+
+		lpages = gfn_to_index(slot->base_gfn + npages - 1,
+				      slot->base_gfn, level) + 1;
 
 		linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL_ACCOUNT);
 		if (!linfo)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index dbf7442a822b6..91bfbfd2c58d4 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -369,4 +369,6 @@ static inline bool kvm_pat_valid(u64 data)
 void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu);
 void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu);
 
+int kvm_allocate_rmaps(struct kvm *kvm);
+
 #endif
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 28/28] kvm: mmu: Support MMIO in the direct MMU
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (26 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 27/28] kvm: mmu: Lazily allocate rmap when direct MMU is enabled Ben Gardon
@ 2019-09-26 23:18 ` Ben Gardon
  2019-10-17 18:50 ` [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Sean Christopherson
  2019-11-27 19:09 ` Sean Christopherson
  29 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-09-26 23:18 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson, Ben Gardon

Add direct MMU handlers to the functions required to support MMIO

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu.c | 91 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 72c2289132c43..0a23daea0df50 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5464,49 +5464,94 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 	return vcpu_match_mmio_gva(vcpu, addr);
 }
 
-/* return true if reserved bit is detected on spte. */
-static bool
-walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
+/*
+ * Return the level of the lowest level pte added to ptes.
+ * That pte may be non-present.
+ */
+static int direct_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *ptes)
 {
-	struct kvm_shadow_walk_iterator iterator;
-	u64 sptes[PT64_ROOT_MAX_LEVEL], spte = 0ull;
-	int root, leaf;
-	bool reserved = false;
+	struct direct_walk_iterator iter;
+	int leaf = vcpu->arch.mmu->root_level;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
-		goto exit;
+	direct_walk_iterator_setup_walk(&iter, vcpu->kvm,
+			kvm_arch_vcpu_memslots_id(vcpu), addr >> PAGE_SHIFT,
+			(addr >> PAGE_SHIFT) + 1, MMU_NO_LOCK);
+	while (direct_walk_iterator_next_pte(&iter)) {
+		leaf = iter.level;
+		ptes[leaf - 1] = iter.old_pte;
+		if (!is_shadow_present_pte(iter.old_pte))
+			break;
+	}
+	direct_walk_iterator_end_traversal(&iter);
+
+	return leaf;
+}
+
+/*
+ * Return the level of the lowest level spte added to sptes.
+ * That spte may be non-present.
+ */
+static int shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes)
+{
+	struct kvm_shadow_walk_iterator iterator;
+	int leaf = vcpu->arch.mmu->root_level;
+	u64 spte;
 
 	walk_shadow_page_lockless_begin(vcpu);
 
-	for (shadow_walk_init(&iterator, vcpu, addr),
-		 leaf = root = iterator.level;
+	for (shadow_walk_init(&iterator, vcpu, addr);
 	     shadow_walk_okay(&iterator);
 	     __shadow_walk_next(&iterator, spte)) {
+		leaf = iterator.level;
 		spte = mmu_spte_get_lockless(iterator.sptep);
-
 		sptes[leaf - 1] = spte;
-		leaf--;
 
 		if (!is_shadow_present_pte(spte))
 			break;
-
-		reserved |= is_shadow_zero_bits_set(vcpu->arch.mmu, spte,
-						    iterator.level);
 	}
 
 	walk_shadow_page_lockless_end(vcpu);
 
+	return leaf;
+}
+
+/* return true if reserved bit is detected on spte. */
+static bool get_mmio_pte(struct kvm_vcpu *vcpu, u64 addr, bool direct,
+			 u64 *ptep)
+{
+	u64 ptes[PT64_ROOT_MAX_LEVEL];
+	int root = vcpu->arch.mmu->root_level;
+	int leaf;
+	int level;
+	bool reserved = false;
+
+
+	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
+		*ptep = 0ull;
+		return reserved;
+	}
+
+	if (direct && vcpu->kvm->arch.direct_mmu_enabled)
+		leaf = direct_mmu_get_walk(vcpu, addr, ptes);
+	else
+		leaf = shadow_mmu_get_walk(vcpu, addr, ptes);
+
+	for (level = root; level >= leaf; level--) {
+		if (!is_shadow_present_pte(ptes[level - 1]))
+			break;
+		reserved |= is_shadow_zero_bits_set(vcpu->arch.mmu,
+				ptes[level - 1], level);
+	}
+
 	if (reserved) {
 		pr_err("%s: detect reserved bits on spte, addr 0x%llx, dump hierarchy:\n",
 		       __func__, addr);
-		while (root > leaf) {
+		for (level = root; level >= leaf; level--)
 			pr_err("------ spte 0x%llx level %d.\n",
-			       sptes[root - 1], root);
-			root--;
-		}
+			       ptes[level - 1], level);
 	}
-exit:
-	*sptep = spte;
+
+	*ptep = ptes[leaf - 1];
 	return reserved;
 }
 
@@ -5518,7 +5563,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 	if (mmio_info_in_cache(vcpu, addr, direct))
 		return RET_PF_EMULATE;
 
-	reserved = walk_shadow_page_get_mmio_spte(vcpu, addr, &spte);
+	reserved = get_mmio_pte(vcpu, addr, direct, &spte);
 	if (WARN_ON(reserved))
 		return -EINVAL;
 
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (27 preceding siblings ...)
  2019-09-26 23:18 ` [RFC PATCH 28/28] kvm: mmu: Support MMIO in the direct MMU Ben Gardon
@ 2019-10-17 18:50 ` Sean Christopherson
  2019-10-18 13:42   ` Paolo Bonzini
  2019-11-27 19:09 ` Sean Christopherson
  29 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-10-17 18:50 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
> Over the years, the needs for KVM's x86 MMU have grown from running small
> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
> we previously depended upon shadow paging to run all guests, we now have
> the use of two dimensional paging (TDP). This RFC proposes and
> demonstrates two major changes to the MMU. First, an iterator abstraction 
> that simplifies traversal of TDP paging structures when running an L1
> guest. This abstraction takes advantage of the relative simplicity of TDP
> to simplify the implementation of MMU functions. Second, this RFC changes
> the synchronization model to enable more parallelism than the monolithic
> MMU lock. This "direct mode" MMU is currently in use at Google and has
> given us the performance necessary to live migrate our 416 vCPU, 12TiB
> m2-ultramem-416 VMs.
> 
> The primary motivation for this work was to handle page faults in
> parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
> MMU lock suffers from extreme contention, resulting in soft-lockups and
> jitter in the guest. To demonstrate this I also written, and will submit
> a demand paging test to KVM selftests. The test creates N vCPUs, which
> each touch disjoint regions of memory. Page faults are picked up by N
> user fault FD handlers, one for each vCPU. Over a 1 second profile of
> the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
> execution time was spent waiting for the MMU lock! With this patch
> series the total execution time for the test was reduced by 89% and the
> execution was dominated by get_user_pages and the user fault FD ioctl.
> As a secondary benefit, the iterator-based implementation does not use
> the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
> overheads.
> 
> The goal of this  RFC is to demonstrate and gather feedback on the
> iterator pattern, the memory savings it enables for the "direct case"
> and the changes to the synchronization model. Though they are interwoven
> in this series, I will separate the iterator from the synchronization
> changes in a future series. I recognize that some feature work will be
> needed to make this patch set ready for merging. That work is detailed
> at the end of this cover letter.

Diving into this series is on my todo list, but realistically that's not
going to happen until after KVM forum.  Sorry I can't provide timely
feedback.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case
  2019-10-17 18:50 ` [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Sean Christopherson
@ 2019-10-18 13:42   ` Paolo Bonzini
  0 siblings, 0 replies; 57+ messages in thread
From: Paolo Bonzini @ 2019-10-18 13:42 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: kvm, Peter Feiner, Peter Shier, Junaid Shahid, Jim Mattson

On 17/10/19 20:50, Sean Christopherson wrote:
> On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
>> Over the years, the needs for KVM's x86 MMU have grown from running small
>> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
>> we previously depended upon shadow paging to run all guests, we now have
>> the use of two dimensional paging (TDP). This RFC proposes and
>> demonstrates two major changes to the MMU. First, an iterator abstraction 
>> that simplifies traversal of TDP paging structures when running an L1
>> guest. This abstraction takes advantage of the relative simplicity of TDP
>> to simplify the implementation of MMU functions. Second, this RFC changes
>> the synchronization model to enable more parallelism than the monolithic
>> MMU lock. This "direct mode" MMU is currently in use at Google and has
>> given us the performance necessary to live migrate our 416 vCPU, 12TiB
>> m2-ultramem-416 VMs.
>>
>> The primary motivation for this work was to handle page faults in
>> parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
>> MMU lock suffers from extreme contention, resulting in soft-lockups and
>> jitter in the guest. To demonstrate this I also written, and will submit
>> a demand paging test to KVM selftests. The test creates N vCPUs, which
>> each touch disjoint regions of memory. Page faults are picked up by N
>> user fault FD handlers, one for each vCPU. Over a 1 second profile of
>> the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
>> execution time was spent waiting for the MMU lock! With this patch
>> series the total execution time for the test was reduced by 89% and the
>> execution was dominated by get_user_pages and the user fault FD ioctl.
>> As a secondary benefit, the iterator-based implementation does not use
>> the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
>> overheads.
>>
>> The goal of this  RFC is to demonstrate and gather feedback on the
>> iterator pattern, the memory savings it enables for the "direct case"
>> and the changes to the synchronization model. Though they are interwoven
>> in this series, I will separate the iterator from the synchronization
>> changes in a future series. I recognize that some feature work will be
>> needed to make this patch set ready for merging. That work is detailed
>> at the end of this cover letter.
> 
> Diving into this series is on my todo list, but realistically that's not
> going to happen until after KVM forum.  Sorry I can't provide timely
> feedback.

Same here.  I was very lazily waiting to get the big picture from Ben's
talk.

Paolo


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 01/28] kvm: mmu: Separate generating and setting mmio ptes
  2019-09-26 23:17 ` [RFC PATCH 01/28] kvm: mmu: Separate generating and setting mmio ptes Ben Gardon
@ 2019-11-27 18:15   ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 18:15 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:17:57PM -0700, Ben Gardon wrote:
> Separate the functions for generating MMIO page table entries from the
> function that inserts them into the paging structure. This refactoring
> will allow changes to the MMU sychronization model to use atomic
> compare / exchanges (which are not guaranteed to succeed) instead of a
> monolithic MMU lock.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 5269aa057dfa6..781c2ca7455e3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -390,8 +390,7 @@ static u64 get_mmio_spte_generation(u64 spte)
>  	return gen;
>  }
>  
> -static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
> -			   unsigned access)
> +static u64 generate_mmio_pte(struct kvm_vcpu *vcpu, u64 gfn, unsigned access)

Maybe get_mmio_spte_value()?  I see "generate" and all I can think of is
the generation number and nothing else.

>  {
>  	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
>  	u64 mask = generation_mmio_spte_mask(gen);
> @@ -403,6 +402,17 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
>  	mask |= (gpa & shadow_nonpresent_or_rsvd_mask)
>  		<< shadow_nonpresent_or_rsvd_mask_len;
>  
> +	return mask;
> +}
> +
> +static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
> +			   unsigned access)
> +{
> +	u64 mask = generate_mmio_pte(vcpu, gfn, access);
> +	unsigned int gen = get_mmio_spte_generation(mask);
> +
> +	access = mask & ACC_ALL;
> +
>  	trace_mark_mmio_spte(sptep, gfn, access, gen);
>  	mmu_spte_set(sptep, mask);
>  }
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 02/28] kvm: mmu: Separate pte generation from set_spte
  2019-09-26 23:17 ` [RFC PATCH 02/28] kvm: mmu: Separate pte generation from set_spte Ben Gardon
@ 2019-11-27 18:25   ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 18:25 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:17:58PM -0700, Ben Gardon wrote:
> Separate the functions for generating leaf page table entries from the
> function that inserts them into the paging structure. This refactoring
> will allow changes to the MMU sychronization model to use atomic
> compare / exchanges (which are not guaranteed to succeed) instead of a
> monolithic MMU lock.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu.c | 93 ++++++++++++++++++++++++++++------------------
>  1 file changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 781c2ca7455e3..7e5ab9c6e2b09 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2964,21 +2964,15 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>  #define SET_SPTE_WRITE_PROTECTED_PT	BIT(0)
>  #define SET_SPTE_NEED_REMOTE_TLB_FLUSH	BIT(1)
>  
> -static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> -		    unsigned pte_access, int level,
> -		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
> -		    bool can_unsync, bool host_writable)
> +static int generate_pte(struct kvm_vcpu *vcpu, unsigned pte_access, int level,

Similar comment on "generate".  Note, I don't necessarily like the names
get_mmio_spte_value() or get_spte_value() either as they could be
misinterpreted as reading the value from memory.  Maybe
calc_{mmio_}spte_value()?

> +		    gfn_t gfn, kvm_pfn_t pfn, u64 old_pte, bool speculative,
> +		    bool can_unsync, bool host_writable, bool ad_disabled,
> +		    u64 *ptep)
>  {
> -	u64 spte = 0;
> +	u64 pte;

Renames and unrelated refactoring (leaving the variable uninitialized and
setting it directdly to shadow_present_mask) belong in separate patches.
The renames especially make this patch much more difficult to review.  And,
I disagree with the rename, I think it's important to keep the "spte"
nomenclature, even though it's a bit of a misnomer for TDP entries, so that
it is easy to differentiate data that is coming from the host PTEs versus
data that is for KVM's MMU.

>  	int ret = 0;
> -	struct kvm_mmu_page *sp;
> -
> -	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
> -		return 0;
>  
> -	sp = page_header(__pa(sptep));
> -	if (sp_ad_disabled(sp))
> -		spte |= shadow_acc_track_value;
> +	*ptep = 0;
>  
>  	/*
>  	 * For the EPT case, shadow_present_mask is 0 if hardware
> @@ -2986,36 +2980,39 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  	 * ACC_USER_MASK and shadow_user_mask are used to represent
>  	 * read access.  See FNAME(gpte_access) in paging_tmpl.h.
>  	 */
> -	spte |= shadow_present_mask;
> +	pte = shadow_present_mask;
> +
> +	if (ad_disabled)
> +		pte |= shadow_acc_track_value;
> +
>  	if (!speculative)
> -		spte |= spte_shadow_accessed_mask(spte);
> +		pte |= spte_shadow_accessed_mask(pte);
>  
>  	if (pte_access & ACC_EXEC_MASK)
> -		spte |= shadow_x_mask;
> +		pte |= shadow_x_mask;
>  	else
> -		spte |= shadow_nx_mask;
> +		pte |= shadow_nx_mask;
>  
>  	if (pte_access & ACC_USER_MASK)
> -		spte |= shadow_user_mask;
> +		pte |= shadow_user_mask;
>  
>  	if (level > PT_PAGE_TABLE_LEVEL)
> -		spte |= PT_PAGE_SIZE_MASK;
> +		pte |= PT_PAGE_SIZE_MASK;
>  	if (tdp_enabled)
> -		spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
> +		pte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
>  			kvm_is_mmio_pfn(pfn));
>  
>  	if (host_writable)
> -		spte |= SPTE_HOST_WRITEABLE;
> +		pte |= SPTE_HOST_WRITEABLE;
>  	else
>  		pte_access &= ~ACC_WRITE_MASK;
>  
>  	if (!kvm_is_mmio_pfn(pfn))
> -		spte |= shadow_me_mask;
> +		pte |= shadow_me_mask;
>  
> -	spte |= (u64)pfn << PAGE_SHIFT;
> +	pte |= (u64)pfn << PAGE_SHIFT;
>  
>  	if (pte_access & ACC_WRITE_MASK) {
> -
>  		/*
>  		 * Other vcpu creates new sp in the window between
>  		 * mapping_level() and acquiring mmu-lock. We can
> @@ -3024,9 +3021,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  		 */
>  		if (level > PT_PAGE_TABLE_LEVEL &&
>  		    mmu_gfn_lpage_is_disallowed(vcpu, gfn, level))
> -			goto done;
> +			return 0;
>  
> -		spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
> +		pte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
>  
>  		/*
>  		 * Optimization: for pte sync, if spte was writable the hash
> @@ -3034,30 +3031,54 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  		 * is responsibility of mmu_get_page / kvm_sync_page.
>  		 * Same reasoning can be applied to dirty page accounting.
>  		 */
> -		if (!can_unsync && is_writable_pte(*sptep))
> -			goto set_pte;
> +		if (!can_unsync && is_writable_pte(old_pte)) {
> +			*ptep = pte;
> +			return 0;
> +		}
>  
>  		if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>  			pgprintk("%s: found shadow page for %llx, marking ro\n",
>  				 __func__, gfn);
> -			ret |= SET_SPTE_WRITE_PROTECTED_PT;
> +			ret = SET_SPTE_WRITE_PROTECTED_PT;

More unnecessary refactoring.

>  			pte_access &= ~ACC_WRITE_MASK;
> -			spte &= ~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
> +			pte &= ~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
>  		}
>  	}
>  
> -	if (pte_access & ACC_WRITE_MASK) {
> -		kvm_vcpu_mark_page_dirty(vcpu, gfn);
> -		spte |= spte_shadow_dirty_mask(spte);
> -	}
> +	if (pte_access & ACC_WRITE_MASK)
> +		pte |= spte_shadow_dirty_mask(pte);
>  
>  	if (speculative)
> -		spte = mark_spte_for_access_track(spte);
> +		pte = mark_spte_for_access_track(pte);
> +
> +	*ptep = pte;
> +	return ret;
> +}
> +
> +static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
> +		    int level, gfn_t gfn, kvm_pfn_t pfn, bool speculative,
> +		    bool can_unsync, bool host_writable)
> +{
> +	u64 spte;
> +	int ret;
> +	struct kvm_mmu_page *sp;
> +
> +	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
> +		return 0;
> +
> +	sp = page_header(__pa(sptep));
> +
> +	ret = generate_pte(vcpu, pte_access, level, gfn, pfn, *sptep,
> +			   speculative, can_unsync, host_writable,
> +			   sp_ad_disabled(sp), &spte);

Yowsers, that's a big prototype.  This is something that came up in an
unrelated internal discussion.  I wonder if it would make sense to define
a struct to hold all of the data needed to insert an spte and pass that
on the stack isntead of having a bajillion parameters.  Just spitballing,
no idea if it's feasible and/or reasonable.

> +	if (!spte)
> +		return 0;
> +
> +	if (spte & PT_WRITABLE_MASK)
> +		kvm_vcpu_mark_page_dirty(vcpu, gfn);
>  
> -set_pte:
>  	if (mmu_spte_update(sptep, spte))
>  		ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
> -done:
>  	return ret;
>  }
>  
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 03/28] kvm: mmu: Zero page cache memory at allocation time
  2019-09-26 23:17 ` [RFC PATCH 03/28] kvm: mmu: Zero page cache memory at allocation time Ben Gardon
@ 2019-11-27 18:32   ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 18:32 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:17:59PM -0700, Ben Gardon wrote:
> Simplify use of the MMU page cache by allocating pages pre-zeroed. This
> ensures that future code does not accidentally add non-zeroed memory to
> the paging structure and moves the work of zeroing page page out from
> under the MMU lock.

Ha, this *just* came up in a different series[*].  Unless there is a hard
dependency on the rest of this series, it'd be nice to tackle this
separately so that we can fully understand the tradeoffs.  And it could be
merged early/independently as well.

[*] https://patchwork.kernel.org/patch/11228487/#23025353

> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7e5ab9c6e2b09..1ecd6d51c0ee0 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1037,7 +1037,7 @@ static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
>  	if (cache->nobjs >= min)
>  		return 0;
>  	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
> -		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> +		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  		if (!page)
>  			return cache->nobjs >= min ? 0 : -ENOMEM;
>  		cache->objects[cache->nobjs++] = page;
> @@ -2548,7 +2548,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		if (level > PT_PAGE_TABLE_LEVEL && need_sync)
>  			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
>  	}
> -	clear_page(sp->spt);
>  	trace_kvm_mmu_get_page(sp, true);
>  
>  	kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 04/28] kvm: mmu: Update the lpages stat atomically
  2019-09-26 23:18 ` [RFC PATCH 04/28] kvm: mmu: Update the lpages stat atomically Ben Gardon
@ 2019-11-27 18:39   ` Sean Christopherson
  2019-12-06 20:10     ` Ben Gardon
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 18:39 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:00PM -0700, Ben Gardon wrote:
> In order to pave the way for more concurrent MMU operations, updates to
> VM-global stats need to be done atomically. Change updates to the lpages
> stat to be atomic in preparation for the introduction of parallel page
> fault handling.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 1ecd6d51c0ee0..56587655aecb9 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1532,7 +1532,7 @@ static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
>  		WARN_ON(page_header(__pa(sptep))->role.level ==
>  			PT_PAGE_TABLE_LEVEL);
>  		drop_spte(kvm, sptep);
> -		--kvm->stat.lpages;
> +		xadd(&kvm->stat.lpages, -1);

Manually doing atomic operations without converting the variable itself to
an atomic feels like a hack, e.g. lacks the compile time checks provided
by the atomics framework.

Tangentially related, should the members of struct kvm_vm_stat be forced
to 64-bit variables to avoid theoretical wrapping on 32-bit KVM?

>  		return true;
>  	}
>  
> @@ -2676,7 +2676,7 @@ static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
>  		if (is_last_spte(pte, sp->role.level)) {
>  			drop_spte(kvm, spte);
>  			if (is_large_pte(pte))
> -				--kvm->stat.lpages;
> +				xadd(&kvm->stat.lpages, -1);
>  		} else {
>  			child = page_header(pte & PT64_BASE_ADDR_MASK);
>  			drop_parent_pte(child, spte);
> @@ -3134,7 +3134,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
>  	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
>  	trace_kvm_mmu_set_spte(level, gfn, sptep);
>  	if (!was_rmapped && is_large_pte(*sptep))
> -		++vcpu->kvm->stat.lpages;
> +		xadd(&vcpu->kvm->stat.lpages, 1);
>  
>  	if (is_shadow_present_pte(*sptep)) {
>  		if (!was_rmapped) {
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 05/28] sched: Add cond_resched_rwlock
  2019-09-26 23:18 ` [RFC PATCH 05/28] sched: Add cond_resched_rwlock Ben Gardon
@ 2019-11-27 18:42   ` Sean Christopherson
  2019-12-06 20:12     ` Ben Gardon
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 18:42 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:01PM -0700, Ben Gardon wrote:
> Rescheduling while holding a spin lock is essential for keeping long
> running kernel operations running smoothly. Add the facility to
> cond_resched read/write spin locks.
> 
> RFC_NOTE: The current implementation of this patch set uses a read/write
> lock to replace the existing MMU spin lock. See the next patch in this
> series for more on why a read/write lock was chosen, and possible
> alternatives.

This definitely needs to be run by the sched/locking folks sooner rather
than later.

> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  include/linux/sched.h | 11 +++++++++++
>  kernel/sched/core.c   | 23 +++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 70db597d6fd4f..4d1fd96693d9b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1767,12 +1767,23 @@ static inline int _cond_resched(void) { return 0; }
>  })
>  
>  extern int __cond_resched_lock(spinlock_t *lock);
> +extern int __cond_resched_rwlock(rwlock_t *lock, bool write_lock);
>  
>  #define cond_resched_lock(lock) ({				\
>  	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
>  	__cond_resched_lock(lock);				\
>  })
>  
> +#define cond_resched_rwlock_read(lock) ({			\
> +	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
> +	__cond_resched_rwlock(lock, false);			\
> +})
> +
> +#define cond_resched_rwlock_write(lock) ({			\
> +	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
> +	__cond_resched_rwlock(lock, true);			\
> +})
> +
>  static inline void cond_resched_rcu(void)
>  {
>  #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f9a1346a5fa95..ba7ed4bed5036 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5663,6 +5663,29 @@ int __cond_resched_lock(spinlock_t *lock)
>  }
>  EXPORT_SYMBOL(__cond_resched_lock);
>  
> +int __cond_resched_rwlock(rwlock_t *lock, bool write_lock)
> +{
> +	int ret = 0;
> +
> +	lockdep_assert_held(lock);
> +	if (should_resched(PREEMPT_LOCK_OFFSET)) {
> +		if (write_lock) {

The existing __cond_resched_lock() checks for resched *or* lock
contention.  Is lock contention not something that needs (or can't?) be
considered?

> +			write_unlock(lock);
> +			preempt_schedule_common();
> +			write_lock(lock);
> +		} else {
> +			read_unlock(lock);
> +			preempt_schedule_common();
> +			read_lock(lock);
> +		}
> +
> +		ret = 1;
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(__cond_resched_rwlock);
> +
>  /**
>   * yield - yield the current processor to other threads.
>   *
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 06/28] kvm: mmu: Replace mmu_lock with a read/write lock
  2019-09-26 23:18 ` [RFC PATCH 06/28] kvm: mmu: Replace mmu_lock with a read/write lock Ben Gardon
@ 2019-11-27 18:47   ` Sean Christopherson
  2019-12-02 22:45     ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 18:47 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:02PM -0700, Ben Gardon wrote:
> Replace the KVM MMU spinlock with a read/write lock so that some parts of
> the MMU can be made more concurrent in future commits by switching some
> write mode aquisitions to read mode. A read/write lock was chosen over
> other synchronization options beause it has minimal initial impact: this
> change simply changes all uses of the MMU spin lock to an MMU read/write
> lock, in write mode. This change has no effect on the logic of the code
> and only a small performance penalty.
> 
> Other, more invasive options were considered for synchronizing access to
> the paging structures. Sharding the MMU lock to protect 2MB chunks of
> addresses, as the main MM does, would also work, however it makes
> acquiring locks for operations on large regions of memory expensive.
> Further, the parallel page fault handling algorithm introduced later in
> this series does not require exclusive access to the region of memory
> for which it is handling a fault.
> 
> There are several disadvantages to the read/write lock approach:
> 1. The reader/writer terminology does not apply well to MMU operations.
> 2. Many operations require exclusive access to a region of memory
> (often a memslot), but not all of memory. The read/write lock does not
> facilitate this.
> 3. Contention between readers and writers can still create problems in
> the face of long running MMU operations.
> 
> Despite these issues,the use of a read/write lock facilitates
> substantial improvements over the monolithic locking scheme.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu.c         | 106 +++++++++++++++++++------------------
>  arch/x86/kvm/page_track.c  |   8 +--
>  arch/x86/kvm/paging_tmpl.h |   8 +--
>  arch/x86/kvm/x86.c         |   4 +-
>  include/linux/kvm_host.h   |   3 +-
>  virt/kvm/kvm_main.c        |  34 ++++++------
>  6 files changed, 83 insertions(+), 80 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 56587655aecb9..0311d18d9a995 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2446,9 +2446,9 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
>  			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
>  			mmu_pages_clear_parents(&parents);
>  		}
> -		if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {

I gather there is no equivalent to spin_needbreak() for r/w locks?  Is it
something that can be added?  Losing spinlock contention detection will
negatively impact other flows, e.g. fast zapping all pages will no longer
drop the lock to allow insertion of SPTEs into the new generation of MMU.

> +		if (need_resched()) {
>  			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
> -			cond_resched_lock(&vcpu->kvm->mmu_lock);
> +			cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
>  			flush = false;
>  		}
>  	}

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/28] kvm: mmu: Add functions for handling changed PTEs
  2019-09-26 23:18 ` [RFC PATCH 07/28] kvm: mmu: Add functions for handling changed PTEs Ben Gardon
@ 2019-11-27 19:04   ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 19:04 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:03PM -0700, Ben Gardon wrote:
> The existing bookkeeping done by KVM when a PTE is changed is
> spread around several functions. This makes it difficult to remember all
> the stats, bitmaps, and other subsystems that need to be updated whenever
> a PTE is modified. When a non-leaf PTE is marked non-present or becomes
> a leaf PTE, page table memory must also be freed. Further, most of the
> bookkeeping is done before the PTE is actually set. This works well with
> a monolithic MMU lock, however if changes use atomic compare/exchanges,
> the bookkeeping cannot be done before the change is made. In either
> case, there is a short window in which some statistics, e.g. the dirty
> bitmap will be inconsistent, however consistency is still restored
> before the MMU lock is released. To simplify the MMU and facilitate the
> use of atomic operations on PTEs, create functions to handle some of the
> bookkeeping required as a result of the change.

This is one case where splitting into multiple patches is probably not the
best option.  It's difficult to review this patch without seeing how
disconnected PTEs are used.  And, the patch is untestable for all intents
and purposes since there is no external caller, i.e. all of the calles are
self-referential within the new code.

> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu.c | 145 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 145 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 0311d18d9a995..50413f17c7cd0 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -143,6 +143,18 @@ module_param(dbg, bool, 0644);
>  #define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
>  #define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
>  
> +/*
> + * PTEs in a disconnected page table can be set to DISCONNECTED_PTE to indicate
> + * to other threads that the page table in which the pte resides is no longer
> + * connected to the root of a paging structure.
> + *
> + * This constant works because it is considered non-present on both AMD and
> + * Intel CPUs and does not create a L1TF vulnerability because the pfn section
> + * is zeroed out. PTE bit 57 is available to software, per vol 3, figure 28-1
> + * of the Intel SDM and vol 2, figures 5-18 to 5-21 of the AMD APM.
> + */
> +#define DISCONNECTED_PTE (1ull << 57)

Use BIT_ULL, ignore the bad examples in mmu.c :-)

> +
>  #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
>  
>  /* make pte_list_desc fit well in cache line */
> @@ -555,6 +567,16 @@ static int is_shadow_present_pte(u64 pte)
>  	return (pte != 0) && !is_mmio_spte(pte);
>  }
>  
> +static inline int is_disconnected_pte(u64 pte)
> +{
> +	return pte == DISCONNECTED_PTE;
> +}

An explicit comparsion scares me a bit, but that's just my off the cuff
reaction.  I'll come back to the meat of this series after turkey day.

> +
> +static int is_present_direct_pte(u64 pte)
> +{
> +	return is_shadow_present_pte(pte) && !is_disconnected_pte(pte);
> +}
> +
>  static int is_large_pte(u64 pte)
>  {
>  	return pte & PT_PAGE_SIZE_MASK;
> @@ -1659,6 +1681,129 @@ static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
>  	return flush;
>  }
>  
> +static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
> +			       u64 old_pte, u64 new_pte, int level);
> +
> +/**
> + * mark_pte_disconnected - Mark a PTE as part of a disconnected PT
> + * @kvm: kvm instance
> + * @as_id: the address space of the paging structure the PTE was a part of
> + * @gfn: the base GFN that was mapped by the PTE
> + * @ptep: a pointer to the PTE to be marked disconnected
> + * @level: the level of the PT this PTE was a part of, when it was part of the
> + *	paging structure
> + */
> +static void mark_pte_disconnected(struct kvm *kvm, int as_id, gfn_t gfn,
> +				  u64 *ptep, int level)
> +{
> +	u64 old_pte;
> +
> +	old_pte = xchg(ptep, DISCONNECTED_PTE);
> +	BUG_ON(old_pte == DISCONNECTED_PTE);
> +
> +	handle_changed_pte(kvm, as_id, gfn, old_pte, DISCONNECTED_PTE, level);
> +}
> +
> +/**
> + * handle_disconnected_pt - Mark a PT as disconnected and handle associated
> + * bookkeeping and freeing
> + * @kvm: kvm instance
> + * @as_id: the address space of the paging structure the PT was a part of
> + * @pt_base_gfn: the base GFN that was mapped by the first PTE in the PT
> + * @pfn: The physical frame number of the disconnected PT page
> + * @level: the level of the PT, when it was part of the paging structure
> + *
> + * Given a pointer to a page table that has been removed from the paging
> + * structure and its level, recursively free child page tables and mark their
> + * entries as disconnected.
> + */
> +static void handle_disconnected_pt(struct kvm *kvm, int as_id,
> +				   gfn_t pt_base_gfn, kvm_pfn_t pfn, int level)
> +{
> +	int i;
> +	gfn_t gfn = pt_base_gfn;
> +	u64 *pt = pfn_to_kaddr(pfn);
> +
> +	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> +		/*
> +		 * Mark the PTE as disconnected so that no other thread will
> +		 * try to map in an entry there or try to free any child page
> +		 * table the entry might have pointed to.
> +		 */
> +		mark_pte_disconnected(kvm, as_id, gfn, &pt[i], level);
> +
> +		gfn += KVM_PAGES_PER_HPAGE(level);
> +	}
> +
> +	free_page((unsigned long)pt);
> +}
> +
> +/**
> + * handle_changed_pte - handle bookkeeping associated with a PTE change
> + * @kvm: kvm instance
> + * @as_id: the address space of the paging structure the PTE was a part of
> + * @gfn: the base GFN that was mapped by the PTE
> + * @old_pte: The value of the PTE before the atomic compare / exchange
> + * @new_pte: The value of the PTE after the atomic compare / exchange
> + * @level: the level of the PT the PTE is part of in the paging structure
> + *
> + * Handle bookkeeping that might result from the modification of a PTE.
> + * This function should be called in the same RCU read critical section as the
> + * atomic cmpxchg on the pte. This function must be called for all direct pte
> + * modifications except those which strictly emulate hardware, for example
> + * setting the dirty bit on a pte.
> + */
> +static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
> +			       u64 old_pte, u64 new_pte, int level)
> +{
> +	bool was_present = is_present_direct_pte(old_pte);
> +	bool is_present = is_present_direct_pte(new_pte);
> +	bool was_leaf = was_present && is_last_spte(old_pte, level);
> +	bool pfn_changed = spte_to_pfn(old_pte) != spte_to_pfn(new_pte);
> +	int child_level;
> +
> +	BUG_ON(level > PT64_ROOT_MAX_LEVEL);
> +	BUG_ON(level < PT_PAGE_TABLE_LEVEL);
> +	BUG_ON(gfn % KVM_PAGES_PER_HPAGE(level));
> +
> +	/*
> +	 * The only times a pte should be changed from a non-present to
> +	 * non-present state is when an entry in an unlinked page table is
> +	 * marked as a disconnected PTE as part of freeing the page table,
> +	 * or an MMIO entry is installed/modified. In these cases there is
> +	 * nothing to do.
> +	 */
> +	if (!was_present && !is_present) {
> +		/*
> +		 * If this change is not on an MMIO PTE and not setting a PTE
> +		 * as disconnected, then it is unexpected. Log the change,
> +		 * though it should not impact the guest since both the former
> +		 * and current PTEs are nonpresent.
> +		 */
> +		WARN_ON((new_pte != DISCONNECTED_PTE) &&
> +			!is_mmio_spte(new_pte));
> +		return;
> +	}
> +
> +	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
> +		/*
> +		 * The level of the page table being freed is one level lower
> +		 * than the level at which it is mapped.
> +		 */
> +		child_level = level - 1;
> +
> +		/*
> +		 * If there was a present non-leaf entry before, and now the
> +		 * entry points elsewhere, the lpage stats and dirty logging /
> +		 * access tracking status for all the entries the old pte
> +		 * pointed to must be updated and the page table pages it
> +		 * pointed to must be freed.
> +		 */
> +		handle_disconnected_pt(kvm, as_id, gfn, spte_to_pfn(old_pte),
> +				       child_level);
> +	}
> +}
> +
>  /**
>   * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
>   * @kvm: kvm instance
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case
  2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
                   ` (28 preceding siblings ...)
  2019-10-17 18:50 ` [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Sean Christopherson
@ 2019-11-27 19:09 ` Sean Christopherson
  2019-12-06 19:55   ` Ben Gardon
  29 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-11-27 19:09 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
> The goal of this  RFC is to demonstrate and gather feedback on the
> iterator pattern, the memory savings it enables for the "direct case"
> and the changes to the synchronization model. Though they are interwoven
> in this series, I will separate the iterator from the synchronization
> changes in a future series. I recognize that some feature work will be
> needed to make this patch set ready for merging. That work is detailed
> at the end of this cover letter.

How difficult would it be to send the synchronization changes as a separate
series in the not-too-distant future?  At a brief glance, those changes
appear to be tiny relative to the direct iterator changes.  From a stability
perspective, it would be nice if the locking changes can get upstreamed and
tested in the wild for a few kernel versions before the iterator code is
introduced.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 06/28] kvm: mmu: Replace mmu_lock with a read/write lock
  2019-11-27 18:47   ` Sean Christopherson
@ 2019-12-02 22:45     ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-12-02 22:45 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Wed, Nov 27, 2019 at 10:47:36AM -0800, Sean Christopherson wrote:
> On Thu, Sep 26, 2019 at 04:18:02PM -0700, Ben Gardon wrote:
> > Replace the KVM MMU spinlock with a read/write lock so that some parts of
> > the MMU can be made more concurrent in future commits by switching some
> > write mode aquisitions to read mode. A read/write lock was chosen over
> > other synchronization options beause it has minimal initial impact: this
> > change simply changes all uses of the MMU spin lock to an MMU read/write
> > lock, in write mode. This change has no effect on the logic of the code
> > and only a small performance penalty.
> > 
> > Other, more invasive options were considered for synchronizing access to
> > the paging structures. Sharding the MMU lock to protect 2MB chunks of
> > addresses, as the main MM does, would also work, however it makes
> > acquiring locks for operations on large regions of memory expensive.
> > Further, the parallel page fault handling algorithm introduced later in
> > this series does not require exclusive access to the region of memory
> > for which it is handling a fault.
> > 
> > There are several disadvantages to the read/write lock approach:
> > 1. The reader/writer terminology does not apply well to MMU operations.
> > 2. Many operations require exclusive access to a region of memory
> > (often a memslot), but not all of memory. The read/write lock does not
> > facilitate this.
> > 3. Contention between readers and writers can still create problems in
> > the face of long running MMU operations.
> > 
> > Despite these issues,the use of a read/write lock facilitates
> > substantial improvements over the monolithic locking scheme.
> > 
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu.c         | 106 +++++++++++++++++++------------------
> >  arch/x86/kvm/page_track.c  |   8 +--
> >  arch/x86/kvm/paging_tmpl.h |   8 +--
> >  arch/x86/kvm/x86.c         |   4 +-
> >  include/linux/kvm_host.h   |   3 +-
> >  virt/kvm/kvm_main.c        |  34 ++++++------
> >  6 files changed, 83 insertions(+), 80 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 56587655aecb9..0311d18d9a995 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -2446,9 +2446,9 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
> >  			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
> >  			mmu_pages_clear_parents(&parents);
> >  		}
> > -		if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {
> 
> I gather there is no equivalent to spin_needbreak() for r/w locks?  Is it
> something that can be added?  Losing spinlock contention detection will
> negatively impact other flows, e.g. fast zapping all pages will no longer
> drop the lock to allow insertion of SPTEs into the new generation of MMU.

Just saw that fast zap is explicitly noted in the cover letter.  Is there
anything beyond a spin_needbreak() implementation that's needed to support
fast zap?

> > +		if (need_resched()) {
> >  			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
> > -			cond_resched_lock(&vcpu->kvm->mmu_lock);
> > +			cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
> >  			flush = false;
> >  		}
> >  	}

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 08/28] kvm: mmu: Init / Uninit the direct MMU
  2019-09-26 23:18 ` [RFC PATCH 08/28] kvm: mmu: Init / Uninit the direct MMU Ben Gardon
@ 2019-12-02 23:40   ` Sean Christopherson
  2019-12-06 20:25     ` Ben Gardon
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-12-02 23:40 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:04PM -0700, Ben Gardon wrote:
> The direct MMU introduces several new fields that need to be initialized
> and torn down. Add functions to do that initialization / cleanup.

Can you briefly explain the basic concepts of the direct MMU?  The cover
letter explains the goals of the direct MMU and the mechanics of how KVM
moves between a shadow MMU and direct MMU, but I didn't see anything that
describes how the direct MMU fundamentally differs from the shadow MMU.

I'm something like 3-4 patches ahead of this one and still don't have a
good idea of the core tenets of the direct MMU.  I might eventually get
there on my own, but a jump start would be appreciated.


On a different topic, have you thrown around any other names besides
"direct MMU"?  I don't necessarily dislike the name, but I don't like it
either, e.g. the @direct flag is also set when IA32 paging is disabled in
the guest.

> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  51 ++++++++----
>  arch/x86/kvm/mmu.c              | 132 +++++++++++++++++++++++++++++---
>  arch/x86/kvm/x86.c              |  16 +++-
>  3 files changed, 169 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 23edf56cf577c..1f8164c577d50 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -236,6 +236,22 @@ enum {
>   */
>  #define KVM_APIC_PV_EOI_PENDING	1
>  
> +#define HF_GIF_MASK		(1 << 0)
> +#define HF_HIF_MASK		(1 << 1)
> +#define HF_VINTR_MASK		(1 << 2)
> +#define HF_NMI_MASK		(1 << 3)
> +#define HF_IRET_MASK		(1 << 4)
> +#define HF_GUEST_MASK		(1 << 5) /* VCPU is in guest-mode */
> +#define HF_SMM_MASK		(1 << 6)
> +#define HF_SMM_INSIDE_NMI_MASK	(1 << 7)
> +
> +#define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> +#define KVM_ADDRESS_SPACE_NUM 2
> +
> +#define kvm_arch_vcpu_memslots_id(vcpu) \
> +		((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
> +#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> +
>  struct kvm_kernel_irq_routing_entry;
>  
>  /*
> @@ -940,6 +956,24 @@ struct kvm_arch {
>  	bool exception_payload_enabled;
>  
>  	struct kvm_pmu_event_filter *pmu_event_filter;
> +
> +	/*
> +	 * Whether the direct MMU is enabled for this VM. This contains a
> +	 * snapshot of the direct MMU module parameter from when the VM was
> +	 * created and remains unchanged for the life of the VM. If this is
> +	 * true, direct MMU handler functions will run for various MMU
> +	 * operations.
> +	 */
> +	bool direct_mmu_enabled;

What's the reasoning behind allowing the module param to be changed after
KVM is loaded?  I haven't looked through all future patches, but I assume
there are optimizations and/or simplifications that can be made if all VMs
are guaranteed to have the same setting?

> +	/*
> +	 * Indicates that the paging structure built by the direct MMU is
> +	 * currently the only one in use. If nesting is used, prompting the
> +	 * creation of shadow page tables for L2, this will be set to false.
> +	 * While this is true, only direct MMU handlers will be run for many
> +	 * MMU functions. Ignored if !direct_mmu_enabled.
> +	 */
> +	bool pure_direct_mmu;

This should be introduced in the same patch that first uses the flag,
without the usage it's impossible to properly review.  E.g. is a dedicated
flag necessary or is it only used in slow paths and so could check for
vmxon?  Is the flag intended to be sticky?  Why is it per-VM and not
per-vCPU?  And so on and so forth.

> +	hpa_t direct_root_hpa[KVM_ADDRESS_SPACE_NUM];
>  };
>  
>  struct kvm_vm_stat {
> @@ -1255,7 +1289,7 @@ void kvm_mmu_module_exit(void);
>  
>  void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
>  int kvm_mmu_create(struct kvm_vcpu *vcpu);
> -void kvm_mmu_init_vm(struct kvm *kvm);
> +int kvm_mmu_init_vm(struct kvm *kvm);
>  void kvm_mmu_uninit_vm(struct kvm *kvm);
>  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
> @@ -1519,21 +1553,6 @@ enum {
>  	TASK_SWITCH_GATE = 3,
>  };
>  
> -#define HF_GIF_MASK		(1 << 0)
> -#define HF_HIF_MASK		(1 << 1)
> -#define HF_VINTR_MASK		(1 << 2)
> -#define HF_NMI_MASK		(1 << 3)
> -#define HF_IRET_MASK		(1 << 4)
> -#define HF_GUEST_MASK		(1 << 5) /* VCPU is in guest-mode */
> -#define HF_SMM_MASK		(1 << 6)
> -#define HF_SMM_INSIDE_NMI_MASK	(1 << 7)
> -
> -#define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> -#define KVM_ADDRESS_SPACE_NUM 2
> -
> -#define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
> -#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> -
>  asmlinkage void kvm_spurious_fault(void);
>  
>  /*
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 50413f17c7cd0..788edbda02f69 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -47,6 +47,10 @@
>  #include <asm/kvm_page_track.h>
>  #include "trace.h"
>  
> +static bool __read_mostly direct_mmu_enabled;
> +module_param_named(enable_direct_mmu, direct_mmu_enabled, bool,

To match other x86 module params, use "direct_mmu" for the param name and
"enable_direct_mmu" for the varaible.

> +		   S_IRUGO | S_IWUSR);

I'd prefer octal perms here.  I'm pretty sure checkpatch complains about
this, and I personally find 0444 and 0644 much more readable.

> +
>  /*
>   * When setting this variable to true it enables Two-Dimensional-Paging
>   * where the hardware walks 2 page tables:
> @@ -3754,27 +3758,56 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
>  	*root_hpa = INVALID_PAGE;
>  }
>  
> +static bool is_direct_mmu_root(struct kvm *kvm, hpa_t root)
> +{
> +	int as_id;
> +
> +	for (as_id = 0; as_id < KVM_ADDRESS_SPACE_NUM; as_id++)
> +		if (root == kvm->arch.direct_root_hpa[as_id])
> +			return true;
> +
> +	return false;
> +}
> +
>  /* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
>  void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			ulong roots_to_free)
>  {
>  	int i;
>  	LIST_HEAD(invalid_list);
> -	bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
>  
>  	BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
>  
> -	/* Before acquiring the MMU lock, see if we need to do any real work. */
> -	if (!(free_active_root && VALID_PAGE(mmu->root_hpa))) {
> -		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> -			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
> -			    VALID_PAGE(mmu->prev_roots[i].hpa))
> -				break;
> +	/*
> +	 * Direct MMU paging structures follow the life of the VM, so instead of
> +	 * destroying direct MMU paging structure root, simply mark the root
> +	 * HPA pointing to it as invalid.
> +	 */
> +	if (vcpu->kvm->arch.direct_mmu_enabled &&
> +	    roots_to_free & KVM_MMU_ROOT_CURRENT &&
> +	    is_direct_mmu_root(vcpu->kvm, mmu->root_hpa))
> +		mmu->root_hpa = INVALID_PAGE;
>  
> -		if (i == KVM_MMU_NUM_PREV_ROOTS)
> -			return;
> +	if (!VALID_PAGE(mmu->root_hpa))
> +		roots_to_free &= ~KVM_MMU_ROOT_CURRENT;
> +
> +	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
> +		if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) {
> +			if (is_direct_mmu_root(vcpu->kvm,
> +					       mmu->prev_roots[i].hpa))
> +				mmu->prev_roots[i].hpa = INVALID_PAGE;
> +			if (!VALID_PAGE(mmu->prev_roots[i].hpa))
> +				roots_to_free &= ~KVM_MMU_ROOT_PREVIOUS(i);
> +		}
>  	}
>  
> +	/*
> +	 * If there are no valid roots that need freeing at this point, avoid
> +	 * acquiring the MMU lock and return.
> +	 */
> +	if (!roots_to_free)
> +		return;
> +
>  	write_lock(&vcpu->kvm->mmu_lock);
>  
>  	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> @@ -3782,7 +3815,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			mmu_free_root_page(vcpu->kvm, &mmu->prev_roots[i].hpa,
>  					   &invalid_list);
>  
> -	if (free_active_root) {
> +	if (roots_to_free & KVM_MMU_ROOT_CURRENT) {
>  		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
>  		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
>  			mmu_free_root_page(vcpu->kvm, &mmu->root_hpa,
> @@ -3820,7 +3853,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu_page *sp;
>  	unsigned i;
>  
> -	if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
> +	if (vcpu->kvm->arch.direct_mmu_enabled) {
> +		// TODO: Support 5 level paging in the direct MMU
> +		BUG_ON(vcpu->arch.mmu->shadow_root_level > PT64_ROOT_4LEVEL);
> +		vcpu->arch.mmu->root_hpa = vcpu->kvm->arch.direct_root_hpa[
> +			kvm_arch_vcpu_memslots_id(vcpu)];
> +	} else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
>  		write_lock(&vcpu->kvm->mmu_lock);
>  		if(make_mmu_pages_available(vcpu) < 0) {
>  			write_unlock(&vcpu->kvm->mmu_lock);
> @@ -3863,6 +3901,10 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	gfn_t root_gfn, root_cr3;
>  	int i;
>  
> +	write_lock(&vcpu->kvm->mmu_lock);
> +	vcpu->kvm->arch.pure_direct_mmu = false;
> +	write_unlock(&vcpu->kvm->mmu_lock);
> +
>  	root_cr3 = vcpu->arch.mmu->get_cr3(vcpu);
>  	root_gfn = root_cr3 >> PAGE_SHIFT;
>  
> @@ -5710,6 +5752,64 @@ void kvm_disable_tdp(void)
>  }
>  EXPORT_SYMBOL_GPL(kvm_disable_tdp);
>  
> +static bool is_direct_mmu_enabled(void)
> +{
> +	if (!READ_ONCE(direct_mmu_enabled))
> +		return false;
> +
> +	if (WARN_ONCE(!tdp_enabled,
> +		      "Creating a VM with direct MMU enabled requires TDP."))
> +		return false;

User-induced WARNs are bad, direct_mmu_enabled must be forced to zero in
kvm_disable_tdp().  Unless there's a good reason for direct_mmu_enabled to
remain writable at runtime, making it read-only will eliminate that case.

> +	return true;
> +}
> +
> +static int kvm_mmu_init_direct_mmu(struct kvm *kvm)
> +{
> +	struct page *page;
> +	int i;
> +
> +	if (!is_direct_mmu_enabled())
> +		return 0;
> +
> +	/*
> +	 * Allocate the direct MMU root pages. These pages follow the life of
> +	 * the VM.
> +	 */
> +	for (i = 0; i < ARRAY_SIZE(kvm->arch.direct_root_hpa); i++) {
> +		page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +		if (!page)
> +			goto err;
> +		kvm->arch.direct_root_hpa[i] = page_to_phys(page);
> +	}
> +
> +	/* This should not be changed for the lifetime of the VM. */
> +	kvm->arch.direct_mmu_enabled = true;
> +
> +	kvm->arch.pure_direct_mmu = true;
> +	return 0;
> +err:
> +	for (i = 0; i < ARRAY_SIZE(kvm->arch.direct_root_hpa); i++) {
> +		if (kvm->arch.direct_root_hpa[i] &&
> +		    VALID_PAGE(kvm->arch.direct_root_hpa[i]))
> +			free_page((unsigned long)kvm->arch.direct_root_hpa[i]);
> +		kvm->arch.direct_root_hpa[i] = INVALID_PAGE;
> +	}
> +	return -ENOMEM;
> +}
> +

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 10/28] kvm: mmu: Flush TLBs before freeing direct MMU page table memory
  2019-09-26 23:18 ` [RFC PATCH 10/28] kvm: mmu: Flush TLBs before freeing direct MMU page table memory Ben Gardon
@ 2019-12-02 23:46   ` Sean Christopherson
  2019-12-06 20:31     ` Ben Gardon
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-12-02 23:46 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:06PM -0700, Ben Gardon wrote:
> If page table memory is freed before a TLB flush, it can result in
> improper guest access to memory through paging structure caches.
> Specifically, until a TLB flush, memory that was part of the paging
> structure could be used by the hardware for address translation if a
> partial walk leading to it is stored in the paging structure cache. Ensure
> that there is a TLB flush before page table memory is freed by
> transferring disconnected pages to a disconnected list, and on a flush
> transferring a snapshot of the disconnected list to a free list. The free
> list is processed asynchronously to avoid slowing TLB flushes.

Tangentially realted to TLB flushing, what generations of CPUs have you
tested this on?  I don't have any specific concerns, but ideally it'd be
nice to get testing cycles on older hardware before merging.  Thankfully
TDP-only eliminates ridiculously old hardware :-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 11/28] kvm: mmu: Optimize for freeing direct MMU PTs on teardown
  2019-09-26 23:18 ` [RFC PATCH 11/28] kvm: mmu: Optimize for freeing direct MMU PTs on teardown Ben Gardon
@ 2019-12-02 23:54   ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-12-02 23:54 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:07PM -0700, Ben Gardon wrote:
> Waiting for a TLB flush and an RCU grace priod before freeing page table
> memory grants safety in steady state operation, however these
> protections are not always necessary. On VM teardown, only one thread is
> operating on the paging structures and no vCPUs are running. As a result
> a fast path can be added to the disconnected page table handler which
> frees the memory immediately. Add the fast path and use it when tearing
> down VMs.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---

...

> @@ -1849,13 +1863,20 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
>  		 * try to map in an entry there or try to free any child page
>  		 * table the entry might have pointed to.
>  		 */
> -		mark_pte_disconnected(kvm, as_id, gfn, &pt[i], level);
> +		mark_pte_disconnected(kvm, as_id, gfn, &pt[i], level,
> +				      vm_teardown);
>  
>  		gfn += KVM_PAGES_PER_HPAGE(level);
>  	}
>  
> -	page = pfn_to_page(pfn);
> -	direct_mmu_disconnected_pt_list_add(kvm, page);
> +	if (vm_teardown) {
> +		BUG_ON(atomic_read(&kvm->online_vcpus) != 0);

BUG() isn't justified here, e.g.

	if (vm_teardown && !WARN_ON_ONCE(atomic_read(&kvm->online_vcpus)))

> +		cond_resched();
> +		free_page((unsigned long)pt);
> +	} else {
> +		page = pfn_to_page(pfn);
> +		direct_mmu_disconnected_pt_list_add(kvm, page);
> +	}
>  }
>  
>  /**
> @@ -1866,6 +1887,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
>   * @old_pte: The value of the PTE before the atomic compare / exchange
>   * @new_pte: The value of the PTE after the atomic compare / exchange
>   * @level: the level of the PT the PTE is part of in the paging structure
> + * @vm_teardown: all vCPUs are paused and the VM is being torn down. Yield and
> + *	free child page table memory immediately.
>   *
>   * Handle bookkeeping that might result from the modification of a PTE.
>   * This function should be called in the same RCU read critical section as the
> @@ -1874,7 +1897,8 @@ static void handle_disconnected_pt(struct kvm *kvm, int as_id,
>   * setting the dirty bit on a pte.
>   */
>  static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
> -			       u64 old_pte, u64 new_pte, int level)
> +			       u64 old_pte, u64 new_pte, int level,
> +			       bool vm_teardown)
>  {
>  	bool was_present = is_present_direct_pte(old_pte);
>  	bool is_present = is_present_direct_pte(new_pte);
> @@ -1920,7 +1944,7 @@ static void handle_changed_pte(struct kvm *kvm, int as_id, gfn_t gfn,
>  		 * pointed to must be freed.
>  		 */
>  		handle_disconnected_pt(kvm, as_id, gfn, spte_to_pfn(old_pte),
> -				       child_level);
> +				       child_level, vm_teardown);
>  	}
>  }
>  
> @@ -5932,7 +5956,7 @@ static void kvm_mmu_uninit_direct_mmu(struct kvm *kvm)
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
>  		handle_disconnected_pt(kvm, i, 0,
>  			(kvm_pfn_t)(kvm->arch.direct_root_hpa[i] >> PAGE_SHIFT),
> -			PT64_ROOT_4LEVEL);
> +			PT64_ROOT_4LEVEL, true);
>  }
>  
>  /* The return value indicates if tlb flush on all vcpus is needed. */
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 12/28] kvm: mmu: Set tlbs_dirty atomically
  2019-09-26 23:18 ` [RFC PATCH 12/28] kvm: mmu: Set tlbs_dirty atomically Ben Gardon
@ 2019-12-03  0:13   ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-12-03  0:13 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:08PM -0700, Ben Gardon wrote:
> The tlbs_dirty mechanism for deferring flushes can be expanded beyond
> its current use case. This allows MMU operations which do not
> themselves require TLB flushes to notify other threads that there are
> unflushed modifications to the paging structure. In order to use this
> mechanism concurrently, the updates to the global tlbs_dirty must be
> made atomically.

If there is a hard requirement that tlbs_dirty must be updated atomically
then it needs to be an actual atomic so that the requirement is enforced.
 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/paging_tmpl.h | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 97903c8dcad16..cc3630c8bd3ea 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -986,6 +986,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  	bool host_writable;
>  	gpa_t first_pte_gpa;
>  	int set_spte_ret = 0;
> +	int ret;
> +	int tlbs_dirty = 0;
>  
>  	/* direct kvm_mmu_page can not be unsync. */
>  	BUG_ON(sp->role.direct);
> @@ -1004,17 +1006,13 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
>  
>  		if (kvm_vcpu_read_guest_atomic(vcpu, pte_gpa, &gpte,
> -					       sizeof(pt_element_t)))
> -			return 0;
> +					       sizeof(pt_element_t))) {
> +			ret = 0;
> +			goto out;
> +		}
>  
>  		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
> -			/*
> -			 * Update spte before increasing tlbs_dirty to make
> -			 * sure no tlb flush is lost after spte is zapped; see
> -			 * the comments in kvm_flush_remote_tlbs().
> -			 */
> -			smp_wmb();
> -			vcpu->kvm->tlbs_dirty++;
> +			tlbs_dirty++;
>  			continue;
>  		}
>  
> @@ -1029,12 +1027,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  
>  		if (gfn != sp->gfns[i]) {
>  			drop_spte(vcpu->kvm, &sp->spt[i]);
> -			/*
> -			 * The same as above where we are doing
> -			 * prefetch_invalid_gpte().
> -			 */
> -			smp_wmb();
> -			vcpu->kvm->tlbs_dirty++;
> +			tlbs_dirty++;
>  			continue;
>  		}
>  
> @@ -1051,7 +1044,11 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  	if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)
>  		kvm_flush_remote_tlbs(vcpu->kvm);
>  
> -	return nr_present;
> +	ret = nr_present;
> +
> +out:
> +	xadd(&vcpu->kvm->tlbs_dirty, tlbs_dirty);

Collecting and applying vcpu->kvm->tlbs_dirty updates at the end versus
updating on the fly is a functional change beyond updating tlbs_dirty
atomically.  At a glance, I have no idea whether or not it affects anything
and if so, whether it's correct, i.e. there needs to be an explanation of
why it's safe to combine things into a single update.

> +	return ret;
>  }
>  
>  #undef pt_element_t
> -- 
> 2.23.0.444.g18eeb5a265-goog
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks
  2019-09-26 23:18 ` [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks Ben Gardon
@ 2019-12-03  2:15   ` Sean Christopherson
  2019-12-18 18:25     ` Ben Gardon
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-12-03  2:15 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:09PM -0700, Ben Gardon wrote:
> Add a utility for concurrent paging structure traversals. This iterator
> uses several mechanisms to ensure that its accesses to paging structure
> memory are safe, and that memory can be freed safely in the face of
> lockless access. The purpose of the iterator is to create a unified
> pattern for concurrent paging structure traversals and simplify the
> implementation of other MMU functions.
> 
> This iterator implements a pre-order traversal of PTEs for a given GFN
> range within a given address space. The iterator abstracts away
> bookkeeping on successful changes to PTEs, retrying on failed PTE
> modifications, TLB flushing, and yielding during long operations.
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu.c      | 455 ++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmutrace.h |  50 +++++
>  2 files changed, 505 insertions(+)

...

> +/*
> + * Sets a direct walk iterator to seek the gfn range [start, end).
> + * If end is greater than the maximum possible GFN, it will be changed to the
> + * maximum possible gfn + 1. (Note that start/end is and inclusive/exclusive
> + * range, so the last gfn to be interated over would be the largest possible
> + * GFN, in this scenario.)
> + */
> +__attribute__((unused))
> +static void direct_walk_iterator_setup_walk(struct direct_walk_iterator *iter,
> +	struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> +	enum mmu_lock_mode lock_mode)

Echoing earlier patches, please introduce variables/flags/functions along
with their users.  I have a feeling you're adding some of the unused
functions so that all flags/variables in struct direct_walk_iterator can
be in place from the get-go, but that actually makes everything much harder
to review.

> +{
> +	BUG_ON(!kvm->arch.direct_mmu_enabled);
> +	BUG_ON((lock_mode & MMU_WRITE_LOCK) && (lock_mode & MMU_READ_LOCK));
> +	BUG_ON(as_id < 0);
> +	BUG_ON(as_id >= KVM_ADDRESS_SPACE_NUM);
> +	BUG_ON(!VALID_PAGE(kvm->arch.direct_root_hpa[as_id]));
> +
> +	/* End cannot be greater than the maximum possible gfn. */
> +	end = min(end, 1ULL << (PT64_ROOT_4LEVEL * PT64_PT_BITS));
> +
> +	iter->as_id = as_id;
> +	iter->pt_path[PT64_ROOT_4LEVEL - 1] =
> +			(u64 *)__va(kvm->arch.direct_root_hpa[as_id]);
> +
> +	iter->walk_start = start;
> +	iter->walk_end = end;
> +	iter->target_gfn = start;
> +
> +	iter->lock_mode = lock_mode;
> +	iter->kvm = kvm;
> +	iter->tlbs_dirty = 0;
> +
> +	direct_walk_iterator_start_traversal(iter);
> +}

...

> +static void direct_walk_iterator_cond_resched(struct direct_walk_iterator *iter)
> +{
> +	if (!(iter->lock_mode & MMU_LOCK_MAY_RESCHED) || !need_resched())
> +		return;
> +
> +	direct_walk_iterator_prepare_cond_resched(iter);
> +	cond_resched();
> +	direct_walk_iterator_finish_cond_resched(iter);
> +}
> +
> +static bool direct_walk_iterator_next_pte(struct direct_walk_iterator *iter)
> +{
> +	/*
> +	 * This iterator could be iterating over a large number of PTEs, such
> +	 * that if this thread did not yield, it would cause scheduler\
> +	 * problems. To avoid this, yield if needed. Note the check on
> +	 * MMU_LOCK_MAY_RESCHED in direct_walk_iterator_cond_resched. This
> +	 * iterator will not yield unless that flag is set in its lock_mode.
> +	 */
> +	direct_walk_iterator_cond_resched(iter);

This looks very fragile, e.g. one of the future patches even has to avoid
problems with this code by limiting the number of PTEs it processes.

> +
> +	while (true) {
> +		if (!direct_walk_iterator_next_pte_raw(iter))

Implicitly initializing the iterator during next_pte_raw() is asking for
problems, e.g. @walk_in_progress should not exist.  The standard kernel
pattern for fancy iterators is to wrap the initialization, deref, and
advancement operators in a macro, e.g. something like:

	for_each_direct_pte(...) {

	}

That might require additional control flow logic in the users of the
iterator, but if so that's probably a good thing in terms of readability
and robustness.  E.g. verifying that rcu_read_unlock() is guaranteed to
be called is extremely difficult as rcu_read_lock() is buried in this
low level helper but the iterator relies on the top-level caller to
terminate traversal.

See mem_cgroup_iter_break() for one example of handling an iter walk
where an action needs to taken when the walk terminates early.

> +			return false;
> +
> +		direct_walk_iterator_recalculate_output_fields(iter);
> +		if (iter->old_pte != DISCONNECTED_PTE)
> +			break;
> +
> +		/*
> +		 * The iterator has encountered a disconnected pte, so it is in
> +		 * a page that has been disconnected from the root. Restart the
> +		 * traversal from the root in this case.
> +		 */
> +		direct_walk_iterator_reset_traversal(iter);

I understand wanting to hide details to eliminate copy-paste, but this
goes too far and makes it too difficult to understand the flow of the
top-level walks.  Ditto for burying retry_pte() in set_pte().  I'd say it
also applies to skip_step_down(), but AFAICT that's dead code.

Off-topic for a second, the super long direct_walk_iterator_... names
make me want to simply call this new MMU the "tdp MMU" and just live with
the discrepancy until the old shadow-based TDP MMU can be nuked.  Then we
could have tdp_iter_blah_blah_blah(), for_each_tdp_present_pte(), etc...

Back to the iterator, I think it can be massaged into a standard for loop
approach without polluting the top level walkers much.  The below code is
the basic idea, e.g. the macros won't compile, probably doesn't terminate
the walk correct, rescheduling is missing, etc...

Note, open coding the down/sideways/up helpers is 50% personal preference,
50% because gfn_start and gfn_end are now local variables, and 50% because
it was the easiest way to learn the code.  I wouldn't argue too much about
having one or more of the helpers.


static void tdp_iter_break(struct tdp_iter *iter)
{
	/* TLB flush, RCU unlock, etc...)
}

static void tdp_iter_next(struct tdp_iter *iter, bool *retry)
{
	gfn_t gfn_start, gfn_end;
	u64 *child_pt;

	if (*retry) {
		*retry = false;
		return;
	}

	/*
	 * Reread the pte before stepping down to avoid traversing into page
	 * tables that are no longer linked from this entry. This is not
	 * needed for correctness - just a small optimization.
	 */
	iter->old_pte = READ_ONCE(*iter->ptep);

	/* Try to step down. */
	child_pt = pte_to_child_pt(iter->old_pte, iter->level);
	if (child_pt) {
		child_pt = rcu_dereference(child_pt);
		iter->level--;
		iter->pt_path[iter->level - 1] = child_pt;
		return;
	}

step_sideways:
	/* Try to step sideways. */
	gfn_start = ALIGN_DOWN(iter->target_gfn,
			       KVM_PAGES_PER_HPAGE(iter->level));
	gfn_end = gfn_start + KVM_PAGES_PER_HPAGE(iter->level)

	/*
	 * If the current gfn maps past the target gfn range, the next entry in
	 * the current page table will be outside the target range.
	 */
	if (gfn_end >= iter->walk_end ||
	    !(gfn_end % KVM_PAGES_PER_HPAGE(iter->level + 1))) {
		/* Try to step up. */
		iter->level++;

		if (iter->level > PT64_ROOT_4LEVEL; {
			/* This is ugly, there's probably a better solution. */
			tdp_iter_break(iter);
			return;
		}
		goto step_sideways;
	}

	iter->target_gfn = gfn_end;
	iter->ptep = iter->pt_path[iter->level - 1] +
			PT64_INDEX(iter->target_gfn << PAGE_SHIFT, iter->level);
	iter->old_pte = READ_ONCE(*iter->ptep);
}

#define for_each_tdp_pte(iter, start, end, retry)
	for (tdp_iter_start(&iter, start, end);
	     iter->level <= PT64_ROOT_4LEVEL;
	     tdp_iter_next(&iter, &retry))

#define for_each_tdp_present_pte(iter, start, end, retry)
	for_each_tdp_pte(iter, start, end, retry)
		if (!is_present_direct_pte(iter->old_pte)) {

		} else

#define for_each_tdp_present_leaf_pte(iter, start, end, retry)
	for_each_tdp_pte(iter, start, end, retry)
		if (!is_present_direct_pte(iter->old_pte) ||
		    !is_last_spte(iter->old_pte, iter->level))
		{

		} else

/*
 * Marks the range of gfns, [start, end), non-present.
 */
static bool zap_direct_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
				 gfn_t end, enum mmu_lock_mode lock_mode)
{
	struct direct_walk_iterator iter;
	bool retry;

	tdp_iter_init(&iter, kvm, as_id, lock_mode);

restart:
	retry = false;
	for_each_tdp_present_pte(iter, start, end, retry) {
		if (tdp_iter_set_pte(&iter, 0))
			retry = true;

		if (tdp_iter_disconnected(&iter)) {
			tdp_iter_break(&iter);
			goto restart;
		}
	}
}


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case
  2019-11-27 19:09 ` Sean Christopherson
@ 2019-12-06 19:55   ` Ben Gardon
  2019-12-06 19:57     ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-12-06 19:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

I'm finally back in the office. Sorry for not getting back to you sooner.
I don't think it would be easy to send the synchronization changes
first. The reason they seem so small is that they're all handled by
the iterator. If we tried to put the synchronization changes in
without the iterator we'd have to 1.) deal with struct kvm_mmu_pages,
2.) deal with the rmap, and 3.) change a huge amount of code to insert
the synchronization changes into the existing framework. The changes
wouldn't be mechanical or easy to insert either since a lot of
bookkeeping is currently done before PTEs are updated, with no
facility for rolling back the bookkeeping on PTE cmpxchg failure. We
could start with the iterator changes and then do the synchronization
changes, but the other way around would be very difficult.


On Wed, Nov 27, 2019 at 11:09 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
> > The goal of this  RFC is to demonstrate and gather feedback on the
> > iterator pattern, the memory savings it enables for the "direct case"
> > and the changes to the synchronization model. Though they are interwoven
> > in this series, I will separate the iterator from the synchronization
> > changes in a future series. I recognize that some feature work will be
> > needed to make this patch set ready for merging. That work is detailed
> > at the end of this cover letter.
>
> How difficult would it be to send the synchronization changes as a separate
> series in the not-too-distant future?  At a brief glance, those changes
> appear to be tiny relative to the direct iterator changes.  From a stability
> perspective, it would be nice if the locking changes can get upstreamed and
> tested in the wild for a few kernel versions before the iterator code is
> introduced.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case
  2019-12-06 19:55   ` Ben Gardon
@ 2019-12-06 19:57     ` Sean Christopherson
  2019-12-06 20:42       ` Ben Gardon
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2019-12-06 19:57 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Fri, Dec 06, 2019 at 11:55:42AM -0800, Ben Gardon wrote:
> I'm finally back in the office. Sorry for not getting back to you sooner.
> I don't think it would be easy to send the synchronization changes
> first. The reason they seem so small is that they're all handled by
> the iterator. If we tried to put the synchronization changes in
> without the iterator we'd have to 1.) deal with struct kvm_mmu_pages,
> 2.) deal with the rmap, and 3.) change a huge amount of code to insert
> the synchronization changes into the existing framework. The changes
> wouldn't be mechanical or easy to insert either since a lot of
> bookkeeping is currently done before PTEs are updated, with no
> facility for rolling back the bookkeeping on PTE cmpxchg failure. We
> could start with the iterator changes and then do the synchronization
> changes, but the other way around would be very difficult.

By synchronization changes, I meant switching to a r/w lock instead of a
straight spinlock.  Is that doable in a smallish series?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 04/28] kvm: mmu: Update the lpages stat atomically
  2019-11-27 18:39   ` Sean Christopherson
@ 2019-12-06 20:10     ` Ben Gardon
  0 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-12-06 20:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

I would definitely support changing all the entries in KVM stat to be
64 bit and making some of them atomic64_t. I agree that doing atomic
operations on int64s is fragile.

On Wed, Nov 27, 2019 at 10:39 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:18:00PM -0700, Ben Gardon wrote:
> > In order to pave the way for more concurrent MMU operations, updates to
> > VM-global stats need to be done atomically. Change updates to the lpages
> > stat to be atomic in preparation for the introduction of parallel page
> > fault handling.
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 1ecd6d51c0ee0..56587655aecb9 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -1532,7 +1532,7 @@ static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
> >               WARN_ON(page_header(__pa(sptep))->role.level ==
> >                       PT_PAGE_TABLE_LEVEL);
> >               drop_spte(kvm, sptep);
> > -             --kvm->stat.lpages;
> > +             xadd(&kvm->stat.lpages, -1);
>
> Manually doing atomic operations without converting the variable itself to
> an atomic feels like a hack, e.g. lacks the compile time checks provided
> by the atomics framework.
>
> Tangentially related, should the members of struct kvm_vm_stat be forced
> to 64-bit variables to avoid theoretical wrapping on 32-bit KVM?
>
> >               return true;
> >       }
> >
> > @@ -2676,7 +2676,7 @@ static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
> >               if (is_last_spte(pte, sp->role.level)) {
> >                       drop_spte(kvm, spte);
> >                       if (is_large_pte(pte))
> > -                             --kvm->stat.lpages;
> > +                             xadd(&kvm->stat.lpages, -1);
> >               } else {
> >                       child = page_header(pte & PT64_BASE_ADDR_MASK);
> >                       drop_parent_pte(child, spte);
> > @@ -3134,7 +3134,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
> >       pgprintk("%s: setting spte %llx\n", __func__, *sptep);
> >       trace_kvm_mmu_set_spte(level, gfn, sptep);
> >       if (!was_rmapped && is_large_pte(*sptep))
> > -             ++vcpu->kvm->stat.lpages;
> > +             xadd(&vcpu->kvm->stat.lpages, 1);
> >
> >       if (is_shadow_present_pte(*sptep)) {
> >               if (!was_rmapped) {
> > --
> > 2.23.0.444.g18eeb5a265-goog
> >

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 05/28] sched: Add cond_resched_rwlock
  2019-11-27 18:42   ` Sean Christopherson
@ 2019-12-06 20:12     ` Ben Gardon
  0 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-12-06 20:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

Lock contention should definitely be considered. It was an oversight
on my part to not have a check for that.

On Wed, Nov 27, 2019 at 10:42 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:18:01PM -0700, Ben Gardon wrote:
> > Rescheduling while holding a spin lock is essential for keeping long
> > running kernel operations running smoothly. Add the facility to
> > cond_resched read/write spin locks.
> >
> > RFC_NOTE: The current implementation of this patch set uses a read/write
> > lock to replace the existing MMU spin lock. See the next patch in this
> > series for more on why a read/write lock was chosen, and possible
> > alternatives.
>
> This definitely needs to be run by the sched/locking folks sooner rather
> than later.
>
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  include/linux/sched.h | 11 +++++++++++
> >  kernel/sched/core.c   | 23 +++++++++++++++++++++++
> >  2 files changed, 34 insertions(+)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 70db597d6fd4f..4d1fd96693d9b 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1767,12 +1767,23 @@ static inline int _cond_resched(void) { return 0; }
> >  })
> >
> >  extern int __cond_resched_lock(spinlock_t *lock);
> > +extern int __cond_resched_rwlock(rwlock_t *lock, bool write_lock);
> >
> >  #define cond_resched_lock(lock) ({                           \
> >       ___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
> >       __cond_resched_lock(lock);                              \
> >  })
> >
> > +#define cond_resched_rwlock_read(lock) ({                    \
> > +     __might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
> > +     __cond_resched_rwlock(lock, false);                     \
> > +})
> > +
> > +#define cond_resched_rwlock_write(lock) ({                   \
> > +     __might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
> > +     __cond_resched_rwlock(lock, true);                      \
> > +})
> > +
> >  static inline void cond_resched_rcu(void)
> >  {
> >  #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f9a1346a5fa95..ba7ed4bed5036 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5663,6 +5663,29 @@ int __cond_resched_lock(spinlock_t *lock)
> >  }
> >  EXPORT_SYMBOL(__cond_resched_lock);
> >
> > +int __cond_resched_rwlock(rwlock_t *lock, bool write_lock)
> > +{
> > +     int ret = 0;
> > +
> > +     lockdep_assert_held(lock);
> > +     if (should_resched(PREEMPT_LOCK_OFFSET)) {
> > +             if (write_lock) {
>
> The existing __cond_resched_lock() checks for resched *or* lock
> contention.  Is lock contention not something that needs (or can't?) be
> considered?
>
> > +                     write_unlock(lock);
> > +                     preempt_schedule_common();
> > +                     write_lock(lock);
> > +             } else {
> > +                     read_unlock(lock);
> > +                     preempt_schedule_common();
> > +                     read_lock(lock);
> > +             }
> > +
> > +             ret = 1;
> > +     }
> > +
> > +     return ret;
> > +}
> > +EXPORT_SYMBOL(__cond_resched_rwlock);
> > +
> >  /**
> >   * yield - yield the current processor to other threads.
> >   *
> > --
> > 2.23.0.444.g18eeb5a265-goog
> >

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 08/28] kvm: mmu: Init / Uninit the direct MMU
  2019-12-02 23:40   ` Sean Christopherson
@ 2019-12-06 20:25     ` Ben Gardon
  0 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-12-06 20:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

>On a different topic, have you thrown around any other names besides
>"direct MMU"?
I think direct MMU is a bad name. It's always been intended as a
temporary name as we intended to generalize the "direct MMU" to work
for nested TDP as well at some point. I'd prefer to eventually call it
the TDP MMU, but right now I guess it would be most correct to call it
an L1 TDP MMU or a TDP MMU for running L1 guests.

> What's the reasoning behind allowing the module param to be changed after
> KVM is loaded?  I haven't looked through all future patches, but I assume
> there are optimizations and/or simplifications that can be made if all VMs
> are guaranteed to have the same setting?
There are no optimizations if all VMs have the same setting. The
module parameter just exists for debugging and as a way to turn off
the "direct MMU" without a reboot, if it started causing problems. I
don't expect the module parameter to stick around in the version of
this code that's ultimately merged.



On Mon, Dec 2, 2019 at 3:41 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:18:04PM -0700, Ben Gardon wrote:
> > The direct MMU introduces several new fields that need to be initialized
> > and torn down. Add functions to do that initialization / cleanup.
>
> Can you briefly explain the basic concepts of the direct MMU?  The cover
> letter explains the goals of the direct MMU and the mechanics of how KVM
> moves between a shadow MMU and direct MMU, but I didn't see anything that
> describes how the direct MMU fundamentally differs from the shadow MMU.
>
> I'm something like 3-4 patches ahead of this one and still don't have a
> good idea of the core tenets of the direct MMU.  I might eventually get
> there on my own, but a jump start would be appreciated.
>
>
> On a different topic, have you thrown around any other names besides
> "direct MMU"?  I don't necessarily dislike the name, but I don't like it
> either, e.g. the @direct flag is also set when IA32 paging is disabled in
> the guest.
>
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  51 ++++++++----
> >  arch/x86/kvm/mmu.c              | 132 +++++++++++++++++++++++++++++---
> >  arch/x86/kvm/x86.c              |  16 +++-
> >  3 files changed, 169 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 23edf56cf577c..1f8164c577d50 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -236,6 +236,22 @@ enum {
> >   */
> >  #define KVM_APIC_PV_EOI_PENDING      1
> >
> > +#define HF_GIF_MASK          (1 << 0)
> > +#define HF_HIF_MASK          (1 << 1)
> > +#define HF_VINTR_MASK                (1 << 2)
> > +#define HF_NMI_MASK          (1 << 3)
> > +#define HF_IRET_MASK         (1 << 4)
> > +#define HF_GUEST_MASK                (1 << 5) /* VCPU is in guest-mode */
> > +#define HF_SMM_MASK          (1 << 6)
> > +#define HF_SMM_INSIDE_NMI_MASK       (1 << 7)
> > +
> > +#define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> > +#define KVM_ADDRESS_SPACE_NUM 2
> > +
> > +#define kvm_arch_vcpu_memslots_id(vcpu) \
> > +             ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
> > +#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> > +
> >  struct kvm_kernel_irq_routing_entry;
> >
> >  /*
> > @@ -940,6 +956,24 @@ struct kvm_arch {
> >       bool exception_payload_enabled;
> >
> >       struct kvm_pmu_event_filter *pmu_event_filter;
> > +
> > +     /*
> > +      * Whether the direct MMU is enabled for this VM. This contains a
> > +      * snapshot of the direct MMU module parameter from when the VM was
> > +      * created and remains unchanged for the life of the VM. If this is
> > +      * true, direct MMU handler functions will run for various MMU
> > +      * operations.
> > +      */
> > +     bool direct_mmu_enabled;
>
> What's the reasoning behind allowing the module param to be changed after
> KVM is loaded?  I haven't looked through all future patches, but I assume
> there are optimizations and/or simplifications that can be made if all VMs
> are guaranteed to have the same setting?
>
> > +     /*
> > +      * Indicates that the paging structure built by the direct MMU is
> > +      * currently the only one in use. If nesting is used, prompting the
> > +      * creation of shadow page tables for L2, this will be set to false.
> > +      * While this is true, only direct MMU handlers will be run for many
> > +      * MMU functions. Ignored if !direct_mmu_enabled.
> > +      */
> > +     bool pure_direct_mmu;
>
> This should be introduced in the same patch that first uses the flag,
> without the usage it's impossible to properly review.  E.g. is a dedicated
> flag necessary or is it only used in slow paths and so could check for
> vmxon?  Is the flag intended to be sticky?  Why is it per-VM and not
> per-vCPU?  And so on and so forth.
>
> > +     hpa_t direct_root_hpa[KVM_ADDRESS_SPACE_NUM];
> >  };
> >
> >  struct kvm_vm_stat {
> > @@ -1255,7 +1289,7 @@ void kvm_mmu_module_exit(void);
> >
> >  void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
> >  int kvm_mmu_create(struct kvm_vcpu *vcpu);
> > -void kvm_mmu_init_vm(struct kvm *kvm);
> > +int kvm_mmu_init_vm(struct kvm *kvm);
> >  void kvm_mmu_uninit_vm(struct kvm *kvm);
> >  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> >               u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
> > @@ -1519,21 +1553,6 @@ enum {
> >       TASK_SWITCH_GATE = 3,
> >  };
> >
> > -#define HF_GIF_MASK          (1 << 0)
> > -#define HF_HIF_MASK          (1 << 1)
> > -#define HF_VINTR_MASK                (1 << 2)
> > -#define HF_NMI_MASK          (1 << 3)
> > -#define HF_IRET_MASK         (1 << 4)
> > -#define HF_GUEST_MASK                (1 << 5) /* VCPU is in guest-mode */
> > -#define HF_SMM_MASK          (1 << 6)
> > -#define HF_SMM_INSIDE_NMI_MASK       (1 << 7)
> > -
> > -#define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> > -#define KVM_ADDRESS_SPACE_NUM 2
> > -
> > -#define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
> > -#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> > -
> >  asmlinkage void kvm_spurious_fault(void);
> >
> >  /*
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 50413f17c7cd0..788edbda02f69 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -47,6 +47,10 @@
> >  #include <asm/kvm_page_track.h>
> >  #include "trace.h"
> >
> > +static bool __read_mostly direct_mmu_enabled;
> > +module_param_named(enable_direct_mmu, direct_mmu_enabled, bool,
>
> To match other x86 module params, use "direct_mmu" for the param name and
> "enable_direct_mmu" for the varaible.
>
> > +                S_IRUGO | S_IWUSR);
>
> I'd prefer octal perms here.  I'm pretty sure checkpatch complains about
> this, and I personally find 0444 and 0644 much more readable.
>
> > +
> >  /*
> >   * When setting this variable to true it enables Two-Dimensional-Paging
> >   * where the hardware walks 2 page tables:
> > @@ -3754,27 +3758,56 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
> >       *root_hpa = INVALID_PAGE;
> >  }
> >
> > +static bool is_direct_mmu_root(struct kvm *kvm, hpa_t root)
> > +{
> > +     int as_id;
> > +
> > +     for (as_id = 0; as_id < KVM_ADDRESS_SPACE_NUM; as_id++)
> > +             if (root == kvm->arch.direct_root_hpa[as_id])
> > +                     return true;
> > +
> > +     return false;
> > +}
> > +
> >  /* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
> >  void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >                       ulong roots_to_free)
> >  {
> >       int i;
> >       LIST_HEAD(invalid_list);
> > -     bool free_active_root = roots_to_free & KVM_MMU_ROOT_CURRENT;
> >
> >       BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG);
> >
> > -     /* Before acquiring the MMU lock, see if we need to do any real work. */
> > -     if (!(free_active_root && VALID_PAGE(mmu->root_hpa))) {
> > -             for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > -                     if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
> > -                         VALID_PAGE(mmu->prev_roots[i].hpa))
> > -                             break;
> > +     /*
> > +      * Direct MMU paging structures follow the life of the VM, so instead of
> > +      * destroying direct MMU paging structure root, simply mark the root
> > +      * HPA pointing to it as invalid.
> > +      */
> > +     if (vcpu->kvm->arch.direct_mmu_enabled &&
> > +         roots_to_free & KVM_MMU_ROOT_CURRENT &&
> > +         is_direct_mmu_root(vcpu->kvm, mmu->root_hpa))
> > +             mmu->root_hpa = INVALID_PAGE;
> >
> > -             if (i == KVM_MMU_NUM_PREV_ROOTS)
> > -                     return;
> > +     if (!VALID_PAGE(mmu->root_hpa))
> > +             roots_to_free &= ~KVM_MMU_ROOT_CURRENT;
> > +
> > +     for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
> > +             if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) {
> > +                     if (is_direct_mmu_root(vcpu->kvm,
> > +                                            mmu->prev_roots[i].hpa))
> > +                             mmu->prev_roots[i].hpa = INVALID_PAGE;
> > +                     if (!VALID_PAGE(mmu->prev_roots[i].hpa))
> > +                             roots_to_free &= ~KVM_MMU_ROOT_PREVIOUS(i);
> > +             }
> >       }
> >
> > +     /*
> > +      * If there are no valid roots that need freeing at this point, avoid
> > +      * acquiring the MMU lock and return.
> > +      */
> > +     if (!roots_to_free)
> > +             return;
> > +
> >       write_lock(&vcpu->kvm->mmu_lock);
> >
> >       for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > @@ -3782,7 +3815,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >                       mmu_free_root_page(vcpu->kvm, &mmu->prev_roots[i].hpa,
> >                                          &invalid_list);
> >
> > -     if (free_active_root) {
> > +     if (roots_to_free & KVM_MMU_ROOT_CURRENT) {
> >               if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> >                   (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> >                       mmu_free_root_page(vcpu->kvm, &mmu->root_hpa,
> > @@ -3820,7 +3853,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >       struct kvm_mmu_page *sp;
> >       unsigned i;
> >
> > -     if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
> > +     if (vcpu->kvm->arch.direct_mmu_enabled) {
> > +             // TODO: Support 5 level paging in the direct MMU
> > +             BUG_ON(vcpu->arch.mmu->shadow_root_level > PT64_ROOT_4LEVEL);
> > +             vcpu->arch.mmu->root_hpa = vcpu->kvm->arch.direct_root_hpa[
> > +                     kvm_arch_vcpu_memslots_id(vcpu)];
> > +     } else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
> >               write_lock(&vcpu->kvm->mmu_lock);
> >               if(make_mmu_pages_available(vcpu) < 0) {
> >                       write_unlock(&vcpu->kvm->mmu_lock);
> > @@ -3863,6 +3901,10 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> >       gfn_t root_gfn, root_cr3;
> >       int i;
> >
> > +     write_lock(&vcpu->kvm->mmu_lock);
> > +     vcpu->kvm->arch.pure_direct_mmu = false;
> > +     write_unlock(&vcpu->kvm->mmu_lock);
> > +
> >       root_cr3 = vcpu->arch.mmu->get_cr3(vcpu);
> >       root_gfn = root_cr3 >> PAGE_SHIFT;
> >
> > @@ -5710,6 +5752,64 @@ void kvm_disable_tdp(void)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_disable_tdp);
> >
> > +static bool is_direct_mmu_enabled(void)
> > +{
> > +     if (!READ_ONCE(direct_mmu_enabled))
> > +             return false;
> > +
> > +     if (WARN_ONCE(!tdp_enabled,
> > +                   "Creating a VM with direct MMU enabled requires TDP."))
> > +             return false;
>
> User-induced WARNs are bad, direct_mmu_enabled must be forced to zero in
> kvm_disable_tdp().  Unless there's a good reason for direct_mmu_enabled to
> remain writable at runtime, making it read-only will eliminate that case.
>
> > +     return true;
> > +}
> > +
> > +static int kvm_mmu_init_direct_mmu(struct kvm *kvm)
> > +{
> > +     struct page *page;
> > +     int i;
> > +
> > +     if (!is_direct_mmu_enabled())
> > +             return 0;
> > +
> > +     /*
> > +      * Allocate the direct MMU root pages. These pages follow the life of
> > +      * the VM.
> > +      */
> > +     for (i = 0; i < ARRAY_SIZE(kvm->arch.direct_root_hpa); i++) {
> > +             page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > +             if (!page)
> > +                     goto err;
> > +             kvm->arch.direct_root_hpa[i] = page_to_phys(page);
> > +     }
> > +
> > +     /* This should not be changed for the lifetime of the VM. */
> > +     kvm->arch.direct_mmu_enabled = true;
> > +
> > +     kvm->arch.pure_direct_mmu = true;
> > +     return 0;
> > +err:
> > +     for (i = 0; i < ARRAY_SIZE(kvm->arch.direct_root_hpa); i++) {
> > +             if (kvm->arch.direct_root_hpa[i] &&
> > +                 VALID_PAGE(kvm->arch.direct_root_hpa[i]))
> > +                     free_page((unsigned long)kvm->arch.direct_root_hpa[i]);
> > +             kvm->arch.direct_root_hpa[i] = INVALID_PAGE;
> > +     }
> > +     return -ENOMEM;
> > +}
> > +

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 10/28] kvm: mmu: Flush TLBs before freeing direct MMU page table memory
  2019-12-02 23:46   ` Sean Christopherson
@ 2019-12-06 20:31     ` Ben Gardon
  0 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-12-06 20:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

We've tested this on Skylake, Broadwell, Haswell, Ivybridge,
Sandybridge, and probably some newer platforms. I haven't gone digging
for any super old hardware to test on.

On Mon, Dec 2, 2019 at 3:46 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:18:06PM -0700, Ben Gardon wrote:
> > If page table memory is freed before a TLB flush, it can result in
> > improper guest access to memory through paging structure caches.
> > Specifically, until a TLB flush, memory that was part of the paging
> > structure could be used by the hardware for address translation if a
> > partial walk leading to it is stored in the paging structure cache. Ensure
> > that there is a TLB flush before page table memory is freed by
> > transferring disconnected pages to a disconnected list, and on a flush
> > transferring a snapshot of the disconnected list to a free list. The free
> > list is processed asynchronously to avoid slowing TLB flushes.
>
> Tangentially realted to TLB flushing, what generations of CPUs have you
> tested this on?  I don't have any specific concerns, but ideally it'd be
> nice to get testing cycles on older hardware before merging.  Thankfully
> TDP-only eliminates ridiculously old hardware :-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case
  2019-12-06 19:57     ` Sean Christopherson
@ 2019-12-06 20:42       ` Ben Gardon
  0 siblings, 0 replies; 57+ messages in thread
From: Ben Gardon @ 2019-12-06 20:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

Switching to a RW lock is easy, but nothing would be able to use the
read lock because it's not safe to make most kinds of changes to PTEs
in parallel in the existing code. If we sharded the spinlock based on
GFN it might be easier, but that would also take a lot of
re-engineering.

On Fri, Dec 6, 2019 at 11:57 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Dec 06, 2019 at 11:55:42AM -0800, Ben Gardon wrote:
> > I'm finally back in the office. Sorry for not getting back to you sooner.
> > I don't think it would be easy to send the synchronization changes
> > first. The reason they seem so small is that they're all handled by
> > the iterator. If we tried to put the synchronization changes in
> > without the iterator we'd have to 1.) deal with struct kvm_mmu_pages,
> > 2.) deal with the rmap, and 3.) change a huge amount of code to insert
> > the synchronization changes into the existing framework. The changes
> > wouldn't be mechanical or easy to insert either since a lot of
> > bookkeeping is currently done before PTEs are updated, with no
> > facility for rolling back the bookkeeping on PTE cmpxchg failure. We
> > could start with the iterator changes and then do the synchronization
> > changes, but the other way around would be very difficult.
>
> By synchronization changes, I meant switching to a r/w lock instead of a
> straight spinlock.  Is that doable in a smallish series?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks
  2019-12-03  2:15   ` Sean Christopherson
@ 2019-12-18 18:25     ` Ben Gardon
  2019-12-18 19:14       ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2019-12-18 18:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Mon, Dec 2, 2019 at 6:15 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:18:09PM -0700, Ben Gardon wrote:
> > Add a utility for concurrent paging structure traversals. This iterator
> > uses several mechanisms to ensure that its accesses to paging structure
> > memory are safe, and that memory can be freed safely in the face of
> > lockless access. The purpose of the iterator is to create a unified
> > pattern for concurrent paging structure traversals and simplify the
> > implementation of other MMU functions.
> >
> > This iterator implements a pre-order traversal of PTEs for a given GFN
> > range within a given address space. The iterator abstracts away
> > bookkeeping on successful changes to PTEs, retrying on failed PTE
> > modifications, TLB flushing, and yielding during long operations.
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu.c      | 455 ++++++++++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmutrace.h |  50 +++++
> >  2 files changed, 505 insertions(+)
>
> ...
>
> > +/*
> > + * Sets a direct walk iterator to seek the gfn range [start, end).
> > + * If end is greater than the maximum possible GFN, it will be changed to the
> > + * maximum possible gfn + 1. (Note that start/end is and inclusive/exclusive
> > + * range, so the last gfn to be interated over would be the largest possible
> > + * GFN, in this scenario.)
> > + */
> > +__attribute__((unused))
> > +static void direct_walk_iterator_setup_walk(struct direct_walk_iterator *iter,
> > +     struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
> > +     enum mmu_lock_mode lock_mode)
>
> Echoing earlier patches, please introduce variables/flags/functions along
> with their users.  I have a feeling you're adding some of the unused
> functions so that all flags/variables in struct direct_walk_iterator can
> be in place from the get-go, but that actually makes everything much harder
> to review.
>
> > +{
> > +     BUG_ON(!kvm->arch.direct_mmu_enabled);
> > +     BUG_ON((lock_mode & MMU_WRITE_LOCK) && (lock_mode & MMU_READ_LOCK));
> > +     BUG_ON(as_id < 0);
> > +     BUG_ON(as_id >= KVM_ADDRESS_SPACE_NUM);
> > +     BUG_ON(!VALID_PAGE(kvm->arch.direct_root_hpa[as_id]));
> > +
> > +     /* End cannot be greater than the maximum possible gfn. */
> > +     end = min(end, 1ULL << (PT64_ROOT_4LEVEL * PT64_PT_BITS));
> > +
> > +     iter->as_id = as_id;
> > +     iter->pt_path[PT64_ROOT_4LEVEL - 1] =
> > +                     (u64 *)__va(kvm->arch.direct_root_hpa[as_id]);
> > +
> > +     iter->walk_start = start;
> > +     iter->walk_end = end;
> > +     iter->target_gfn = start;
> > +
> > +     iter->lock_mode = lock_mode;
> > +     iter->kvm = kvm;
> > +     iter->tlbs_dirty = 0;
> > +
> > +     direct_walk_iterator_start_traversal(iter);
> > +}
>
> ...
>
> > +static void direct_walk_iterator_cond_resched(struct direct_walk_iterator *iter)
> > +{
> > +     if (!(iter->lock_mode & MMU_LOCK_MAY_RESCHED) || !need_resched())
> > +             return;
> > +
> > +     direct_walk_iterator_prepare_cond_resched(iter);
> > +     cond_resched();
> > +     direct_walk_iterator_finish_cond_resched(iter);
> > +}
> > +
> > +static bool direct_walk_iterator_next_pte(struct direct_walk_iterator *iter)
> > +{
> > +     /*
> > +      * This iterator could be iterating over a large number of PTEs, such
> > +      * that if this thread did not yield, it would cause scheduler\
> > +      * problems. To avoid this, yield if needed. Note the check on
> > +      * MMU_LOCK_MAY_RESCHED in direct_walk_iterator_cond_resched. This
> > +      * iterator will not yield unless that flag is set in its lock_mode.
> > +      */
> > +     direct_walk_iterator_cond_resched(iter);
>
> This looks very fragile, e.g. one of the future patches even has to avoid
> problems with this code by limiting the number of PTEs it processes.
With this, functions either need to limit the number of PTEs they
process or pass the MMU_LOCK_MAY_RESCHED to the iterator. It would
probably be safer to invert the flag and make it
MMU_LOCK_MAY_NOT_RESCHED for functions that can self-regulate the
number of PTEs they process or have weird synchronization
requirements. For example, the page fault handler can't reschedule and
we know it won't process many entries, so we could pass
MMU_LOCK_MAY_NOT_RESCHED in there.


>
> > +
> > +     while (true) {
> > +             if (!direct_walk_iterator_next_pte_raw(iter))
>
> Implicitly initializing the iterator during next_pte_raw() is asking for
> problems, e.g. @walk_in_progress should not exist.  The standard kernel
> pattern for fancy iterators is to wrap the initialization, deref, and
> advancement operators in a macro, e.g. something like:
>
>         for_each_direct_pte(...) {
>
>         }
>
> That might require additional control flow logic in the users of the
> iterator, but if so that's probably a good thing in terms of readability
> and robustness.  E.g. verifying that rcu_read_unlock() is guaranteed to
> be called is extremely difficult as rcu_read_lock() is buried in this
> low level helper but the iterator relies on the top-level caller to
> terminate traversal.
>
> See mem_cgroup_iter_break() for one example of handling an iter walk
> where an action needs to taken when the walk terminates early.
>
> > +                     return false;
> > +
> > +             direct_walk_iterator_recalculate_output_fields(iter);
> > +             if (iter->old_pte != DISCONNECTED_PTE)
> > +                     break;
> > +
> > +             /*
> > +              * The iterator has encountered a disconnected pte, so it is in
> > +              * a page that has been disconnected from the root. Restart the
> > +              * traversal from the root in this case.
> > +              */
> > +             direct_walk_iterator_reset_traversal(iter);
>
> I understand wanting to hide details to eliminate copy-paste, but this
> goes too far and makes it too difficult to understand the flow of the
> top-level walks.  Ditto for burying retry_pte() in set_pte().  I'd say it
> also applies to skip_step_down(), but AFAICT that's dead code.
>
> Off-topic for a second, the super long direct_walk_iterator_... names
> make me want to simply call this new MMU the "tdp MMU" and just live with
> the discrepancy until the old shadow-based TDP MMU can be nuked.  Then we
> could have tdp_iter_blah_blah_blah(), for_each_tdp_present_pte(), etc...
>
> Back to the iterator, I think it can be massaged into a standard for loop
> approach without polluting the top level walkers much.  The below code is
> the basic idea, e.g. the macros won't compile, probably doesn't terminate
> the walk correct, rescheduling is missing, etc...
>
> Note, open coding the down/sideways/up helpers is 50% personal preference,
> 50% because gfn_start and gfn_end are now local variables, and 50% because
> it was the easiest way to learn the code.  I wouldn't argue too much about
> having one or more of the helpers.
>
>
> static void tdp_iter_break(struct tdp_iter *iter)
> {
>         /* TLB flush, RCU unlock, etc...)
> }
>
> static void tdp_iter_next(struct tdp_iter *iter, bool *retry)
> {
>         gfn_t gfn_start, gfn_end;
>         u64 *child_pt;
>
>         if (*retry) {
>                 *retry = false;
>                 return;
>         }
>
>         /*
>          * Reread the pte before stepping down to avoid traversing into page
>          * tables that are no longer linked from this entry. This is not
>          * needed for correctness - just a small optimization.
>          */
>         iter->old_pte = READ_ONCE(*iter->ptep);
>
>         /* Try to step down. */
>         child_pt = pte_to_child_pt(iter->old_pte, iter->level);
>         if (child_pt) {
>                 child_pt = rcu_dereference(child_pt);
>                 iter->level--;
>                 iter->pt_path[iter->level - 1] = child_pt;
>                 return;
>         }
>
> step_sideways:
>         /* Try to step sideways. */
>         gfn_start = ALIGN_DOWN(iter->target_gfn,
>                                KVM_PAGES_PER_HPAGE(iter->level));
>         gfn_end = gfn_start + KVM_PAGES_PER_HPAGE(iter->level)
>
>         /*
>          * If the current gfn maps past the target gfn range, the next entry in
>          * the current page table will be outside the target range.
>          */
>         if (gfn_end >= iter->walk_end ||
>             !(gfn_end % KVM_PAGES_PER_HPAGE(iter->level + 1))) {
>                 /* Try to step up. */
>                 iter->level++;
>
>                 if (iter->level > PT64_ROOT_4LEVEL; {
>                         /* This is ugly, there's probably a better solution. */
>                         tdp_iter_break(iter);
>                         return;
>                 }
>                 goto step_sideways;
>         }
>
>         iter->target_gfn = gfn_end;
>         iter->ptep = iter->pt_path[iter->level - 1] +
>                         PT64_INDEX(iter->target_gfn << PAGE_SHIFT, iter->level);
>         iter->old_pte = READ_ONCE(*iter->ptep);
> }
>
> #define for_each_tdp_pte(iter, start, end, retry)
>         for (tdp_iter_start(&iter, start, end);
>              iter->level <= PT64_ROOT_4LEVEL;
>              tdp_iter_next(&iter, &retry))
>
> #define for_each_tdp_present_pte(iter, start, end, retry)
>         for_each_tdp_pte(iter, start, end, retry)
>                 if (!is_present_direct_pte(iter->old_pte)) {
>
>                 } else
>
> #define for_each_tdp_present_leaf_pte(iter, start, end, retry)
>         for_each_tdp_pte(iter, start, end, retry)
>                 if (!is_present_direct_pte(iter->old_pte) ||
>                     !is_last_spte(iter->old_pte, iter->level))
>                 {
>
>                 } else
>
> /*
>  * Marks the range of gfns, [start, end), non-present.
>  */
> static bool zap_direct_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
>                                  gfn_t end, enum mmu_lock_mode lock_mode)
> {
>         struct direct_walk_iterator iter;
>         bool retry;
>
>         tdp_iter_init(&iter, kvm, as_id, lock_mode);
>
> restart:
>         retry = false;
>         for_each_tdp_present_pte(iter, start, end, retry) {
>                 if (tdp_iter_set_pte(&iter, 0))
>                         retry = true;
>
>                 if (tdp_iter_disconnected(&iter)) {
>                         tdp_iter_break(&iter);
>                         goto restart;
>                 }
>         }
> }
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks
  2019-12-18 18:25     ` Ben Gardon
@ 2019-12-18 19:14       ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2019-12-18 19:14 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Wed, Dec 18, 2019 at 10:25:45AM -0800, Ben Gardon wrote:
> On Mon, Dec 2, 2019 at 6:15 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > > +static bool direct_walk_iterator_next_pte(struct direct_walk_iterator *iter)
> > > +{
> > > +     /*
> > > +      * This iterator could be iterating over a large number of PTEs, such
> > > +      * that if this thread did not yield, it would cause scheduler\
> > > +      * problems. To avoid this, yield if needed. Note the check on
> > > +      * MMU_LOCK_MAY_RESCHED in direct_walk_iterator_cond_resched. This
> > > +      * iterator will not yield unless that flag is set in its lock_mode.
> > > +      */
> > > +     direct_walk_iterator_cond_resched(iter);
> >
> > This looks very fragile, e.g. one of the future patches even has to avoid
> > problems with this code by limiting the number of PTEs it processes.
>
> With this, functions either need to limit the number of PTEs they
> process or pass the MMU_LOCK_MAY_RESCHED to the iterator. It would
> probably be safer to invert the flag and make it
> MMU_LOCK_MAY_NOT_RESCHED for functions that can self-regulate the
> number of PTEs they process or have weird synchronization
> requirements. For example, the page fault handler can't reschedule and
> we know it won't process many entries, so we could pass
> MMU_LOCK_MAY_NOT_RESCHED in there.

That doesn't address the underlying fragility of the iterator, i.e. relying
on callers to self-regulate.  Especially since the threshold is completely
arbitrary, e.g. in zap_direct_gfn_range(), what's to say PDPE and lower is
always safe, e.g. if should_resched() becomes true at the very start of the
walk?

The direct comparison to zap_direct_gfn_range() is slot_handle_level_range(),
which supports rescheduling regardless of what function is being invoked.
What prevents the TDP iterator from doing the same?  E.g. what's the worst
case scenario if a reschedule pops up at an inopportune time?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler
  2019-09-26 23:18 ` [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler Ben Gardon
@ 2020-01-08 17:20   ` Peter Xu
  2020-01-08 18:15     ` Ben Gardon
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Xu @ 2020-01-08 17:20 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Thu, Sep 26, 2019 at 04:18:12PM -0700, Ben Gardon wrote:

[...]

> +static int handle_direct_page_fault(struct kvm_vcpu *vcpu,
> +		unsigned long mmu_seq, int write, int map_writable, int level,
> +		gpa_t gpa, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
> +{
> +	struct direct_walk_iterator iter;
> +	struct kvm_mmu_memory_cache *pf_pt_cache = &vcpu->arch.mmu_page_cache;
> +	u64 *child_pt;
> +	u64 new_pte;
> +	int ret = RET_PF_RETRY;
> +
> +	direct_walk_iterator_setup_walk(&iter, vcpu->kvm,
> +			kvm_arch_vcpu_memslots_id(vcpu), gpa >> PAGE_SHIFT,
> +			(gpa >> PAGE_SHIFT) + 1, MMU_READ_LOCK);
> +	while (direct_walk_iterator_next_pte(&iter)) {
> +		if (iter.level == level) {
> +			ret = direct_page_fault_handle_target_level(vcpu,
> +					write, map_writable, &iter, pfn,
> +					prefault);
> +
> +			break;
> +		} else if (!is_present_direct_pte(iter.old_pte) ||
> +			   is_large_pte(iter.old_pte)) {
> +			/*
> +			 * The leaf PTE for this fault must be mapped at a
> +			 * lower level, so a non-leaf PTE must be inserted into
> +			 * the paging structure. If the assignment below
> +			 * succeeds, it will add the non-leaf PTE and a new
> +			 * page of page table memory. Then the iterator can
> +			 * traverse into that new page. If the atomic compare/
> +			 * exchange fails, the iterator will repeat the current
> +			 * PTE, so the only thing this function must do
> +			 * differently is return the page table memory to the
> +			 * vCPU's fault cache.
> +			 */
> +			child_pt = mmu_memory_cache_alloc(pf_pt_cache);
> +			new_pte = generate_nonleaf_pte(child_pt, false);
> +
> +			if (!direct_walk_iterator_set_pte(&iter, new_pte))
> +				mmu_memory_cache_return(pf_pt_cache, child_pt);
> +		}
> +	}

I have a question on how this will guarantee safe concurrency...

As you mentioned previously somewhere, the design somehow mimics how
the core mm works with process page tables, and IIUC here the rwlock
works really like the mmap_sem that we have for the process mm.  So
with the series now we can have multiple page fault happening with
read lock held of the mmu_lock to reach here.

Then I'm imagining a case where both vcpu threads faulted on the same
address range while when they wanted to do different things, like: (1)
vcpu1 thread wanted to map this as a 2M huge page, while (2) vcpu2
thread wanted to map this as a 4K page.  Then is it possible that
vcpu2 is faster so it firstly setup the pmd as a page table page (via
direct_walk_iterator_set_pte above), then vcpu1 quickly overwrite it
as a huge page (via direct_page_fault_handle_target_level, level=2),
then I feel like the previous page table page that setup by vcpu2 can
be lost unnoticed.

I think general process page table does not have this issue is because
it has per pmd lock so anyone who changes the pmd or beneath it will
need to take that.  However here we don't have it, instead we only
depend on the atomic ops, which seems to be not enough for this?

Thanks,

> +	direct_walk_iterator_end_traversal(&iter);
> +
> +	/* If emulating, flush this vcpu's TLB. */
> +	if (ret == RET_PF_EMULATE)
> +		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> +
> +	return ret;
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler
  2020-01-08 17:20   ` Peter Xu
@ 2020-01-08 18:15     ` Ben Gardon
  2020-01-08 19:00       ` Peter Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Ben Gardon @ 2020-01-08 18:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Wed, Jan 8, 2020 at 9:20 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:18:12PM -0700, Ben Gardon wrote:
>
> [...]
>
> > +static int handle_direct_page_fault(struct kvm_vcpu *vcpu,
> > +             unsigned long mmu_seq, int write, int map_writable, int level,
> > +             gpa_t gpa, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
> > +{
> > +     struct direct_walk_iterator iter;
> > +     struct kvm_mmu_memory_cache *pf_pt_cache = &vcpu->arch.mmu_page_cache;
> > +     u64 *child_pt;
> > +     u64 new_pte;
> > +     int ret = RET_PF_RETRY;
> > +
> > +     direct_walk_iterator_setup_walk(&iter, vcpu->kvm,
> > +                     kvm_arch_vcpu_memslots_id(vcpu), gpa >> PAGE_SHIFT,
> > +                     (gpa >> PAGE_SHIFT) + 1, MMU_READ_LOCK);
> > +     while (direct_walk_iterator_next_pte(&iter)) {
> > +             if (iter.level == level) {
> > +                     ret = direct_page_fault_handle_target_level(vcpu,
> > +                                     write, map_writable, &iter, pfn,
> > +                                     prefault);
> > +
> > +                     break;
> > +             } else if (!is_present_direct_pte(iter.old_pte) ||
> > +                        is_large_pte(iter.old_pte)) {
> > +                     /*
> > +                      * The leaf PTE for this fault must be mapped at a
> > +                      * lower level, so a non-leaf PTE must be inserted into
> > +                      * the paging structure. If the assignment below
> > +                      * succeeds, it will add the non-leaf PTE and a new
> > +                      * page of page table memory. Then the iterator can
> > +                      * traverse into that new page. If the atomic compare/
> > +                      * exchange fails, the iterator will repeat the current
> > +                      * PTE, so the only thing this function must do
> > +                      * differently is return the page table memory to the
> > +                      * vCPU's fault cache.
> > +                      */
> > +                     child_pt = mmu_memory_cache_alloc(pf_pt_cache);
> > +                     new_pte = generate_nonleaf_pte(child_pt, false);
> > +
> > +                     if (!direct_walk_iterator_set_pte(&iter, new_pte))
> > +                             mmu_memory_cache_return(pf_pt_cache, child_pt);
> > +             }
> > +     }
>
> I have a question on how this will guarantee safe concurrency...
>
> As you mentioned previously somewhere, the design somehow mimics how
> the core mm works with process page tables, and IIUC here the rwlock
> works really like the mmap_sem that we have for the process mm.  So
> with the series now we can have multiple page fault happening with
> read lock held of the mmu_lock to reach here.

Ah, I'm sorry if I put that down somewhere. I think that comparing the
MMU rwlock in this series to the core mm mmap_sem was a mistake. I do
not understand the ways in which the core mm uses the mmap_sem enough
to make such a comparison. You're correct that with two faulting vCPUs
we could have page faults on the same address range happening in
parallel. I'll try to elaborate more on why that's safe.

> Then I'm imagining a case where both vcpu threads faulted on the same
> address range while when they wanted to do different things, like: (1)
> vcpu1 thread wanted to map this as a 2M huge page, while (2) vcpu2
> thread wanted to map this as a 4K page.

By vcpu thread, do you mean the page fault / EPT violation handler
wants to map memory at different levels?. As far as I understand,
vCPUs do not have any intent to map  a page at a certain level when
they take an EPT violation. The page fault handlers could certainly
want to map the memory at different levels. For example, if guest
memory was backed with 2M hugepages and one vCPU tried to do an
instruction fetch on an unmapped page while another tried to read it,
that should result in the page fault handler for the first vCPU trying
to map at 4K and the other trying to map at 2M, as in your example.

> Then is it possible that
> vcpu2 is faster so it firstly setup the pmd as a page table page (via
> direct_walk_iterator_set_pte above),

This is definitely possible

> then vcpu1 quickly overwrite it
> as a huge page (via direct_page_fault_handle_target_level, level=2),
> then I feel like the previous page table page that setup by vcpu2 can
> be lost unnoticed.

There are two possibilities here. 1.) vCPU2 saw vCPU1's modification
to the PTE during its walk. In this case, vCPU2 should not map the
memory at 2M. (I realize that in this example there is a discrepancy
as there's no NX hugepage support in this RFC. I need to add that in
the next version. In this case, vCPU1 would set a bit in the non-leaf
PTE to indicate it was split to allow X on a constituent 4K entry.)
2.) If vCPU2 did not see vCPU1's modification during its walk, it will
indeed try to map the memory at 2M. However in this case the atomic
cpmxchg on the PTE will fail because vCPU2 did not have the current
value of the PTE. In this case the PTE will be re-read and the walk
will continue or the page fault will be retried. When threads using
the direct walk iterator change PTEs with an atomic cmpxchg, they are
guaranteed to know what the value of the PTE was before the cmpxchg
and so that thread is then responsible for any cleanup associated with
the PTE modification - e.g. freeing pages of page table memory.

> I think general process page table does not have this issue is because
> it has per pmd lock so anyone who changes the pmd or beneath it will
> need to take that.  However here we don't have it, instead we only
> depend on the atomic ops, which seems to be not enough for this?

I think that atomic ops (plus rcu to ensure no use-after-free) are
enough in this case, but I could definitely be wrong. If your concern
about the race requires the NX hugepages stuff, I need to get on top
of sending out those patches. If you can think of a race that doesn't
require that, I'd be very excited to hear it.

> Thanks,
>
> > +     direct_walk_iterator_end_traversal(&iter);
> > +
> > +     /* If emulating, flush this vcpu's TLB. */
> > +     if (ret == RET_PF_EMULATE)
> > +             kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> > +
> > +     return ret;
> > +}
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler
  2020-01-08 18:15     ` Ben Gardon
@ 2020-01-08 19:00       ` Peter Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Xu @ 2020-01-08 19:00 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Paolo Bonzini, Peter Feiner, Peter Shier, Junaid Shahid,
	Jim Mattson

On Wed, Jan 08, 2020 at 10:15:41AM -0800, Ben Gardon wrote:
> On Wed, Jan 8, 2020 at 9:20 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 26, 2019 at 04:18:12PM -0700, Ben Gardon wrote:
> >
> > [...]
> >
> > > +static int handle_direct_page_fault(struct kvm_vcpu *vcpu,
> > > +             unsigned long mmu_seq, int write, int map_writable, int level,
> > > +             gpa_t gpa, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
> > > +{
> > > +     struct direct_walk_iterator iter;
> > > +     struct kvm_mmu_memory_cache *pf_pt_cache = &vcpu->arch.mmu_page_cache;
> > > +     u64 *child_pt;
> > > +     u64 new_pte;
> > > +     int ret = RET_PF_RETRY;
> > > +
> > > +     direct_walk_iterator_setup_walk(&iter, vcpu->kvm,
> > > +                     kvm_arch_vcpu_memslots_id(vcpu), gpa >> PAGE_SHIFT,
> > > +                     (gpa >> PAGE_SHIFT) + 1, MMU_READ_LOCK);
> > > +     while (direct_walk_iterator_next_pte(&iter)) {
> > > +             if (iter.level == level) {
> > > +                     ret = direct_page_fault_handle_target_level(vcpu,
> > > +                                     write, map_writable, &iter, pfn,
> > > +                                     prefault);
> > > +
> > > +                     break;
> > > +             } else if (!is_present_direct_pte(iter.old_pte) ||
> > > +                        is_large_pte(iter.old_pte)) {
> > > +                     /*
> > > +                      * The leaf PTE for this fault must be mapped at a
> > > +                      * lower level, so a non-leaf PTE must be inserted into
> > > +                      * the paging structure. If the assignment below
> > > +                      * succeeds, it will add the non-leaf PTE and a new
> > > +                      * page of page table memory. Then the iterator can
> > > +                      * traverse into that new page. If the atomic compare/
> > > +                      * exchange fails, the iterator will repeat the current
> > > +                      * PTE, so the only thing this function must do
> > > +                      * differently is return the page table memory to the
> > > +                      * vCPU's fault cache.
> > > +                      */
> > > +                     child_pt = mmu_memory_cache_alloc(pf_pt_cache);
> > > +                     new_pte = generate_nonleaf_pte(child_pt, false);
> > > +
> > > +                     if (!direct_walk_iterator_set_pte(&iter, new_pte))
> > > +                             mmu_memory_cache_return(pf_pt_cache, child_pt);
> > > +             }
> > > +     }
> >
> > I have a question on how this will guarantee safe concurrency...
> >
> > As you mentioned previously somewhere, the design somehow mimics how
> > the core mm works with process page tables, and IIUC here the rwlock
> > works really like the mmap_sem that we have for the process mm.  So
> > with the series now we can have multiple page fault happening with
> > read lock held of the mmu_lock to reach here.
> 
> Ah, I'm sorry if I put that down somewhere. I think that comparing the
> MMU rwlock in this series to the core mm mmap_sem was a mistake. I do
> not understand the ways in which the core mm uses the mmap_sem enough
> to make such a comparison. You're correct that with two faulting vCPUs
> we could have page faults on the same address range happening in
> parallel. I'll try to elaborate more on why that's safe.
> 
> > Then I'm imagining a case where both vcpu threads faulted on the same
> > address range while when they wanted to do different things, like: (1)
> > vcpu1 thread wanted to map this as a 2M huge page, while (2) vcpu2
> > thread wanted to map this as a 4K page.
> 
> By vcpu thread, do you mean the page fault / EPT violation handler
> wants to map memory at different levels?. As far as I understand,
> vCPUs do not have any intent to map  a page at a certain level when
> they take an EPT violation. The page fault handlers could certainly
> want to map the memory at different levels. For example, if guest
> memory was backed with 2M hugepages and one vCPU tried to do an
> instruction fetch on an unmapped page while another tried to read it,
> that should result in the page fault handler for the first vCPU trying
> to map at 4K and the other trying to map at 2M, as in your example.
> 
> > Then is it possible that
> > vcpu2 is faster so it firstly setup the pmd as a page table page (via
> > direct_walk_iterator_set_pte above),
> 
> This is definitely possible
> 
> > then vcpu1 quickly overwrite it
> > as a huge page (via direct_page_fault_handle_target_level, level=2),
> > then I feel like the previous page table page that setup by vcpu2 can
> > be lost unnoticed.
> 
> There are two possibilities here. 1.) vCPU2 saw vCPU1's modification
> to the PTE during its walk. In this case, vCPU2 should not map the
> memory at 2M. (I realize that in this example there is a discrepancy
> as there's no NX hugepage support in this RFC. I need to add that in
> the next version. In this case, vCPU1 would set a bit in the non-leaf
> PTE to indicate it was split to allow X on a constituent 4K entry.)
> 2.) If vCPU2 did not see vCPU1's modification during its walk, it will
> indeed try to map the memory at 2M. However in this case the atomic
> cpmxchg on the PTE will fail because vCPU2 did not have the current
> value of the PTE. In this case the PTE will be re-read and the walk
> will continue or the page fault will be retried. When threads using
> the direct walk iterator change PTEs with an atomic cmpxchg, they are
> guaranteed to know what the value of the PTE was before the cmpxchg
> and so that thread is then responsible for any cleanup associated with
> the PTE modification - e.g. freeing pages of page table memory.
> 
> > I think general process page table does not have this issue is because
> > it has per pmd lock so anyone who changes the pmd or beneath it will
> > need to take that.  However here we don't have it, instead we only
> > depend on the atomic ops, which seems to be not enough for this?
> 
> I think that atomic ops (plus rcu to ensure no use-after-free) are
> enough in this case, but I could definitely be wrong. If your concern
> about the race requires the NX hugepages stuff, I need to get on top
> of sending out those patches. If you can think of a race that doesn't
> require that, I'd be very excited to hear it.

Actually nx_huge_pages is exactly the thing I thought about for this
case when I was trying to find a real scenario (because in most cases
even if vcpu1 & vcpu2 traps at the same address, they still seem to
map the pages in the same way).  But yes if you're even prepared for
that (so IIUC the 2M mapping will respect the 4K mappings in that
case) then it looks reasonable.

And I think I was wrong above in that the page should not be leaked
anyway, since I just noticed handle_changed_pte() should take care of
that, iiuc:

	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
		/*
		 * The level of the page table being freed is one level lower
		 * than the level at which it is mapped.
		 */
		child_level = level - 1;

		/*
		 * If there was a present non-leaf entry before, and now the
		 * entry points elsewhere, the lpage stats and dirty logging /
		 * access tracking status for all the entries the old pte
		 * pointed to must be updated and the page table pages it
		 * pointed to must be freed.
		 */
		handle_disconnected_pt(kvm, as_id, gfn, spte_to_pfn(old_pte),
				       child_level, vm_teardown,
				       disconnected_pts);
	}

With that, I don't have any other concerns so far.  Will wait for your
next version.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2020-01-08 19:00 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-26 23:17 [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Ben Gardon
2019-09-26 23:17 ` [RFC PATCH 01/28] kvm: mmu: Separate generating and setting mmio ptes Ben Gardon
2019-11-27 18:15   ` Sean Christopherson
2019-09-26 23:17 ` [RFC PATCH 02/28] kvm: mmu: Separate pte generation from set_spte Ben Gardon
2019-11-27 18:25   ` Sean Christopherson
2019-09-26 23:17 ` [RFC PATCH 03/28] kvm: mmu: Zero page cache memory at allocation time Ben Gardon
2019-11-27 18:32   ` Sean Christopherson
2019-09-26 23:18 ` [RFC PATCH 04/28] kvm: mmu: Update the lpages stat atomically Ben Gardon
2019-11-27 18:39   ` Sean Christopherson
2019-12-06 20:10     ` Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 05/28] sched: Add cond_resched_rwlock Ben Gardon
2019-11-27 18:42   ` Sean Christopherson
2019-12-06 20:12     ` Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 06/28] kvm: mmu: Replace mmu_lock with a read/write lock Ben Gardon
2019-11-27 18:47   ` Sean Christopherson
2019-12-02 22:45     ` Sean Christopherson
2019-09-26 23:18 ` [RFC PATCH 07/28] kvm: mmu: Add functions for handling changed PTEs Ben Gardon
2019-11-27 19:04   ` Sean Christopherson
2019-09-26 23:18 ` [RFC PATCH 08/28] kvm: mmu: Init / Uninit the direct MMU Ben Gardon
2019-12-02 23:40   ` Sean Christopherson
2019-12-06 20:25     ` Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 09/28] kvm: mmu: Free direct MMU page table memory in an RCU callback Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 10/28] kvm: mmu: Flush TLBs before freeing direct MMU page table memory Ben Gardon
2019-12-02 23:46   ` Sean Christopherson
2019-12-06 20:31     ` Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 11/28] kvm: mmu: Optimize for freeing direct MMU PTs on teardown Ben Gardon
2019-12-02 23:54   ` Sean Christopherson
2019-09-26 23:18 ` [RFC PATCH 12/28] kvm: mmu: Set tlbs_dirty atomically Ben Gardon
2019-12-03  0:13   ` Sean Christopherson
2019-09-26 23:18 ` [RFC PATCH 13/28] kvm: mmu: Add an iterator for concurrent paging structure walks Ben Gardon
2019-12-03  2:15   ` Sean Christopherson
2019-12-18 18:25     ` Ben Gardon
2019-12-18 19:14       ` Sean Christopherson
2019-09-26 23:18 ` [RFC PATCH 14/28] kvm: mmu: Batch updates to the direct mmu disconnected list Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 15/28] kvm: mmu: Support invalidate_zap_all_pages Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 16/28] kvm: mmu: Add direct MMU page fault handler Ben Gardon
2020-01-08 17:20   ` Peter Xu
2020-01-08 18:15     ` Ben Gardon
2020-01-08 19:00       ` Peter Xu
2019-09-26 23:18 ` [RFC PATCH 17/28] kvm: mmu: Add direct MMU fast " Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 18/28] kvm: mmu: Add an hva range iterator for memslot GFNs Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 19/28] kvm: mmu: Make address space ID a property of memslots Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 20/28] kvm: mmu: Implement the invalidation MMU notifiers for the direct MMU Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 21/28] kvm: mmu: Integrate the direct mmu with the changed pte notifier Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 22/28] kvm: mmu: Implement access tracking for the direct MMU Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 23/28] kvm: mmu: Make mark_page_dirty_in_slot usable from outside kvm_main Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 24/28] kvm: mmu: Support dirty logging in the direct MMU Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 25/28] kvm: mmu: Support kvm_zap_gfn_range " Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 26/28] kvm: mmu: Integrate direct MMU with nesting Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 27/28] kvm: mmu: Lazily allocate rmap when direct MMU is enabled Ben Gardon
2019-09-26 23:18 ` [RFC PATCH 28/28] kvm: mmu: Support MMIO in the direct MMU Ben Gardon
2019-10-17 18:50 ` [RFC PATCH 00/28] kvm: mmu: Rework the x86 TDP direct mapped case Sean Christopherson
2019-10-18 13:42   ` Paolo Bonzini
2019-11-27 19:09 ` Sean Christopherson
2019-12-06 19:55   ` Ben Gardon
2019-12-06 19:57     ` Sean Christopherson
2019-12-06 20:42       ` Ben Gardon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).