All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] KVM: Scalable memslots implementation
@ 2021-05-16 21:44 Maciej S. Szmigiero
  2021-05-16 21:44 ` [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array Maciej S. Szmigiero
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The current memslot code uses a (reverse) gfn-ordered memslot array
for keeping track of them.
This only allows quick binary search by gfn, quick lookup by hva is
not possible (the implementation has to do a linear scan of the whole
memslot array).

Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change
flags) has to make a copy of the whole array so it has a scratch copy
to work on.

Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently
present is just a limitation of the array-based memslot implementation.

Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.

In order to have two memslot sets, but only one copy of the actual
memslots it is necessary to split out the memslot data from the
memslot sets.

The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.

These two memslot sets should normally point to the same set of memslots.
They can, however, be desynchronized when performing a memslot management
operation by replacing the memslot to be modified by its copy.
After the operation is complete, both memslot sets once again point to
the same, common set of memslot data.

This series implements the aforementioned idea.

The new implementation uses two trees to perform quick lookups:
For tracking of gfn an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.

For tracking of hva, however, an interval tree is needed since they
can overlap between memslots.

ID to memslot mappings are kept in a hash table instead of using a
statically allocated "id_to_index" array.

The "lru slot" mini-cache, that keeps track of the last found-by-gfn
memslot, is still present in the new code.

There was also a desire to make the new structure operate on "pay as
you go" basis, that is, that the user only pays the price of the
memslot count that is actually used, not of the maximum count allowed.

The operation semantics were carefully matched to the original
implementation, the outside-visible behavior should not change.
Only the timing will be different.

Making lookup and memslot management operations O(log(n)) brings some
performance benefits (tested on a Xeon 8167M machine):
509 slots in use:
Test            Before          After           Improvement
Map             0.0232s         0.0223s          4%
Unmap           0.0724s         0.0315s         56%
Unmap 2M        0.00155s        0.000869s       44%
Move active     0.000101s       0.0000792s      22%
Move inactive   0.000108s       0.0000847s      21%
Slot setup      0.0135s         0.00803s        41%

100 slots in use:
Test            Before          After           Improvement
Map             0.0195s         0.0191s         None
Unmap           0.0374s         0.0312s         17%
Unmap 2M        0.000470s       0.000447s        5%
Move active     0.0000956s      0.0000800s      16%
Move inactive   0.000101s       0.0000840s      17%
Slot setup      0.00260s        0.00174s        33%

50 slots in use:
Test            Before          After           Improvement
Map             0.0192s         0.0190s         None
Unmap           0.0339s         0.0311s          8%
Unmap 2M        0.000399s       0.000395s       None
Move active     0.0000999s      0.0000854s      15%
Move inactive   0.0000992s      0.0000826s      17%
Slot setup      0.00141s        0.000990s       30%

30 slots in use:
Test            Before          After           Improvement
Map             0.0192s         0.0190s         None
Unmap           0.0325s         0.0310s          5%
Unmap 2M        0.000373s       0.000373s       None
Move active     0.000100s       0.0000865s      14%
Move inactive   0.000106s       0.0000918s      13%
Slot setup      0.000989s       0.000775s       22%

10 slots in use:
Test            Before          After           Improvement
Map             0.0192s         0.0186s          3%
Unmap           0.0313s         0.0310s         None
Unmap 2M        0.000348s       0.000351s       None
Move active     0.000110s       0.0000948s      14%
Move inactive   0.000111s       0.0000933s      16%
Slot setup      0.000342s       0.000283s       17%

32k slots in use:
Test            Before          After           Improvement
Map (8194)       0.200s         0.0541s         73%
Unmap            3.88s          0.0351s         99%
Unmap 2M         3.88s          0.0348s         99%
Move active      0.00142s       0.0000786s      94%
Move inactive    0.00148s       0.0000880s      94%
Slot setup      16.1s           0.59s           96%

Since the map test can be done with up to 8194 slots, the result above
for this test was obtained running it with that maximum number of
slots.

In both the old and new memslot code case the measurements were done
against the new KVM selftest framework, with TDP MMU disabled
(that is, with the default setting).

On x86-64 the code was well tested, passed KVM unit tests and KVM
selftests with KASAN on.
And, of course, booted various guests successfully (including nested
ones with TDP MMU enabled).
On other KVM platforms the code was compile-tested only.

Changes since v1:
* Drop the already merged HVA handler retpoline-friendliness patch,

* Split the scalable memslots patch into 8 smaller ones,

* Rebase onto the current kvm/queue,

* Make sure that ranges at both memslot's hva_nodes are always
initialized,

* Remove kvm_mmu_calculate_default_mmu_pages() prototype, too,
when removing this function,

* Redo benchmarks, measure 32k memslots on the old implementation, too.

Changes since v2:
* Rebase onto the current kvm/queue, taking into account the now-merged
MMU notifiers rewrite.
This reduces the diffstat by ~50 lines.

 arch/arm64/kvm/Kconfig              |   1 +
 arch/arm64/kvm/mmu.c                |   8 +-
 arch/mips/kvm/Kconfig               |   1 +
 arch/powerpc/kvm/Kconfig            |   1 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   4 +-
 arch/powerpc/kvm/book3s_64_vio.c    |   2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c |   2 +-
 arch/powerpc/kvm/book3s_hv.c        |   3 +-
 arch/powerpc/kvm/book3s_hv_nested.c |   4 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  |  14 +-
 arch/s390/kvm/Kconfig               |   1 +
 arch/s390/kvm/kvm-s390.c            |  66 +--
 arch/s390/kvm/kvm-s390.h            |  15 +
 arch/s390/kvm/pv.c                  |   4 +-
 arch/x86/include/asm/kvm_host.h     |   2 +-
 arch/x86/kvm/Kconfig                |   1 +
 arch/x86/kvm/mmu/mmu.c              |  65 +--
 arch/x86/kvm/x86.c                  |  18 +-
 include/linux/kvm_host.h            | 139 ++++---
 virt/kvm/kvm_main.c                 | 604 ++++++++++++++++------------
 20 files changed, 555 insertions(+), 400 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  2021-05-19 21:00   ` Sean Christopherson
  2021-05-16 21:44 ` [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots() Maciej S. Szmigiero
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

There is no point in recalculating from scratch the total number of pages
in all memslots each time a memslot is created or deleted.

Just cache the value and update it accordingly on each such operation so
the code doesn't need to traverse the whole memslot array each time.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 24 ------------------------
 arch/x86/kvm/x86.c              | 18 +++++++++++++++---
 3 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55efbacfc244..e594f54a3875 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -975,6 +975,7 @@ struct kvm_x86_msr_filter {
 #define APICV_INHIBIT_REASON_X2APIC	5
 
 struct kvm_arch {
+	unsigned long n_memslots_pages;
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
@@ -1472,7 +1473,6 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
-unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0144c40d09c7..1e46a0ce034b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5890,30 +5890,6 @@ int kvm_mmu_module_init(void)
 	return ret;
 }
 
-/*
- * Calculate mmu pages needed for kvm.
- */
-unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)
-{
-	unsigned long nr_mmu_pages;
-	unsigned long nr_pages = 0;
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
-	int i;
-
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		slots = __kvm_memslots(kvm, i);
-
-		kvm_for_each_memslot(memslot, slots)
-			nr_pages += memslot->npages;
-	}
-
-	nr_mmu_pages = nr_pages * KVM_PERMILLE_MMU_PAGES / 1000;
-	nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
-
-	return nr_mmu_pages;
-}
-
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5bd550eaf683..8c7738b75393 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11112,9 +11112,21 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				const struct kvm_memory_slot *new,
 				enum kvm_mr_change change)
 {
-	if (!kvm->arch.n_requested_mmu_pages)
-		kvm_mmu_change_mmu_pages(kvm,
-				kvm_mmu_calculate_default_mmu_pages(kvm));
+	if (change == KVM_MR_CREATE)
+		kvm->arch.n_memslots_pages += new->npages;
+	else if (change == KVM_MR_DELETE) {
+		WARN_ON(kvm->arch.n_memslots_pages < old->npages);
+		kvm->arch.n_memslots_pages -= old->npages;
+	}
+
+	if (!kvm->arch.n_requested_mmu_pages) {
+		unsigned long nr_mmu_pages;
+
+		nr_mmu_pages = kvm->arch.n_memslots_pages *
+			       KVM_PERMILLE_MMU_PAGES / 1000;
+		nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
+		kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
+	}
 
 	/*
 	 * FIXME: const-ify all uses of struct kvm_memory_slot.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots()
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
  2021-05-16 21:44 ` [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  2021-05-19 21:24   ` Sean Christopherson
  2021-05-16 21:44 ` [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array Maciej S. Szmigiero
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

s390 arch has gfn_to_memslot_approx() which is almost identical to
search_memslots(), differing only in that in case the gfn falls in a hole
one of the memslots bordering the hole is returned.

Add this lookup mode as an option to search_memslots() so we don't have two
almost identical functions for looking up a memslot by its gfn.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/powerpc/kvm/book3s_64_vio.c    |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  2 +-
 arch/s390/kvm/kvm-s390.c            | 39 ++---------------------------
 include/linux/kvm_host.h            | 13 +++++++---
 4 files changed, 14 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 8da93fdfa59e..148525120504 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -346,7 +346,7 @@ static long kvmppc_tce_to_ua(struct kvm *kvm, unsigned long tce,
 	unsigned long gfn = tce >> PAGE_SHIFT;
 	struct kvm_memory_slot *memslot;
 
-	memslot = search_memslots(kvm_memslots(kvm), gfn);
+	memslot = search_memslots(kvm_memslots(kvm), gfn, false);
 	if (!memslot)
 		return -EINVAL;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 083a4e037718..a4042403630d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -80,7 +80,7 @@ static long kvmppc_rm_tce_to_ua(struct kvm *kvm,
 	unsigned long gfn = tce >> PAGE_SHIFT;
 	struct kvm_memory_slot *memslot;
 
-	memslot = search_memslots(kvm_memslots_raw(kvm), gfn);
+	memslot = search_memslots(kvm_memslots_raw(kvm), gfn, false);
 	if (!memslot)
 		return -EINVAL;
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 1296fc10f80c..75e635ede6ff 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1921,41 +1921,6 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
 /* for consistency */
 #define KVM_S390_CMMA_SIZE_MAX ((u32)KVM_S390_SKEYS_MAX)
 
-/*
- * Similar to gfn_to_memslot, but returns the index of a memslot also when the
- * address falls in a hole. In that case the index of one of the memslots
- * bordering the hole is returned.
- */
-static int gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn)
-{
-	int start = 0, end = slots->used_slots;
-	int slot = atomic_read(&slots->lru_slot);
-	struct kvm_memory_slot *memslots = slots->memslots;
-
-	if (gfn >= memslots[slot].base_gfn &&
-	    gfn < memslots[slot].base_gfn + memslots[slot].npages)
-		return slot;
-
-	while (start < end) {
-		slot = start + (end - start) / 2;
-
-		if (gfn >= memslots[slot].base_gfn)
-			end = slot;
-		else
-			start = slot + 1;
-	}
-
-	if (start >= slots->used_slots)
-		return slots->used_slots - 1;
-
-	if (gfn >= memslots[start].base_gfn &&
-	    gfn < memslots[start].base_gfn + memslots[start].npages) {
-		atomic_set(&slots->lru_slot, start);
-	}
-
-	return start;
-}
-
 static int kvm_s390_peek_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
 			      u8 *res, unsigned long bufsize)
 {
@@ -1982,8 +1947,8 @@ static int kvm_s390_peek_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
 static unsigned long kvm_s390_next_dirty_cmma(struct kvm_memslots *slots,
 					      unsigned long cur_gfn)
 {
-	int slotidx = gfn_to_memslot_approx(slots, cur_gfn);
-	struct kvm_memory_slot *ms = slots->memslots + slotidx;
+	struct kvm_memory_slot *ms = search_memslots(slots, cur_gfn, true);
+	int slotidx = ms - slots->memslots;
 	unsigned long ofs = cur_gfn - ms->base_gfn;
 
 	if (ms->base_gfn + ms->npages <= cur_gfn) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8895b95b6a22..3c40c7d32f7e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1091,10 +1091,14 @@ bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
  * gfn_to_memslot() itself isn't here as an inline because that would
  * bloat other code too much.
  *
+ * With "approx" set returns the memslot also when the address falls
+ * in a hole. In that case one of the memslots bordering the hole is
+ * returned.
+ *
  * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
  */
 static inline struct kvm_memory_slot *
-search_memslots(struct kvm_memslots *slots, gfn_t gfn)
+search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
 {
 	int start = 0, end = slots->used_slots;
 	int slot = atomic_read(&slots->lru_slot);
@@ -1116,19 +1120,22 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn)
 			start = slot + 1;
 	}
 
+	if (approx && start >= slots->used_slots)
+		return &memslots[slots->used_slots - 1];
+
 	if (start < slots->used_slots && gfn >= memslots[start].base_gfn &&
 	    gfn < memslots[start].base_gfn + memslots[start].npages) {
 		atomic_set(&slots->lru_slot, start);
 		return &memslots[start];
 	}
 
-	return NULL;
+	return approx ? &memslots[start] : NULL;
 }
 
 static inline struct kvm_memory_slot *
 __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
 {
-	return search_memslots(slots, gfn);
+	return search_memslots(slots, gfn, false);
 }
 
 static inline unsigned long

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
  2021-05-16 21:44 ` [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array Maciej S. Szmigiero
  2021-05-16 21:44 ` [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots() Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  2021-05-19 22:31   ` Sean Christopherson
  2021-05-16 21:44 ` [PATCH v3 4/8] KVM: Introduce memslots hva tree Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Memslot ID to the corresponding memslot mappings are currently kept as
indices in static id_to_index array.
The size of this array depends on the maximum allowed memslot count
(regardless of the number of memslots actually in use).

This has become especially problematic recently, when memslot count cap was
removed, so the maximum count is now full 32k memslots - the maximum
allowed by the current KVM API.

Keeping these IDs in a hash table (instead of an array) avoids this
problem.

Resolving a memslot ID to the actual memslot (instead of its index) will
also enable transitioning away from an array-based implementation of the
whole memslots structure in a later commit.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/linux/kvm_host.h | 16 +++++------
 virt/kvm/kvm_main.c      | 58 ++++++++++++++++++++++++++++++----------
 2 files changed, 52 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3c40c7d32f7e..d3a35646dfd8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -27,6 +27,7 @@
 #include <linux/rcuwait.h>
 #include <linux/refcount.h>
 #include <linux/nospec.h>
+#include <linux/hashtable.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -356,6 +357,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
 
 struct kvm_memory_slot {
+	struct hlist_node id_node;
 	gfn_t base_gfn;
 	unsigned long npages;
 	unsigned long *dirty_bitmap;
@@ -458,7 +460,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 struct kvm_memslots {
 	u64 generation;
 	/* The mapping table from slot id to the index in memslots[]. */
-	short id_to_index[KVM_MEM_SLOTS_NUM];
+	DECLARE_HASHTABLE(id_hash, 7);
 	atomic_t lru_slot;
 	int used_slots;
 	struct kvm_memory_slot memslots[];
@@ -680,16 +682,14 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
 static inline
 struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
 {
-	int index = slots->id_to_index[id];
 	struct kvm_memory_slot *slot;
 
-	if (index < 0)
-		return NULL;
-
-	slot = &slots->memslots[index];
+	hash_for_each_possible(slots->id_hash, slot, id_node, id) {
+		if (slot->id == id)
+			return slot;
+	}
 
-	WARN_ON(slot->id != id);
-	return slot;
+	return NULL;
 }
 
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b4feb92dc79..50f9bc9bb1e0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -781,15 +781,13 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 static struct kvm_memslots *kvm_alloc_memslots(void)
 {
-	int i;
 	struct kvm_memslots *slots;
 
 	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
 	if (!slots)
 		return NULL;
 
-	for (i = 0; i < KVM_MEM_SLOTS_NUM; i++)
-		slots->id_to_index[i] = -1;
+	hash_init(slots->id_hash);
 
 	return slots;
 }
@@ -1097,14 +1095,16 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
 /*
  * Delete a memslot by decrementing the number of used slots and shifting all
  * other entries in the array forward one spot.
+ * @memslot is a detached dummy struct with just .id and .as_id filled.
  */
 static inline void kvm_memslot_delete(struct kvm_memslots *slots,
 				      struct kvm_memory_slot *memslot)
 {
 	struct kvm_memory_slot *mslots = slots->memslots;
+	struct kvm_memory_slot *dmemslot = id_to_memslot(slots, memslot->id);
 	int i;
 
-	if (WARN_ON(slots->id_to_index[memslot->id] == -1))
+	if (WARN_ON(!dmemslot))
 		return;
 
 	slots->used_slots--;
@@ -1112,12 +1112,13 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
 	if (atomic_read(&slots->lru_slot) >= slots->used_slots)
 		atomic_set(&slots->lru_slot, 0);
 
-	for (i = slots->id_to_index[memslot->id]; i < slots->used_slots; i++) {
+	for (i = dmemslot - mslots; i < slots->used_slots; i++) {
+		hash_del(&mslots[i].id_node);
 		mslots[i] = mslots[i + 1];
-		slots->id_to_index[mslots[i].id] = i;
+		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
 	}
+	hash_del(&mslots[i].id_node);
 	mslots[i] = *memslot;
-	slots->id_to_index[memslot->id] = -1;
 }
 
 /*
@@ -1135,31 +1136,41 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
  * itself is not preserved in the array, i.e. not swapped at this time, only
  * its new index into the array is tracked.  Returns the changed memslot's
  * current index into the memslots array.
+ * The memslot at the returned index will not be in @slots->id_hash by then.
+ * @memslot is a detached struct with desired final data of the changed slot.
  */
 static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
 					    struct kvm_memory_slot *memslot)
 {
 	struct kvm_memory_slot *mslots = slots->memslots;
+	struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
 	int i;
 
-	if (WARN_ON_ONCE(slots->id_to_index[memslot->id] == -1) ||
+	if (WARN_ON_ONCE(!mmemslot) ||
 	    WARN_ON_ONCE(!slots->used_slots))
 		return -1;
 
+	/*
+	 * update_memslots() will unconditionally overwrite and re-add the
+	 * target memslot so it has to be removed here first
+	 */
+	hash_del(&mmemslot->id_node);
+
 	/*
 	 * Move the target memslot backward in the array by shifting existing
 	 * memslots with a higher GFN (than the target memslot) towards the
 	 * front of the array.
 	 */
-	for (i = slots->id_to_index[memslot->id]; i < slots->used_slots - 1; i++) {
+	for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
 		if (memslot->base_gfn > mslots[i + 1].base_gfn)
 			break;
 
 		WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);
 
 		/* Shift the next memslot forward one and update its index. */
+		hash_del(&mslots[i + 1].id_node);
 		mslots[i] = mslots[i + 1];
-		slots->id_to_index[mslots[i].id] = i;
+		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
 	}
 	return i;
 }
@@ -1170,6 +1181,10 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
  * is not preserved in the array, i.e. not swapped at this time, only its new
  * index into the array is tracked.  Returns the changed memslot's final index
  * into the memslots array.
+ * The memslot at the returned index will not be in @slots->id_hash by then.
+ * @memslot is a detached struct with desired final data of the new or
+ * changed slot.
+ * Assumes that the memslot at @start index is not in @slots->id_hash.
  */
 static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
 					   struct kvm_memory_slot *memslot,
@@ -1185,8 +1200,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
 		WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);
 
 		/* Shift the next memslot back one and update its index. */
+		hash_del(&mslots[i - 1].id_node);
 		mslots[i] = mslots[i - 1];
-		slots->id_to_index[mslots[i].id] = i;
+		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
 	}
 	return i;
 }
@@ -1231,6 +1247,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
  * most likely to be referenced, sorting it to the front of the array was
  * advantageous.  The current binary search starts from the middle of the array
  * and uses an LRU pointer to improve performance for all memslots and GFNs.
+ *
+ * @memslot is a detached struct, not a part of the current or new memslot
+ * array.
  */
 static void update_memslots(struct kvm_memslots *slots,
 			    struct kvm_memory_slot *memslot,
@@ -1247,12 +1266,16 @@ static void update_memslots(struct kvm_memslots *slots,
 			i = kvm_memslot_move_backward(slots, memslot);
 		i = kvm_memslot_move_forward(slots, memslot, i);
 
+		if (i < 0)
+			return;
+
 		/*
 		 * Copy the memslot to its new position in memslots and update
 		 * its index accordingly.
 		 */
 		slots->memslots[i] = *memslot;
-		slots->id_to_index[memslot->id] = i;
+		hash_add(slots->id_hash, &slots->memslots[i].id_node,
+			 memslot->id);
 	}
 }
 
@@ -1316,6 +1339,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
 {
 	struct kvm_memslots *slots;
 	size_t old_size, new_size;
+	struct kvm_memory_slot *memslot;
 
 	old_size = sizeof(struct kvm_memslots) +
 		   (sizeof(struct kvm_memory_slot) * old->used_slots);
@@ -1326,8 +1350,14 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
 		new_size = old_size;
 
 	slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
-	if (likely(slots))
-		memcpy(slots, old, old_size);
+	if (unlikely(!slots))
+		return NULL;
+
+	memcpy(slots, old, old_size);
+
+	hash_init(slots->id_hash);
+	kvm_for_each_memslot(memslot, slots)
+		hash_add(slots->id_hash, &memslot->id_node, memslot->id);
 
 	return slots;
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/8] KVM: Introduce memslots hva tree
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2021-05-16 21:44 ` [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  2021-05-19 23:07   ` Sean Christopherson
  2021-05-16 21:44 ` [PATCH v3 5/8] KVM: s390: Introduce kvm_s390_get_gfn_end() Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The current memslots implementation only allows quick binary search by gfn,
quick lookup by hva is not possible - the implementation has to do a linear
scan of the whole memslots array, even though the operation being performed
might apply just to a single memslot.

This significantly hurts performance of per-hva operations with higher
memslot counts.

Since hva ranges can overlap between memslots an interval tree is needed
for tracking them.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/arm64/kvm/Kconfig   |  1 +
 arch/mips/kvm/Kconfig    |  1 +
 arch/powerpc/kvm/Kconfig |  1 +
 arch/s390/kvm/Kconfig    |  1 +
 arch/x86/kvm/Kconfig     |  1 +
 include/linux/kvm_host.h |  8 ++++++++
 virt/kvm/kvm_main.c      | 43 +++++++++++++++++++++++++++++++++-------
 7 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 3964acf5451e..f075e9939a2a 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -40,6 +40,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select TASKSTATS
 	select TASK_DELAY_ACCT
+	select INTERVAL_TREE
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index a77297480f56..91d197bee9c0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -27,6 +27,7 @@ config KVM
 	select KVM_MMIO
 	select MMU_NOTIFIER
 	select SRCU
+	select INTERVAL_TREE
 	help
 	  Support for hosting Guest kernels.
 
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index e45644657d49..519d6d3642a5 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -26,6 +26,7 @@ config KVM
 	select KVM_VFIO
 	select IRQ_BYPASS_MANAGER
 	select HAVE_KVM_IRQ_BYPASS
+	select INTERVAL_TREE
 
 config KVM_BOOK3S_HANDLER
 	bool
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 67a8e770e369..2e84d3922f7c 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -33,6 +33,7 @@ config KVM
 	select HAVE_KVM_NO_POLL
 	select SRCU
 	select KVM_VFIO
+	select INTERVAL_TREE
 	help
 	  Support hosting paravirtualized guest machines using the SIE
 	  virtualization capability on the mainframe. This should work
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f6b93a35ce14..ee15f7e113c8 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,6 +46,7 @@ config KVM
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
 	select SRCU
+	select INTERVAL_TREE
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d3a35646dfd8..f59847b6e9b3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -27,6 +27,7 @@
 #include <linux/rcuwait.h>
 #include <linux/refcount.h>
 #include <linux/nospec.h>
+#include <linux/interval_tree.h>
 #include <linux/hashtable.h>
 #include <asm/signal.h>
 
@@ -358,6 +359,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 
 struct kvm_memory_slot {
 	struct hlist_node id_node;
+	struct interval_tree_node hva_node;
 	gfn_t base_gfn;
 	unsigned long npages;
 	unsigned long *dirty_bitmap;
@@ -459,6 +461,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
  */
 struct kvm_memslots {
 	u64 generation;
+	struct rb_root_cached hva_tree;
 	/* The mapping table from slot id to the index in memslots[]. */
 	DECLARE_HASHTABLE(id_hash, 7);
 	atomic_t lru_slot;
@@ -679,6 +682,11 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
 	return __kvm_memslots(vcpu->kvm, as_id);
 }
 
+#define kvm_for_each_hva_range_memslot(node, slots, start, last)	     \
+	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
+	     node;							     \
+	     node = interval_tree_iter_next(node, start, last))	     \
+
 static inline
 struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 50f9bc9bb1e0..a55309432c9a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -488,6 +488,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	struct kvm_memslots *slots;
 	int i, idx;
 
+	if (range->end == range->start || WARN_ON(range->end < range->start))
+		return 0;
+
 	/* A null handler is allowed if and only if on_lock() is provided. */
 	if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
 			 IS_KVM_NULL_FN(range->handler)))
@@ -507,15 +510,18 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		struct interval_tree_node *node;
+
 		slots = __kvm_memslots(kvm, i);
-		kvm_for_each_memslot(slot, slots) {
+		kvm_for_each_hva_range_memslot(node, slots,
+					       range->start, range->end - 1) {
 			unsigned long hva_start, hva_end;
 
+			slot = container_of(node, struct kvm_memory_slot,
+					    hva_node);
 			hva_start = max(range->start, slot->userspace_addr);
 			hva_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));
-			if (hva_start >= hva_end)
-				continue;
 
 			/*
 			 * To optimize for the likely case where the address
@@ -787,6 +793,7 @@ static struct kvm_memslots *kvm_alloc_memslots(void)
 	if (!slots)
 		return NULL;
 
+	slots->hva_tree = RB_ROOT_CACHED;
 	hash_init(slots->id_hash);
 
 	return slots;
@@ -1113,10 +1120,14 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
 		atomic_set(&slots->lru_slot, 0);
 
 	for (i = dmemslot - mslots; i < slots->used_slots; i++) {
+		interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
 		hash_del(&mslots[i].id_node);
+
 		mslots[i] = mslots[i + 1];
+		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
 		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
 	}
+	interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
 	hash_del(&mslots[i].id_node);
 	mslots[i] = *memslot;
 }
@@ -1136,7 +1147,8 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
  * itself is not preserved in the array, i.e. not swapped at this time, only
  * its new index into the array is tracked.  Returns the changed memslot's
  * current index into the memslots array.
- * The memslot at the returned index will not be in @slots->id_hash by then.
+ * The memslot at the returned index will not be in @slots->hva_tree or
+ * @slots->id_hash by then.
  * @memslot is a detached struct with desired final data of the changed slot.
  */
 static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
@@ -1154,6 +1166,7 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
 	 * update_memslots() will unconditionally overwrite and re-add the
 	 * target memslot so it has to be removed here first
 	 */
+	interval_tree_remove(&mmemslot->hva_node, &slots->hva_tree);
 	hash_del(&mmemslot->id_node);
 
 	/*
@@ -1168,8 +1181,11 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
 		WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);
 
 		/* Shift the next memslot forward one and update its index. */
+		interval_tree_remove(&mslots[i + 1].hva_node, &slots->hva_tree);
 		hash_del(&mslots[i + 1].id_node);
+
 		mslots[i] = mslots[i + 1];
+		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
 		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
 	}
 	return i;
@@ -1181,10 +1197,12 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
  * is not preserved in the array, i.e. not swapped at this time, only its new
  * index into the array is tracked.  Returns the changed memslot's final index
  * into the memslots array.
- * The memslot at the returned index will not be in @slots->id_hash by then.
+ * The memslot at the returned index will not be in @slots->hva_tree or
+ * @slots->id_hash by then.
  * @memslot is a detached struct with desired final data of the new or
  * changed slot.
- * Assumes that the memslot at @start index is not in @slots->id_hash.
+ * Assumes that the memslot at @start index is not in @slots->hva_tree or
+ * @slots->id_hash.
  */
 static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
 					   struct kvm_memory_slot *memslot,
@@ -1200,8 +1218,11 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
 		WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);
 
 		/* Shift the next memslot back one and update its index. */
+		interval_tree_remove(&mslots[i - 1].hva_node, &slots->hva_tree);
 		hash_del(&mslots[i - 1].id_node);
+
 		mslots[i] = mslots[i - 1];
+		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
 		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
 	}
 	return i;
@@ -1274,6 +1295,11 @@ static void update_memslots(struct kvm_memslots *slots,
 		 * its index accordingly.
 		 */
 		slots->memslots[i] = *memslot;
+		slots->memslots[i].hva_node.start = memslot->userspace_addr;
+		slots->memslots[i].hva_node.last = memslot->userspace_addr +
+			(memslot->npages << PAGE_SHIFT) - 1;
+		interval_tree_insert(&slots->memslots[i].hva_node,
+				     &slots->hva_tree);
 		hash_add(slots->id_hash, &slots->memslots[i].id_node,
 			 memslot->id);
 	}
@@ -1355,9 +1381,12 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
 
 	memcpy(slots, old, old_size);
 
+	slots->hva_tree = RB_ROOT_CACHED;
 	hash_init(slots->id_hash);
-	kvm_for_each_memslot(memslot, slots)
+	kvm_for_each_memslot(memslot, slots) {
+		interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
 		hash_add(slots->id_hash, &memslot->id_node, memslot->id);
+	}
 
 	return slots;
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 5/8] KVM: s390: Introduce kvm_s390_get_gfn_end()
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2021-05-16 21:44 ` [PATCH v3 4/8] KVM: Introduce memslots hva tree Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  2021-05-16 21:44 ` [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

And use it where s390 code would just access the memslot with the highest
gfn directly.

No functional change intended.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/s390/kvm/kvm-s390.c |  2 +-
 arch/s390/kvm/kvm-s390.h | 12 ++++++++++++
 arch/s390/kvm/pv.c       |  4 +---
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 75e635ede6ff..c6dfb9c24c8c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1986,7 +1986,7 @@ static int kvm_s390_get_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
 	if (!ms)
 		return 0;
 	next_gfn = kvm_s390_next_dirty_cmma(slots, cur_gfn + 1);
-	mem_end = slots->memslots[0].base_gfn + slots->memslots[0].npages;
+	mem_end = kvm_s390_get_gfn_end(slots);
 
 	while (args->count < bufsize) {
 		hva = gfn_to_hva(kvm, cur_gfn);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..5787b12aff7e 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -208,6 +208,18 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
 	return kvm->arch.user_cpu_state_ctrl != 0;
 }
 
+/* get the end gfn of the last (highest gfn) memslot */
+static inline unsigned long kvm_s390_get_gfn_end(struct kvm_memslots *slots)
+{
+	struct kvm_memory_slot *ms;
+
+	if (WARN_ON(!slots->used_slots))
+		return 0;
+
+	ms = slots->memslots;
+	return ms->base_gfn + ms->npages;
+}
+
 /* implemented in pv.c */
 int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
 int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 813b6e93dc83..6bf42cdf4013 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -117,7 +117,6 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
 	unsigned long base = uv_info.guest_base_stor_len;
 	unsigned long virt = uv_info.guest_virt_var_stor_len;
 	unsigned long npages = 0, vlen = 0;
-	struct kvm_memory_slot *memslot;
 
 	kvm->arch.pv.stor_var = NULL;
 	kvm->arch.pv.stor_base = __get_free_pages(GFP_KERNEL_ACCOUNT, get_order(base));
@@ -131,8 +130,7 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
 	 * Slots are sorted by GFN
 	 */
 	mutex_lock(&kvm->slots_lock);
-	memslot = kvm_memslots(kvm)->memslots;
-	npages = memslot->base_gfn + memslot->npages;
+	npages = kvm_s390_get_gfn_end(kvm_memslots(kvm));
 	mutex_unlock(&kvm->slots_lock);
 
 	kvm->arch.pv.guest_len = npages * PAGE_SIZE;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2021-05-16 21:44 ` [PATCH v3 5/8] KVM: s390: Introduce kvm_s390_get_gfn_end() Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  2021-05-19 23:10   ` Sean Christopherson
  2021-05-25 23:21   ` Sean Christopherson
  2021-05-16 21:44 ` [PATCH v3 7/8] KVM: Optimize gfn lookup in kvm_zap_gfn_range() Maciej S. Szmigiero
  2021-05-16 21:44 ` [PATCH v3 8/8] KVM: Optimize overlapping memslots check Maciej S. Szmigiero
  7 siblings, 2 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.

Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.

Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.

Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.

In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.

The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.

These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy.  After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.

This commit implements the aforementioned idea.

For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.

The "lru slot" mini-cache, that keeps track of the last found-by-gfn
memslot, is still present in the new code.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/arm64/kvm/mmu.c                |   8 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   4 +-
 arch/powerpc/kvm/book3s_hv.c        |   3 +-
 arch/powerpc/kvm/book3s_hv_nested.c |   4 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  |  14 +-
 arch/s390/kvm/kvm-s390.c            |  27 +-
 arch/s390/kvm/kvm-s390.h            |   7 +-
 arch/x86/kvm/mmu/mmu.c              |   4 +-
 include/linux/kvm_host.h            | 100 ++---
 virt/kvm/kvm_main.c                 | 580 ++++++++++++++--------------
 10 files changed, 379 insertions(+), 372 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..2b4ced4f1e55 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -199,13 +199,13 @@ static void stage2_flush_vm(struct kvm *kvm)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
-	int idx;
+	int idx, ctr;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
 
 	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots)
+	kvm_for_each_memslot(memslot, ctr, slots)
 		stage2_flush_memslot(kvm, memslot);
 
 	spin_unlock(&kvm->mmu_lock);
@@ -536,14 +536,14 @@ void stage2_unmap_vm(struct kvm *kvm)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
-	int idx;
+	int idx, ctr;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	mmap_read_lock(current->mm);
 	spin_lock(&kvm->mmu_lock);
 
 	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots)
+	kvm_for_each_memslot(memslot, ctr, slots)
 		stage2_unmap_memslot(kvm, memslot);
 
 	spin_unlock(&kvm->mmu_lock);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2d9193cd73be..dbdb6d1b2de8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -734,11 +734,11 @@ void kvmppc_rmap_reset(struct kvm *kvm)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
-	int srcu_idx;
+	int srcu_idx, ctr;
 
 	srcu_idx = srcu_read_lock(&kvm->srcu);
 	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots) {
+	kvm_for_each_memslot(memslot, ctr, slots) {
 		/* Mutual exclusion with kvm_unmap_hva_range etc. */
 		spin_lock(&kvm->mmu_lock);
 		/*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 28a80d240b76..15c9c0db06ba 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5619,11 +5619,12 @@ static int kvmhv_svm_off(struct kvm *kvm)
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		struct kvm_memory_slot *memslot;
 		struct kvm_memslots *slots = __kvm_memslots(kvm, i);
+		int ctr;
 
 		if (!slots)
 			continue;
 
-		kvm_for_each_memslot(memslot, slots) {
+		kvm_for_each_memslot(memslot, ctr, slots) {
 			kvmppc_uvmem_drop_pages(memslot, kvm, true);
 			uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
 		}
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 60724f674421..822fff767407 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -721,7 +721,7 @@ void kvmhv_release_all_nested(struct kvm *kvm)
 	struct kvm_nested_guest *gp;
 	struct kvm_nested_guest *freelist = NULL;
 	struct kvm_memory_slot *memslot;
-	int srcu_idx;
+	int srcu_idx, ctr;
 
 	spin_lock(&kvm->mmu_lock);
 	for (i = 0; i <= kvm->arch.max_nested_lpid; i++) {
@@ -742,7 +742,7 @@ void kvmhv_release_all_nested(struct kvm *kvm)
 	}
 
 	srcu_idx = srcu_read_lock(&kvm->srcu);
-	kvm_for_each_memslot(memslot, kvm_memslots(kvm))
+	kvm_for_each_memslot(memslot, ctr, kvm_memslots(kvm))
 		kvmhv_free_memslot_nest_rmap(memslot);
 	srcu_read_unlock(&kvm->srcu, srcu_idx);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..671c8f6d605e 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -458,7 +458,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot, *m;
 	int ret = H_SUCCESS;
-	int srcu_idx;
+	int srcu_idx, ctr;
 
 	kvm->arch.secure_guest = KVMPPC_SECURE_INIT_START;
 
@@ -477,7 +477,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 
 	/* register the memslot */
 	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots) {
+	kvm_for_each_memslot(memslot, ctr, slots) {
 		ret = __kvmppc_uvmem_memslot_create(kvm, memslot);
 		if (ret)
 			break;
@@ -485,7 +485,7 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 
 	if (ret) {
 		slots = kvm_memslots(kvm);
-		kvm_for_each_memslot(m, slots) {
+		kvm_for_each_memslot(m, ctr, slots) {
 			if (m == memslot)
 				break;
 			__kvmppc_uvmem_memslot_delete(kvm, memslot);
@@ -646,7 +646,7 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
 {
-	int srcu_idx;
+	int srcu_idx, ctr;
 	struct kvm_memory_slot *memslot;
 
 	/*
@@ -661,7 +661,7 @@ unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
 
 	srcu_idx = srcu_read_lock(&kvm->srcu);
 
-	kvm_for_each_memslot(memslot, kvm_memslots(kvm))
+	kvm_for_each_memslot(memslot, ctr, kvm_memslots(kvm))
 		kvmppc_uvmem_drop_pages(memslot, kvm, false);
 
 	srcu_read_unlock(&kvm->srcu, srcu_idx);
@@ -820,7 +820,7 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
-	int srcu_idx;
+	int srcu_idx, ctr;
 	long ret = H_SUCCESS;
 
 	if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
@@ -829,7 +829,7 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
 	/* migrate any unmoved normal pfn to device pfns*/
 	srcu_idx = srcu_read_lock(&kvm->srcu);
 	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots) {
+	kvm_for_each_memslot(memslot, ctr, slots) {
 		ret = kvmppc_uv_migrate_mem_slot(kvm, memslot);
 		if (ret) {
 			/*
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c6dfb9c24c8c..9f6da4ac58b7 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1015,13 +1015,13 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
 	struct kvm_memory_slot *ms;
 	struct kvm_memslots *slots;
 	unsigned long ram_pages = 0;
-	int slotnr;
+	int ctr;
 
 	/* migration mode already enabled */
 	if (kvm->arch.migration_mode)
 		return 0;
 	slots = kvm_memslots(kvm);
-	if (!slots || !slots->used_slots)
+	if (!slots || kvm_memslots_empty(slots))
 		return -EINVAL;
 
 	if (!kvm->arch.use_cmma) {
@@ -1029,8 +1029,7 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
 		return 0;
 	}
 	/* mark all the pages in active slots as dirty */
-	for (slotnr = 0; slotnr < slots->used_slots; slotnr++) {
-		ms = slots->memslots + slotnr;
+	kvm_for_each_memslot(ms, ctr, slots) {
 		if (!ms->dirty_bitmap)
 			return -EINVAL;
 		/*
@@ -1948,22 +1947,24 @@ static unsigned long kvm_s390_next_dirty_cmma(struct kvm_memslots *slots,
 					      unsigned long cur_gfn)
 {
 	struct kvm_memory_slot *ms = search_memslots(slots, cur_gfn, true);
-	int slotidx = ms - slots->memslots;
 	unsigned long ofs = cur_gfn - ms->base_gfn;
+	int idxactive = kvm_memslots_idx(slots);
+	struct rb_node *mnode = &ms->gfn_node[idxactive];
 
 	if (ms->base_gfn + ms->npages <= cur_gfn) {
-		slotidx--;
+		mnode = rb_next(mnode);
 		/* If we are above the highest slot, wrap around */
-		if (slotidx < 0)
-			slotidx = slots->used_slots - 1;
+		if (!mnode)
+			mnode = rb_first(&slots->gfn_tree);
 
-		ms = slots->memslots + slotidx;
+		ms = container_of(mnode, struct kvm_memory_slot,
+				  gfn_node[idxactive]);
 		ofs = 0;
 	}
 	ofs = find_next_bit(kvm_second_dirty_bitmap(ms), ms->npages, ofs);
-	while ((slotidx > 0) && (ofs >= ms->npages)) {
-		slotidx--;
-		ms = slots->memslots + slotidx;
+	while (ofs >= ms->npages && (mnode = rb_next(mnode))) {
+		ms = container_of(mnode, struct kvm_memory_slot,
+				  gfn_node[idxactive]);
 		ofs = find_next_bit(kvm_second_dirty_bitmap(ms), ms->npages, 0);
 	}
 	return ms->base_gfn + ofs;
@@ -1976,7 +1977,7 @@ static int kvm_s390_get_cmma(struct kvm *kvm, struct kvm_s390_cmma_log *args,
 	struct kvm_memslots *slots = kvm_memslots(kvm);
 	struct kvm_memory_slot *ms;
 
-	if (unlikely(!slots->used_slots))
+	if (unlikely(kvm_memslots_empty(slots)))
 		return 0;
 
 	cur_gfn = kvm_s390_next_dirty_cmma(slots, args->start_gfn);
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 5787b12aff7e..79548da748dd 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -211,12 +211,15 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
 /* get the end gfn of the last (highest gfn) memslot */
 static inline unsigned long kvm_s390_get_gfn_end(struct kvm_memslots *slots)
 {
+	struct rb_node *node;
+	int idxactive = kvm_memslots_idx(slots);
 	struct kvm_memory_slot *ms;
 
-	if (WARN_ON(!slots->used_slots))
+	if (WARN_ON(kvm_memslots_empty(slots)))
 		return 0;
 
-	ms = slots->memslots;
+	node = rb_last(&slots->gfn_tree);
+	ms = container_of(node, struct kvm_memory_slot, gfn_node[idxactive]);
 	return ms->base_gfn + ms->npages;
 }
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e46a0ce034b..7222b552d139 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5492,8 +5492,10 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		int ctr;
+
 		slots = __kvm_memslots(kvm, i);
-		kvm_for_each_memslot(memslot, slots) {
+		kvm_for_each_memslot(memslot, ctr, slots) {
 			gfn_t start, end;
 
 			start = max(gfn_start, memslot->base_gfn);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f59847b6e9b3..a9c5b0df2311 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -29,6 +29,7 @@
 #include <linux/nospec.h>
 #include <linux/interval_tree.h>
 #include <linux/hashtable.h>
+#include <linux/rbtree.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -358,8 +359,9 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
 
 struct kvm_memory_slot {
-	struct hlist_node id_node;
-	struct interval_tree_node hva_node;
+	struct hlist_node id_node[2];
+	struct interval_tree_node hva_node[2];
+	struct rb_node gfn_node[2];
 	gfn_t base_gfn;
 	unsigned long npages;
 	unsigned long *dirty_bitmap;
@@ -454,19 +456,14 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
-/*
- * Note:
- * memslots are not sorted by id anymore, please use id_to_memslot()
- * to get the memslot by its id.
- */
 struct kvm_memslots {
 	u64 generation;
+	atomic_long_t lru_slot;
 	struct rb_root_cached hva_tree;
-	/* The mapping table from slot id to the index in memslots[]. */
+	struct rb_root gfn_tree;
+	/* The mapping table from slot id to memslot. */
 	DECLARE_HASHTABLE(id_hash, 7);
-	atomic_t lru_slot;
-	int used_slots;
-	struct kvm_memory_slot memslots[];
+	bool is_idx_0;
 };
 
 struct kvm {
@@ -478,6 +475,7 @@ struct kvm {
 
 	struct mutex slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
+	struct kvm_memslots memslots_all[KVM_ADDRESS_SPACE_NUM][2];
 	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 
@@ -617,12 +615,6 @@ static inline int kvm_vcpu_get_idx(struct kvm_vcpu *vcpu)
 	return vcpu->vcpu_idx;
 }
 
-#define kvm_for_each_memslot(memslot, slots)				\
-	for (memslot = &slots->memslots[0];				\
-	     memslot < slots->memslots + slots->used_slots; memslot++)	\
-		if (WARN_ON_ONCE(!memslot->npages)) {			\
-		} else
-
 void kvm_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 void vcpu_load(struct kvm_vcpu *vcpu);
@@ -682,6 +674,22 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
 	return __kvm_memslots(vcpu->kvm, as_id);
 }
 
+static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
+{
+	return RB_EMPTY_ROOT(&slots->gfn_tree);
+}
+
+static inline int kvm_memslots_idx(struct kvm_memslots *slots)
+{
+	return slots->is_idx_0 ? 0 : 1;
+}
+
+#define kvm_for_each_memslot(memslot, ctr, slots)	\
+	hash_for_each(slots->id_hash, ctr, memslot,	\
+		      id_node[kvm_memslots_idx(slots)]) \
+		if (WARN_ON_ONCE(!memslot->npages)) {	\
+		} else
+
 #define kvm_for_each_hva_range_memslot(node, slots, start, last)	     \
 	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
 	     node;							     \
@@ -690,9 +698,10 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
 static inline
 struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
 {
+	int idxactive = kvm_memslots_idx(slots);
 	struct kvm_memory_slot *slot;
 
-	hash_for_each_possible(slots->id_hash, slot, id_node, id) {
+	hash_for_each_possible(slots->id_hash, slot, id_node[idxactive], id) {
 		if (slot->id == id)
 			return slot;
 	}
@@ -1102,42 +1111,39 @@ bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
  * With "approx" set returns the memslot also when the address falls
  * in a hole. In that case one of the memslots bordering the hole is
  * returned.
- *
- * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
  */
 static inline struct kvm_memory_slot *
 search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
 {
-	int start = 0, end = slots->used_slots;
-	int slot = atomic_read(&slots->lru_slot);
-	struct kvm_memory_slot *memslots = slots->memslots;
-
-	if (unlikely(!slots->used_slots))
-		return NULL;
-
-	if (gfn >= memslots[slot].base_gfn &&
-	    gfn < memslots[slot].base_gfn + memslots[slot].npages)
-		return &memslots[slot];
-
-	while (start < end) {
-		slot = start + (end - start) / 2;
-
-		if (gfn >= memslots[slot].base_gfn)
-			end = slot;
-		else
-			start = slot + 1;
+	int idxactive = kvm_memslots_idx(slots);
+	struct kvm_memory_slot *slot;
+	struct rb_node *prevnode, *node;
+
+	slot = (struct kvm_memory_slot *)atomic_long_read(&slots->lru_slot);
+	if (slot &&
+	    gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
+		return slot;
+
+	for (prevnode = NULL, node = slots->gfn_tree.rb_node; node; ) {
+		prevnode = node;
+		slot = container_of(node, struct kvm_memory_slot,
+				    gfn_node[idxactive]);
+		if (gfn >= slot->base_gfn) {
+			if (gfn < slot->base_gfn + slot->npages) {
+				atomic_long_set(&slots->lru_slot,
+						(unsigned long)slot);
+				return slot;
+			}
+			node = node->rb_right;
+		} else
+			node = node->rb_left;
 	}
 
-	if (approx && start >= slots->used_slots)
-		return &memslots[slots->used_slots - 1];
+	if (approx && prevnode)
+		return container_of(prevnode, struct kvm_memory_slot,
+				    gfn_node[idxactive]);
 
-	if (start < slots->used_slots && gfn >= memslots[start].base_gfn &&
-	    gfn < memslots[start].base_gfn + memslots[start].npages) {
-		atomic_set(&slots->lru_slot, start);
-		return &memslots[start];
-	}
-
-	return approx ? &memslots[start] : NULL;
+	return NULL;
 }
 
 static inline struct kvm_memory_slot *
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a55309432c9a..189504b27ca6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -510,15 +510,17 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		int idxactive;
 		struct interval_tree_node *node;
 
 		slots = __kvm_memslots(kvm, i);
+		idxactive = kvm_memslots_idx(slots);
 		kvm_for_each_hva_range_memslot(node, slots,
 					       range->start, range->end - 1) {
 			unsigned long hva_start, hva_end;
 
 			slot = container_of(node, struct kvm_memory_slot,
-					    hva_node);
+					    hva_node[idxactive]);
 			hva_start = max(range->start, slot->userspace_addr);
 			hva_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));
@@ -785,18 +787,12 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
-static struct kvm_memslots *kvm_alloc_memslots(void)
+static void kvm_init_memslots(struct kvm_memslots *slots)
 {
-	struct kvm_memslots *slots;
-
-	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
-	if (!slots)
-		return NULL;
-
+	atomic_long_set(&slots->lru_slot, (unsigned long)NULL);
 	slots->hva_tree = RB_ROOT_CACHED;
+	slots->gfn_tree = RB_ROOT;
 	hash_init(slots->id_hash);
-
-	return slots;
 }
 
 static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
@@ -808,27 +804,31 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 	memslot->dirty_bitmap = NULL;
 }
 
+/* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
 
-	slot->flags = 0;
-	slot->npages = 0;
+	kfree(slot);
 }
 
 static void kvm_free_memslots(struct kvm *kvm, struct kvm_memslots *slots)
 {
+	int ctr;
+	struct hlist_node *idnode;
 	struct kvm_memory_slot *memslot;
 
-	if (!slots)
+	/*
+	 * Both active and inactive struct kvm_memslots should point to
+	 * the same set of memslots, so it's enough to free them once
+	 */
+	if (slots->is_idx_0)
 		return;
 
-	kvm_for_each_memslot(memslot, slots)
+	hash_for_each_safe(slots->id_hash, ctr, idnode, memslot, id_node[1])
 		kvm_free_memslot(kvm, memslot);
-
-	kvfree(slots);
 }
 
 static void kvm_destroy_vm_debugfs(struct kvm *kvm)
@@ -924,13 +924,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
 
 	refcount_set(&kvm->users_count, 1);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_memslots *slots = kvm_alloc_memslots();
+		kvm_init_memslots(&kvm->memslots_all[i][0]);
+		kvm_init_memslots(&kvm->memslots_all[i][1]);
+		kvm->memslots_all[i][0].is_idx_0 = true;
+		kvm->memslots_all[i][1].is_idx_0 = false;
 
-		if (!slots)
-			goto out_err_no_arch_destroy_vm;
 		/* Generations must be different for each address space. */
-		slots->generation = i;
-		rcu_assign_pointer(kvm->memslots[i], slots);
+		kvm->memslots_all[i][0].generation = i;
+		rcu_assign_pointer(kvm->memslots[i], &kvm->memslots_all[i][0]);
 	}
 
 	for (i = 0; i < KVM_NR_BUSES; i++) {
@@ -983,8 +984,6 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
 	for (i = 0; i < KVM_NR_BUSES; i++)
 		kfree(kvm_get_bus(kvm, i));
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		kvm_free_memslots(kvm, __kvm_memslots(kvm, i));
 	cleanup_srcu_struct(&kvm->irq_srcu);
 out_err_no_irq_srcu:
 	cleanup_srcu_struct(&kvm->srcu);
@@ -1038,8 +1037,10 @@ static void kvm_destroy_vm(struct kvm *kvm)
 #endif
 	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		kvm_free_memslots(kvm, __kvm_memslots(kvm, i));
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		kvm_free_memslots(kvm, &kvm->memslots_all[i][0]);
+		kvm_free_memslots(kvm, &kvm->memslots_all[i][1]);
+	}
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_free_vm(kvm);
@@ -1099,212 +1100,6 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
 	return 0;
 }
 
-/*
- * Delete a memslot by decrementing the number of used slots and shifting all
- * other entries in the array forward one spot.
- * @memslot is a detached dummy struct with just .id and .as_id filled.
- */
-static inline void kvm_memslot_delete(struct kvm_memslots *slots,
-				      struct kvm_memory_slot *memslot)
-{
-	struct kvm_memory_slot *mslots = slots->memslots;
-	struct kvm_memory_slot *dmemslot = id_to_memslot(slots, memslot->id);
-	int i;
-
-	if (WARN_ON(!dmemslot))
-		return;
-
-	slots->used_slots--;
-
-	if (atomic_read(&slots->lru_slot) >= slots->used_slots)
-		atomic_set(&slots->lru_slot, 0);
-
-	for (i = dmemslot - mslots; i < slots->used_slots; i++) {
-		interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
-		hash_del(&mslots[i].id_node);
-
-		mslots[i] = mslots[i + 1];
-		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
-		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
-	}
-	interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
-	hash_del(&mslots[i].id_node);
-	mslots[i] = *memslot;
-}
-
-/*
- * "Insert" a new memslot by incrementing the number of used slots.  Returns
- * the new slot's initial index into the memslots array.
- */
-static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
-{
-	return slots->used_slots++;
-}
-
-/*
- * Move a changed memslot backwards in the array by shifting existing slots
- * with a higher GFN toward the front of the array.  Note, the changed memslot
- * itself is not preserved in the array, i.e. not swapped at this time, only
- * its new index into the array is tracked.  Returns the changed memslot's
- * current index into the memslots array.
- * The memslot at the returned index will not be in @slots->hva_tree or
- * @slots->id_hash by then.
- * @memslot is a detached struct with desired final data of the changed slot.
- */
-static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
-					    struct kvm_memory_slot *memslot)
-{
-	struct kvm_memory_slot *mslots = slots->memslots;
-	struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
-	int i;
-
-	if (WARN_ON_ONCE(!mmemslot) ||
-	    WARN_ON_ONCE(!slots->used_slots))
-		return -1;
-
-	/*
-	 * update_memslots() will unconditionally overwrite and re-add the
-	 * target memslot so it has to be removed here first
-	 */
-	interval_tree_remove(&mmemslot->hva_node, &slots->hva_tree);
-	hash_del(&mmemslot->id_node);
-
-	/*
-	 * Move the target memslot backward in the array by shifting existing
-	 * memslots with a higher GFN (than the target memslot) towards the
-	 * front of the array.
-	 */
-	for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
-		if (memslot->base_gfn > mslots[i + 1].base_gfn)
-			break;
-
-		WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);
-
-		/* Shift the next memslot forward one and update its index. */
-		interval_tree_remove(&mslots[i + 1].hva_node, &slots->hva_tree);
-		hash_del(&mslots[i + 1].id_node);
-
-		mslots[i] = mslots[i + 1];
-		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
-		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
-	}
-	return i;
-}
-
-/*
- * Move a changed memslot forwards in the array by shifting existing slots with
- * a lower GFN toward the back of the array.  Note, the changed memslot itself
- * is not preserved in the array, i.e. not swapped at this time, only its new
- * index into the array is tracked.  Returns the changed memslot's final index
- * into the memslots array.
- * The memslot at the returned index will not be in @slots->hva_tree or
- * @slots->id_hash by then.
- * @memslot is a detached struct with desired final data of the new or
- * changed slot.
- * Assumes that the memslot at @start index is not in @slots->hva_tree or
- * @slots->id_hash.
- */
-static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
-					   struct kvm_memory_slot *memslot,
-					   int start)
-{
-	struct kvm_memory_slot *mslots = slots->memslots;
-	int i;
-
-	for (i = start; i > 0; i--) {
-		if (memslot->base_gfn < mslots[i - 1].base_gfn)
-			break;
-
-		WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);
-
-		/* Shift the next memslot back one and update its index. */
-		interval_tree_remove(&mslots[i - 1].hva_node, &slots->hva_tree);
-		hash_del(&mslots[i - 1].id_node);
-
-		mslots[i] = mslots[i - 1];
-		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
-		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
-	}
-	return i;
-}
-
-/*
- * Re-sort memslots based on their GFN to account for an added, deleted, or
- * moved memslot.  Sorting memslots by GFN allows using a binary search during
- * memslot lookup.
- *
- * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!  I.e. the entry
- * at memslots[0] has the highest GFN.
- *
- * The sorting algorithm takes advantage of having initially sorted memslots
- * and knowing the position of the changed memslot.  Sorting is also optimized
- * by not swapping the updated memslot and instead only shifting other memslots
- * and tracking the new index for the update memslot.  Only once its final
- * index is known is the updated memslot copied into its position in the array.
- *
- *  - When deleting a memslot, the deleted memslot simply needs to be moved to
- *    the end of the array.
- *
- *  - When creating a memslot, the algorithm "inserts" the new memslot at the
- *    end of the array and then it forward to its correct location.
- *
- *  - When moving a memslot, the algorithm first moves the updated memslot
- *    backward to handle the scenario where the memslot's GFN was changed to a
- *    lower value.  update_memslots() then falls through and runs the same flow
- *    as creating a memslot to move the memslot forward to handle the scenario
- *    where its GFN was changed to a higher value.
- *
- * Note, slots are sorted from highest->lowest instead of lowest->highest for
- * historical reasons.  Originally, invalid memslots where denoted by having
- * GFN=0, thus sorting from highest->lowest naturally sorted invalid memslots
- * to the end of the array.  The current algorithm uses dedicated logic to
- * delete a memslot and thus does not rely on invalid memslots having GFN=0.
- *
- * The other historical motiviation for highest->lowest was to improve the
- * performance of memslot lookup.  KVM originally used a linear search starting
- * at memslots[0].  On x86, the largest memslot usually has one of the highest,
- * if not *the* highest, GFN, as the bulk of the guest's RAM is located in a
- * single memslot above the 4gb boundary.  As the largest memslot is also the
- * most likely to be referenced, sorting it to the front of the array was
- * advantageous.  The current binary search starts from the middle of the array
- * and uses an LRU pointer to improve performance for all memslots and GFNs.
- *
- * @memslot is a detached struct, not a part of the current or new memslot
- * array.
- */
-static void update_memslots(struct kvm_memslots *slots,
-			    struct kvm_memory_slot *memslot,
-			    enum kvm_mr_change change)
-{
-	int i;
-
-	if (change == KVM_MR_DELETE) {
-		kvm_memslot_delete(slots, memslot);
-	} else {
-		if (change == KVM_MR_CREATE)
-			i = kvm_memslot_insert_back(slots);
-		else
-			i = kvm_memslot_move_backward(slots, memslot);
-		i = kvm_memslot_move_forward(slots, memslot, i);
-
-		if (i < 0)
-			return;
-
-		/*
-		 * Copy the memslot to its new position in memslots and update
-		 * its index accordingly.
-		 */
-		slots->memslots[i] = *memslot;
-		slots->memslots[i].hva_node.start = memslot->userspace_addr;
-		slots->memslots[i].hva_node.last = memslot->userspace_addr +
-			(memslot->npages << PAGE_SHIFT) - 1;
-		interval_tree_insert(&slots->memslots[i].hva_node,
-				     &slots->hva_tree);
-		hash_add(slots->id_hash, &slots->memslots[i].id_node,
-			 memslot->id);
-	}
-}
-
 static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -1319,10 +1114,12 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	return 0;
 }
 
-static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
-		int as_id, struct kvm_memslots *slots)
+static void swap_memslots(struct kvm *kvm, int as_id)
 {
 	struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id);
+	int idxactive = kvm_memslots_idx(old_memslots);
+	int idxina = idxactive == 0 ? 1 : 0;
+	struct kvm_memslots *slots = &kvm->memslots_all[as_id][idxina];
 	u64 gen = old_memslots->generation;
 
 	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
@@ -1351,44 +1148,129 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
 	kvm_arch_memslots_updated(kvm, gen);
 
 	slots->generation = gen;
+}
+
+static void kvm_memslot_gfn_insert(struct rb_root *gfn_tree,
+				  struct kvm_memory_slot *slot,
+				  int which)
+{
+	struct rb_node **cur, *parent;
+
+	for (cur = &gfn_tree->rb_node, parent = NULL; *cur; ) {
+		struct kvm_memory_slot *cslot;
+
+		cslot = container_of(*cur, typeof(*cslot), gfn_node[which]);
+		parent = *cur;
+		if (slot->base_gfn < cslot->base_gfn)
+			cur = &(*cur)->rb_left;
+		else if (slot->base_gfn > cslot->base_gfn)
+			cur = &(*cur)->rb_right;
+		else
+			BUG();
+	}
 
-	return old_memslots;
+	rb_link_node(&slot->gfn_node[which], parent, cur);
+	rb_insert_color(&slot->gfn_node[which], gfn_tree);
 }
 
 /*
- * Note, at a minimum, the current number of used slots must be allocated, even
- * when deleting a memslot, as we need a complete duplicate of the memslots for
- * use when invalidating a memslot prior to deleting/moving the memslot.
+ * Just copies the memslot data.
+ * Does not copy or touch the embedded nodes, including the ranges at hva_nodes.
  */
-static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
-					     enum kvm_mr_change change)
+static void kvm_copy_memslot(struct kvm_memory_slot *dest,
+			     struct kvm_memory_slot *src)
 {
-	struct kvm_memslots *slots;
-	size_t old_size, new_size;
-	struct kvm_memory_slot *memslot;
+	dest->base_gfn = src->base_gfn;
+	dest->npages = src->npages;
+	dest->dirty_bitmap = src->dirty_bitmap;
+	dest->arch = src->arch;
+	dest->userspace_addr = src->userspace_addr;
+	dest->flags = src->flags;
+	dest->id = src->id;
+	dest->as_id = src->as_id;
+}
 
-	old_size = sizeof(struct kvm_memslots) +
-		   (sizeof(struct kvm_memory_slot) * old->used_slots);
+/*
+ * Initializes the ranges at both hva_nodes from the memslot userspace_addr
+ * and npages fields.
+ */
+static void kvm_init_memslot_hva_ranges(struct kvm_memory_slot *slot)
+{
+	slot->hva_node[0].start = slot->hva_node[1].start =
+		slot->userspace_addr;
+	slot->hva_node[0].last = slot->hva_node[1].last =
+		slot->userspace_addr + (slot->npages << PAGE_SHIFT) - 1;
+}
 
-	if (change == KVM_MR_CREATE)
-		new_size = old_size + sizeof(struct kvm_memory_slot);
-	else
-		new_size = old_size;
+/*
+ * Replaces the @oldslot with @nslot in the memslot set indicated by
+ * @slots_idx.
+ *
+ * With NULL @oldslot this simply adds the @nslot to the set.
+ * With NULL @nslot this simply removes the @oldslot from the set.
+ *
+ * If @nslot is non-NULL its hva_node[slots_idx] range has to be set
+ * appropriately.
+ */
+static void kvm_replace_memslot(struct kvm *kvm,
+				int as_id, int slots_idx,
+				struct kvm_memory_slot *oldslot,
+				struct kvm_memory_slot *nslot)
+{
+	struct kvm_memslots *slots = &kvm->memslots_all[as_id][slots_idx];
 
-	slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
-	if (unlikely(!slots))
-		return NULL;
+	if (WARN_ON(!oldslot && !nslot))
+		return;
+
+	if (oldslot) {
+		hash_del(&oldslot->id_node[slots_idx]);
+		interval_tree_remove(&oldslot->hva_node[slots_idx],
+				     &slots->hva_tree);
+		atomic_long_cmpxchg(&slots->lru_slot,
+				    (unsigned long)oldslot,
+				    (unsigned long)nslot);
+		if (!nslot) {
+			rb_erase(&oldslot->gfn_node[slots_idx],
+				 &slots->gfn_tree);
+			return;
+		}
+	}
 
-	memcpy(slots, old, old_size);
+	hash_add(slots->id_hash, &nslot->id_node[slots_idx],
+		 nslot->id);
+	WARN_ON(PAGE_SHIFT > 0 &&
+		nslot->hva_node[slots_idx].start >=
+		nslot->hva_node[slots_idx].last);
+	interval_tree_insert(&nslot->hva_node[slots_idx],
+			     &slots->hva_tree);
 
-	slots->hva_tree = RB_ROOT_CACHED;
-	hash_init(slots->id_hash);
-	kvm_for_each_memslot(memslot, slots) {
-		interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
-		hash_add(slots->id_hash, &memslot->id_node, memslot->id);
+	/* Shame there is no O(1) interval_tree_replace()... */
+	if (oldslot && oldslot->base_gfn == nslot->base_gfn)
+		rb_replace_node(&oldslot->gfn_node[slots_idx],
+				&nslot->gfn_node[slots_idx],
+				&slots->gfn_tree);
+	else {
+		if (oldslot)
+			rb_erase(&oldslot->gfn_node[slots_idx],
+				 &slots->gfn_tree);
+		kvm_memslot_gfn_insert(&slots->gfn_tree,
+				       nslot, slots_idx);
 	}
+}
+
+/*
+ * Copies the @oldslot data into @nslot and uses this slot to replace
+ * @oldslot in the memslot set indicated by @slots_idx.
+ */
+static void kvm_copy_replace_memslot(struct kvm *kvm,
+				     int as_id, int slots_idx,
+				     struct kvm_memory_slot *oldslot,
+				     struct kvm_memory_slot *nslot)
+{
+	kvm_copy_memslot(nslot, oldslot);
+	kvm_init_memslot_hva_ranges(nslot);
 
-	return slots;
+	kvm_replace_memslot(kvm, as_id, slots_idx, oldslot, nslot);
 }
 
 static int kvm_set_memslot(struct kvm *kvm,
@@ -1397,56 +1279,178 @@ static int kvm_set_memslot(struct kvm *kvm,
 			   struct kvm_memory_slot *new, int as_id,
 			   enum kvm_mr_change change)
 {
-	struct kvm_memory_slot *slot;
-	struct kvm_memslots *slots;
+	struct kvm_memslots *slotsact = __kvm_memslots(kvm, as_id);
+	int idxact = kvm_memslots_idx(slotsact);
+	int idxina = idxact == 0 ? 1 : 0;
+	struct kvm_memslots *slotsina = &kvm->memslots_all[as_id][idxina];
+	struct kvm_memory_slot *slotina, *slotact;
 	int r;
 
-	slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
-	if (!slots)
+	slotina = kzalloc(sizeof(*slotina), GFP_KERNEL_ACCOUNT);
+	if (!slotina)
 		return -ENOMEM;
 
+	if (change != KVM_MR_CREATE)
+		slotact = id_to_memslot(slotsact, old->id);
+
 	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
 		/*
-		 * Note, the INVALID flag needs to be in the appropriate entry
-		 * in the freshly allocated memslots, not in @old or @new.
+		 * Replace the slot to be deleted or moved in the inactive
+		 * memslot set by its copy with KVM_MEMSLOT_INVALID flag set.
 		 */
-		slot = id_to_memslot(slots, old->id);
-		slot->flags |= KVM_MEMSLOT_INVALID;
+		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
+		slotina->flags |= KVM_MEMSLOT_INVALID;
 
 		/*
-		 * We can re-use the old memslots, the only difference from the
-		 * newly installed memslots is the invalid flag, which will get
-		 * dropped by update_memslots anyway.  We'll also revert to the
-		 * old memslots if preparing the new memory region fails.
+		 * Swap the active <-> inactive memslot set.
+		 * Now the active memslot set still contains the memslot to be
+		 * deleted or moved, but with the KVM_MEMSLOT_INVALID flag set.
 		 */
-		slots = install_new_memslots(kvm, as_id, slots);
+		swap_memslots(kvm, as_id);
+		swap(idxact, idxina);
+		swap(slotsina, slotsact);
+		swap(slotact, slotina);
 
-		/* From this point no new shadow pages pointing to a deleted,
+		/*
+		 * From this point no new shadow pages pointing to a deleted,
 		 * or moved, memslot will be created.
 		 *
 		 * validation of sp->gfn happens in:
 		 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
 		 *	- kvm_is_visible_gfn (mmu_check_root)
 		 */
-		kvm_arch_flush_shadow_memslot(kvm, slot);
+		kvm_arch_flush_shadow_memslot(kvm, slotact);
 	}
 
 	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
 	if (r)
 		goto out_slots;
 
-	update_memslots(slots, new, change);
-	slots = install_new_memslots(kvm, as_id, slots);
+	if (change == KVM_MR_MOVE) {
+		/*
+		 * Since we are going to be changing the memslot gfn we need to
+		 * remove it from the gfn tree so it can be re-added there with
+		 * the updated gfn.
+		 */
+		rb_erase(&slotina->gfn_node[idxina],
+			 &slotsina->gfn_tree);
+
+		slotina->base_gfn = new->base_gfn;
+		slotina->flags = new->flags;
+		slotina->dirty_bitmap = new->dirty_bitmap;
+		/* kvm_arch_prepare_memory_region() might have modified arch */
+		slotina->arch = new->arch;
+
+		/* Re-add to the gfn tree with the updated gfn */
+		kvm_memslot_gfn_insert(&slotsina->gfn_tree,
+				       slotina, idxina);
+
+		/*
+		 * Swap the active <-> inactive memslot set.
+		 * Now the active memslot set contains the new, final memslot.
+		 */
+		swap_memslots(kvm, as_id);
+		swap(idxact, idxina);
+		swap(slotsina, slotsact);
+		swap(slotact, slotina);
+
+		/*
+		 * Replace the temporary KVM_MEMSLOT_INVALID slot with the
+		 * new, final memslot in the inactive memslot set and
+		 * free the temporary memslot.
+		 */
+		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
+		kfree(slotina);
+	} else if (change == KVM_MR_FLAGS_ONLY) {
+		/*
+		 * Almost like the move case above, but we don't use a temporary
+		 * KVM_MEMSLOT_INVALID slot.
+		 * Instead, we simply replace the old memslot with a new, updated
+		 * copy in both memslot sets.
+		 *
+		 * Since we aren't going to be changing the memslot gfn we can
+		 * simply use kvm_copy_replace_memslot(), which will use
+		 * rb_replace_node() to switch the memslot node in the gfn tree
+		 * instead of removing the old one and inserting the new one
+		 * as two separate operations.
+		 * It's a performance win since node replacement is a single
+		 * O(1) operation as opposed to two O(log(n)) operations for
+		 * slot removal and then re-insertion.
+		 */
+		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
+		slotina->flags = new->flags;
+		slotina->dirty_bitmap = new->dirty_bitmap;
+		/* kvm_arch_prepare_memory_region() might have modified arch */
+		slotina->arch = new->arch;
+
+		/* Swap the active <-> inactive memslot set. */
+		swap_memslots(kvm, as_id);
+		swap(idxact, idxina);
+		swap(slotsina, slotsact);
+		swap(slotact, slotina);
+
+		/*
+		 * Replace the old memslot in the other memslot set and
+		 * then finally free it.
+		 */
+		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
+		kfree(slotina);
+	} else if (change == KVM_MR_CREATE) {
+		/*
+		 * Add the new memslot to the current inactive set as a copy
+		 * of the provided new memslot data.
+		 */
+		kvm_copy_memslot(slotina, new);
+		kvm_init_memslot_hva_ranges(slotina);
+
+		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
+
+		/* Swap the active <-> inactive memslot set. */
+		swap_memslots(kvm, as_id);
+		swap(idxact, idxina);
+		swap(slotsina, slotsact);
+
+		/* Now add it also to the other memslot set */
+		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
+	} else if (change == KVM_MR_DELETE) {
+		/*
+		 * Remove the old memslot from the current inactive set
+		 * (the other, active set contains the temporary
+		 * KVM_MEMSLOT_INVALID slot)
+		 */
+		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
+
+		/* Swap the active <-> inactive memslot set. */
+		swap_memslots(kvm, as_id);
+		swap(idxact, idxina);
+		swap(slotsina, slotsact);
+		swap(slotact, slotina);
+
+		/* Remove the temporary KVM_MEMSLOT_INVALID slot and free it. */
+		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
+		kfree(slotina);
+		/* slotact will be freed by kvm_free_memslot() */
+	} else
+		BUG();
 
 	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
 
-	kvfree(slots);
+	if (change == KVM_MR_DELETE)
+		kvm_free_memslot(kvm, slotact);
+
 	return 0;
 
 out_slots:
-	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
-		slots = install_new_memslots(kvm, as_id, slots);
-	kvfree(slots);
+	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
+		swap_memslots(kvm, as_id);
+		swap(idxact, idxina);
+		swap(slotsina, slotsact);
+		swap(slotact, slotina);
+
+		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
+	}
+	kfree(slotina);
+
 	return r;
 }
 
@@ -1455,7 +1459,6 @@ static int kvm_delete_memslot(struct kvm *kvm,
 			      struct kvm_memory_slot *old, int as_id)
 {
 	struct kvm_memory_slot new;
-	int r;
 
 	if (!old->npages)
 		return -EINVAL;
@@ -1468,12 +1471,7 @@ static int kvm_delete_memslot(struct kvm *kvm,
 	 */
 	new.as_id = as_id;
 
-	r = kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
-	if (r)
-		return r;
-
-	kvm_free_memslot(kvm, old);
-	return 0;
+	return kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
 }
 
 /*
@@ -1516,12 +1514,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
 		return -EINVAL;
 
-	/*
-	 * Make a full copy of the old memslot, the pointer will become stale
-	 * when the memslots are re-sorted by update_memslots(), and the old
-	 * memslot needs to be referenced after calling update_memslots(), e.g.
-	 * to free its resources and for arch specific behavior.
-	 */
 	tmp = id_to_memslot(__kvm_memslots(kvm, as_id), id);
 	if (tmp) {
 		old = *tmp;
@@ -1567,8 +1559,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	}
 
 	if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+		int ctr;
+
 		/* Check for overlaps */
-		kvm_for_each_memslot(tmp, __kvm_memslots(kvm, as_id)) {
+		kvm_for_each_memslot(tmp, ctr, __kvm_memslots(kvm, as_id)) {
 			if (tmp->id == id)
 				continue;
 			if (!((new.base_gfn + new.npages <= tmp->base_gfn) ||

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 7/8] KVM: Optimize gfn lookup in kvm_zap_gfn_range()
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
                   ` (5 preceding siblings ...)
  2021-05-16 21:44 ` [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  2021-05-26 17:33   ` Sean Christopherson
  2021-05-16 21:44 ` [PATCH v3 8/8] KVM: Optimize overlapping memslots check Maciej S. Szmigiero
  7 siblings, 1 reply; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Introduce a memslots gfn upper bound operation and use it to optimize
kvm_zap_gfn_range().
This way this handler can do a quick lookup for intersecting gfns and won't
have to do a linear scan of the whole memslot set.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/x86/kvm/mmu/mmu.c   | 41 ++++++++++++++++++++++++++++++++++++++--
 include/linux/kvm_host.h | 22 +++++++++++++++++++++
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7222b552d139..f23398cf0316 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5490,14 +5490,51 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	int i;
 	bool flush = false;
 
+	if (gfn_end == gfn_start || WARN_ON(gfn_end < gfn_start))
+		return;
+
 	write_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		int ctr;
+		int idxactive;
+		struct rb_node *node;
 
 		slots = __kvm_memslots(kvm, i);
-		kvm_for_each_memslot(memslot, ctr, slots) {
+		idxactive = kvm_memslots_idx(slots);
+
+		/*
+		 * Find the slot with the lowest gfn that can possibly intersect with
+		 * the range, so we'll ideally have slot start <= range start
+		 */
+		node = kvm_memslots_gfn_upper_bound(slots, gfn_start);
+		if (node) {
+			struct rb_node *pnode;
+
+			/*
+			 * A NULL previous node means that the very first slot
+			 * already has a higher start gfn.
+			 * In this case slot start > range start.
+			 */
+			pnode = rb_prev(node);
+			if (pnode)
+				node = pnode;
+		} else {
+			/* a NULL node below means no slots */
+			node = rb_last(&slots->gfn_tree);
+		}
+
+		for ( ; node; node = rb_next(node)) {
 			gfn_t start, end;
 
+			memslot = container_of(node, struct kvm_memory_slot,
+					       gfn_node[idxactive]);
+
+			/*
+			 * If this slot starts beyond or at the end of the range so does
+			 * every next one
+			 */
+			if (memslot->base_gfn >= gfn_start + gfn_end)
+				break;
+
 			start = max(gfn_start, memslot->base_gfn);
 			end = min(gfn_end, memslot->base_gfn + memslot->npages);
 			if (start >= end)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a9c5b0df2311..fd88e971eef2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -709,6 +709,28 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
 	return NULL;
 }
 
+static inline
+struct rb_node *kvm_memslots_gfn_upper_bound(struct kvm_memslots *slots,
+					     gfn_t gfn)
+{
+	int idxactive = kvm_memslots_idx(slots);
+	struct rb_node *node, *result = NULL;
+
+	for (node = slots->gfn_tree.rb_node; node; ) {
+		struct kvm_memory_slot *slot;
+
+		slot = container_of(node, struct kvm_memory_slot,
+				    gfn_node[idxactive]);
+		if (gfn < slot->base_gfn) {
+			result = node;
+			node = node->rb_left;
+		} else
+			node = node->rb_right;
+	}
+
+	return result;
+}
+
 /*
  * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
  * - create a new memory slot

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 8/8] KVM: Optimize overlapping memslots check
  2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
                   ` (6 preceding siblings ...)
  2021-05-16 21:44 ` [PATCH v3 7/8] KVM: Optimize gfn lookup in kvm_zap_gfn_range() Maciej S. Szmigiero
@ 2021-05-16 21:44 ` Maciej S. Szmigiero
  7 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-16 21:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Do a quick lookup for possibly overlapping gfns when creating or moving
a memslot instead of performing a linear scan of the whole memslot set.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 virt/kvm/kvm_main.c | 65 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 56 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 189504b27ca6..11814730cbcb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1474,6 +1474,59 @@ static int kvm_delete_memslot(struct kvm *kvm,
 	return kvm_set_memslot(kvm, mem, old, &new, as_id, KVM_MR_DELETE);
 }
 
+static bool kvm_check_memslot_overlap(struct kvm_memslots *slots,
+				      struct kvm_memory_slot *nslot)
+{
+	int idxactive = kvm_memslots_idx(slots);
+	struct rb_node *node;
+
+	/*
+	 * Find the slot with the lowest gfn that can possibly intersect with
+	 * the new slot, so we'll ideally have slot start <= nslot start
+	 */
+	node = kvm_memslots_gfn_upper_bound(slots, nslot->base_gfn);
+	if (node) {
+		struct rb_node *pnode;
+
+		/*
+		 * A NULL previous node means that the very first slot
+		 * already has a higher start gfn.
+		 * In this case slot start > nslot start.
+		 */
+		pnode = rb_prev(node);
+		if (pnode)
+			node = pnode;
+	} else {
+		/* a NULL node below means no existing slots */
+		node = rb_last(&slots->gfn_tree);
+	}
+
+	for ( ; node; node = rb_next(node)) {
+		struct kvm_memory_slot *cslot;
+
+		cslot = container_of(node, struct kvm_memory_slot,
+				     gfn_node[idxactive]);
+
+		/*
+		 * if this slot starts beyond or at the end of the new slot
+		 * so does every next one
+		 */
+		if (cslot->base_gfn >= nslot->base_gfn + nslot->npages)
+			break;
+
+		if (cslot->id == nslot->id)
+			continue;
+
+		if (cslot->base_gfn >= nslot->base_gfn)
+			return true;
+
+		if (cslot->base_gfn + cslot->npages > nslot->base_gfn)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Allocate some memory and give it an address in the guest physical address
  * space.
@@ -1559,16 +1612,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	}
 
 	if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
-		int ctr;
-
 		/* Check for overlaps */
-		kvm_for_each_memslot(tmp, ctr, __kvm_memslots(kvm, as_id)) {
-			if (tmp->id == id)
-				continue;
-			if (!((new.base_gfn + new.npages <= tmp->base_gfn) ||
-			      (new.base_gfn >= tmp->base_gfn + tmp->npages)))
-				return -EEXIST;
-		}
+		if (kvm_check_memslot_overlap(__kvm_memslots(kvm, as_id),
+					      &new))
+			return -EEXIST;
 	}
 
 	/* Allocate/free page dirty bitmap as needed */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array
  2021-05-16 21:44 ` [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array Maciej S. Szmigiero
@ 2021-05-19 21:00   ` Sean Christopherson
  2021-05-21  7:03     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2021-05-19 21:00 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> There is no point in recalculating from scratch the total number of pages
> in all memslots each time a memslot is created or deleted.
> 
> Just cache the value and update it accordingly on each such operation so
> the code doesn't need to traverse the whole memslot array each time.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5bd550eaf683..8c7738b75393 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11112,9 +11112,21 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  				const struct kvm_memory_slot *new,
>  				enum kvm_mr_change change)
>  {
> -	if (!kvm->arch.n_requested_mmu_pages)
> -		kvm_mmu_change_mmu_pages(kvm,
> -				kvm_mmu_calculate_default_mmu_pages(kvm));
> +	if (change == KVM_MR_CREATE)
> +		kvm->arch.n_memslots_pages += new->npages;
> +	else if (change == KVM_MR_DELETE) {
> +		WARN_ON(kvm->arch.n_memslots_pages < old->npages);

Heh, so I think this WARN can be triggered at will by userspace on 32-bit KVM by
causing the running count to wrap.  KVM artificially caps the size of a single
memslot at ((1UL << 31) - 1), but userspace could create multiple gigantic slots
to overflow arch.n_memslots_pages.

I _think_ changing it to a u64 would fix the problem since KVM forbids overlapping
memslots in the GPA space.

Also, what about moving the check-and-WARN to prepare_memory_region() so that
KVM can error out if the check fails?  Doesn't really matter, but an explicit
error for userspace is preferable to underflowing the number of pages and getting
weird MMU errors/behavior down the line.

> +		kvm->arch.n_memslots_pages -= old->npages;
> +	}
> +
> +	if (!kvm->arch.n_requested_mmu_pages) {

If we're going to bother caching the number of pages then we should also skip
the update when the number pages isn't changing, e.g.

	if (change == KVM_MR_CREATE || change == KVM_MR_DELETE) {
		if (change == KVM_MR_CREATE)
			kvm->arch.n_memslots_pages += new->npages;
		else
			kvm->arch.n_memslots_pages -= old->npages;

		if (!kvm->arch.n_requested_mmu_pages) {
			unsigned long nr_mmu_pages;

			nr_mmu_pages = kvm->arch.n_memslots_pages *
				       KVM_PERMILLE_MMU_PAGES / 1000;
			nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
			kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
		}
	}

> +		unsigned long nr_mmu_pages;
> +
> +		nr_mmu_pages = kvm->arch.n_memslots_pages *
> +			       KVM_PERMILLE_MMU_PAGES / 1000;
> +		nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
> +		kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
> +	}
>  
>  	/*
>  	 * FIXME: const-ify all uses of struct kvm_memory_slot.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots()
  2021-05-16 21:44 ` [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots() Maciej S. Szmigiero
@ 2021-05-19 21:24   ` Sean Christopherson
  2021-05-21  7:03     ` Maciej S. Szmigiero
  2021-06-10 16:17     ` Paolo Bonzini
  0 siblings, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2021-05-19 21:24 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8895b95b6a22..3c40c7d32f7e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1091,10 +1091,14 @@ bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
>   * gfn_to_memslot() itself isn't here as an inline because that would
>   * bloat other code too much.
>   *
> + * With "approx" set returns the memslot also when the address falls
> + * in a hole. In that case one of the memslots bordering the hole is
> + * returned.
> + *
>   * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
>   */
>  static inline struct kvm_memory_slot *
> -search_memslots(struct kvm_memslots *slots, gfn_t gfn)
> +search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)

An alternative to modifying the PPC code would be to make the existing
search_memslots() a wrapper to __search_memslots(), with the latter taking
@approx.

We might also want to make this __always_inline to improve the likelihood of the
compiler optimizing away @approx.  I doubt it matters in practice...

>  {
>  	int start = 0, end = slots->used_slots;
>  	int slot = atomic_read(&slots->lru_slot);
> @@ -1116,19 +1120,22 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn)
>  			start = slot + 1;
>  	}
>  
> +	if (approx && start >= slots->used_slots)
> +		return &memslots[slots->used_slots - 1];
> +
>  	if (start < slots->used_slots && gfn >= memslots[start].base_gfn &&
>  	    gfn < memslots[start].base_gfn + memslots[start].npages) {
>  		atomic_set(&slots->lru_slot, start);
>  		return &memslots[start];
>  	}
>  
> -	return NULL;
> +	return approx ? &memslots[start] : NULL;
>  }
>  
>  static inline struct kvm_memory_slot *
>  __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>  {
> -	return search_memslots(slots, gfn);
> +	return search_memslots(slots, gfn, false);
>  }
>  
>  static inline unsigned long

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array
  2021-05-16 21:44 ` [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array Maciej S. Szmigiero
@ 2021-05-19 22:31   ` Sean Christopherson
  2021-05-21  7:05     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2021-05-19 22:31 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
> @@ -356,6 +357,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>  #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
>  
>  struct kvm_memory_slot {
> +	struct hlist_node id_node;
>  	gfn_t base_gfn;
>  	unsigned long npages;
>  	unsigned long *dirty_bitmap;
> @@ -458,7 +460,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  struct kvm_memslots {
>  	u64 generation;
>  	/* The mapping table from slot id to the index in memslots[]. */
> -	short id_to_index[KVM_MEM_SLOTS_NUM];
> +	DECLARE_HASHTABLE(id_hash, 7);

Is there any specific motivation for using 7 bits?

>  	atomic_t lru_slot;
>  	int used_slots;
>  	struct kvm_memory_slot memslots[];

...

> @@ -1097,14 +1095,16 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
>  /*
>   * Delete a memslot by decrementing the number of used slots and shifting all
>   * other entries in the array forward one spot.
> + * @memslot is a detached dummy struct with just .id and .as_id filled.
>   */
>  static inline void kvm_memslot_delete(struct kvm_memslots *slots,
>  				      struct kvm_memory_slot *memslot)
>  {
>  	struct kvm_memory_slot *mslots = slots->memslots;
> +	struct kvm_memory_slot *dmemslot = id_to_memslot(slots, memslot->id);

I vote to call these local vars "old", or something along those lines.  dmemslot
isn't too bad, but mmemslot in the helpers below is far too similar to memslot,
and using the wrong will cause nasty explosions.

>  	int i;
>  
> -	if (WARN_ON(slots->id_to_index[memslot->id] == -1))
> +	if (WARN_ON(!dmemslot))
>  		return;
>  
>  	slots->used_slots--;
> @@ -1112,12 +1112,13 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
>  	if (atomic_read(&slots->lru_slot) >= slots->used_slots)
>  		atomic_set(&slots->lru_slot, 0);
>  
> -	for (i = slots->id_to_index[memslot->id]; i < slots->used_slots; i++) {
> +	for (i = dmemslot - mslots; i < slots->used_slots; i++) {
> +		hash_del(&mslots[i].id_node);
>  		mslots[i] = mslots[i + 1];
> -		slots->id_to_index[mslots[i].id] = i;
> +		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
>  	}
> +	hash_del(&mslots[i].id_node);
>  	mslots[i] = *memslot;
> -	slots->id_to_index[memslot->id] = -1;
>  }
>  
>  /*
> @@ -1135,31 +1136,41 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
>   * itself is not preserved in the array, i.e. not swapped at this time, only
>   * its new index into the array is tracked.  Returns the changed memslot's
>   * current index into the memslots array.
> + * The memslot at the returned index will not be in @slots->id_hash by then.
> + * @memslot is a detached struct with desired final data of the changed slot.
>   */
>  static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
>  					    struct kvm_memory_slot *memslot)
>  {
>  	struct kvm_memory_slot *mslots = slots->memslots;
> +	struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
>  	int i;
>  
> -	if (WARN_ON_ONCE(slots->id_to_index[memslot->id] == -1) ||
> +	if (WARN_ON_ONCE(!mmemslot) ||
>  	    WARN_ON_ONCE(!slots->used_slots))
>  		return -1;
>  
> +	/*
> +	 * update_memslots() will unconditionally overwrite and re-add the
> +	 * target memslot so it has to be removed here firs
> +	 */

It would be helpful to explain "why" this is necessary.  Something like:

	/*
	 * The memslot is being moved, delete its previous hash entry; its new
	 * entry will be added by updated_memslots().  The old entry cannot be
	 * kept even though its id is unchanged, because the old entry points at
	 * the memslot in the old instance of memslots.
	 */

> +	hash_del(&mmemslot->id_node);
> +
>  	/*
>  	 * Move the target memslot backward in the array by shifting existing
>  	 * memslots with a higher GFN (than the target memslot) towards the
>  	 * front of the array.
>  	 */
> -	for (i = slots->id_to_index[memslot->id]; i < slots->used_slots - 1; i++) {
> +	for (i = mmemslot - mslots; i < slots->used_slots - 1; i++) {
>  		if (memslot->base_gfn > mslots[i + 1].base_gfn)
>  			break;
>  
>  		WARN_ON_ONCE(memslot->base_gfn == mslots[i + 1].base_gfn);
>  
>  		/* Shift the next memslot forward one and update its index. */
> +		hash_del(&mslots[i + 1].id_node);
>  		mslots[i] = mslots[i + 1];
> -		slots->id_to_index[mslots[i].id] = i;
> +		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
>  	}
>  	return i;
>  }
> @@ -1170,6 +1181,10 @@ static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
>   * is not preserved in the array, i.e. not swapped at this time, only its new
>   * index into the array is tracked.  Returns the changed memslot's final index
>   * into the memslots array.
> + * The memslot at the returned index will not be in @slots->id_hash by then.
> + * @memslot is a detached struct with desired final data of the new or
> + * changed slot.
> + * Assumes that the memslot at @start index is not in @slots->id_hash.
>   */
>  static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
>  					   struct kvm_memory_slot *memslot,
> @@ -1185,8 +1200,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
>  		WARN_ON_ONCE(memslot->base_gfn == mslots[i - 1].base_gfn);
>  
>  		/* Shift the next memslot back one and update its index. */
> +		hash_del(&mslots[i - 1].id_node);
>  		mslots[i] = mslots[i - 1];
> -		slots->id_to_index[mslots[i].id] = i;
> +		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
>  	}
>  	return i;
>  }
> @@ -1231,6 +1247,9 @@ static inline int kvm_memslot_move_forward(struct kvm_memslots *slots,
>   * most likely to be referenced, sorting it to the front of the array was
>   * advantageous.  The current binary search starts from the middle of the array
>   * and uses an LRU pointer to improve performance for all memslots and GFNs.
> + *
> + * @memslot is a detached struct, not a part of the current or new memslot
> + * array.
>   */
>  static void update_memslots(struct kvm_memslots *slots,
>  			    struct kvm_memory_slot *memslot,
> @@ -1247,12 +1266,16 @@ static void update_memslots(struct kvm_memslots *slots,
>  			i = kvm_memslot_move_backward(slots, memslot);
>  		i = kvm_memslot_move_forward(slots, memslot, i);
>  
> +		if (i < 0)
> +			return;

Hmm, this is essentially a "fix" to existing code, it should be in a separate
patch.  And since kvm_memslot_move_forward() can theoretically hit this even if
kvm_memslot_move_backward() doesn't return -1, i.e. doesn't WARN, what about
doing WARN_ON_ONCE() here and dropping the WARNs in kvm_memslot_move_backward()?
It'll be slightly less developer friendly, but anyone that has the unfortunate
pleasure of breaking and debugging this code is already in for a world of pain.

> +
>  		/*
>  		 * Copy the memslot to its new position in memslots and update
>  		 * its index accordingly.
>  		 */
>  		slots->memslots[i] = *memslot;
> -		slots->id_to_index[memslot->id] = i;
> +		hash_add(slots->id_hash, &slots->memslots[i].id_node,
> +			 memslot->id);
>  	}
>  }
>  
> @@ -1316,6 +1339,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
>  {
>  	struct kvm_memslots *slots;
>  	size_t old_size, new_size;
> +	struct kvm_memory_slot *memslot;
>  
>  	old_size = sizeof(struct kvm_memslots) +
>  		   (sizeof(struct kvm_memory_slot) * old->used_slots);
> @@ -1326,8 +1350,14 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
>  		new_size = old_size;
>  
>  	slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
> -	if (likely(slots))
> -		memcpy(slots, old, old_size);
> +	if (unlikely(!slots))
> +		return NULL;
> +
> +	memcpy(slots, old, old_size);
> +
> +	hash_init(slots->id_hash);
> +	kvm_for_each_memslot(memslot, slots)
> +		hash_add(slots->id_hash, &memslot->id_node, memslot->id);

What's the perf penalty if the number of memslots gets large?  I ask because the
lazy rmap allocation is adding multiple calls to kvm_dup_memslots().

>  	return slots;
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/8] KVM: Introduce memslots hva tree
  2021-05-16 21:44 ` [PATCH v3 4/8] KVM: Introduce memslots hva tree Maciej S. Szmigiero
@ 2021-05-19 23:07   ` Sean Christopherson
  2021-05-21  7:06     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2021-05-19 23:07 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

Nit: something like "KVM: Use interval tree to do fast hva lookup in memslots"
would be more helpful when perusing the shortlogs.  Stating that a tree is being
added doesn't provide any hint as to why, or even the what is somewhat unclear.

On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> The current memslots implementation only allows quick binary search by gfn,
> quick lookup by hva is not possible - the implementation has to do a linear
> scan of the whole memslots array, even though the operation being performed
> might apply just to a single memslot.
> 
> This significantly hurts performance of per-hva operations with higher
> memslot counts.
> 
> Since hva ranges can overlap between memslots an interval tree is needed
> for tracking them.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---

...

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d3a35646dfd8..f59847b6e9b3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -27,6 +27,7 @@
>  #include <linux/rcuwait.h>
>  #include <linux/refcount.h>
>  #include <linux/nospec.h>
> +#include <linux/interval_tree.h>
>  #include <linux/hashtable.h>
>  #include <asm/signal.h>
>  
> @@ -358,6 +359,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>  
>  struct kvm_memory_slot {
>  	struct hlist_node id_node;
> +	struct interval_tree_node hva_node;
>  	gfn_t base_gfn;
>  	unsigned long npages;
>  	unsigned long *dirty_bitmap;
> @@ -459,6 +461,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>   */
>  struct kvm_memslots {
>  	u64 generation;
> +	struct rb_root_cached hva_tree;
>  	/* The mapping table from slot id to the index in memslots[]. */
>  	DECLARE_HASHTABLE(id_hash, 7);
>  	atomic_t lru_slot;
> @@ -679,6 +682,11 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
>  	return __kvm_memslots(vcpu->kvm, as_id);
>  }
>  
> +#define kvm_for_each_hva_range_memslot(node, slots, start, last)	     \

kvm_for_each_memslot_in_range()?  Or kvm_for_each_memslot_in_hva_range()?

Please add a comment about whether start is inclusive or exclusive.

I'd also be in favor of hiding this in kvm_main.c, just above the MMU notifier
usage.  It'd be nice to discourage arch code from adding lookups that more than
likely belong in generic code.

> +	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
> +	     node;							     \
> +	     node = interval_tree_iter_next(node, start, last))	     \
> +
>  static inline
>  struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
>  {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 50f9bc9bb1e0..a55309432c9a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -488,6 +488,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  	struct kvm_memslots *slots;
>  	int i, idx;
>  
> +	if (range->end == range->start || WARN_ON(range->end < range->staart))

I'm pretty sure both of these are WARNable offenses, i.e. they can be combined.
It'd also be a good idea to use WARN_ON_ONCE(); if a caller does manage to
trigger this, odds are good it will get spammed.

Also, does interval_tree_iter_first() explode if given bad inputs?  If not, I'd
probably say just omit this entirely.  If it does explode, it might be a good idea
to work the sanity check into the macro, even if the macro is hidden here.

> +		return 0;
> +
>  	/* A null handler is allowed if and only if on_lock() is provided. */
>  	if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
>  			 IS_KVM_NULL_FN(range->handler)))
> @@ -507,15 +510,18 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  	}
>  
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		struct interval_tree_node *node;
> +
>  		slots = __kvm_memslots(kvm, i);
> -		kvm_for_each_memslot(slot, slots) {
> +		kvm_for_each_hva_range_memslot(node, slots,
> +					       range->start, range->end - 1) {
>  			unsigned long hva_start, hva_end;
>  
> +			slot = container_of(node, struct kvm_memory_slot,
> +					    hva_node);

Eh, let that poke out.  The 80 limit is more of a guideline.

>  			hva_start = max(range->start, slot->userspace_addr);
>  			hva_end = min(range->end, slot->userspace_addr +
>  						  (slot->npages << PAGE_SHIFT));
> -			if (hva_start >= hva_end)
> -				continue;
>  
>  			/*
>  			 * To optimize for the likely case where the address
> @@ -787,6 +793,7 @@ static struct kvm_memslots *kvm_alloc_memslots(void)
>  	if (!slots)
>  		return NULL;
>  
> +	slots->hva_tree = RB_ROOT_CACHED;
>  	hash_init(slots->id_hash);
>  
>  	return slots;
> @@ -1113,10 +1120,14 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
>  		atomic_set(&slots->lru_slot, 0);
>  
>  	for (i = dmemslot - mslots; i < slots->used_slots; i++) {
> +		interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
>  		hash_del(&mslots[i].id_node);

I think it would make sense to add helpers for these?  Not sure I like the names,
but it would certainly dedup the code a bit.

static void kvm_memslot_remove(struct kvm_memslots *slots,
			       struct kvm_memslot *memslot)
{
	interval_tree_remove(&memslot->hva_node, &slots->hva_tree);
	hash_del(&memslot->id_node);
}

static void kvm_memslot_insert(struct kvm_memslots *slots,
			       struct kvm_memslot *memslot)
{
	interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
	hash_add(slots->id_hash, &memslot->id_node, memslot->id);
}

> +
>  		mslots[i] = mslots[i + 1];
> +		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
>  		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
>  	}
> +	interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
>  	hash_del(&mslots[i].id_node);
>  	mslots[i] = *memslot;
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones
  2021-05-16 21:44 ` [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones Maciej S. Szmigiero
@ 2021-05-19 23:10   ` Sean Christopherson
  2021-05-21  7:06     ` Maciej S. Szmigiero
  2021-05-25 23:21   ` Sean Christopherson
  1 sibling, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2021-05-19 23:10 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

...

>  arch/arm64/kvm/mmu.c                |   8 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c |   4 +-
>  arch/powerpc/kvm/book3s_hv.c        |   3 +-
>  arch/powerpc/kvm/book3s_hv_nested.c |   4 +-
>  arch/powerpc/kvm/book3s_hv_uvmem.c  |  14 +-
>  arch/s390/kvm/kvm-s390.c            |  27 +-
>  arch/s390/kvm/kvm-s390.h            |   7 +-
>  arch/x86/kvm/mmu/mmu.c              |   4 +-
>  include/linux/kvm_host.h            | 100 ++---
>  virt/kvm/kvm_main.c                 | 580 ++++++++++++++--------------
>  10 files changed, 379 insertions(+), 372 deletions(-)

I got through the easy ones, I'll circle back to this one a different day when
my brain is fresh :-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array
  2021-05-19 21:00   ` Sean Christopherson
@ 2021-05-21  7:03     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-21  7:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 19.05.2021 23:00, Sean Christopherson wrote:
> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> There is no point in recalculating from scratch the total number of pages
>> in all memslots each time a memslot is created or deleted.
>>
>> Just cache the value and update it accordingly on each such operation so
>> the code doesn't need to traverse the whole memslot array each time.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 5bd550eaf683..8c7738b75393 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11112,9 +11112,21 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>>   				const struct kvm_memory_slot *new,
>>   				enum kvm_mr_change change)
>>   {
>> -	if (!kvm->arch.n_requested_mmu_pages)
>> -		kvm_mmu_change_mmu_pages(kvm,
>> -				kvm_mmu_calculate_default_mmu_pages(kvm));
>> +	if (change == KVM_MR_CREATE)
>> +		kvm->arch.n_memslots_pages += new->npages;
>> +	else if (change == KVM_MR_DELETE) {
>> +		WARN_ON(kvm->arch.n_memslots_pages < old->npages);
> 
> Heh, so I think this WARN can be triggered at will by userspace on 32-bit KVM by
> causing the running count to wrap.  KVM artificially caps the size of a single
> memslot at ((1UL << 31) - 1), but userspace could create multiple gigantic slots
> to overflow arch.n_memslots_pages.
> 
> I _think_ changing it to a u64 would fix the problem since KVM forbids overlapping
> memslots in the GPA space.

You are right, n_memslots_pages needs to be u64 so it does not overflow
on 32-bit KVM.

The memslot count is limited to 32k in each of 2 address spaces, so in
the worst case the variable should hold 15-bits + 1 bit + 31-bits = 47 bit number.

> Also, what about moving the check-and-WARN to prepare_memory_region() so that
> KVM can error out if the check fails?  Doesn't really matter, but an explicit
> error for userspace is preferable to underflowing the number of pages and getting
> weird MMU errors/behavior down the line.

In principle this seems like a possibility, however, it is a more
regression-risky option, in case something has (perhaps unintentionally)
relied on the fact that kvm_mmu_zap_oldest_mmu_pages() call from
kvm_mmu_change_mmu_pages() was being done only in the memslot commit
function.

>> +		kvm->arch.n_memslots_pages -= old->npages;
>> +	}
>> +
>> +	if (!kvm->arch.n_requested_mmu_pages) {
> 
> If we're going to bother caching the number of pages then we should also skip
> the update when the number pages isn't changing, e.g.
> 
> 	if (change == KVM_MR_CREATE || change == KVM_MR_DELETE) {
> 		if (change == KVM_MR_CREATE)
> 			kvm->arch.n_memslots_pages += new->npages;
> 		else
> 			kvm->arch.n_memslots_pages -= old->npages;
> 
> 		if (!kvm->arch.n_requested_mmu_pages) {
> 			unsigned long nr_mmu_pages;
> 
> 			nr_mmu_pages = kvm->arch.n_memslots_pages *
> 				       KVM_PERMILLE_MMU_PAGES / 1000;
> 			nr_mmu_pages = max(nr_mmu_pages, KVM_MIN_ALLOC_MMU_PAGES);
> 			kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
> 		}
> 	}

The old code did it that way (unconditionally) and, as in the case above,
I didn't want to risk an regression.
If we are going to change this fact then I think it should happen in a
separate patch.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots()
  2021-05-19 21:24   ` Sean Christopherson
@ 2021-05-21  7:03     ` Maciej S. Szmigiero
  2021-06-10 16:17     ` Paolo Bonzini
  1 sibling, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-21  7:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 19.05.2021 23:24, Sean Christopherson wrote:
> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 8895b95b6a22..3c40c7d32f7e 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1091,10 +1091,14 @@ bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
>>    * gfn_to_memslot() itself isn't here as an inline because that would
>>    * bloat other code too much.
>>    *
>> + * With "approx" set returns the memslot also when the address falls
>> + * in a hole. In that case one of the memslots bordering the hole is
>> + * returned.
>> + *
>>    * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
>>    */
>>   static inline struct kvm_memory_slot *
>> -search_memslots(struct kvm_memslots *slots, gfn_t gfn)
>> +search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
> 
> An alternative to modifying the PPC code would be to make the existing
> search_memslots() a wrapper to __search_memslots(), with the latter taking
> @approx.

I guess you mean that if search_memslots() only does an exact search
(like the current code does) its 3 callers won't have to be modified.
Will do it then.

> We might also want to make this __always_inline to improve the likelihood of the
> compiler optimizing away @approx.  I doubt it matters in practice...

Sounds like a good idea, will do.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array
  2021-05-19 22:31   ` Sean Christopherson
@ 2021-05-21  7:05     ` Maciej S. Szmigiero
  2021-05-22 11:11       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-21  7:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 20.05.2021 00:31, Sean Christopherson wrote:
> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>> @@ -356,6 +357,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>>   #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
>>   
>>   struct kvm_memory_slot {
>> +	struct hlist_node id_node;
>>   	gfn_t base_gfn;
>>   	unsigned long npages;
>>   	unsigned long *dirty_bitmap;
>> @@ -458,7 +460,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>>   struct kvm_memslots {
>>   	u64 generation;
>>   	/* The mapping table from slot id to the index in memslots[]. */
>> -	short id_to_index[KVM_MEM_SLOTS_NUM];
>> +	DECLARE_HASHTABLE(id_hash, 7);
> 
> Is there any specific motivation for using 7 bits?

At the time this code was written "id_to_index" was 512 entries * 2 bytes =
1024 bytes in size and I didn't want to unnecessarily make
struct kvm_memslots bigger so I have tried using a hashtable array of the
same size (128 bucket-heads * 8 bytes).

I have done a few performance measurements then and I remember there was
only a small performance difference in comparison to using a larger
hashtable (for 509 memslots), so it seemed like a good compromise.

The KVM selftest framework patch actually uses a 9-bit hashtable so the
509 original memslots have chance to be stored without hash collisions.

Another option would be to use a dynamically-resizable hashtable but this
would make the code significantly more complex and possibly introduce new
performance corner cases (like a workload that forces the hashtable grow
and shrink repeatably).

>>   	atomic_t lru_slot;
>>   	int used_slots;
>>   	struct kvm_memory_slot memslots[];
> 
> ...
> 
>> @@ -1097,14 +1095,16 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
>>   /*
>>    * Delete a memslot by decrementing the number of used slots and shifting all
>>    * other entries in the array forward one spot.
>> + * @memslot is a detached dummy struct with just .id and .as_id filled.
>>    */
>>   static inline void kvm_memslot_delete(struct kvm_memslots *slots,
>>   				      struct kvm_memory_slot *memslot)
>>   {
>>   	struct kvm_memory_slot *mslots = slots->memslots;
>> +	struct kvm_memory_slot *dmemslot = id_to_memslot(slots, memslot->id);
> 
> I vote to call these local vars "old", or something along those lines.  dmemslot
> isn't too bad, but mmemslot in the helpers below is far too similar to memslot,
> and using the wrong will cause nasty explosions.

Will rename to "oldslot" then.

(..)
>> @@ -1135,31 +1136,41 @@ static inline int kvm_memslot_insert_back(struct kvm_memslots *slots)
>>    * itself is not preserved in the array, i.e. not swapped at this time, only
>>    * its new index into the array is tracked.  Returns the changed memslot's
>>    * current index into the memslots array.
>> + * The memslot at the returned index will not be in @slots->id_hash by then.
>> + * @memslot is a detached struct with desired final data of the changed slot.
>>    */
>>   static inline int kvm_memslot_move_backward(struct kvm_memslots *slots,
>>   					    struct kvm_memory_slot *memslot)
>>   {
>>   	struct kvm_memory_slot *mslots = slots->memslots;
>> +	struct kvm_memory_slot *mmemslot = id_to_memslot(slots, memslot->id);
>>   	int i;
>>   
>> -	if (WARN_ON_ONCE(slots->id_to_index[memslot->id] == -1) ||
>> +	if (WARN_ON_ONCE(!mmemslot) ||
>>   	    WARN_ON_ONCE(!slots->used_slots))
>>   		return -1;
>>   
>> +	/*
>> +	 * update_memslots() will unconditionally overwrite and re-add the
>> +	 * target memslot so it has to be removed here firs
>> +	 */
> 
> It would be helpful to explain "why" this is necessary.  Something like:
> 
> 	/*
> 	 * The memslot is being moved, delete its previous hash entry; its new
> 	 * entry will be added by updated_memslots().  The old entry cannot be
> 	 * kept even though its id is unchanged, because the old entry points at
> 	 * the memslot in the old instance of memslots.
> 	 */

Well, this isn't technically true, since kvm_dup_memslots() reinits
the hashtable of the copied memslots array and re-adds all the existing
memslots there.

The reasons this memslot is getting removed from the hashtable are that:
a) The loop below will (possibly) overwrite it with data of the next
memslot, or a similar loop in kvm_memslot_move_forward() will overwrite
it with data of the previous memslot,

b) update_memslots() will overwrite it with data of the target memslot.

The comment above only refers to the case b), so I will update it to
also cover the case a).

(..)
>> @@ -1247,12 +1266,16 @@ static void update_memslots(struct kvm_memslots *slots,
>>   			i = kvm_memslot_move_backward(slots, memslot);
>>   		i = kvm_memslot_move_forward(slots, memslot, i);
>>   
>> +		if (i < 0)
>> +			return;
> 
> Hmm, this is essentially a "fix" to existing code, it should be in a separate
> patch.  And since kvm_memslot_move_forward() can theoretically hit this even if
> kvm_memslot_move_backward() doesn't return -1, i.e. doesn't WARN, what about
> doing WARN_ON_ONCE() here and dropping the WARNs in kvm_memslot_move_backward()?
> It'll be slightly less developer friendly, but anyone that has the unfortunate
> pleasure of breaking and debugging this code is already in for a world of pain.

Will do.

>> +
>>   		/*
>>   		 * Copy the memslot to its new position in memslots and update
>>   		 * its index accordingly.
>>   		 */
>>   		slots->memslots[i] = *memslot;
>> -		slots->id_to_index[memslot->id] = i;
>> +		hash_add(slots->id_hash, &slots->memslots[i].id_node,
>> +			 memslot->id);
>>   	}
>>   }
>>   
>> @@ -1316,6 +1339,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
>>   {
>>   	struct kvm_memslots *slots;
>>   	size_t old_size, new_size;
>> +	struct kvm_memory_slot *memslot;
>>   
>>   	old_size = sizeof(struct kvm_memslots) +
>>   		   (sizeof(struct kvm_memory_slot) * old->used_slots);
>> @@ -1326,8 +1350,14 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
>>   		new_size = old_size;
>>   
>>   	slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
>> -	if (likely(slots))
>> -		memcpy(slots, old, old_size);
>> +	if (unlikely(!slots))
>> +		return NULL;
>> +
>> +	memcpy(slots, old, old_size);
>> +
>> +	hash_init(slots->id_hash);
>> +	kvm_for_each_memslot(memslot, slots)
>> +		hash_add(slots->id_hash, &memslot->id_node, memslot->id);
> 
> What's the perf penalty if the number of memslots gets large?  I ask because the
> lazy rmap allocation is adding multiple calls to kvm_dup_memslots().

I would expect the "move inactive" benchmark to be closest to measuring
the performance of just a memslot array copy operation but the results
suggest that the performance stays within ~10% window from 10 to 509
memslots on the old code (it then climbs 13x for 32k case).

That suggests that something else is dominating this benchmark for these
memslot counts (probably zapping of shadow pages).

At the same time, the tree-based memslots implementation is clearly
faster in this benchmark, even for smaller memslot counts, so apparently
copying of the memslot array has some performance impact, too.

Measuring just kvm_dup_memslots() performance would probably be done
best by benchmarking KVM_MR_FLAGS_ONLY operation - will try to add this
operation to my set of benchmarks and see how it performs with different
memslot counts.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/8] KVM: Introduce memslots hva tree
  2021-05-19 23:07   ` Sean Christopherson
@ 2021-05-21  7:06     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-21  7:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 20.05.2021 01:07, Sean Christopherson wrote:
> Nit: something like "KVM: Use interval tree to do fast hva lookup in memslots"
> would be more helpful when perusing the shortlogs.  Stating that a tree is being
> added doesn't provide any hint as to why, or even the what is somewhat unclear.

Will do.

> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The current memslots implementation only allows quick binary search by gfn,
>> quick lookup by hva is not possible - the implementation has to do a linear
>> scan of the whole memslots array, even though the operation being performed
>> might apply just to a single memslot.
>>
>> This significantly hurts performance of per-hva operations with higher
>> memslot counts.
>>
>> Since hva ranges can overlap between memslots an interval tree is needed
>> for tracking them.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
> 
> ...
> 
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index d3a35646dfd8..f59847b6e9b3 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -27,6 +27,7 @@
>>   #include <linux/rcuwait.h>
>>   #include <linux/refcount.h>
>>   #include <linux/nospec.h>
>> +#include <linux/interval_tree.h>
>>   #include <linux/hashtable.h>
>>   #include <asm/signal.h>
>>   
>> @@ -358,6 +359,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>>   
>>   struct kvm_memory_slot {
>>   	struct hlist_node id_node;
>> +	struct interval_tree_node hva_node;
>>   	gfn_t base_gfn;
>>   	unsigned long npages;
>>   	unsigned long *dirty_bitmap;
>> @@ -459,6 +461,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>>    */
>>   struct kvm_memslots {
>>   	u64 generation;
>> +	struct rb_root_cached hva_tree;
>>   	/* The mapping table from slot id to the index in memslots[]. */
>>   	DECLARE_HASHTABLE(id_hash, 7);
>>   	atomic_t lru_slot;
>> @@ -679,6 +682,11 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
>>   	return __kvm_memslots(vcpu->kvm, as_id);
>>   }
>>   
>> +#define kvm_for_each_hva_range_memslot(node, slots, start, last)	     \
> 
> kvm_for_each_memslot_in_range()?  Or kvm_for_each_memslot_in_hva_range()?

Will change the name to kvm_for_each_memslot_in_hva_range(), so it is
obvious it's the *hva* range this iterates over.

> Please add a comment about whether start is inclusive or exclusive.

Will do.

> I'd also be in favor of hiding this in kvm_main.c, just above the MMU notifier
> usage.  It'd be nice to discourage arch code from adding lookups that more than
> likely belong in generic code.

Will do.

>> +	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
>> +	     node;							     \
>> +	     node = interval_tree_iter_next(node, start, last))	     \
>> +
>>   static inline
>>   struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
>>   {
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 50f9bc9bb1e0..a55309432c9a 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -488,6 +488,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>   	struct kvm_memslots *slots;
>>   	int i, idx;
>>   
>> +	if (range->end == range->start || WARN_ON(range->end < range->start))
> 
> I'm pretty sure both of these are WARNable offenses, i.e. they can be combined.
> It'd also be a good idea to use WARN_ON_ONCE(); if a caller does manage to
> trigger this, odds are good it will get spammed.

Will do.

> Also, does interval_tree_iter_first() explode if given bad inputs?  If not, I'd
> probably say just omit this entirely.  

Looking at the interval tree code it seems it does not account for this
possibility.
But even if after a deeper analysis it turns out to be safe (as of now)
there is always a possibility that in the future somebody will optimize
how this data structure performs its operations.
After all, garbage in, garbage out.

> If it does explode, it might be a good idea
> to work the sanity check into the macro, even if the macro is hidden here.

Can be done, although this will make the macro a bit uglier.

>> +		return 0;
>> +
>>   	/* A null handler is allowed if and only if on_lock() is provided. */
>>   	if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
>>   			 IS_KVM_NULL_FN(range->handler)))
>> @@ -507,15 +510,18 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>   	}
>>   
>>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>> +		struct interval_tree_node *node;
>> +
>>   		slots = __kvm_memslots(kvm, i);
>> -		kvm_for_each_memslot(slot, slots) {
>> +		kvm_for_each_hva_range_memslot(node, slots,
>> +					       range->start, range->end - 1) {
>>   			unsigned long hva_start, hva_end;
>>   
>> +			slot = container_of(node, struct kvm_memory_slot,
>> +					    hva_node);
> 
> Eh, let that poke out.  The 80 limit is more of a guideline.

Okay.

>>   			hva_start = max(range->start, slot->userspace_addr);
>>   			hva_end = min(range->end, slot->userspace_addr +
>>   						  (slot->npages << PAGE_SHIFT));
>> -			if (hva_start >= hva_end)
>> -				continue;
>>   
>>   			/*
>>   			 * To optimize for the likely case where the address
>> @@ -787,6 +793,7 @@ static struct kvm_memslots *kvm_alloc_memslots(void)
>>   	if (!slots)
>>   		return NULL;
>>   
>> +	slots->hva_tree = RB_ROOT_CACHED;
>>   	hash_init(slots->id_hash);
>>   
>>   	return slots;
>> @@ -1113,10 +1120,14 @@ static inline void kvm_memslot_delete(struct kvm_memslots *slots,
>>   		atomic_set(&slots->lru_slot, 0);
>>   
>>   	for (i = dmemslot - mslots; i < slots->used_slots; i++) {
>> +		interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
>>   		hash_del(&mslots[i].id_node);
> 
> I think it would make sense to add helpers for these?  Not sure I like the names,
> but it would certainly dedup the code a bit.
> 
> static void kvm_memslot_remove(struct kvm_memslots *slots,
> 			       struct kvm_memslot *memslot)
> {
> 	interval_tree_remove(&memslot->hva_node, &slots->hva_tree);
> 	hash_del(&memslot->id_node);
> }
> 
> static void kvm_memslot_insert(struct kvm_memslots *slots,
> 			       struct kvm_memslot *memslot)
> {
> 	interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
> 	hash_add(slots->id_hash, &memslot->id_node, memslot->id);> }

This is possible, however patch 6 replaces the whole code anyway
(and it has kvm_memslot_gfn_insert() and kvm_replace_memslot() helpers).

>> +
>>   		mslots[i] = mslots[i + 1];
>> +		interval_tree_insert(&mslots[i].hva_node, &slots->hva_tree);
>>   		hash_add(slots->id_hash, &mslots[i].id_node, mslots[i].id);
>>   	}
>> +	interval_tree_remove(&mslots[i].hva_node, &slots->hva_tree);
>>   	hash_del(&mslots[i].id_node);
>>   	mslots[i] = *memslot;
>>   }

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones
  2021-05-19 23:10   ` Sean Christopherson
@ 2021-05-21  7:06     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-21  7:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 20.05.2021 01:10, Sean Christopherson wrote:
> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> ...
> 
>>   arch/arm64/kvm/mmu.c                |   8 +-
>>   arch/powerpc/kvm/book3s_64_mmu_hv.c |   4 +-
>>   arch/powerpc/kvm/book3s_hv.c        |   3 +-
>>   arch/powerpc/kvm/book3s_hv_nested.c |   4 +-
>>   arch/powerpc/kvm/book3s_hv_uvmem.c  |  14 +-
>>   arch/s390/kvm/kvm-s390.c            |  27 +-
>>   arch/s390/kvm/kvm-s390.h            |   7 +-
>>   arch/x86/kvm/mmu/mmu.c              |   4 +-
>>   include/linux/kvm_host.h            | 100 ++---
>>   virt/kvm/kvm_main.c                 | 580 ++++++++++++++--------------
>>   10 files changed, 379 insertions(+), 372 deletions(-)
> 
> I got through the easy ones, I'll circle back to this one a different day when
> my brain is fresh :-)
> 

Sure, thanks for doing the review.

Maciej

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array
  2021-05-21  7:05     ` Maciej S. Szmigiero
@ 2021-05-22 11:11       ` Maciej S. Szmigiero
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-05-22 11:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 21.05.2021 09:05, Maciej S. Szmigiero wrote:
> On 20.05.2021 00:31, Sean Christopherson wrote:
>> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
(..)
>>>           new_size = old_size;
>>>       slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
>>> -    if (likely(slots))
>>> -        memcpy(slots, old, old_size);
>>> +    if (unlikely(!slots))
>>> +        return NULL;
>>> +
>>> +    memcpy(slots, old, old_size);
>>> +
>>> +    hash_init(slots->id_hash);
>>> +    kvm_for_each_memslot(memslot, slots)
>>> +        hash_add(slots->id_hash, &memslot->id_node, memslot->id);
>>
>> What's the perf penalty if the number of memslots gets large?  I ask because the
>> lazy rmap allocation is adding multiple calls to kvm_dup_memslots().
> 
> I would expect the "move inactive" benchmark to be closest to measuring
> the performance of just a memslot array copy operation but the results
> suggest that the performance stays within ~10% window from 10 to 509
> memslots on the old code (it then climbs 13x for 32k case).
> 
> That suggests that something else is dominating this benchmark for these
> memslot counts (probably zapping of shadow pages).
> 
> At the same time, the tree-based memslots implementation is clearly
> faster in this benchmark, even for smaller memslot counts, so apparently
> copying of the memslot array has some performance impact, too.
> 
> Measuring just kvm_dup_memslots() performance would probably be done
> best by benchmarking KVM_MR_FLAGS_ONLY operation - will try to add this
> operation to my set of benchmarks and see how it performs with different
> memslot counts.

Update:
I've implemented a simple KVM_MR_FLAGS_ONLY benchmark, that repeatably
sets and unsets KVM_MEM_LOG_DIRTY_PAGES flag on a memslot with a single
page of memory in it. [1]

Since on the current code with higher memslot counts the "set flags"
operation spends a significant time in kvm_mmu_calculate_default_mmu_pages()
a second set of measurements was done with patch [2] applied.

In this case, the top functions in the perf trace are "memcpy" and
"clear_page" (called from kvm_set_memslot(), most likely from inlined
kvm_dup_memslots()).

For reference, a set of measurements with the whole patch series
(patches 1 - 8) applied was also done, as "new code".
In this case, SRCU-related functions dominate the perf trace.

32k memslots:
Current code:             0.00130s
Current code + patch [2]: 0.00104s (13x 4k result)
New code:                 0.0000144s

4k memslots:
Current code:             0.0000899s
Current code + patch [2]: 0.0000799s (+78% 2k result)
New code:                 0.0000144s

2k memslots:
Current code:             0.0000495s
Current code + patch [2]: 0.0000447s (+54% 509 result)
New code:                 0.0000143s

509 memslots:
Current code:             0.0000305s
Current code + patch [2]: 0.0000290s (+5% 100 result)
New code:                 0.0000141s

100 memslots:
Current code:             0.0000280s
Current code + patch [2]: 0.0000275s (same as for 10 slots)
New code:                 0.0000142s

10 memslots:
Current code:             0.0000272s
Current code + patch [2]: 0.0000272s
New code:                 0.0000141s

Thanks,
Maciej

[1]: The patch against memslot_perf_test.c is available here:
https://github.com/maciejsszmigiero/linux/commit/841e94898a55ff79af9d20a08205aa80808bd2a8

[2]: "[PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array"

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones
  2021-05-16 21:44 ` [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones Maciej S. Szmigiero
  2021-05-19 23:10   ` Sean Christopherson
@ 2021-05-25 23:21   ` Sean Christopherson
  2021-06-01 20:24     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2021-05-25 23:21 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 33290 bytes --]

Overall, I like it!  Didn't see any functional issues, though that comes with a
disclaimer that functionality was a secondary focus for this pass.

I have lots of comments, but they're (almost?) all mechanical.

The most impactful feedback is to store the actual node index in the memslots
instead of storing whether or not an instance is node0.  This has a cascading
effect that allows for substantial cleanup, specifically that it obviates the
motivation for caching the active vs. inactive indices in local variables.  That
in turn reduces naming collisions, which allows using more generic (but easily
parsed/read) names.

I also tweaked/added quite a few comments, mostly as documentation of my own
(mis)understanding of the code.  Patch attached (hopefully), with another
disclaimer that it's compile tested only, and only on x86.

On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>  arch/arm64/kvm/mmu.c                |   8 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c |   4 +-
>  arch/powerpc/kvm/book3s_hv.c        |   3 +-
>  arch/powerpc/kvm/book3s_hv_nested.c |   4 +-
>  arch/powerpc/kvm/book3s_hv_uvmem.c  |  14 +-
>  arch/s390/kvm/kvm-s390.c            |  27 +-
>  arch/s390/kvm/kvm-s390.h            |   7 +-
>  arch/x86/kvm/mmu/mmu.c              |   4 +-
>  include/linux/kvm_host.h            | 100 ++---
>  virt/kvm/kvm_main.c                 | 580 ++++++++++++++--------------
>  10 files changed, 379 insertions(+), 372 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c5d1f3c87dbd..2b4ced4f1e55 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -199,13 +199,13 @@ static void stage2_flush_vm(struct kvm *kvm)
>  {
>  	struct kvm_memslots *slots;
>  	struct kvm_memory_slot *memslot;
> -	int idx;
> +	int idx, ctr;

Let's use 'bkt' instead of 'ctr', purely that's what the interval tree uses.  KVM
itself shouldn't care since it shoudn't be poking into those details anyways.

>  
>  	idx = srcu_read_lock(&kvm->srcu);
>  	spin_lock(&kvm->mmu_lock);
>  
>  	slots = kvm_memslots(kvm);
> -	kvm_for_each_memslot(memslot, slots)
> +	kvm_for_each_memslot(memslot, ctr, slots)
>  		stage2_flush_memslot(kvm, memslot);
>  
>  	spin_unlock(&kvm->mmu_lock);

...

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f59847b6e9b3..a9c5b0df2311 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -29,6 +29,7 @@
>  #include <linux/nospec.h>
>  #include <linux/interval_tree.h>
>  #include <linux/hashtable.h>
> +#include <linux/rbtree.h>
>  #include <asm/signal.h>
>  
>  #include <linux/kvm.h>
> @@ -358,8 +359,9 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>  #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
>  
>  struct kvm_memory_slot {
> -	struct hlist_node id_node;
> -	struct interval_tree_node hva_node;
> +	struct hlist_node id_node[2];
> +	struct interval_tree_node hva_node[2];
> +	struct rb_node gfn_node[2];

This block needs a comment explaining the dual (duelling?) tree system.

>  	gfn_t base_gfn;
>  	unsigned long npages;
>  	unsigned long *dirty_bitmap;
> @@ -454,19 +456,14 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  }
>  #endif
>  
> -/*
> - * Note:
> - * memslots are not sorted by id anymore, please use id_to_memslot()
> - * to get the memslot by its id.
> - */
>  struct kvm_memslots {
>  	u64 generation;
> +	atomic_long_t lru_slot;
>  	struct rb_root_cached hva_tree;
> -	/* The mapping table from slot id to the index in memslots[]. */
> +	struct rb_root gfn_tree;
> +	/* The mapping table from slot id to memslot. */
>  	DECLARE_HASHTABLE(id_hash, 7);
> -	atomic_t lru_slot;
> -	int used_slots;
> -	struct kvm_memory_slot memslots[];
> +	bool is_idx_0;

This is where storing an int helps.  I was thinking 'int node_idx'.

>  };
>  
>  struct kvm {
> @@ -478,6 +475,7 @@ struct kvm {
>  
>  	struct mutex slots_lock;
>  	struct mm_struct *mm; /* userspace tied to this vm */
> +	struct kvm_memslots memslots_all[KVM_ADDRESS_SPACE_NUM][2];

I think it makes sense to call this '__memslots', to try to convey that it is
backing for a front end.  'memslots_all' could be misinterpreted as "memslots
for all address spaces".  A comment is probably warranted, too.

>  	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
>  	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>  
> @@ -617,12 +615,6 @@ static inline int kvm_vcpu_get_idx(struct kvm_vcpu *vcpu)
>  	return vcpu->vcpu_idx;
>  }
>  
> -#define kvm_for_each_memslot(memslot, slots)				\
> -	for (memslot = &slots->memslots[0];				\
> -	     memslot < slots->memslots + slots->used_slots; memslot++)	\
> -		if (WARN_ON_ONCE(!memslot->npages)) {			\
> -		} else
> -
>  void kvm_vcpu_destroy(struct kvm_vcpu *vcpu);
>  
>  void vcpu_load(struct kvm_vcpu *vcpu);
> @@ -682,6 +674,22 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
>  	return __kvm_memslots(vcpu->kvm, as_id);
>  }
>  
> +static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
> +{
> +	return RB_EMPTY_ROOT(&slots->gfn_tree);
> +}
> +
> +static inline int kvm_memslots_idx(struct kvm_memslots *slots)
> +{
> +	return slots->is_idx_0 ? 0 : 1;
> +}

This helper can go away.

> +
> +#define kvm_for_each_memslot(memslot, ctr, slots)	\

Use 'bkt' again.

> +	hash_for_each(slots->id_hash, ctr, memslot,	\
> +		      id_node[kvm_memslots_idx(slots)]) \

With 'node_idx, this can squeak into a single line:

	hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \

> +		if (WARN_ON_ONCE(!memslot->npages)) {	\
> +		} else
> +
>  #define kvm_for_each_hva_range_memslot(node, slots, start, last)	     \
>  	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
>  	     node;							     \
> @@ -690,9 +698,10 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
>  static inline
>  struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
>  {
> +	int idxactive = kvm_memslots_idx(slots);

Use 'idx'.  Partly for readability, partly because this function doesn't (and
shouldn't) care whether or not @slots is the active set.

>  	struct kvm_memory_slot *slot;
>  
> -	hash_for_each_possible(slots->id_hash, slot, id_node, id) {
> +	hash_for_each_possible(slots->id_hash, slot, id_node[idxactive], id) {
>  		if (slot->id == id)
>  			return slot;
>  	}
> @@ -1102,42 +1111,39 @@ bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
>   * With "approx" set returns the memslot also when the address falls
>   * in a hole. In that case one of the memslots bordering the hole is
>   * returned.
> - *
> - * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
>   */
>  static inline struct kvm_memory_slot *
>  search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
>  {
> -	int start = 0, end = slots->used_slots;
> -	int slot = atomic_read(&slots->lru_slot);
> -	struct kvm_memory_slot *memslots = slots->memslots;
> -
> -	if (unlikely(!slots->used_slots))
> -		return NULL;
> -
> -	if (gfn >= memslots[slot].base_gfn &&
> -	    gfn < memslots[slot].base_gfn + memslots[slot].npages)
> -		return &memslots[slot];
> -
> -	while (start < end) {
> -		slot = start + (end - start) / 2;
> -
> -		if (gfn >= memslots[slot].base_gfn)
> -			end = slot;
> -		else
> -			start = slot + 1;
> +	int idxactive = kvm_memslots_idx(slots);

Same as above, s/idxactive/idx.

> +	struct kvm_memory_slot *slot;
> +	struct rb_node *prevnode, *node;
> +
> +	slot = (struct kvm_memory_slot *)atomic_long_read(&slots->lru_slot);
> +	if (slot &&
> +	    gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
> +		return slot;
> +
> +	for (prevnode = NULL, node = slots->gfn_tree.rb_node; node; ) {
> +		prevnode = node;
> +		slot = container_of(node, struct kvm_memory_slot,
> +				    gfn_node[idxactive]);

With 'idx', this can go on a single line.  It runs over by two chars, but the 80
char limit is a soft limit, and IMO avoiding line breaks for things like this
improves readability.

> +		if (gfn >= slot->base_gfn) {
> +			if (gfn < slot->base_gfn + slot->npages) {
> +				atomic_long_set(&slots->lru_slot,
> +						(unsigned long)slot);
> +				return slot;
> +			}
> +			node = node->rb_right;
> +		} else
> +			node = node->rb_left;
>  	}
>  
> -	if (approx && start >= slots->used_slots)
> -		return &memslots[slots->used_slots - 1];
> +	if (approx && prevnode)
> +		return container_of(prevnode, struct kvm_memory_slot,
> +				    gfn_node[idxactive]);

And arguably the same here, though the overrun is a wee bit worse.

>  
> -	if (start < slots->used_slots && gfn >= memslots[start].base_gfn &&
> -	    gfn < memslots[start].base_gfn + memslots[start].npages) {
> -		atomic_set(&slots->lru_slot, start);
> -		return &memslots[start];
> -	}
> -
> -	return approx ? &memslots[start] : NULL;
> +	return NULL;
>  }
>  
>  static inline struct kvm_memory_slot *
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a55309432c9a..189504b27ca6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -510,15 +510,17 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  	}
>  
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		int idxactive;

This variable can be avoided entirely be using slots->node_idx directly (my
favored 'idx' is stolen for SRCU).

>  		struct interval_tree_node *node;
>  
>  		slots = __kvm_memslots(kvm, i);
> +		idxactive = kvm_memslots_idx(slots);
>  		kvm_for_each_hva_range_memslot(node, slots,
>  					       range->start, range->end - 1) {
>  			unsigned long hva_start, hva_end;
>  
>  			slot = container_of(node, struct kvm_memory_slot,
> -					    hva_node);
> +					    hva_node[idxactive]);
>  			hva_start = max(range->start, slot->userspace_addr);
>  			hva_end = min(range->end, slot->userspace_addr +
>  						  (slot->npages << PAGE_SHIFT));
> @@ -785,18 +787,12 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>  
> -static struct kvm_memslots *kvm_alloc_memslots(void)
> +static void kvm_init_memslots(struct kvm_memslots *slots)
>  {
> -	struct kvm_memslots *slots;
> -
> -	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
> -	if (!slots)
> -		return NULL;
> -
> +	atomic_long_set(&slots->lru_slot, (unsigned long)NULL);
>  	slots->hva_tree = RB_ROOT_CACHED;
> +	slots->gfn_tree = RB_ROOT;
>  	hash_init(slots->id_hash);
> -
> -	return slots;

With 'node_idx' in the slots, it's easier to open code this in the loop and
drop kvm_init_memslots().

>  }
>  
>  static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> @@ -808,27 +804,31 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>  	memslot->dirty_bitmap = NULL;
>  }
>  
> +/* This does not remove the slot from struct kvm_memslots data structures */
>  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  	kvm_destroy_dirty_bitmap(slot);
>  
>  	kvm_arch_free_memslot(kvm, slot);
>  
> -	slot->flags = 0;
> -	slot->npages = 0;
> +	kfree(slot);
>  }
>  
>  static void kvm_free_memslots(struct kvm *kvm, struct kvm_memslots *slots)
>  {
> +	int ctr;
> +	struct hlist_node *idnode;
>  	struct kvm_memory_slot *memslot;
>  
> -	if (!slots)
> +	/*
> +	 * Both active and inactive struct kvm_memslots should point to
> +	 * the same set of memslots, so it's enough to free them once
> +	 */

Thumbs up for comments!  It would be very helpful to state that which index is
used is completely arbitrary.

> +	if (slots->is_idx_0)
>  		return;
>  
> -	kvm_for_each_memslot(memslot, slots)
> +	hash_for_each_safe(slots->id_hash, ctr, idnode, memslot, id_node[1])
>  		kvm_free_memslot(kvm, memslot);
> -
> -	kvfree(slots);
>  }
>  
>  static void kvm_destroy_vm_debugfs(struct kvm *kvm)
> @@ -924,13 +924,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  
>  	refcount_set(&kvm->users_count, 1);
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -		struct kvm_memslots *slots = kvm_alloc_memslots();
> +		kvm_init_memslots(&kvm->memslots_all[i][0]);
> +		kvm_init_memslots(&kvm->memslots_all[i][1]);
> +		kvm->memslots_all[i][0].is_idx_0 = true;
> +		kvm->memslots_all[i][1].is_idx_0 = false;
>  
> -		if (!slots)
> -			goto out_err_no_arch_destroy_vm;
>  		/* Generations must be different for each address space. */
> -		slots->generation = i;
> -		rcu_assign_pointer(kvm->memslots[i], slots);
> +		kvm->memslots_all[i][0].generation = i;
> +		rcu_assign_pointer(kvm->memslots[i], &kvm->memslots_all[i][0]);

Open coding this with node_idx looks like so:

		for (j = 0; j < 2; j++) {
			slots = &kvm->__memslots[i][j];

			atomic_long_set(&slots->lru_slot, (unsigned long)NULL);
			slots->hva_tree = RB_ROOT_CACHED;
			slots->gfn_tree = RB_ROOT;
			hash_init(slots->id_hash);
			slots->node_idx = j;

			/* Generations must be different for each address space. */
			slots->generation = i;
		}

		rcu_assign_pointer(kvm->memslots[i], &kvm->__memslots[i][0]);

>  	}
>  
>  	for (i = 0; i < KVM_NR_BUSES; i++) {
> @@ -983,8 +984,6 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
>  	for (i = 0; i < KVM_NR_BUSES; i++)
>  		kfree(kvm_get_bus(kvm, i));
> -	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -		kvm_free_memslots(kvm, __kvm_memslots(kvm, i));
>  	cleanup_srcu_struct(&kvm->irq_srcu);
>  out_err_no_irq_srcu:
>  	cleanup_srcu_struct(&kvm->srcu);
> @@ -1038,8 +1037,10 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  #endif
>  	kvm_arch_destroy_vm(kvm);
>  	kvm_destroy_devices(kvm);
> -	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -		kvm_free_memslots(kvm, __kvm_memslots(kvm, i));
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		kvm_free_memslots(kvm, &kvm->memslots_all[i][0]);
> +		kvm_free_memslots(kvm, &kvm->memslots_all[i][1]);
> +	}
>  	cleanup_srcu_struct(&kvm->irq_srcu);
>  	cleanup_srcu_struct(&kvm->srcu);
>  	kvm_arch_free_vm(kvm);
> @@ -1099,212 +1100,6 @@ static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot)
>  	return 0;
>  }
>  

...

> @@ -1319,10 +1114,12 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
>  	return 0;
>  }
>  
> -static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
> -		int as_id, struct kvm_memslots *slots)
> +static void swap_memslots(struct kvm *kvm, int as_id)
>  {
>  	struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id);
> +	int idxactive = kvm_memslots_idx(old_memslots);
> +	int idxina = idxactive == 0 ? 1 : 0;
> +	struct kvm_memslots *slots = &kvm->memslots_all[as_id][idxina];
>  	u64 gen = old_memslots->generation;
>  
>  	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
> @@ -1351,44 +1148,129 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
>  	kvm_arch_memslots_updated(kvm, gen);
>  
>  	slots->generation = gen;
> +}
> +
> +static void kvm_memslot_gfn_insert(struct rb_root *gfn_tree,
> +				  struct kvm_memory_slot *slot,
> +				  int which)

Pass slots instead of the tree, that way the index doesn't need to passed
separately.  And similar to previous feedback, s/which/idx.

> +{
> +	struct rb_node **cur, *parent;
> +
> +	for (cur = &gfn_tree->rb_node, parent = NULL; *cur; ) {

I think it makes sense to initialize 'parent' outside of loop, both to make the
loop control flow easier to read, and to make it more obvious that parent _must_
be initialized in the empty case.

> +		struct kvm_memory_slot *cslot;

I'd prefer s/cur/node and s/cslot/tmp.  'cslot' in particular is hard to parse.

> +		cslot = container_of(*cur, typeof(*cslot), gfn_node[which]);
> +		parent = *cur;
> +		if (slot->base_gfn < cslot->base_gfn)
> +			cur = &(*cur)->rb_left;
> +		else if (slot->base_gfn > cslot->base_gfn)
> +			cur = &(*cur)->rb_right;
> +		else
> +			BUG();
> +	}
>  
> -	return old_memslots;
> +	rb_link_node(&slot->gfn_node[which], parent, cur);
> +	rb_insert_color(&slot->gfn_node[which], gfn_tree);
>  }
>  
>  /*
> - * Note, at a minimum, the current number of used slots must be allocated, even
> - * when deleting a memslot, as we need a complete duplicate of the memslots for
> - * use when invalidating a memslot prior to deleting/moving the memslot.
> + * Just copies the memslot data.
> + * Does not copy or touch the embedded nodes, including the ranges at hva_nodes.
>   */
> -static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
> -					     enum kvm_mr_change change)
> +static void kvm_copy_memslot(struct kvm_memory_slot *dest,
> +			     struct kvm_memory_slot *src)
>  {
> -	struct kvm_memslots *slots;
> -	size_t old_size, new_size;
> -	struct kvm_memory_slot *memslot;
> +	dest->base_gfn = src->base_gfn;
> +	dest->npages = src->npages;
> +	dest->dirty_bitmap = src->dirty_bitmap;
> +	dest->arch = src->arch;
> +	dest->userspace_addr = src->userspace_addr;
> +	dest->flags = src->flags;
> +	dest->id = src->id;
> +	dest->as_id = src->as_id;
> +}
>  
> -	old_size = sizeof(struct kvm_memslots) +
> -		   (sizeof(struct kvm_memory_slot) * old->used_slots);
> +/*
> + * Initializes the ranges at both hva_nodes from the memslot userspace_addr
> + * and npages fields.
> + */
> +static void kvm_init_memslot_hva_ranges(struct kvm_memory_slot *slot)
> +{
> +	slot->hva_node[0].start = slot->hva_node[1].start =
> +		slot->userspace_addr;
> +	slot->hva_node[0].last = slot->hva_node[1].last =
> +		slot->userspace_addr + (slot->npages << PAGE_SHIFT) - 1;

Fold this into kvm_copy_memslot().  It's always called immediately after, and
technically the node range does come from the src, e.g. calling this without
first calling kvm_copy_memslot() doesn't make sense.

> +}
>  
> -	if (change == KVM_MR_CREATE)
> -		new_size = old_size + sizeof(struct kvm_memory_slot);
> -	else
> -		new_size = old_size;
> +/*
> + * Replaces the @oldslot with @nslot in the memslot set indicated by
> + * @slots_idx.
> + *
> + * With NULL @oldslot this simply adds the @nslot to the set.
> + * With NULL @nslot this simply removes the @oldslot from the set.
> + *
> + * If @nslot is non-NULL its hva_node[slots_idx] range has to be set
> + * appropriately.
> + */
> +static void kvm_replace_memslot(struct kvm *kvm,
> +				int as_id, int slots_idx,

Pass slots itself, then all three of these go away.

> +				struct kvm_memory_slot *oldslot,
> +				struct kvm_memory_slot *nslot)

s/oldslot/old and s/nslot/new, again to make it easier to identify which is what,
and for consistency.

> +{
> +	struct kvm_memslots *slots = &kvm->memslots_all[as_id][slots_idx];

s/slots_idx/idx for consistency.

>  
> -	slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
> -	if (unlikely(!slots))
> -		return NULL;
> +	if (WARN_ON(!oldslot && !nslot))

This should be moved to kvm_set_memslot() in the form of:

	if (change != KVM_MR_CREATE) {
		slot = id_to_memslot(active, old->id);
		if (WARN_ON_ONCE(!slot))
			return -EIO;
	}

Adding a WARN that the caller doesn't pass "NULL, NULL" is unnecessary, and
putting the WARN in this helper obfuscates the one case that warrants a guard.

> +		return;
> +
> +	if (oldslot) {
> +		hash_del(&oldslot->id_node[slots_idx]);
> +		interval_tree_remove(&oldslot->hva_node[slots_idx],
> +				     &slots->hva_tree);

Unnecessary newline.

> +		atomic_long_cmpxchg(&slots->lru_slot,
> +				    (unsigned long)oldslot,
> +				    (unsigned long)nslot);

Can be:

		atomic_long_cmpxchg(&slots->lru_slot,
				    (unsigned long)old, (unsigned long)new);


> +		if (!nslot) {
> +			rb_erase(&oldslot->gfn_node[slots_idx],
> +				 &slots->gfn_tree);

Unnecessary newline.

> +			return;
> +		}
> +	}
>  
> -	memcpy(slots, old, old_size);
> +	hash_add(slots->id_hash, &nslot->id_node[slots_idx],
> +		 nslot->id);
> +	WARN_ON(PAGE_SHIFT > 0 &&

Are there actually KVM architectures for which PAGE_SHIFT==0?

> +		nslot->hva_node[slots_idx].start >=
> +		nslot->hva_node[slots_idx].last);
> +	interval_tree_insert(&nslot->hva_node[slots_idx],
> +			     &slots->hva_tree);
>  
> -	slots->hva_tree = RB_ROOT_CACHED;
> -	hash_init(slots->id_hash);
> -	kvm_for_each_memslot(memslot, slots) {
> -		interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
> -		hash_add(slots->id_hash, &memslot->id_node, memslot->id);
> +	/* Shame there is no O(1) interval_tree_replace()... */
> +	if (oldslot && oldslot->base_gfn == nslot->base_gfn)
> +		rb_replace_node(&oldslot->gfn_node[slots_idx],
> +				&nslot->gfn_node[slots_idx],
> +				&slots->gfn_tree);

Add wrappers for all the rb-tree mutators.  Partly for consistency, mostly for
readability.  Having the node index in the memslots helps in this case.  E.g.

	/* Shame there is no O(1) interval_tree_replace()... */
	if (old && old->base_gfn == new->base_gfn) {
		kvm_memslot_gfn_replace(slots, old, new);
	} else {
		if (old)
			kvm_memslot_gfn_erase(slots, old);
		kvm_memslot_gfn_insert(slots, new);
	}

> +	else {
> +		if (oldslot)
> +			rb_erase(&oldslot->gfn_node[slots_idx],
> +				 &slots->gfn_tree);
> +		kvm_memslot_gfn_insert(&slots->gfn_tree,
> +				       nslot, slots_idx);

Unnecessary newlines.

>  	}
> +}
> +
> +/*
> + * Copies the @oldslot data into @nslot and uses this slot to replace
> + * @oldslot in the memslot set indicated by @slots_idx.
> + */
> +static void kvm_copy_replace_memslot(struct kvm *kvm,

I fiddled with this one, and I think it's best to drop this helper in favor of
open coding the calls to kvm_copy_memslot() and kvm_replace_memslot().  More on
this below.

> +				     int as_id, int slots_idx,
> +				     struct kvm_memory_slot *oldslot,
> +				     struct kvm_memory_slot *nslot)
> +{
> +	kvm_copy_memslot(nslot, oldslot);
> +	kvm_init_memslot_hva_ranges(nslot);
>  
> -	return slots;
> +	kvm_replace_memslot(kvm, as_id, slots_idx, oldslot, nslot);
>  }
>  
>  static int kvm_set_memslot(struct kvm *kvm,
> @@ -1397,56 +1279,178 @@ static int kvm_set_memslot(struct kvm *kvm,
>  			   struct kvm_memory_slot *new, int as_id,
>  			   enum kvm_mr_change change)
>  {
> -	struct kvm_memory_slot *slot;
> -	struct kvm_memslots *slots;
> +	struct kvm_memslots *slotsact = __kvm_memslots(kvm, as_id);
> +	int idxact = kvm_memslots_idx(slotsact);
> +	int idxina = idxact == 0 ? 1 : 0;

I strongly prefer using "active" and "inactive" for the memslots, dropping the
local incidices completely, and using "slot" and "tmp" for the slot.

"slot" and "tmp" aren't great, but slotsact vs. slotact is really, really hard
to read.  And more importantly, slotact vs. slotina is misleading because they
don't always point slots in the active set and inactive set respectively.  That
detail took me a while to fully grep.

> +	struct kvm_memslots *slotsina = &kvm->memslots_all[as_id][idxina];

To avoid local index variables, this can be the somewhat clever/gross:

	struct kvm_memslots *inactive = &kvm->__memslots[as_id][active->node_idx ^ 1];

> +	struct kvm_memory_slot *slotina, *slotact;
>  	int r;
>  
> -	slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
> -	if (!slots)
> +	slotina = kzalloc(sizeof(*slotina), GFP_KERNEL_ACCOUNT);
> +	if (!slotina)
>  		return -ENOMEM;
>  
> +	if (change != KVM_MR_CREATE)
> +		slotact = id_to_memslot(slotsact, old->id);

And the WARN, as mentioned above.
> +
>  	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
>  		/*
> -		 * Note, the INVALID flag needs to be in the appropriate entry
> -		 * in the freshly allocated memslots, not in @old or @new.
> +		 * Replace the slot to be deleted or moved in the inactive
> +		 * memslot set by its copy with KVM_MEMSLOT_INVALID flag set.

This is where the inactive slot vs. inactive memslot terminology gets really
confusing.  "inactive memslots" refers precisely which set is not kvm->memslots,
whereas "inactive slot" might refer to a slot that is in the inactive set, but
it might also refer to a slot that is currently not in any tree at all.  E.g.

		/*
		 * Mark the current slot INVALID.  This must be done on the tmp
		 * slot to avoid modifying the current slot in the active tree.
		 */

>  		 */
> -		slot = id_to_memslot(slots, old->id);
> -		slot->flags |= KVM_MEMSLOT_INVALID;
> +		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
> +		slotina->flags |= KVM_MEMSLOT_INVALID;

This is where I'd prefer to open code the copy and replace.  Functionally it works,
but setting the INVALID flag _after_ replacing the slot is not intuitive.  E.g.
this sequencing helps the reader understand that it makes a copy

		kvm_copy_memslot(tmp, slot);
		tmp->flags |= KVM_MEMSLOT_INVALID;
		kvm_replace_memslot(inactive, slot, tmp);


>  		/*
> -		 * We can re-use the old memslots, the only difference from the
> -		 * newly installed memslots is the invalid flag, which will get
> -		 * dropped by update_memslots anyway.  We'll also revert to the
> -		 * old memslots if preparing the new memory region fails.
> +		 * Swap the active <-> inactive memslot set.
> +		 * Now the active memslot set still contains the memslot to be
> +		 * deleted or moved, but with the KVM_MEMSLOT_INVALID flag set.
>  		 */
> -		slots = install_new_memslots(kvm, as_id, slots);
> +		swap_memslots(kvm, as_id);

kvm_swap_active_memslots() would be preferable, without context it's not clear
what's being swapped.

To dedup code, and hopefully improve readability, I think it makes sense to do
the swap() of the memslots in kvm_swap_active_memslots().  With the indices gone,
the number of swap() calls goes down dramatically, e.g.


		/*
		 * Activate the slot that is now marked INVALID, but don't
		 * propagate the slot to the now inactive slots.  The slot is
		 * either going to be deleted or recreated as a new slot.
		*/
		kvm_swap_active_memslots(kvm, as_id, &active, &inactive);

		/* The temporary and current slot have swapped roles. */
		swap(tmp, slot);

> +		swap(idxact, idxina);
> +		swap(slotsina, slotsact);
> +		swap(slotact, slotina);
>  
> -		/* From this point no new shadow pages pointing to a deleted,
> +		/*
> +		 * From this point no new shadow pages pointing to a deleted,
>  		 * or moved, memslot will be created.
>  		 *
>  		 * validation of sp->gfn happens in:
>  		 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
>  		 *	- kvm_is_visible_gfn (mmu_check_root)
>  		 */
> -		kvm_arch_flush_shadow_memslot(kvm, slot);
> +		kvm_arch_flush_shadow_memslot(kvm, slotact);
>  	}
>  
>  	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
>  	if (r)
>  		goto out_slots;

Normally I like avoiding code churn, but in this case I think it makes sense to
avoid the goto.  Even after the below if-else-elif block is trimmed down, the
error handling that is specific to the above DELETE|MOVE ends up landing too far
away from the code that it is reverting.  E.g.

	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
	if (r) {
		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
			/*
			 * Revert the above INVALID change.  No modifications
			 * required since the original slot was preserved in
			 * the inactive slots.
			 */
			kvm_swap_active_memslots(kvm, as_id, &active, &inactive);
			swap(tmp, slot);
		}
		kfree(tmp);
		return r;
	}
  
> -	update_memslots(slots, new, change);
> -	slots = install_new_memslots(kvm, as_id, slots);
> +	if (change == KVM_MR_MOVE) {
> +		/*
> +		 * Since we are going to be changing the memslot gfn we need to

Try to use imperative mode and avoid I/we/us/etc....  It requires creative
wording in some cases, but overall it does help avoid ambiguity (though "we" is
fairly obvious in this case).

> +		 * remove it from the gfn tree so it can be re-added there with
> +		 * the updated gfn.
> +		 */
> +		rb_erase(&slotina->gfn_node[idxina],
> +			 &slotsina->gfn_tree);

Unnecessary newline.

> +
> +		slotina->base_gfn = new->base_gfn;
> +		slotina->flags = new->flags;
> +		slotina->dirty_bitmap = new->dirty_bitmap;
> +		/* kvm_arch_prepare_memory_region() might have modified arch */
> +		slotina->arch = new->arch;
> +
> +		/* Re-add to the gfn tree with the updated gfn */
> +		kvm_memslot_gfn_insert(&slotsina->gfn_tree,
> +				       slotina, idxina);

Again, newline.  
> +
> +		/*
> +		 * Swap the active <-> inactive memslot set.
> +		 * Now the active memslot set contains the new, final memslot.
> +		 */
> +		swap_memslots(kvm, as_id);
> +		swap(idxact, idxina);
> +		swap(slotsina, slotsact);
> +		swap(slotact, slotina);
> +
> +		/*
> +		 * Replace the temporary KVM_MEMSLOT_INVALID slot with the
> +		 * new, final memslot in the inactive memslot set and
> +		 * free the temporary memslot.
> +		 */
> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
> +		kfree(slotina);

Adding a wrapper for the trifecta of swap + replace + kfree() cuts down on the
boilerplate tremendously, and sidesteps having to swap() "slot" and "tmp" for
these flows.   E.g. this can become:

		/*
		 * The memslot's gfn is changing, remove it from the inactive
		 * tree, it will be re-added with its updated gfn.  Because its
		 * range is changing, an in-place replace is not possible.
		 */
		kvm_memslot_gfn_erase(inactive, tmp);

		tmp->base_gfn = new->base_gfn;
		tmp->flags = new->flags;
		tmp->dirty_bitmap = new->dirty_bitmap;
		/* kvm_arch_prepare_memory_region() might have modified arch */
		tmp->arch = new->arch;

		/* Re-add to the gfn tree with the updated gfn */
		kvm_memslot_gfn_insert(inactive, tmp);

		/* Replace the current INVALID slot with the updated memslot. */
		kvm_activate_memslot(kvm, as_id, &active, &inactive, slot, tmp);

by adding:

static void kvm_activate_memslot(struct kvm *kvm, int as_id,
				 struct kvm_memslots **active,
				 struct kvm_memslots **inactive,
				 struct kvm_memory_slot *old,
				 struct kvm_memory_slot *new)
{
	/*
	 * Swap the active <-> inactive memslots.  Note, this also swaps
	 * the active and inactive pointers themselves.
	 */
	kvm_swap_active_memslots(kvm, as_id, active, inactive);

	/* Propagate the new memslot to the now inactive memslots. */
	kvm_replace_memslot(*inactive, old, new);

	/* And free the old slot. */
	if (old)
		kfree(old);
}

> +	} else if (change == KVM_MR_FLAGS_ONLY) {
> +		/*
> +		 * Almost like the move case above, but we don't use a temporary
> +		 * KVM_MEMSLOT_INVALID slot.

Let's use INVALID, it should be obvious to readers that make it his far :-)

> +		 * Instead, we simply replace the old memslot with a new, updated
> +		 * copy in both memslot sets.
> +		 *
> +		 * Since we aren't going to be changing the memslot gfn we can
> +		 * simply use kvm_copy_replace_memslot(), which will use
> +		 * rb_replace_node() to switch the memslot node in the gfn tree
> +		 * instead of removing the old one and inserting the new one
> +		 * as two separate operations.
> +		 * It's a performance win since node replacement is a single
> +		 * O(1) operation as opposed to two O(log(n)) operations for
> +		 * slot removal and then re-insertion.
> +		 */
> +		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
> +		slotina->flags = new->flags;
> +		slotina->dirty_bitmap = new->dirty_bitmap;
> +		/* kvm_arch_prepare_memory_region() might have modified arch */
> +		slotina->arch = new->arch;
> +
> +		/* Swap the active <-> inactive memslot set. */
> +		swap_memslots(kvm, as_id);
> +		swap(idxact, idxina);
> +		swap(slotsina, slotsact);
> +		swap(slotact, slotina);
> +
> +		/*
> +		 * Replace the old memslot in the other memslot set and
> +		 * then finally free it.
> +		 */
> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
> +		kfree(slotina);
> +	} else if (change == KVM_MR_CREATE) {
> +		/*
> +		 * Add the new memslot to the current inactive set as a copy
> +		 * of the provided new memslot data.
> +		 */
> +		kvm_copy_memslot(slotina, new);
> +		kvm_init_memslot_hva_ranges(slotina);
> +
> +		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
> +
> +		/* Swap the active <-> inactive memslot set. */
> +		swap_memslots(kvm, as_id);
> +		swap(idxact, idxina);
> +		swap(slotsina, slotsact);
> +
> +		/* Now add it also to the other memslot set */
> +		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
> +	} else if (change == KVM_MR_DELETE) {
> +		/*
> +		 * Remove the old memslot from the current inactive set
> +		 * (the other, active set contains the temporary
> +		 * KVM_MEMSLOT_INVALID slot)
> +		 */
> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
> +
> +		/* Swap the active <-> inactive memslot set. */
> +		swap_memslots(kvm, as_id);
> +		swap(idxact, idxina);
> +		swap(slotsina, slotsact);
> +		swap(slotact, slotina);
> +
> +		/* Remove the temporary KVM_MEMSLOT_INVALID slot and free it. */
> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
> +		kfree(slotina);
> +		/* slotact will be freed by kvm_free_memslot() */

I think this comment can go away in favor of documenting the kvm_free_memslot()
call down below.  With all of the aforementioned changes:

		/*
		 * Remove the old memslot (in the inactive memslots) and activate
		 * the NULL slot.
		*/
		kvm_replace_memslot(inactive, tmp, NULL);
		kvm_activate_memslot(kvm, as_id, &active, &inactive, slot, NULL);

> +	} else
> +		BUG();
>  
>  	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
>  
> -	kvfree(slots);
> +	if (change == KVM_MR_DELETE)
> +		kvm_free_memslot(kvm, slotact);
> +
>  	return 0;

[-- Attachment #2: 0001-tmp.patch --]
[-- Type: text/x-diff, Size: 23863 bytes --]

From 9909e51d83ba3e5abe5573946891b20e5fd50a22 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 25 May 2021 15:14:05 -0700
Subject: [PATCH] tmp

---
 include/linux/kvm_host.h |  31 ++-
 virt/kvm/kvm_main.c      | 427 ++++++++++++++++++---------------------
 2 files changed, 204 insertions(+), 254 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 84fd72d8bb23..85ee81318362 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -508,7 +508,7 @@ struct kvm_memslots {
 	struct rb_root gfn_tree;
 	/* The mapping table from slot id to memslot. */
 	DECLARE_HASHTABLE(id_hash, 7);
-	bool is_idx_0;
+	int node_idx;
 };
 
 struct kvm {
@@ -520,7 +520,7 @@ struct kvm {
 
 	struct mutex slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
-	struct kvm_memslots memslots_all[KVM_ADDRESS_SPACE_NUM][2];
+	struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
 	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 
@@ -724,15 +724,9 @@ static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
 	return RB_EMPTY_ROOT(&slots->gfn_tree);
 }
 
-static inline int kvm_memslots_idx(struct kvm_memslots *slots)
-{
-	return slots->is_idx_0 ? 0 : 1;
-}
-
-#define kvm_for_each_memslot(memslot, ctr, slots)	\
-	hash_for_each(slots->id_hash, ctr, memslot,	\
-		      id_node[kvm_memslots_idx(slots)]) \
-		if (WARN_ON_ONCE(!memslot->npages)) {	\
+#define kvm_for_each_memslot(memslot, bkt, slots)			      \
+	hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \
+		if (WARN_ON_ONCE(!memslot->npages)) {			      \
 		} else
 
 #define kvm_for_each_hva_range_memslot(node, slots, start, last)	     \
@@ -743,10 +737,10 @@ static inline int kvm_memslots_idx(struct kvm_memslots *slots)
 static inline
 struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
 {
-	int idxactive = kvm_memslots_idx(slots);
 	struct kvm_memory_slot *slot;
+	int idx = slots->node_idx;
 
-	hash_for_each_possible(slots->id_hash, slot, id_node[idxactive], id) {
+	hash_for_each_possible(slots->id_hash, slot, id_node[idx], id) {
 		if (slot->id == id)
 			return slot;
 	}
@@ -1160,19 +1154,19 @@ bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
 static inline struct kvm_memory_slot *
 search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
 {
-	int idxactive = kvm_memslots_idx(slots);
 	struct kvm_memory_slot *slot;
 	struct rb_node *prevnode, *node;
+	int idx = slots->node_idx;
 
 	slot = (struct kvm_memory_slot *)atomic_long_read(&slots->lru_slot);
 	if (slot &&
 	    gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
 		return slot;
 
-	for (prevnode = NULL, node = slots->gfn_tree.rb_node; node; ) {
+	prevnode = NULL;
+	for (node = slots->gfn_tree.rb_node; node; ) {
 		prevnode = node;
-		slot = container_of(node, struct kvm_memory_slot,
-				    gfn_node[idxactive]);
+		slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
 		if (gfn >= slot->base_gfn) {
 			if (gfn < slot->base_gfn + slot->npages) {
 				atomic_long_set(&slots->lru_slot,
@@ -1185,8 +1179,7 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
 	}
 
 	if (approx && prevnode)
-		return container_of(prevnode, struct kvm_memory_slot,
-				    gfn_node[idxactive]);
+		return container_of(prevnode, struct kvm_memory_slot, gfn_node[idx]);
 
 	return NULL;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 189504b27ca6..0744b081b16b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -510,17 +510,15 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		int idxactive;
 		struct interval_tree_node *node;
 
 		slots = __kvm_memslots(kvm, i);
-		idxactive = kvm_memslots_idx(slots);
 		kvm_for_each_hva_range_memslot(node, slots,
 					       range->start, range->end - 1) {
 			unsigned long hva_start, hva_end;
 
 			slot = container_of(node, struct kvm_memory_slot,
-					    hva_node[idxactive]);
+					    hva_node[slots->node_idx]);
 			hva_start = max(range->start, slot->userspace_addr);
 			hva_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));
@@ -787,14 +785,6 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
-static void kvm_init_memslots(struct kvm_memslots *slots)
-{
-	atomic_long_set(&slots->lru_slot, (unsigned long)NULL);
-	slots->hva_tree = RB_ROOT_CACHED;
-	slots->gfn_tree = RB_ROOT;
-	hash_init(slots->id_hash);
-}
-
 static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 {
 	if (!memslot->dirty_bitmap)
@@ -816,18 +806,18 @@ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 
 static void kvm_free_memslots(struct kvm *kvm, struct kvm_memslots *slots)
 {
-	int ctr;
 	struct hlist_node *idnode;
 	struct kvm_memory_slot *memslot;
+	int bkt;
 
 	/*
-	 * Both active and inactive struct kvm_memslots should point to
-	 * the same set of memslots, so it's enough to free them once
+	 * The same memslot objects live in both active and inactive sets,
+	 * arbitrarily free using index '0'.
 	 */
-	if (slots->is_idx_0)
+	if (slots->node_idx)
 		return;
 
-	hash_for_each_safe(slots->id_hash, ctr, idnode, memslot, id_node[1])
+	hash_for_each_safe(slots->id_hash, bkt, idnode, memslot, id_node[0])
 		kvm_free_memslot(kvm, memslot);
 }
 
@@ -900,8 +890,9 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
 static struct kvm *kvm_create_vm(unsigned long type)
 {
 	struct kvm *kvm = kvm_arch_alloc_vm();
+	struct kvm_memslots *slots;
 	int r = -ENOMEM;
-	int i;
+	int i, j;
 
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
@@ -924,14 +915,20 @@ static struct kvm *kvm_create_vm(unsigned long type)
 
 	refcount_set(&kvm->users_count, 1);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		kvm_init_memslots(&kvm->memslots_all[i][0]);
-		kvm_init_memslots(&kvm->memslots_all[i][1]);
-		kvm->memslots_all[i][0].is_idx_0 = true;
-		kvm->memslots_all[i][1].is_idx_0 = false;
+		for (j = 0; j < 2; j++) {
+			slots = &kvm->__memslots[i][j];
 
-		/* Generations must be different for each address space. */
-		kvm->memslots_all[i][0].generation = i;
-		rcu_assign_pointer(kvm->memslots[i], &kvm->memslots_all[i][0]);
+			atomic_long_set(&slots->lru_slot, (unsigned long)NULL);
+			slots->hva_tree = RB_ROOT_CACHED;
+			slots->gfn_tree = RB_ROOT;
+			hash_init(slots->id_hash);
+			slots->node_idx = j;
+
+			/* Generations must be different for each address space. */
+			slots->generation = i;
+		}
+
+		rcu_assign_pointer(kvm->memslots[i], &kvm->__memslots[i][0]);
 	}
 
 	for (i = 0; i < KVM_NR_BUSES; i++) {
@@ -1038,8 +1035,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		kvm_free_memslots(kvm, &kvm->memslots_all[i][0]);
-		kvm_free_memslots(kvm, &kvm->memslots_all[i][1]);
+		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
+		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
@@ -1114,13 +1111,12 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	return 0;
 }
 
-static void swap_memslots(struct kvm *kvm, int as_id)
+static void kvm_swap_active_memslots(struct kvm *kvm, int as_id,
+				     struct kvm_memslots **active,
+				     struct kvm_memslots **inactive)
 {
-	struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id);
-	int idxactive = kvm_memslots_idx(old_memslots);
-	int idxina = idxactive == 0 ? 1 : 0;
-	struct kvm_memslots *slots = &kvm->memslots_all[as_id][idxina];
-	u64 gen = old_memslots->generation;
+	struct kvm_memslots *slots = *inactive;
+	u64 gen = (*active)->generation;
 
 	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
 	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
@@ -1148,37 +1144,55 @@ static void swap_memslots(struct kvm *kvm, int as_id)
 	kvm_arch_memslots_updated(kvm, gen);
 
 	slots->generation = gen;
+
+	swap(*active, *inactive);
 }
 
-static void kvm_memslot_gfn_insert(struct rb_root *gfn_tree,
-				  struct kvm_memory_slot *slot,
-				  int which)
+static void kvm_memslot_gfn_insert(struct kvm_memslots *slots,
+				   struct kvm_memory_slot *slot)
 {
-	struct rb_node **cur, *parent;
+	struct rb_root *gfn_tree = &slots->gfn_tree;
+	struct rb_node **node, *parent;
+	int idx = slots->node_idx;
 
-	for (cur = &gfn_tree->rb_node, parent = NULL; *cur; ) {
-		struct kvm_memory_slot *cslot;
+	parent = NULL;
+	for (node = &gfn_tree->rb_node; *node; ) {
+		struct kvm_memory_slot *tmp;
 
-		cslot = container_of(*cur, typeof(*cslot), gfn_node[which]);
-		parent = *cur;
-		if (slot->base_gfn < cslot->base_gfn)
-			cur = &(*cur)->rb_left;
-		else if (slot->base_gfn > cslot->base_gfn)
-			cur = &(*cur)->rb_right;
+		tmp = container_of(*node, struct kvm_memory_slot, gfn_node[idx]);
+		parent = *node;
+		if (slot->base_gfn < tmp->base_gfn)
+			node = &(*node)->rb_left;
+		else if (slot->base_gfn > tmp->base_gfn)
+			node = &(*node)->rb_right;
 		else
 			BUG();
 	}
 
-	rb_link_node(&slot->gfn_node[which], parent, cur);
-	rb_insert_color(&slot->gfn_node[which], gfn_tree);
+	rb_link_node(&slot->gfn_node[idx], parent, node);
+	rb_insert_color(&slot->gfn_node[idx], gfn_tree);
+}
+
+static void kvm_memslot_gfn_erase(struct kvm_memslots *slots,
+				  struct kvm_memory_slot *slot)
+{
+	rb_erase(&slot->gfn_node[slots->node_idx], &slots->gfn_tree);
+}
+
+static void kvm_memslot_gfn_replace(struct kvm_memslots *slots,
+				    struct kvm_memory_slot *old,
+				    struct kvm_memory_slot *new)
+{
+	int idx = slots->node_idx;
+
+	WARN_ON_ONCE(old->base_gfn != new->base_gfn);
+
+	rb_replace_node(&old->gfn_node[idx], &new->gfn_node[idx],
+			&slots->gfn_tree);
 }
 
-/*
- * Just copies the memslot data.
- * Does not copy or touch the embedded nodes, including the ranges at hva_nodes.
- */
 static void kvm_copy_memslot(struct kvm_memory_slot *dest,
-			     struct kvm_memory_slot *src)
+			     const struct kvm_memory_slot *src)
 {
 	dest->base_gfn = src->base_gfn;
 	dest->npages = src->npages;
@@ -1188,89 +1202,72 @@ static void kvm_copy_memslot(struct kvm_memory_slot *dest,
 	dest->flags = src->flags;
 	dest->id = src->id;
 	dest->as_id = src->as_id;
-}
 
-/*
- * Initializes the ranges at both hva_nodes from the memslot userspace_addr
- * and npages fields.
- */
-static void kvm_init_memslot_hva_ranges(struct kvm_memory_slot *slot)
-{
-	slot->hva_node[0].start = slot->hva_node[1].start =
-		slot->userspace_addr;
-	slot->hva_node[0].last = slot->hva_node[1].last =
-		slot->userspace_addr + (slot->npages << PAGE_SHIFT) - 1;
+	dest->hva_node[0].start = dest->hva_node[1].start =
+		dest->userspace_addr;
+	dest->hva_node[0].last = dest->hva_node[1].last =
+		dest->userspace_addr + (dest->npages << PAGE_SHIFT) - 1;
 }
 
 /*
- * Replaces the @oldslot with @nslot in the memslot set indicated by
- * @slots_idx.
+ * Replace @old with @new in @slots.
  *
- * With NULL @oldslot this simply adds the @nslot to the set.
- * With NULL @nslot this simply removes the @oldslot from the set.
+ * With NULL @old this simply adds the @new to @slots.
+ * With NULL @new this simply removes the @old from @slots.
  *
- * If @nslot is non-NULL its hva_node[slots_idx] range has to be set
+ * If @new is non-NULL its hva_node[slots_idx] range has to be set
  * appropriately.
  */
-static void kvm_replace_memslot(struct kvm *kvm,
-				int as_id, int slots_idx,
-				struct kvm_memory_slot *oldslot,
-				struct kvm_memory_slot *nslot)
+static void kvm_replace_memslot(struct kvm_memslots *slots,
+				struct kvm_memory_slot *old,
+				struct kvm_memory_slot *new)
 {
-	struct kvm_memslots *slots = &kvm->memslots_all[as_id][slots_idx];
+	int idx = slots->node_idx;
 
-	if (WARN_ON(!oldslot && !nslot))
-		return;
-
-	if (oldslot) {
-		hash_del(&oldslot->id_node[slots_idx]);
-		interval_tree_remove(&oldslot->hva_node[slots_idx],
-				     &slots->hva_tree);
+	if (old) {
+		hash_del(&old->id_node[idx]);
+		interval_tree_remove(&old->hva_node[idx], &slots->hva_tree);
 		atomic_long_cmpxchg(&slots->lru_slot,
-				    (unsigned long)oldslot,
-				    (unsigned long)nslot);
-		if (!nslot) {
-			rb_erase(&oldslot->gfn_node[slots_idx],
-				 &slots->gfn_tree);
+				    (unsigned long)old, (unsigned long)new);
+		if (!new) {
+			kvm_memslot_gfn_erase(slots, old);
 			return;
 		}
 	}
 
-	hash_add(slots->id_hash, &nslot->id_node[slots_idx],
-		 nslot->id);
 	WARN_ON(PAGE_SHIFT > 0 &&
-		nslot->hva_node[slots_idx].start >=
-		nslot->hva_node[slots_idx].last);
-	interval_tree_insert(&nslot->hva_node[slots_idx],
-			     &slots->hva_tree);
+		new->hva_node[idx].start >= new->hva_node[idx].last);
+	hash_add(slots->id_hash, &new->id_node[idx], new->id);
+	interval_tree_insert(&new->hva_node[idx], &slots->hva_tree);
 
 	/* Shame there is no O(1) interval_tree_replace()... */
-	if (oldslot && oldslot->base_gfn == nslot->base_gfn)
-		rb_replace_node(&oldslot->gfn_node[slots_idx],
-				&nslot->gfn_node[slots_idx],
-				&slots->gfn_tree);
-	else {
-		if (oldslot)
-			rb_erase(&oldslot->gfn_node[slots_idx],
-				 &slots->gfn_tree);
-		kvm_memslot_gfn_insert(&slots->gfn_tree,
-				       nslot, slots_idx);
+	if (old && old->base_gfn == new->base_gfn) {
+		kvm_memslot_gfn_replace(slots, old, new);
+	} else {
+		if (old)
+			kvm_memslot_gfn_erase(slots, old);
+		kvm_memslot_gfn_insert(slots, new);
 	}
 }
 
-/*
- * Copies the @oldslot data into @nslot and uses this slot to replace
- * @oldslot in the memslot set indicated by @slots_idx.
- */
-static void kvm_copy_replace_memslot(struct kvm *kvm,
-				     int as_id, int slots_idx,
-				     struct kvm_memory_slot *oldslot,
-				     struct kvm_memory_slot *nslot)
+static void kvm_activate_memslot(struct kvm *kvm, int as_id,
+				 struct kvm_memslots **active,
+				 struct kvm_memslots **inactive,
+				 struct kvm_memory_slot *old,
+				 struct kvm_memory_slot *new)
 {
-	kvm_copy_memslot(nslot, oldslot);
-	kvm_init_memslot_hva_ranges(nslot);
+	/*
+	 * Swap the active <-> inactive memslots.  Note, this also swaps
+	 * the active and inactive pointers themselves.
+	 */
+	kvm_swap_active_memslots(kvm, as_id, active, inactive);
 
-	kvm_replace_memslot(kvm, as_id, slots_idx, oldslot, nslot);
+	/* Propagate the new memslot to the now inactive memslots. */
+	kvm_replace_memslot(*inactive, old, new);
+
+	/* And free the old slot. */
+	if (old)
+		kfree(old);
 }
 
 static int kvm_set_memslot(struct kvm *kvm,
@@ -1279,37 +1276,43 @@ static int kvm_set_memslot(struct kvm *kvm,
 			   struct kvm_memory_slot *new, int as_id,
 			   enum kvm_mr_change change)
 {
-	struct kvm_memslots *slotsact = __kvm_memslots(kvm, as_id);
-	int idxact = kvm_memslots_idx(slotsact);
-	int idxina = idxact == 0 ? 1 : 0;
-	struct kvm_memslots *slotsina = &kvm->memslots_all[as_id][idxina];
-	struct kvm_memory_slot *slotina, *slotact;
+	struct kvm_memslots *active = __kvm_memslots(kvm, as_id);
+	struct kvm_memslots *inactive = &kvm->__memslots[as_id][active->node_idx ^ 1];
+	struct kvm_memory_slot *slot, *tmp;
 	int r;
 
-	slotina = kzalloc(sizeof(*slotina), GFP_KERNEL_ACCOUNT);
-	if (!slotina)
+	if (change != KVM_MR_CREATE) {
+		slot = id_to_memslot(active, old->id);
+		if (WARN_ON_ONCE(!slot))
+			return -EIO;
+	}
+
+	/*
+	 * Modifications are done on a tmp, unreachable slot.  The changes are
+	 * then (eventually) propagated to both the active and inactive slots.
+	 */
+	tmp = kzalloc(sizeof(*tmp), GFP_KERNEL_ACCOUNT);
+	if (!tmp)
 		return -ENOMEM;
 
-	if (change != KVM_MR_CREATE)
-		slotact = id_to_memslot(slotsact, old->id);
-
 	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
 		/*
-		 * Replace the slot to be deleted or moved in the inactive
-		 * memslot set by its copy with KVM_MEMSLOT_INVALID flag set.
+		 * Mark the current slot INVALID.  This must be done on the tmp
+		 * slot to avoid modifying the current slot in the active tree.
 		 */
-		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
-		slotina->flags |= KVM_MEMSLOT_INVALID;
+		kvm_copy_memslot(tmp, slot);
+		tmp->flags |= KVM_MEMSLOT_INVALID;
+		kvm_replace_memslot(inactive, slot, tmp);
 
 		/*
-		 * Swap the active <-> inactive memslot set.
-		 * Now the active memslot set still contains the memslot to be
-		 * deleted or moved, but with the KVM_MEMSLOT_INVALID flag set.
-		 */
-		swap_memslots(kvm, as_id);
-		swap(idxact, idxina);
-		swap(slotsina, slotsact);
-		swap(slotact, slotina);
+		 * Activate the slot that is now marked INVALID, but don't
+		 * propagate the slot to the now inactive slots.  The slot is
+		 * either going to be deleted or recreated as a new slot.
+		*/
+		kvm_swap_active_memslots(kvm, as_id, &active, &inactive);
+
+		/* The temporary and current slot have swapped roles. */
+		swap(tmp, slot);
 
 		/*
 		 * From this point no new shadow pages pointing to a deleted,
@@ -1319,139 +1322,93 @@ static int kvm_set_memslot(struct kvm *kvm,
 		 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
 		 *	- kvm_is_visible_gfn (mmu_check_root)
 		 */
-		kvm_arch_flush_shadow_memslot(kvm, slotact);
+		kvm_arch_flush_shadow_memslot(kvm, slot);
 	}
 
 	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
-	if (r)
-		goto out_slots;
+	if (r) {
+		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
+			/*
+			 * Revert the above INVALID change.  No modifications
+			 * required since the original slot was preserved in
+			 * the inactive slots.
+			 */
+			kvm_swap_active_memslots(kvm, as_id, &active, &inactive);
+			swap(tmp, slot);
+		}
+		kfree(tmp);
+		return r;
+	}
 
 	if (change == KVM_MR_MOVE) {
 		/*
-		 * Since we are going to be changing the memslot gfn we need to
-		 * remove it from the gfn tree so it can be re-added there with
-		 * the updated gfn.
+		 * The memslot's gfn is changing, remove it from the inactive
+		 * tree, it will be re-added with its updated gfn.  Because its
+		 * range is changing, an in-place replace is not possible.
 		 */
-		rb_erase(&slotina->gfn_node[idxina],
-			 &slotsina->gfn_tree);
+		kvm_memslot_gfn_erase(inactive, tmp);
 
-		slotina->base_gfn = new->base_gfn;
-		slotina->flags = new->flags;
-		slotina->dirty_bitmap = new->dirty_bitmap;
+		tmp->base_gfn = new->base_gfn;
+		tmp->flags = new->flags;
+		tmp->dirty_bitmap = new->dirty_bitmap;
 		/* kvm_arch_prepare_memory_region() might have modified arch */
-		slotina->arch = new->arch;
+		tmp->arch = new->arch;
 
 		/* Re-add to the gfn tree with the updated gfn */
-		kvm_memslot_gfn_insert(&slotsina->gfn_tree,
-				       slotina, idxina);
+		kvm_memslot_gfn_insert(inactive, tmp);
 
-		/*
-		 * Swap the active <-> inactive memslot set.
-		 * Now the active memslot set contains the new, final memslot.
-		 */
-		swap_memslots(kvm, as_id);
-		swap(idxact, idxina);
-		swap(slotsina, slotsact);
-		swap(slotact, slotina);
-
-		/*
-		 * Replace the temporary KVM_MEMSLOT_INVALID slot with the
-		 * new, final memslot in the inactive memslot set and
-		 * free the temporary memslot.
-		 */
-		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
-		kfree(slotina);
+		/* Replace the current INVALID slot with the updated memslot. */
+		kvm_activate_memslot(kvm, as_id, &active, &inactive, slot, tmp);
 	} else if (change == KVM_MR_FLAGS_ONLY) {
 		/*
-		 * Almost like the move case above, but we don't use a temporary
-		 * KVM_MEMSLOT_INVALID slot.
-		 * Instead, we simply replace the old memslot with a new, updated
-		 * copy in both memslot sets.
+		 * Similar to the MOVE case, but the slot doesn't need to be
+		 * zapped as an intermediate step.  Instead, the old memslot is
+		 * simply replaced with a new, updated copy in both memslot sets.
 		 *
-		 * Since we aren't going to be changing the memslot gfn we can
-		 * simply use kvm_copy_replace_memslot(), which will use
-		 * rb_replace_node() to switch the memslot node in the gfn tree
-		 * instead of removing the old one and inserting the new one
-		 * as two separate operations.
-		 * It's a performance win since node replacement is a single
-		 * O(1) operation as opposed to two O(log(n)) operations for
-		 * slot removal and then re-insertion.
+		 * Since the memslot gfn is unchanged, kvm_copy_replace_memslot()
+		 * and kvm_memslot_gfn_replace() can be used to switch the node
+		 * in the gfn tree instead of removing the old and inserting the
+		 * new as two separate operations.  Replacement is a single O(1)
+		 * operation versus two O(log(n)) operations for remove+insert.
 		 */
-		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
-		slotina->flags = new->flags;
-		slotina->dirty_bitmap = new->dirty_bitmap;
+		kvm_copy_memslot(tmp, slot);
+		tmp->flags = new->flags;
+		tmp->dirty_bitmap = new->dirty_bitmap;
 		/* kvm_arch_prepare_memory_region() might have modified arch */
-		slotina->arch = new->arch;
+		tmp->arch = new->arch;
+		kvm_replace_memslot(inactive, slot, tmp);
 
-		/* Swap the active <-> inactive memslot set. */
-		swap_memslots(kvm, as_id);
-		swap(idxact, idxina);
-		swap(slotsina, slotsact);
-		swap(slotact, slotina);
-
-		/*
-		 * Replace the old memslot in the other memslot set and
-		 * then finally free it.
-		 */
-		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
-		kfree(slotina);
+		kvm_activate_memslot(kvm, as_id, &active, &inactive, slot, tmp);
 	} else if (change == KVM_MR_CREATE) {
 		/*
-		 * Add the new memslot to the current inactive set as a copy
-		 * of the provided new memslot data.
+		 * Add the new memslot to the inactive set as a copy of the
+		 * new memslot data provided by userspace.
 		 */
-		kvm_copy_memslot(slotina, new);
-		kvm_init_memslot_hva_ranges(slotina);
+		kvm_copy_memslot(tmp, new);
+		kvm_replace_memslot(inactive, NULL, tmp);
 
-		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
-
-		/* Swap the active <-> inactive memslot set. */
-		swap_memslots(kvm, as_id);
-		swap(idxact, idxina);
-		swap(slotsina, slotsact);
-
-		/* Now add it also to the other memslot set */
-		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
+		kvm_activate_memslot(kvm, as_id, &active, &inactive, NULL, tmp);
 	} else if (change == KVM_MR_DELETE) {
 		/*
-		 * Remove the old memslot from the current inactive set
-		 * (the other, active set contains the temporary
-		 * KVM_MEMSLOT_INVALID slot)
-		 */
-		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
-
-		/* Swap the active <-> inactive memslot set. */
-		swap_memslots(kvm, as_id);
-		swap(idxact, idxina);
-		swap(slotsina, slotsact);
-		swap(slotact, slotina);
-
-		/* Remove the temporary KVM_MEMSLOT_INVALID slot and free it. */
-		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
-		kfree(slotina);
-		/* slotact will be freed by kvm_free_memslot() */
-	} else
+		 * Remove the old memslot (in the inactive memslots) and activate
+		 * the NULL slot.
+		*/
+		kvm_replace_memslot(inactive, tmp, NULL);
+		kvm_activate_memslot(kvm, as_id, &active, &inactive, slot, NULL);
+	} else {
 		BUG();
+	}
 
 	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
 
+	/*
+	 * Free the memslot and its metadata.  Note, slot and tmp hold the same
+	 * metadata, but slot is freed as part of activation.  It's tmp's turn.
+	 */
 	if (change == KVM_MR_DELETE)
-		kvm_free_memslot(kvm, slotact);
+		kvm_free_memslot(kvm, tmp);
 
 	return 0;
-
-out_slots:
-	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
-		swap_memslots(kvm, as_id);
-		swap(idxact, idxina);
-		swap(slotsina, slotsact);
-		swap(slotact, slotina);
-
-		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
-	}
-	kfree(slotina);
-
-	return r;
 }
 
 static int kvm_delete_memslot(struct kvm *kvm,
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 7/8] KVM: Optimize gfn lookup in kvm_zap_gfn_range()
  2021-05-16 21:44 ` [PATCH v3 7/8] KVM: Optimize gfn lookup in kvm_zap_gfn_range() Maciej S. Szmigiero
@ 2021-05-26 17:33   ` Sean Christopherson
  2021-06-01 20:25     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2021-05-26 17:33 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> Introduce a memslots gfn upper bound operation and use it to optimize
> kvm_zap_gfn_range().
> This way this handler can do a quick lookup for intersecting gfns and won't
> have to do a linear scan of the whole memslot set.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 41 ++++++++++++++++++++++++++++++++++++++--
>  include/linux/kvm_host.h | 22 +++++++++++++++++++++
>  2 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7222b552d139..f23398cf0316 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5490,14 +5490,51 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  	int i;
>  	bool flush = false;
>  
> +	if (gfn_end == gfn_start || WARN_ON(gfn_end < gfn_start))
> +		return;
> +
>  	write_lock(&kvm->mmu_lock);
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -		int ctr;
> +		int idxactive;
> +		struct rb_node *node;
>  
>  		slots = __kvm_memslots(kvm, i);
> -		kvm_for_each_memslot(memslot, ctr, slots) {
> +		idxactive = kvm_memslots_idx(slots);
> +
> +		/*
> +		 * Find the slot with the lowest gfn that can possibly intersect with
> +		 * the range, so we'll ideally have slot start <= range start
> +		 */
> +		node = kvm_memslots_gfn_upper_bound(slots, gfn_start);
> +		if (node) {
> +			struct rb_node *pnode;
> +
> +			/*
> +			 * A NULL previous node means that the very first slot
> +			 * already has a higher start gfn.
> +			 * In this case slot start > range start.
> +			 */
> +			pnode = rb_prev(node);
> +			if (pnode)
> +				node = pnode;
> +		} else {
> +			/* a NULL node below means no slots */
> +			node = rb_last(&slots->gfn_tree);
> +		}
> +
> +		for ( ; node; node = rb_next(node)) {
>  			gfn_t start, end;

Can this be abstracted into something like:

		kvm_for_each_memslot_in_gfn_range(...) {

		}

and share that implementation with kvm_check_memslot_overlap() in the next patch?

I really don't think arch code should be poking into gfn_tree, and ideally arch
code wouldn't even be aware that gfn_tree exists.

> +			memslot = container_of(node, struct kvm_memory_slot,
> +					       gfn_node[idxactive]);
> +
> +			/*
> +			 * If this slot starts beyond or at the end of the range so does
> +			 * every next one
> +			 */
> +			if (memslot->base_gfn >= gfn_start + gfn_end)
> +				break;
> +
>  			start = max(gfn_start, memslot->base_gfn);
>  			end = min(gfn_end, memslot->base_gfn + memslot->npages);
>  			if (start >= end)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones
  2021-05-25 23:21   ` Sean Christopherson
@ 2021-06-01 20:24     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-06-01 20:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 26.05.2021 01:21, Sean Christopherson wrote:
> Overall, I like it!  Didn't see any functional issues, though that comes with a
> disclaimer that functionality was a secondary focus for this pass.
> 
> I have lots of comments, but they're (almost?) all mechanical.
> 
> The most impactful feedback is to store the actual node index in the memslots
> instead of storing whether or not an instance is node0.  This has a cascading
> effect that allows for substantial cleanup, specifically that it obviates the
> motivation for caching the active vs. inactive indices in local variables.  That
> in turn reduces naming collisions, which allows using more generic (but easily
> parsed/read) names.
> 
> I also tweaked/added quite a few comments, mostly as documentation of my own
> (mis)understanding of the code.  Patch attached (hopefully), with another
> disclaimer that it's compile tested only, and only on x86.

Thanks for the review Sean!

I agree that storing the actual node index in the memslot set makes the
code more readable.

Thanks for the suggested patch (which mostly can be integrated as-is into
this commit), however I think that its changes are deep enough for you to
at least be tagged "Co-developed-by:" here.

My remaining comments are below, inline.

Maciej

> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>>   arch/arm64/kvm/mmu.c                |   8 +-
>>   arch/powerpc/kvm/book3s_64_mmu_hv.c |   4 +-
>>   arch/powerpc/kvm/book3s_hv.c        |   3 +-
>>   arch/powerpc/kvm/book3s_hv_nested.c |   4 +-
>>   arch/powerpc/kvm/book3s_hv_uvmem.c  |  14 +-
>>   arch/s390/kvm/kvm-s390.c            |  27 +-
>>   arch/s390/kvm/kvm-s390.h            |   7 +-
>>   arch/x86/kvm/mmu/mmu.c              |   4 +-
>>   include/linux/kvm_host.h            | 100 ++---
>>   virt/kvm/kvm_main.c                 | 580 ++++++++++++++--------------
>>   10 files changed, 379 insertions(+), 372 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..2b4ced4f1e55 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -199,13 +199,13 @@ static void stage2_flush_vm(struct kvm *kvm)
>>   {
>>   	struct kvm_memslots *slots;
>>   	struct kvm_memory_slot *memslot;
>> -	int idx;
>> +	int idx, ctr;
> 
> Let's use 'bkt' instead of 'ctr', purely that's what the interval tree uses.  KVM
> itself shouldn't care since it shoudn't be poking into those details anyways.

Will do (BTW I guess you meant 'hash table' not 'interval tree' here).

>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index f59847b6e9b3..a9c5b0df2311 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -29,6 +29,7 @@
>>   #include <linux/nospec.h>
>>   #include <linux/interval_tree.h>
>>   #include <linux/hashtable.h>
>> +#include <linux/rbtree.h>
>>   #include <asm/signal.h>
>>   
>>   #include <linux/kvm.h>
>> @@ -358,8 +359,9 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>>   #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
>>   
>>   struct kvm_memory_slot {
>> -	struct hlist_node id_node;
>> -	struct interval_tree_node hva_node;
>> +	struct hlist_node id_node[2];
>> +	struct interval_tree_node hva_node[2];
>> +	struct rb_node gfn_node[2];
> 
> This block needs a comment explaining the dual (duelling?) tree system.

Will do.

>>   	gfn_t base_gfn;
>>   	unsigned long npages;
>>   	unsigned long *dirty_bitmap;
>> @@ -454,19 +456,14 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>>   }
>>   #endif
>>   
>> -/*
>> - * Note:
>> - * memslots are not sorted by id anymore, please use id_to_memslot()
>> - * to get the memslot by its id.
>> - */
>>   struct kvm_memslots {
>>   	u64 generation;
>> +	atomic_long_t lru_slot;
>>   	struct rb_root_cached hva_tree;
>> -	/* The mapping table from slot id to the index in memslots[]. */
>> +	struct rb_root gfn_tree;
>> +	/* The mapping table from slot id to memslot. */
>>   	DECLARE_HASHTABLE(id_hash, 7);
>> -	atomic_t lru_slot;
>> -	int used_slots;
>> -	struct kvm_memory_slot memslots[];
>> +	bool is_idx_0;
> 
> This is where storing an int helps.  I was thinking 'int node_idx'.

Looks sensible.

>>   };
>>   
>>   struct kvm {
>> @@ -478,6 +475,7 @@ struct kvm {
>>   
>>   	struct mutex slots_lock;
>>   	struct mm_struct *mm; /* userspace tied to this vm */
>> +	struct kvm_memslots memslots_all[KVM_ADDRESS_SPACE_NUM][2];
> 
> I think it makes sense to call this '__memslots', to try to convey that it is
> backing for a front end.  'memslots_all' could be misinterpreted as "memslots
> for all address spaces".  A comment is probably warranted, too.

Will add comment here, your patch already renames the variable.

>>   	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
>>   	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>>   
>> @@ -617,12 +615,6 @@ static inline int kvm_vcpu_get_idx(struct kvm_vcpu *vcpu)
>>   	return vcpu->vcpu_idx;
>>   }
>>   
>> -#define kvm_for_each_memslot(memslot, slots)				\
>> -	for (memslot = &slots->memslots[0];				\
>> -	     memslot < slots->memslots + slots->used_slots; memslot++)	\
>> -		if (WARN_ON_ONCE(!memslot->npages)) {			\
>> -		} else
>> -
>>   void kvm_vcpu_destroy(struct kvm_vcpu *vcpu);
>>   
>>   void vcpu_load(struct kvm_vcpu *vcpu);
>> @@ -682,6 +674,22 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
>>   	return __kvm_memslots(vcpu->kvm, as_id);
>>   }
>>   
>> +static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
>> +{
>> +	return RB_EMPTY_ROOT(&slots->gfn_tree);
>> +}
>> +
>> +static inline int kvm_memslots_idx(struct kvm_memslots *slots)
>> +{
>> +	return slots->is_idx_0 ? 0 : 1;
>> +}
> 
> This helper can go away.

Your patch already does that.

>> +
>> +#define kvm_for_each_memslot(memslot, ctr, slots)	\
> 
> Use 'bkt' again.

Will do.

>> +	hash_for_each(slots->id_hash, ctr, memslot,	\
>> +		      id_node[kvm_memslots_idx(slots)]) \
> 
> With 'node_idx, this can squeak into a single line:
> 
> 	hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \

Your patch already does that.

>> +		if (WARN_ON_ONCE(!memslot->npages)) {	\
>> +		} else
>> +
>>   #define kvm_for_each_hva_range_memslot(node, slots, start, last)	     \
>>   	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
>>   	     node;							     \
>> @@ -690,9 +698,10 @@ static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
>>   static inline
>>   struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
>>   {
>> +	int idxactive = kvm_memslots_idx(slots);
> 
> Use 'idx'.  Partly for readability, partly because this function doesn't (and
> shouldn't) care whether or not @slots is the active set.

Your patch already does that.

>>   	struct kvm_memory_slot *slot;
>>   
>> -	hash_for_each_possible(slots->id_hash, slot, id_node, id) {
>> +	hash_for_each_possible(slots->id_hash, slot, id_node[idxactive], id) {
>>   		if (slot->id == id)
>>   			return slot;
>>   	}
>> @@ -1102,42 +1111,39 @@ bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
>>    * With "approx" set returns the memslot also when the address falls
>>    * in a hole. In that case one of the memslots bordering the hole is
>>    * returned.
>> - *
>> - * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
>>    */
>>   static inline struct kvm_memory_slot *
>>   search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
>>   {
>> -	int start = 0, end = slots->used_slots;
>> -	int slot = atomic_read(&slots->lru_slot);
>> -	struct kvm_memory_slot *memslots = slots->memslots;
>> -
>> -	if (unlikely(!slots->used_slots))
>> -		return NULL;
>> -
>> -	if (gfn >= memslots[slot].base_gfn &&
>> -	    gfn < memslots[slot].base_gfn + memslots[slot].npages)
>> -		return &memslots[slot];
>> -
>> -	while (start < end) {
>> -		slot = start + (end - start) / 2;
>> -
>> -		if (gfn >= memslots[slot].base_gfn)
>> -			end = slot;
>> -		else
>> -			start = slot + 1;
>> +	int idxactive = kvm_memslots_idx(slots);
> 
> Same as above, s/idxactive/idx.

Your patch already does that.

>> +	struct kvm_memory_slot *slot;
>> +	struct rb_node *prevnode, *node;
>> +
>> +	slot = (struct kvm_memory_slot *)atomic_long_read(&slots->lru_slot);
>> +	if (slot &&
>> +	    gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
>> +		return slot;
>> +
>> +	for (prevnode = NULL, node = slots->gfn_tree.rb_node; node; ) {
>> +		prevnode = node;
>> +		slot = container_of(node, struct kvm_memory_slot,
>> +				    gfn_node[idxactive]);
> 
> With 'idx', this can go on a single line.  It runs over by two chars, but the 80
> char limit is a soft limit, and IMO avoiding line breaks for things like this
> improves readability.

Your patch already does that.
    
>> +		if (gfn >= slot->base_gfn) {
>> +			if (gfn < slot->base_gfn + slot->npages) {
>> +				atomic_long_set(&slots->lru_slot,
>> +						(unsigned long)slot);
>> +				return slot;
>> +			}
>> +			node = node->rb_right;
>> +		} else
>> +			node = node->rb_left;
>>   	}
>>   
>> -	if (approx && start >= slots->used_slots)
>> -		return &memslots[slots->used_slots - 1];
>> +	if (approx && prevnode)
>> +		return container_of(prevnode, struct kvm_memory_slot,
>> +				    gfn_node[idxactive]);
> 
> And arguably the same here, though the overrun is a wee bit worse.

Your patch already does that.
    
>>   
>> -	if (start < slots->used_slots && gfn >= memslots[start].base_gfn &&
>> -	    gfn < memslots[start].base_gfn + memslots[start].npages) {
>> -		atomic_set(&slots->lru_slot, start);
>> -		return &memslots[start];
>> -	}
>> -
>> -	return approx ? &memslots[start] : NULL;
>> +	return NULL;
>>   }
>>   
>>   static inline struct kvm_memory_slot *
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index a55309432c9a..189504b27ca6 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -510,15 +510,17 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>   	}
>>   
>>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>> +		int idxactive;
> 
> This variable can be avoided entirely be using slots->node_idx directly (my
> favored 'idx' is stolen for SRCU).

Your patch already does that.
    
>>   		struct interval_tree_node *node;
>>   
>>   		slots = __kvm_memslots(kvm, i);
>> +		idxactive = kvm_memslots_idx(slots);
>>   		kvm_for_each_hva_range_memslot(node, slots,
>>   					       range->start, range->end - 1) {
>>   			unsigned long hva_start, hva_end;
>>   
>>   			slot = container_of(node, struct kvm_memory_slot,
>> -					    hva_node);
>> +					    hva_node[idxactive]);
>>   			hva_start = max(range->start, slot->userspace_addr);
>>   			hva_end = min(range->end, slot->userspace_addr +
>>   						  (slot->npages << PAGE_SHIFT));
>> @@ -785,18 +787,12 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>>   
>>   #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>>   
>> -static struct kvm_memslots *kvm_alloc_memslots(void)
>> +static void kvm_init_memslots(struct kvm_memslots *slots)
>>   {
>> -	struct kvm_memslots *slots;
>> -
>> -	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT);
>> -	if (!slots)
>> -		return NULL;
>> -
>> +	atomic_long_set(&slots->lru_slot, (unsigned long)NULL);
>>   	slots->hva_tree = RB_ROOT_CACHED;
>> +	slots->gfn_tree = RB_ROOT;
>>   	hash_init(slots->id_hash);
>> -
>> -	return slots;
> 
> With 'node_idx' in the slots, it's easier to open code this in the loop and
> drop kvm_init_memslots().

Your patch already does that.
    
>>   }
>>   
>>   static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>> @@ -808,27 +804,31 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>>   	memslot->dirty_bitmap = NULL;
>>   }
>>   
>> +/* This does not remove the slot from struct kvm_memslots data structures */
>>   static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>>   {
>>   	kvm_destroy_dirty_bitmap(slot);
>>   
>>   	kvm_arch_free_memslot(kvm, slot);
>>   
>> -	slot->flags = 0;
>> -	slot->npages = 0;
>> +	kfree(slot);
>>   }
>>   
>>   static void kvm_free_memslots(struct kvm *kvm, struct kvm_memslots *slots)
>>   {
>> +	int ctr;
>> +	struct hlist_node *idnode;
>>   	struct kvm_memory_slot *memslot;
>>   
>> -	if (!slots)
>> +	/*
>> +	 * Both active and inactive struct kvm_memslots should point to
>> +	 * the same set of memslots, so it's enough to free them once
>> +	 */
> 
> Thumbs up for comments!

Thanks :)

> It would be very helpful to state that which index is> used is completely arbitrary.

Your patch already states that, however it switches the actual freeing
from idx 1 to idx 0.

Even though technically it does not matter I think it just looks better
when the first kvm_free_memslots() call does nothing and the second
invocation actually frees the slot data rather than the first call
freeing the data and the second invocation operating over a structure
with dangling pointers (even if the function isn't actually touching
them).

>> +	if (slots->is_idx_0)
>>   		return;
>>   
>> -	kvm_for_each_memslot(memslot, slots)
>> +	hash_for_each_safe(slots->id_hash, ctr, idnode, memslot, id_node[1])
>>   		kvm_free_memslot(kvm, memslot);
>> -
>> -	kvfree(slots);
>>   }
>>   
>>   static void kvm_destroy_vm_debugfs(struct kvm *kvm)
>> @@ -924,13 +924,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>>   
>>   	refcount_set(&kvm->users_count, 1);
>>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>> -		struct kvm_memslots *slots = kvm_alloc_memslots();
>> +		kvm_init_memslots(&kvm->memslots_all[i][0]);
>> +		kvm_init_memslots(&kvm->memslots_all[i][1]);
>> +		kvm->memslots_all[i][0].is_idx_0 = true;
>> +		kvm->memslots_all[i][1].is_idx_0 = false;
>>   
>> -		if (!slots)
>> -			goto out_err_no_arch_destroy_vm;
>>   		/* Generations must be different for each address space. */
>> -		slots->generation = i;
>> -		rcu_assign_pointer(kvm->memslots[i], slots);
>> +		kvm->memslots_all[i][0].generation = i;
>> +		rcu_assign_pointer(kvm->memslots[i], &kvm->memslots_all[i][0]);
> 
> Open coding this with node_idx looks like so:
> 
> 		for (j = 0; j < 2; j++) {
> 			slots = &kvm->__memslots[i][j];
> 
> 			atomic_long_set(&slots->lru_slot, (unsigned long)NULL);
> 			slots->hva_tree = RB_ROOT_CACHED;
> 			slots->gfn_tree = RB_ROOT;
> 			hash_init(slots->id_hash);
> 			slots->node_idx = j;
> 
> 			/* Generations must be different for each address space. */
> 			slots->generation = i;
> 		}
> 
> 		rcu_assign_pointer(kvm->memslots[i], &kvm->__memslots[i][0]);

Your patch already does that.

(..)
>> @@ -1351,44 +1148,129 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
>>   	kvm_arch_memslots_updated(kvm, gen);
>>   
>>   	slots->generation = gen;
>> +}
>> +
>> +static void kvm_memslot_gfn_insert(struct rb_root *gfn_tree,
>> +				  struct kvm_memory_slot *slot,
>> +				  int which)
> 
> Pass slots instead of the tree, that way the index doesn't need to passed
> separately.  And similar to previous feedback, s/which/idx.

Your patch already does that.

>> +{
>> +	struct rb_node **cur, *parent;
>> +
>> +	for (cur = &gfn_tree->rb_node, parent = NULL; *cur; ) {
> 
> I think it makes sense to initialize 'parent' outside of loop, both to make the
> loop control flow easier to read, and to make it more obvious that parent _must_
> be initialized in the empty case.

Your patch already does that.

>> +		struct kvm_memory_slot *cslot;
> 
> I'd prefer s/cur/node and s/cslot/tmp.  'cslot' in particular is hard to parse.

Your patch already does that.

>> +		cslot = container_of(*cur, typeof(*cslot), gfn_node[which]);
>> +		parent = *cur;
>> +		if (slot->base_gfn < cslot->base_gfn)
>> +			cur = &(*cur)->rb_left;
>> +		else if (slot->base_gfn > cslot->base_gfn)
>> +			cur = &(*cur)->rb_right;
>> +		else
>> +			BUG();
>> +	}
>>   
>> -	return old_memslots;
>> +	rb_link_node(&slot->gfn_node[which], parent, cur);
>> +	rb_insert_color(&slot->gfn_node[which], gfn_tree);
>>   }
>>   
>>   /*
>> - * Note, at a minimum, the current number of used slots must be allocated, even
>> - * when deleting a memslot, as we need a complete duplicate of the memslots for
>> - * use when invalidating a memslot prior to deleting/moving the memslot.
>> + * Just copies the memslot data.
>> + * Does not copy or touch the embedded nodes, including the ranges at hva_nodes.
>>    */
>> -static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
>> -					     enum kvm_mr_change change)
>> +static void kvm_copy_memslot(struct kvm_memory_slot *dest,
>> +			     struct kvm_memory_slot *src)
>>   {
>> -	struct kvm_memslots *slots;
>> -	size_t old_size, new_size;
>> -	struct kvm_memory_slot *memslot;
>> +	dest->base_gfn = src->base_gfn;
>> +	dest->npages = src->npages;
>> +	dest->dirty_bitmap = src->dirty_bitmap;
>> +	dest->arch = src->arch;
>> +	dest->userspace_addr = src->userspace_addr;
>> +	dest->flags = src->flags;
>> +	dest->id = src->id;
>> +	dest->as_id = src->as_id;
>> +}
>>   
>> -	old_size = sizeof(struct kvm_memslots) +
>> -		   (sizeof(struct kvm_memory_slot) * old->used_slots);
>> +/*
>> + * Initializes the ranges at both hva_nodes from the memslot userspace_addr
>> + * and npages fields.
>> + */
>> +static void kvm_init_memslot_hva_ranges(struct kvm_memory_slot *slot)
>> +{
>> +	slot->hva_node[0].start = slot->hva_node[1].start =
>> +		slot->userspace_addr;
>> +	slot->hva_node[0].last = slot->hva_node[1].last =
>> +		slot->userspace_addr + (slot->npages << PAGE_SHIFT) - 1;
> 
> Fold this into kvm_copy_memslot().  It's always called immediately after, and
> technically the node range does come from the src, e.g. calling this without
> first calling kvm_copy_memslot() doesn't make sense.

Your patch already does that.

>> +}
>>   
>> -	if (change == KVM_MR_CREATE)
>> -		new_size = old_size + sizeof(struct kvm_memory_slot);
>> -	else
>> -		new_size = old_size;
>> +/*
>> + * Replaces the @oldslot with @nslot in the memslot set indicated by
>> + * @slots_idx.
>> + *
>> + * With NULL @oldslot this simply adds the @nslot to the set.
>> + * With NULL @nslot this simply removes the @oldslot from the set.
>> + *
>> + * If @nslot is non-NULL its hva_node[slots_idx] range has to be set
>> + * appropriately.
>> + */
>> +static void kvm_replace_memslot(struct kvm *kvm,
>> +				int as_id, int slots_idx,
> 
> Pass slots itself, then all three of these go away.

Your patch already does that.

>> +				struct kvm_memory_slot *oldslot,
>> +				struct kvm_memory_slot *nslot)
> 
> s/oldslot/old and s/nslot/new, again to make it easier to identify which is what,
> and for consistency.

Your patch already does that.

>> +{
>> +	struct kvm_memslots *slots = &kvm->memslots_all[as_id][slots_idx];
> 
> s/slots_idx/idx for consistency.

Your patch already does that.

>>   
>> -	slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
>> -	if (unlikely(!slots))
>> -		return NULL;
>> +	if (WARN_ON(!oldslot && !nslot))
> 
> This should be moved to kvm_set_memslot() in the form of:
> 
> 	if (change != KVM_MR_CREATE) {
> 		slot = id_to_memslot(active, old->id);
> 		if (WARN_ON_ONCE(!slot))
> 			return -EIO;
> 	}
>
> Adding a WARN that the caller doesn't pass "NULL, NULL" is unnecessary, and
> putting the WARN in this helper obfuscates the one case that warrants a guard.

Your patch already does both of these changes.

>> +		return;
>> +
>> +	if (oldslot) {
>> +		hash_del(&oldslot->id_node[slots_idx]);
>> +		interval_tree_remove(&oldslot->hva_node[slots_idx],
>> +				     &slots->hva_tree);
> 
> Unnecessary newline.

Your patch already removes it.

> 
>> +		atomic_long_cmpxchg(&slots->lru_slot,
>> +				    (unsigned long)oldslot,
>> +				    (unsigned long)nslot);
> 
> Can be:
> 
> 		atomic_long_cmpxchg(&slots->lru_slot,
> 				    (unsigned long)old, (unsigned long)new);

Your patch already does it.

> 
>> +		if (!nslot) {
>> +			rb_erase(&oldslot->gfn_node[slots_idx],
>> +				 &slots->gfn_tree);
> 
> Unnecessary newline.

Your patch already changes this into a single-line
kvm_memslot_gfn_erase() call.

> 
>> +			return;
>> +		}
>> +	}
>>   
>> -	memcpy(slots, old, old_size);
>> +	hash_add(slots->id_hash, &nslot->id_node[slots_idx],
>> +		 nslot->id);
>> +	WARN_ON(PAGE_SHIFT > 0 &&
> 
> Are there actually KVM architectures for which PAGE_SHIFT==0?

Don't think so, it's just future-proofing of the code.

>> +		nslot->hva_node[slots_idx].start >=
>> +		nslot->hva_node[slots_idx].last);
>> +	interval_tree_insert(&nslot->hva_node[slots_idx],
>> +			     &slots->hva_tree);
>>   
>> -	slots->hva_tree = RB_ROOT_CACHED;
>> -	hash_init(slots->id_hash);
>> -	kvm_for_each_memslot(memslot, slots) {
>> -		interval_tree_insert(&memslot->hva_node, &slots->hva_tree);
>> -		hash_add(slots->id_hash, &memslot->id_node, memslot->id);
>> +	/* Shame there is no O(1) interval_tree_replace()... */
>> +	if (oldslot && oldslot->base_gfn == nslot->base_gfn)
>> +		rb_replace_node(&oldslot->gfn_node[slots_idx],
>> +				&nslot->gfn_node[slots_idx],
>> +				&slots->gfn_tree);
> 
> Add wrappers for all the rb-tree mutators.  Partly for consistency, mostly for
> readability.  Having the node index in the memslots helps in this case.  E.g.
> 
> 	/* Shame there is no O(1) interval_tree_replace()... */
> 	if (old && old->base_gfn == new->base_gfn) {
> 		kvm_memslot_gfn_replace(slots, old, new);
> 	} else {
> 		if (old)
> 			kvm_memslot_gfn_erase(slots, old);
> 		kvm_memslot_gfn_insert(slots, new);
> 	}

Your patch already does it.

>> +	else {
>> +		if (oldslot)
>> +			rb_erase(&oldslot->gfn_node[slots_idx],
>> +				 &slots->gfn_tree);
>> +		kvm_memslot_gfn_insert(&slots->gfn_tree,
>> +				       nslot, slots_idx);
> 
> Unnecessary newlines.

Your patch already removes them.

>>   	}
>> +}
>> +
>> +/*
>> + * Copies the @oldslot data into @nslot and uses this slot to replace
>> + * @oldslot in the memslot set indicated by @slots_idx.
>> + */
>> +static void kvm_copy_replace_memslot(struct kvm *kvm,
> 
> I fiddled with this one, and I think it's best to drop this helper in favor of
> open coding the calls to kvm_copy_memslot() and kvm_replace_memslot().  More on
> this below.
> 
>> +				     int as_id, int slots_idx,
>> +				     struct kvm_memory_slot *oldslot,
>> +				     struct kvm_memory_slot *nslot)
>> +{
>> +	kvm_copy_memslot(nslot, oldslot);
>> +	kvm_init_memslot_hva_ranges(nslot);
>>   
>> -	return slots;
>> +	kvm_replace_memslot(kvm, as_id, slots_idx, oldslot, nslot);
>>   }
>>   
>>   static int kvm_set_memslot(struct kvm *kvm,
>> @@ -1397,56 +1279,178 @@ static int kvm_set_memslot(struct kvm *kvm,
>>   			   struct kvm_memory_slot *new, int as_id,
>>   			   enum kvm_mr_change change)
>>   {
>> -	struct kvm_memory_slot *slot;
>> -	struct kvm_memslots *slots;
>> +	struct kvm_memslots *slotsact = __kvm_memslots(kvm, as_id);
>> +	int idxact = kvm_memslots_idx(slotsact);
>> +	int idxina = idxact == 0 ? 1 : 0;
> 
> I strongly prefer using "active" and "inactive" for the memslots, dropping the
> local incidices completely, and using "slot" and "tmp" for the slot.
> 
> "slot" and "tmp" aren't great, but slotsact vs. slotact is really, really hard
> to read.  And more importantly, slotact vs. slotina is misleading because they
> don't always point slots in the active set and inactive set respectively.  That
> detail took me a while to fully grep.

Your patch already does these changes.
My comment on the "active" vs "inactive" slot terminology thing is below.

>> +	struct kvm_memslots *slotsina = &kvm->memslots_all[as_id][idxina];
> 
> To avoid local index variables, this can be the somewhat clever/gross:
> 
> 	struct kvm_memslots *inactive = &kvm->__memslots[as_id][active->node_idx ^ 1];

Or "active->node_idx == 0 ? 1 : 0" to make it more explicit.

>> +	struct kvm_memory_slot *slotina, *slotact;
>>   	int r;
>>   
>> -	slots = kvm_dup_memslots(__kvm_memslots(kvm, as_id), change);
>> -	if (!slots)
>> +	slotina = kzalloc(sizeof(*slotina), GFP_KERNEL_ACCOUNT);
>> +	if (!slotina)
>>   		return -ENOMEM;
>>   
>> +	if (change != KVM_MR_CREATE)
>> +		slotact = id_to_memslot(slotsact, old->id);
> 
> And the WARN, as mentioned above.

Your patch already does this.

>> +
>>   	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
>>   		/*
>> -		 * Note, the INVALID flag needs to be in the appropriate entry
>> -		 * in the freshly allocated memslots, not in @old or @new.
>> +		 * Replace the slot to be deleted or moved in the inactive
>> +		 * memslot set by its copy with KVM_MEMSLOT_INVALID flag set.
> 
> This is where the inactive slot vs. inactive memslot terminology gets really
> confusing.  "inactive memslots" refers precisely which set is not kvm->memslots,
> whereas "inactive slot" might refer to a slot that is in the inactive set, but
> it might also refer to a slot that is currently not in any tree at all.  E.g.

"Inactive slot" is a slot that is never in the active memslot set.
There is one exception to that in the second part of the KVM_MR_CREATE
branch, but that's easily fixable by adding a "swap(slotact, slotina)"
there, just as other branches use (it's a consistency win, too).

The swap could be actually inside kvm_swap_active_memslots() and be
shared between all the branches - more about this later.

Also, KVM_MR_CREATE uses just one slot anyway, so technically it should
be called just a "slot" without any qualifier.

Conversely, an "active slot" is a slot that is in the active memslot set,
with an exception of the second part of the KVM_MR_DELETE branch where the
slot should be called something like a "dead slot".

I have a bit of a mixed felling here: these "slot" and, "tmp" variable
names are even less descriptive ("tmp" in particular sounds totally
generic).

Additionally, the introduction of kvm_activate_memslot() removed "slot"
and "tmp" swapping at memslot sets swap time from the KVM_MR_MOVE,
KVM_MR_FLAGS_ONLY and KVM_MR_DELETE branches while keeping it in the
code block that adds a temporary KVM_MEMSLOT_INVALID slot and in the
error cleanup code.

In terms of "slotact" and "slotina" the KVM_MR_CREATE case can be
trivially fixed, the KVM_MR_DELETE branch can make use either of a
dedicated "slotdead" variable, or a comment explaining what "slotina"
points to there.

At the same time a comment can be added near these variables describing
their semantics so they are clear to future people analyzing the code.

> 
> 		/*
> 		 * Mark the current slot INVALID.  This must be done on the tmp
> 		 * slot to avoid modifying the current slot in the active tree.
> 		 */

Your patch already does this comment change.

>>   		 */
>> -		slot = id_to_memslot(slots, old->id);
>> -		slot->flags |= KVM_MEMSLOT_INVALID;
>> +		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
>> +		slotina->flags |= KVM_MEMSLOT_INVALID;
> 
> This is where I'd prefer to open code the copy and replace.  Functionally it works,
> but setting the INVALID flag _after_ replacing the slot is not intuitive.  E.g.
> this sequencing helps the reader understand that it makes a copy
> 
> 		kvm_copy_memslot(tmp, slot);
> 		tmp->flags |= KVM_MEMSLOT_INVALID;
> 		kvm_replace_memslot(inactive, slot, tmp);
> 

Your patch already does this change.

>>   		/*
>> -		 * We can re-use the old memslots, the only difference from the
>> -		 * newly installed memslots is the invalid flag, which will get
>> -		 * dropped by update_memslots anyway.  We'll also revert to the
>> -		 * old memslots if preparing the new memory region fails.
>> +		 * Swap the active <-> inactive memslot set.
>> +		 * Now the active memslot set still contains the memslot to be
>> +		 * deleted or moved, but with the KVM_MEMSLOT_INVALID flag set.
>>   		 */
>> -		slots = install_new_memslots(kvm, as_id, slots);
>> +		swap_memslots(kvm, as_id);
> 
> kvm_swap_active_memslots() would be preferable, without context it's not clear
> what's being swapped.
> 
> To dedup code, and hopefully improve readability, I think it makes sense to do
> the swap() of the memslots in kvm_swap_active_memslots().  With the indices gone,
> the number of swap() calls goes down dramatically, e.g.
> 
> 
> 		/*
> 		 * Activate the slot that is now marked INVALID, but don't
> 		 * propagate the slot to the now inactive slots.  The slot is
> 		 * either going to be deleted or recreated as a new slot.
> 		*/
> 		kvm_swap_active_memslots(kvm, as_id, &active, &inactive);
> 
> 		/* The temporary and current slot have swapped roles. */
> 		swap(tmp, slot);

Your patch already does this change.

>> +		swap(idxact, idxina);
>> +		swap(slotsina, slotsact);
>> +		swap(slotact, slotina);
>>   
>> -		/* From this point no new shadow pages pointing to a deleted,
>> +		/*
>> +		 * From this point no new shadow pages pointing to a deleted,
>>   		 * or moved, memslot will be created.
>>   		 *
>>   		 * validation of sp->gfn happens in:
>>   		 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
>>   		 *	- kvm_is_visible_gfn (mmu_check_root)
>>   		 */
>> -		kvm_arch_flush_shadow_memslot(kvm, slot);
>> +		kvm_arch_flush_shadow_memslot(kvm, slotact);
>>   	}
>>   
>>   	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
>>   	if (r)
>>   		goto out_slots;
> 
> Normally I like avoiding code churn, but in this case I think it makes sense to
> avoid the goto.  Even after the below if-else-elif block is trimmed down, the
> error handling that is specific to the above DELETE|MOVE ends up landing too far
> away from the code that it is reverting.  E.g.
> 
> 	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
> 	if (r) {
> 		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> 			/*
> 			 * Revert the above INVALID change.  No modifications
> 			 * required since the original slot was preserved in
> 			 * the inactive slots.
> 			 */
> 			kvm_swap_active_memslots(kvm, as_id, &active, &inactive);
> 			swap(tmp, slot);
> 		}
> 		kfree(tmp);
> 		return r;
> 	}
>

No problem, however the error handler code above needs to also re-add
the original memslot to the second memslot set, as the original error
handler previously located at the function end did:
> swap_memslots(kvm, as_id);
> swap(idxact, idxina);
> swap(slotsina, slotsact);
> swap(slotact, slotina);
> 
> kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);


>> -	update_memslots(slots, new, change);
>> -	slots = install_new_memslots(kvm, as_id, slots);
>> +	if (change == KVM_MR_MOVE) {
>> +		/*
>> +		 * Since we are going to be changing the memslot gfn we need to
> 
> Try to use imperative mode and avoid I/we/us/etc....  It requires creative
> wording in some cases, but overall it does help avoid ambiguity (though "we" is
> fairly obvious in this case).

Your patch already rewrites this comment (will have this on mind for the
future comments).

> 
>> +		 * remove it from the gfn tree so it can be re-added there with
>> +		 * the updated gfn.
>> +		 */
>> +		rb_erase(&slotina->gfn_node[idxina],
>> +			 &slotsina->gfn_tree);
> 
> Unnecessary newline.

Your patch already removes it.

>> +
>> +		slotina->base_gfn = new->base_gfn;
>> +		slotina->flags = new->flags;
>> +		slotina->dirty_bitmap = new->dirty_bitmap;
>> +		/* kvm_arch_prepare_memory_region() might have modified arch */
>> +		slotina->arch = new->arch;
>> +
>> +		/* Re-add to the gfn tree with the updated gfn */
>> +		kvm_memslot_gfn_insert(&slotsina->gfn_tree,
>> +				       slotina, idxina);
> 
> Again, newline.

Your patch already removes it.

>> +
>> +		/*
>> +		 * Swap the active <-> inactive memslot set.
>> +		 * Now the active memslot set contains the new, final memslot.
>> +		 */
>> +		swap_memslots(kvm, as_id);
>> +		swap(idxact, idxina);
>> +		swap(slotsina, slotsact);
>> +		swap(slotact, slotina);
>> +
>> +		/*
>> +		 * Replace the temporary KVM_MEMSLOT_INVALID slot with the
>> +		 * new, final memslot in the inactive memslot set and
>> +		 * free the temporary memslot.
>> +		 */
>> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
>> +		kfree(slotina);
> 
> Adding a wrapper for the trifecta of swap + replace + kfree() cuts down on the
> boilerplate tremendously, and sidesteps having to swap() "slot" and "tmp" for
> these flows.   E.g. this can become:
> 
> 		/*
> 		 * The memslot's gfn is changing, remove it from the inactive
> 		 * tree, it will be re-added with its updated gfn.  Because its
> 		 * range is changing, an in-place replace is not possible.
> 		 */
> 		kvm_memslot_gfn_erase(inactive, tmp);
> 
> 		tmp->base_gfn = new->base_gfn;
> 		tmp->flags = new->flags;
> 		tmp->dirty_bitmap = new->dirty_bitmap;
> 		/* kvm_arch_prepare_memory_region() might have modified arch */
> 		tmp->arch = new->arch;
> 
> 		/* Re-add to the gfn tree with the updated gfn */
> 		kvm_memslot_gfn_insert(inactive, tmp);
> 
> 		/* Replace the current INVALID slot with the updated memslot. */
> 		kvm_activate_memslot(kvm, as_id, &active, &inactive, slot, tmp);
> 
> by adding:
> 
> static void kvm_activate_memslot(struct kvm *kvm, int as_id,
> 				 struct kvm_memslots **active,
> 				 struct kvm_memslots **inactive,
> 				 struct kvm_memory_slot *old,
> 				 struct kvm_memory_slot *new)
> {
> 	/*
> 	 * Swap the active <-> inactive memslots.  Note, this also swaps
> 	 * the active and inactive pointers themselves.
> 	 */
> 	kvm_swap_active_memslots(kvm, as_id, active, inactive);
> 
> 	/* Propagate the new memslot to the now inactive memslots. */
> 	kvm_replace_memslot(*inactive, old, new);
> 
> 	/* And free the old slot. */
> 	if (old)
> 		kfree(old);
> }

Generally, I agree, but as I have written above, I think it would be
good to assign some semantics to slot variables (however they are called).
Your refactoring actually helps with that, since then there will be only
a single place where both memslot sets and memslot pointers are swapped:
in kvm_swap_active_memslots().

I also think kvm_activate_memslot() would benefit from a good comment
describing what it actually does.

BTW, (k)free() can be safely called with a NULL pointer.

>> +	} else if (change == KVM_MR_FLAGS_ONLY) {
>> +		/*
>> +		 * Almost like the move case above, but we don't use a temporary
>> +		 * KVM_MEMSLOT_INVALID slot.
> 
> Let's use INVALID, it should be obvious to readers that make it his far :-)

All right :)
Your patch already does a similar change, so I guess it has the final
proposed wording.

>> +		 * Instead, we simply replace the old memslot with a new, updated
>> +		 * copy in both memslot sets.
>> +		 *
>> +		 * Since we aren't going to be changing the memslot gfn we can
>> +		 * simply use kvm_copy_replace_memslot(), which will use
>> +		 * rb_replace_node() to switch the memslot node in the gfn tree
>> +		 * instead of removing the old one and inserting the new one
>> +		 * as two separate operations.
>> +		 * It's a performance win since node replacement is a single
>> +		 * O(1) operation as opposed to two O(log(n)) operations for
>> +		 * slot removal and then re-insertion.
>> +		 */
>> +		kvm_copy_replace_memslot(kvm, as_id, idxina, slotact, slotina);
>> +		slotina->flags = new->flags;
>> +		slotina->dirty_bitmap = new->dirty_bitmap;
>> +		/* kvm_arch_prepare_memory_region() might have modified arch */
>> +		slotina->arch = new->arch;
>> +
>> +		/* Swap the active <-> inactive memslot set. */
>> +		swap_memslots(kvm, as_id);
>> +		swap(idxact, idxina);
>> +		swap(slotsina, slotsact);
>> +		swap(slotact, slotina);
>> +
>> +		/*
>> +		 * Replace the old memslot in the other memslot set and
>> +		 * then finally free it.
>> +		 */
>> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, slotact);
>> +		kfree(slotina);
>> +	} else if (change == KVM_MR_CREATE) {
>> +		/*
>> +		 * Add the new memslot to the current inactive set as a copy
>> +		 * of the provided new memslot data.
>> +		 */
>> +		kvm_copy_memslot(slotina, new);
>> +		kvm_init_memslot_hva_ranges(slotina);
>> +
>> +		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
>> +
>> +		/* Swap the active <-> inactive memslot set. */
>> +		swap_memslots(kvm, as_id);
>> +		swap(idxact, idxina);
>> +		swap(slotsina, slotsact);
>> +
>> +		/* Now add it also to the other memslot set */
>> +		kvm_replace_memslot(kvm, as_id, idxina, NULL, slotina);
>> +	} else if (change == KVM_MR_DELETE) {
>> +		/*
>> +		 * Remove the old memslot from the current inactive set
>> +		 * (the other, active set contains the temporary
>> +		 * KVM_MEMSLOT_INVALID slot)
>> +		 */
>> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
>> +
>> +		/* Swap the active <-> inactive memslot set. */
>> +		swap_memslots(kvm, as_id);
>> +		swap(idxact, idxina);
>> +		swap(slotsina, slotsact);
>> +		swap(slotact, slotina);
>> +
>> +		/* Remove the temporary KVM_MEMSLOT_INVALID slot and free it. */
>> +		kvm_replace_memslot(kvm, as_id, idxina, slotina, NULL);
>> +		kfree(slotina);
>> +		/* slotact will be freed by kvm_free_memslot() */
> 
> I think this comment can go away in favor of documenting the kvm_free_memslot()
> call down below.  With all of the aforementioned changes:
> 
> 		/*
> 		 * Remove the old memslot (in the inactive memslots) and activate
> 		 * the NULL slot.
> 		*/
> 		kvm_replace_memslot(inactive, tmp, NULL);
> 		kvm_activate_memslot(kvm, as_id, &active, &inactive, slot, NULL);

Your patch already does this change, however I will slightly change the
comment to say something like:
> Remove the old memslot (in the inactive memslots) by passing NULL as the new slot
since there isn't really anything like a NULL memslot.

>> +	} else
>> +		BUG();
>>   
>>   	kvm_arch_commit_memory_region(kvm, mem, old, new, change);
>>   
>> -	kvfree(slots);
>> +	if (change == KVM_MR_DELETE)
>> +		kvm_free_memslot(kvm, slotact);
>> +
>>   	return 0;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 7/8] KVM: Optimize gfn lookup in kvm_zap_gfn_range()
  2021-05-26 17:33   ` Sean Christopherson
@ 2021-06-01 20:25     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 25+ messages in thread
From: Maciej S. Szmigiero @ 2021-06-01 20:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Igor Mammedov, Marc Zyngier, James Morse, Julien Thierry,
	Suzuki K Poulose, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Christian Borntraeger, Janosch Frank,
	David Hildenbrand, Cornelia Huck, Claudio Imbrenda, Joerg Roedel,
	kvm, linux-kernel

On 26.05.2021 19:33, Sean Christopherson wrote:
> On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> Introduce a memslots gfn upper bound operation and use it to optimize
>> kvm_zap_gfn_range().
>> This way this handler can do a quick lookup for intersecting gfns and won't
>> have to do a linear scan of the whole memslot set.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   arch/x86/kvm/mmu/mmu.c   | 41 ++++++++++++++++++++++++++++++++++++++--
>>   include/linux/kvm_host.h | 22 +++++++++++++++++++++
>>   2 files changed, 61 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 7222b552d139..f23398cf0316 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5490,14 +5490,51 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>>   	int i;
>>   	bool flush = false;
>>   
>> +	if (gfn_end == gfn_start || WARN_ON(gfn_end < gfn_start))
>> +		return;
>> +
>>   	write_lock(&kvm->mmu_lock);
>>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>> -		int ctr;
>> +		int idxactive;
>> +		struct rb_node *node;
>>   
>>   		slots = __kvm_memslots(kvm, i);
>> -		kvm_for_each_memslot(memslot, ctr, slots) {
>> +		idxactive = kvm_memslots_idx(slots);
>> +
>> +		/*
>> +		 * Find the slot with the lowest gfn that can possibly intersect with
>> +		 * the range, so we'll ideally have slot start <= range start
>> +		 */
>> +		node = kvm_memslots_gfn_upper_bound(slots, gfn_start);
>> +		if (node) {
>> +			struct rb_node *pnode;
>> +
>> +			/*
>> +			 * A NULL previous node means that the very first slot
>> +			 * already has a higher start gfn.
>> +			 * In this case slot start > range start.
>> +			 */
>> +			pnode = rb_prev(node);
>> +			if (pnode)
>> +				node = pnode;
>> +		} else {
>> +			/* a NULL node below means no slots */
>> +			node = rb_last(&slots->gfn_tree);
>> +		}
>> +
>> +		for ( ; node; node = rb_next(node)) {
>>   			gfn_t start, end;
> 
> Can this be abstracted into something like:
> 
> 		kvm_for_each_memslot_in_gfn_range(...) {
> 
> 		}
> 
> and share that implementation with kvm_check_memslot_overlap() in the next patch?
> 
> I really don't think arch code should be poking into gfn_tree, and ideally arch
> code wouldn't even be aware that gfn_tree exists.

That's a good idea, will do.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots()
  2021-05-19 21:24   ` Sean Christopherson
  2021-05-21  7:03     ` Maciej S. Szmigiero
@ 2021-06-10 16:17     ` Paolo Bonzini
  1 sibling, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2021-06-10 16:17 UTC (permalink / raw)
  To: Sean Christopherson, Maciej S. Szmigiero
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Igor Mammedov,
	Marc Zyngier, James Morse, Julien Thierry, Suzuki K Poulose,
	Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Christian Borntraeger, Janosch Frank, David Hildenbrand,
	Cornelia Huck, Claudio Imbrenda, Joerg Roedel, kvm, linux-kernel

On 19/05/21 23:24, Sean Christopherson wrote:
> An alternative to modifying the PPC code would be to make the existing
> search_memslots() a wrapper to __search_memslots(), with the latter taking
> @approx.

Let's just modify PPC to use __gfn_to_memslot instead of search_memslots().

__gfn_to_memslot() has never introduced any functionality over 
search_memslots(), ever since search_memslots() was introduced in 2011.

Paolo

> We might also want to make this __always_inline to improve the likelihood of the
> compiler optimizing away @approx.  I doubt it matters in practice...
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-06-10 16:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-16 21:44 [PATCH v3 0/8] KVM: Scalable memslots implementation Maciej S. Szmigiero
2021-05-16 21:44 ` [PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array Maciej S. Szmigiero
2021-05-19 21:00   ` Sean Christopherson
2021-05-21  7:03     ` Maciej S. Szmigiero
2021-05-16 21:44 ` [PATCH v3 2/8] KVM: Integrate gfn_to_memslot_approx() into search_memslots() Maciej S. Szmigiero
2021-05-19 21:24   ` Sean Christopherson
2021-05-21  7:03     ` Maciej S. Szmigiero
2021-06-10 16:17     ` Paolo Bonzini
2021-05-16 21:44 ` [PATCH v3 3/8] KVM: Resolve memslot ID via a hash table instead of via a static array Maciej S. Szmigiero
2021-05-19 22:31   ` Sean Christopherson
2021-05-21  7:05     ` Maciej S. Szmigiero
2021-05-22 11:11       ` Maciej S. Szmigiero
2021-05-16 21:44 ` [PATCH v3 4/8] KVM: Introduce memslots hva tree Maciej S. Szmigiero
2021-05-19 23:07   ` Sean Christopherson
2021-05-21  7:06     ` Maciej S. Szmigiero
2021-05-16 21:44 ` [PATCH v3 5/8] KVM: s390: Introduce kvm_s390_get_gfn_end() Maciej S. Szmigiero
2021-05-16 21:44 ` [PATCH v3 6/8] KVM: Keep memslots in tree-based structures instead of array-based ones Maciej S. Szmigiero
2021-05-19 23:10   ` Sean Christopherson
2021-05-21  7:06     ` Maciej S. Szmigiero
2021-05-25 23:21   ` Sean Christopherson
2021-06-01 20:24     ` Maciej S. Szmigiero
2021-05-16 21:44 ` [PATCH v3 7/8] KVM: Optimize gfn lookup in kvm_zap_gfn_range() Maciej S. Szmigiero
2021-05-26 17:33   ` Sean Christopherson
2021-06-01 20:25     ` Maciej S. Szmigiero
2021-05-16 21:44 ` [PATCH v3 8/8] KVM: Optimize overlapping memslots check Maciej S. Szmigiero

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.