linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Faster MMU lookups for Book3s v3
@ 2010-06-30 13:18 Alexander Graf
  2010-06-30 13:18 ` [PATCH 1/2] KVM: PPC: Add generic hpte management functions Alexander Graf
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Alexander Graf @ 2010-06-30 13:18 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev, KVM list

Book3s suffered from my really bad shadow MMU implementation so far. So
I finally got around to implement a combined hash and list mechanism that
allows for much faster lookup of mapped pages.

To show that it really is faster, I tried to run simple process spawning
code inside the guest with and without these patches:

[without]

debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello > /dev/null; done

real    0m20.235s
user    0m10.418s
sys     0m9.766s

[with]

debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello > /dev/null; done

real    0m14.659s
user    0m8.967s
sys     0m5.688s

So as you can see, performance improved significantly.

v2 -> v3:

  - use hlist
  - use global slab cache

Alexander Graf (2):
  KVM: PPC: Add generic hpte management functions
  KVM: PPC: Make use of hash based Shadow MMU

 arch/powerpc/include/asm/kvm_book3s.h |    9 +
 arch/powerpc/include/asm/kvm_host.h   |   17 ++-
 arch/powerpc/kvm/Makefile             |    2 +
 arch/powerpc/kvm/book3s.c             |   14 ++-
 arch/powerpc/kvm/book3s_32_mmu_host.c |  104 ++-----------
 arch/powerpc/kvm/book3s_64_mmu_host.c |   98 +-----------
 arch/powerpc/kvm/book3s_mmu_hpte.c    |  277 +++++++++++++++++++++++++++++++++
 7 files changed, 331 insertions(+), 190 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_mmu_hpte.c

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] KVM: PPC: Add generic hpte management functions
  2010-06-30 13:18 [PATCH 0/2] Faster MMU lookups for Book3s v3 Alexander Graf
@ 2010-06-30 13:18 ` Alexander Graf
  2010-06-30 13:18 ` [PATCH 2/2] KVM: PPC: Make use of hash based Shadow MMU Alexander Graf
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Alexander Graf @ 2010-06-30 13:18 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev, KVM list

Currently the shadow paging code keeps an array of entries it knows about.
Whenever the guest invalidates an entry, we loop through that entry,
trying to invalidate matching parts.

While this is a really simple implementation, it is probably the most
ineffective one possible. So instead, let's keep an array of lists around
that are indexed by a hash. This way each PTE can be added by 4 list_add,
removed by 4 list_del invocations and the search only needs to loop through
entries that share the same hash.

This patch implements said lookup and exports generic functions that both
the 32-bit and 64-bit backend can use.

Signed-off-by: Alexander Graf <agraf@suse.de>

---

v1 -> v2:

  - remove hpte_all list
  - lookup all using vpte_long lists
  - decrease size of vpte_long hash
  - fix missing brackets

v2 -> v3:

  - use hlist
  - use global kmem cache
---
 arch/powerpc/kvm/book3s_mmu_hpte.c |  277 ++++++++++++++++++++++++++++++++++++
 1 files changed, 277 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_mmu_hpte.c

diff --git a/arch/powerpc/kvm/book3s_mmu_hpte.c b/arch/powerpc/kvm/book3s_mmu_hpte.c
new file mode 100644
index 0000000..4868d4a
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_mmu_hpte.c
@@ -0,0 +1,277 @@
+/*
+ * Copyright (C) 2010 SUSE Linux Products GmbH. All rights reserved.
+ *
+ * Authors:
+ *     Alexander Graf <agraf@suse.de>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/hash.h>
+#include <linux/slab.h>
+
+#include <asm/kvm_ppc.h>
+#include <asm/kvm_book3s.h>
+#include <asm/machdep.h>
+#include <asm/mmu_context.h>
+#include <asm/hw_irq.h>
+
+#define PTE_SIZE	12
+
+/* #define DEBUG_MMU */
+
+#ifdef DEBUG_MMU
+#define dprintk_mmu(a, ...) printk(KERN_INFO a, __VA_ARGS__)
+#else
+#define dprintk_mmu(a, ...) do { } while(0)
+#endif
+
+static struct kmem_cache *hpte_cache;
+
+static inline u64 kvmppc_mmu_hash_pte(u64 eaddr)
+{
+	return hash_64(eaddr >> PTE_SIZE, HPTEG_HASH_BITS_PTE);
+}
+
+static inline u64 kvmppc_mmu_hash_vpte(u64 vpage)
+{
+	return hash_64(vpage & 0xfffffffffULL, HPTEG_HASH_BITS_VPTE);
+}
+
+static inline u64 kvmppc_mmu_hash_vpte_long(u64 vpage)
+{
+	return hash_64((vpage & 0xffffff000ULL) >> 12,
+		       HPTEG_HASH_BITS_VPTE_LONG);
+}
+
+void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
+{
+	u64 index;
+
+	/* Add to ePTE list */
+	index = kvmppc_mmu_hash_pte(pte->pte.eaddr);
+	hlist_add_head(&pte->list_pte, &vcpu->arch.hpte_hash_pte[index]);
+
+	/* Add to vPTE list */
+	index = kvmppc_mmu_hash_vpte(pte->pte.vpage);
+	hlist_add_head(&pte->list_vpte, &vcpu->arch.hpte_hash_vpte[index]);
+
+	/* Add to vPTE_long list */
+	index = kvmppc_mmu_hash_vpte_long(pte->pte.vpage);
+	hlist_add_head(&pte->list_vpte_long,
+		       &vcpu->arch.hpte_hash_vpte_long[index]);
+}
+
+static void invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
+{
+	dprintk_mmu("KVM: Flushing SPT: 0x%lx (0x%llx) -> 0x%llx\n",
+		    pte->pte.eaddr, pte->pte.vpage, pte->host_va);
+
+	/* Different for 32 and 64 bit */
+	kvmppc_mmu_invalidate_pte(vcpu, pte);
+
+	if (pte->pte.may_write)
+		kvm_release_pfn_dirty(pte->pfn);
+	else
+		kvm_release_pfn_clean(pte->pfn);
+
+	hlist_del(&pte->list_pte);
+	hlist_del(&pte->list_vpte);
+	hlist_del(&pte->list_vpte_long);
+
+	vcpu->arch.hpte_cache_count--;
+	kmem_cache_free(hpte_cache, pte);
+}
+
+static void kvmppc_mmu_pte_flush_all(struct kvm_vcpu *vcpu)
+{
+	struct hpte_cache *pte;
+	struct hlist_node *node, *tmp;
+	int i;
+
+	for (i = 0; i < HPTEG_HASH_NUM_VPTE_LONG; i++) {
+		struct hlist_head *list = &vcpu->arch.hpte_hash_vpte_long[i];
+
+		hlist_for_each_entry_safe(pte, node, tmp, list, list_vpte_long)
+			invalidate_pte(vcpu, pte);
+	}
+}
+
+static void kvmppc_mmu_pte_flush_page(struct kvm_vcpu *vcpu, ulong guest_ea)
+{
+	struct hlist_head *list;
+	struct hlist_node *node, *tmp;
+	struct hpte_cache *pte;
+
+	/* Find the list of entries in the map */
+	list = &vcpu->arch.hpte_hash_pte[kvmppc_mmu_hash_pte(guest_ea)];
+
+	/* Check the list for matching entries and invalidate */
+	hlist_for_each_entry_safe(pte, node, tmp, list, list_pte)
+		if ((pte->pte.eaddr & ~0xfffUL) == guest_ea)
+			invalidate_pte(vcpu, pte);
+}
+
+void kvmppc_mmu_pte_flush(struct kvm_vcpu *vcpu, ulong guest_ea, ulong ea_mask)
+{
+	u64 i;
+
+	dprintk_mmu("KVM: Flushing %d Shadow PTEs: 0x%lx & 0x%lx\n",
+		    vcpu->arch.hpte_cache_count, guest_ea, ea_mask);
+
+	guest_ea &= ea_mask;
+
+	switch (ea_mask) {
+	case ~0xfffUL:
+		kvmppc_mmu_pte_flush_page(vcpu, guest_ea);
+		break;
+	case 0x0ffff000:
+		/* 32-bit flush w/o segment, go through all possible segments */
+		for (i = 0; i < 0x100000000ULL; i += 0x10000000ULL)
+			kvmppc_mmu_pte_flush(vcpu, guest_ea | i, ~0xfffUL);
+		break;
+	case 0:
+		/* Doing a complete flush -> start from scratch */
+		kvmppc_mmu_pte_flush_all(vcpu);
+		break;
+	default:
+		WARN_ON(1);
+		break;
+	}
+}
+
+/* Flush with mask 0xfffffffff */
+static void kvmppc_mmu_pte_vflush_short(struct kvm_vcpu *vcpu, u64 guest_vp)
+{
+	struct hlist_head *list;
+	struct hlist_node *node, *tmp;
+	struct hpte_cache *pte;
+	u64 vp_mask = 0xfffffffffULL;
+
+	list = &vcpu->arch.hpte_hash_vpte[kvmppc_mmu_hash_vpte(guest_vp)];
+
+	/* Check the list for matching entries and invalidate */
+	hlist_for_each_entry_safe(pte, node, tmp, list, list_vpte)
+		if ((pte->pte.vpage & vp_mask) == guest_vp)
+			invalidate_pte(vcpu, pte);
+}
+
+/* Flush with mask 0xffffff000 */
+static void kvmppc_mmu_pte_vflush_long(struct kvm_vcpu *vcpu, u64 guest_vp)
+{
+	struct hlist_head *list;
+	struct hlist_node *node, *tmp;
+	struct hpte_cache *pte;
+	u64 vp_mask = 0xffffff000ULL;
+
+	list = &vcpu->arch.hpte_hash_vpte_long[
+		kvmppc_mmu_hash_vpte_long(guest_vp)];
+
+	/* Check the list for matching entries and invalidate */
+	hlist_for_each_entry_safe(pte, node, tmp, list, list_vpte_long)
+		if ((pte->pte.vpage & vp_mask) == guest_vp)
+			invalidate_pte(vcpu, pte);
+}
+
+void kvmppc_mmu_pte_vflush(struct kvm_vcpu *vcpu, u64 guest_vp, u64 vp_mask)
+{
+	dprintk_mmu("KVM: Flushing %d Shadow vPTEs: 0x%llx & 0x%llx\n",
+		    vcpu->arch.hpte_cache_count, guest_vp, vp_mask);
+	guest_vp &= vp_mask;
+
+	switch(vp_mask) {
+	case 0xfffffffffULL:
+		kvmppc_mmu_pte_vflush_short(vcpu, guest_vp);
+		break;
+	case 0xffffff000ULL:
+		kvmppc_mmu_pte_vflush_long(vcpu, guest_vp);
+		break;
+	default:
+		WARN_ON(1);
+		return;
+	}
+}
+
+void kvmppc_mmu_pte_pflush(struct kvm_vcpu *vcpu, ulong pa_start, ulong pa_end)
+{
+	struct hlist_node *node, *tmp;
+	struct hpte_cache *pte;
+	int i;
+
+	dprintk_mmu("KVM: Flushing %d Shadow pPTEs: 0x%lx - 0x%lx\n",
+		    vcpu->arch.hpte_cache_count, pa_start, pa_end);
+
+	for (i = 0; i < HPTEG_HASH_NUM_VPTE_LONG; i++) {
+		struct hlist_head *list = &vcpu->arch.hpte_hash_vpte_long[i];
+
+		hlist_for_each_entry_safe(pte, node, tmp, list, list_vpte_long)
+			if ((pte->pte.raddr >= pa_start) &&
+			    (pte->pte.raddr < pa_end))
+				invalidate_pte(vcpu, pte);
+	}
+}
+
+struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu)
+{
+	struct hpte_cache *pte;
+
+	pte = kmem_cache_zalloc(hpte_cache, GFP_KERNEL);
+	vcpu->arch.hpte_cache_count++;
+
+	if (vcpu->arch.hpte_cache_count == HPTEG_CACHE_NUM)
+		kvmppc_mmu_pte_flush_all(vcpu);
+
+	return pte;
+}
+
+void kvmppc_mmu_hpte_destroy(struct kvm_vcpu *vcpu)
+{
+	kvmppc_mmu_pte_flush(vcpu, 0, 0);
+}
+
+static void kvmppc_mmu_hpte_init_hash(struct hlist_head *hash_list, int len)
+{
+	int i;
+
+	for (i = 0; i < len; i++)
+		INIT_HLIST_HEAD(&hash_list[i]);
+}
+
+int kvmppc_mmu_hpte_init(struct kvm_vcpu *vcpu)
+{
+	/* init hpte lookup hashes */
+	kvmppc_mmu_hpte_init_hash(vcpu->arch.hpte_hash_pte,
+				  ARRAY_SIZE(vcpu->arch.hpte_hash_pte));
+	kvmppc_mmu_hpte_init_hash(vcpu->arch.hpte_hash_vpte,
+				  ARRAY_SIZE(vcpu->arch.hpte_hash_vpte));
+	kvmppc_mmu_hpte_init_hash(vcpu->arch.hpte_hash_vpte_long,
+				  ARRAY_SIZE(vcpu->arch.hpte_hash_vpte_long));
+
+	return 0;
+}
+
+int kvmppc_mmu_hpte_sysinit(void)
+{
+	/* init hpte slab cache */
+	hpte_cache = kmem_cache_create("kvm-spt", sizeof(struct hpte_cache),
+				       sizeof(struct hpte_cache), 0, NULL);
+
+	return 0;
+}
+
+void kvmppc_mmu_hpte_sysexit(void)
+{
+	kmem_cache_destroy(hpte_cache);
+}
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] KVM: PPC: Make use of hash based Shadow MMU
  2010-06-30 13:18 [PATCH 0/2] Faster MMU lookups for Book3s v3 Alexander Graf
  2010-06-30 13:18 ` [PATCH 1/2] KVM: PPC: Add generic hpte management functions Alexander Graf
@ 2010-06-30 13:18 ` Alexander Graf
  2010-07-01  7:29 ` [PATCH 0/2] Faster MMU lookups for Book3s v3 Avi Kivity
  2010-07-01 15:40 ` Marcelo Tosatti
  3 siblings, 0 replies; 15+ messages in thread
From: Alexander Graf @ 2010-06-30 13:18 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev, KVM list

We just introduced generic functions to handle shadow pages on PPC.
This patch makes the respective backends make use of them, getting
rid of a lot of duplicate code along the way.

Signed-off-by: Alexander Graf <agraf@suse.de>

---

v2 -> v3:

  - use hlist
  - use global kmem cache
---
 arch/powerpc/include/asm/kvm_book3s.h |    9 +++
 arch/powerpc/include/asm/kvm_host.h   |   17 +++++-
 arch/powerpc/kvm/Makefile             |    2 +
 arch/powerpc/kvm/book3s.c             |   14 ++++-
 arch/powerpc/kvm/book3s_32_mmu_host.c |  104 +++-----------------------------
 arch/powerpc/kvm/book3s_64_mmu_host.c |   98 ++----------------------------
 6 files changed, 54 insertions(+), 190 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 4e99559..8274a2d 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -115,6 +115,15 @@ extern void kvmppc_mmu_book3s_32_init(struct kvm_vcpu *vcpu);
 extern int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte);
 extern int kvmppc_mmu_map_segment(struct kvm_vcpu *vcpu, ulong eaddr);
 extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu);
+
+extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte);
+extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu);
+extern void kvmppc_mmu_hpte_destroy(struct kvm_vcpu *vcpu);
+extern int kvmppc_mmu_hpte_init(struct kvm_vcpu *vcpu);
+extern void kvmppc_mmu_invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte);
+extern int kvmppc_mmu_hpte_sysinit(void);
+extern void kvmppc_mmu_hpte_sysexit(void);
+
 extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, bool data);
 extern int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, bool data);
 extern void kvmppc_book3s_queue_irqprio(struct kvm_vcpu *vcpu, unsigned int vec);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 0c9ad86..e004eaf 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -38,7 +38,13 @@
 #define KVM_NR_PAGE_SIZES	1
 #define KVM_PAGES_PER_HPAGE(x)	(1UL<<31)
 
-#define HPTEG_CACHE_NUM 1024
+#define HPTEG_CACHE_NUM			(1 << 15)
+#define HPTEG_HASH_BITS_PTE		13
+#define HPTEG_HASH_BITS_VPTE		13
+#define HPTEG_HASH_BITS_VPTE_LONG	5
+#define HPTEG_HASH_NUM_PTE		(1 << HPTEG_HASH_BITS_PTE)
+#define HPTEG_HASH_NUM_VPTE		(1 << HPTEG_HASH_BITS_VPTE)
+#define HPTEG_HASH_NUM_VPTE_LONG	(1 << HPTEG_HASH_BITS_VPTE_LONG)
 
 struct kvm;
 struct kvm_run;
@@ -151,6 +157,9 @@ struct kvmppc_mmu {
 };
 
 struct hpte_cache {
+	struct hlist_node list_pte;
+	struct hlist_node list_vpte;
+	struct hlist_node list_vpte_long;
 	u64 host_va;
 	u64 pfn;
 	ulong slot;
@@ -282,8 +291,10 @@ struct kvm_vcpu_arch {
 	unsigned long pending_exceptions;
 
 #ifdef CONFIG_PPC_BOOK3S
-	struct hpte_cache hpte_cache[HPTEG_CACHE_NUM];
-	int hpte_cache_offset;
+	struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
+	struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
+	struct hlist_head hpte_hash_vpte_long[HPTEG_HASH_NUM_VPTE_LONG];
+	int hpte_cache_count;
 #endif
 };
 
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index ff43606..d45c818 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -45,6 +45,7 @@ kvm-book3s_64-objs := \
 	book3s.o \
 	book3s_emulate.o \
 	book3s_interrupts.o \
+	book3s_mmu_hpte.o \
 	book3s_64_mmu_host.o \
 	book3s_64_mmu.o \
 	book3s_32_mmu.o
@@ -57,6 +58,7 @@ kvm-book3s_32-objs := \
 	book3s.o \
 	book3s_emulate.o \
 	book3s_interrupts.o \
+	book3s_mmu_hpte.o \
 	book3s_32_mmu_host.o \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 884d4a5..30c0bd5 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1389,12 +1389,22 @@ int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 
 static int kvmppc_book3s_init(void)
 {
-	return kvm_init(NULL, sizeof(struct kvmppc_vcpu_book3s), 0,
-			THIS_MODULE);
+	int r;
+
+	r = kvm_init(NULL, sizeof(struct kvmppc_vcpu_book3s), 0,
+		     THIS_MODULE);
+
+	if (r)
+		return r;
+
+	r = kvmppc_mmu_hpte_sysinit();
+
+	return r;
 }
 
 static void kvmppc_book3s_exit(void)
 {
+	kvmppc_mmu_hpte_sysexit();
 	kvm_exit();
 }
 
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c b/arch/powerpc/kvm/book3s_32_mmu_host.c
index 904f5ac..0b51ef8 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -58,105 +58,19 @@
 static ulong htab;
 static u32 htabmask;
 
-static void invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
+void kvmppc_mmu_invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
 {
 	volatile u32 *pteg;
 
-	dprintk_mmu("KVM: Flushing SPTE: 0x%llx (0x%llx) -> 0x%llx\n",
-		    pte->pte.eaddr, pte->pte.vpage, pte->host_va);
-
+	/* Remove from host HTAB */
 	pteg = (u32*)pte->slot;
-
 	pteg[0] = 0;
+
+	/* And make sure it's gone from the TLB too */
 	asm volatile ("sync");
 	asm volatile ("tlbie %0" : : "r" (pte->pte.eaddr) : "memory");
 	asm volatile ("sync");
 	asm volatile ("tlbsync");
-
-	pte->host_va = 0;
-
-	if (pte->pte.may_write)
-		kvm_release_pfn_dirty(pte->pfn);
-	else
-		kvm_release_pfn_clean(pte->pfn);
-}
-
-void kvmppc_mmu_pte_flush(struct kvm_vcpu *vcpu, ulong guest_ea, ulong ea_mask)
-{
-	int i;
-
-	dprintk_mmu("KVM: Flushing %d Shadow PTEs: 0x%x & 0x%x\n",
-		    vcpu->arch.hpte_cache_offset, guest_ea, ea_mask);
-	BUG_ON(vcpu->arch.hpte_cache_offset > HPTEG_CACHE_NUM);
-
-	guest_ea &= ea_mask;
-	for (i = 0; i < vcpu->arch.hpte_cache_offset; i++) {
-		struct hpte_cache *pte;
-
-		pte = &vcpu->arch.hpte_cache[i];
-		if (!pte->host_va)
-			continue;
-
-		if ((pte->pte.eaddr & ea_mask) == guest_ea) {
-			invalidate_pte(vcpu, pte);
-		}
-	}
-
-	/* Doing a complete flush -> start from scratch */
-	if (!ea_mask)
-		vcpu->arch.hpte_cache_offset = 0;
-}
-
-void kvmppc_mmu_pte_vflush(struct kvm_vcpu *vcpu, u64 guest_vp, u64 vp_mask)
-{
-	int i;
-
-	dprintk_mmu("KVM: Flushing %d Shadow vPTEs: 0x%llx & 0x%llx\n",
-		    vcpu->arch.hpte_cache_offset, guest_vp, vp_mask);
-	BUG_ON(vcpu->arch.hpte_cache_offset > HPTEG_CACHE_NUM);
-
-	guest_vp &= vp_mask;
-	for (i = 0; i < vcpu->arch.hpte_cache_offset; i++) {
-		struct hpte_cache *pte;
-
-		pte = &vcpu->arch.hpte_cache[i];
-		if (!pte->host_va)
-			continue;
-
-		if ((pte->pte.vpage & vp_mask) == guest_vp) {
-			invalidate_pte(vcpu, pte);
-		}
-	}
-}
-
-void kvmppc_mmu_pte_pflush(struct kvm_vcpu *vcpu, ulong pa_start, ulong pa_end)
-{
-	int i;
-
-	dprintk_mmu("KVM: Flushing %d Shadow pPTEs: 0x%llx & 0x%llx\n",
-		    vcpu->arch.hpte_cache_offset, pa_start, pa_end);
-	BUG_ON(vcpu->arch.hpte_cache_offset > HPTEG_CACHE_NUM);
-
-	for (i = 0; i < vcpu->arch.hpte_cache_offset; i++) {
-		struct hpte_cache *pte;
-
-		pte = &vcpu->arch.hpte_cache[i];
-		if (!pte->host_va)
-			continue;
-
-		if ((pte->pte.raddr >= pa_start) &&
-		    (pte->pte.raddr < pa_end)) {
-			invalidate_pte(vcpu, pte);
-		}
-	}
-}
-
-static int kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu)
-{
-	if (vcpu->arch.hpte_cache_offset == HPTEG_CACHE_NUM)
-		kvmppc_mmu_pte_flush(vcpu, 0, 0);
-
-	return vcpu->arch.hpte_cache_offset++;
 }
 
 /* We keep 512 gvsid->hvsid entries, mapping the guest ones to the array using
@@ -230,7 +144,6 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte)
 	register int rr = 0;
 	bool primary = false;
 	bool evict = false;
-	int hpte_id;
 	struct hpte_cache *pte;
 
 	/* Get host physical address for gpa */
@@ -315,8 +228,7 @@ next_pteg:
 
 	/* Now tell our Shadow PTE code about the new page */
 
-	hpte_id = kvmppc_mmu_hpte_cache_next(vcpu);
-	pte = &vcpu->arch.hpte_cache[hpte_id];
+	pte = kvmppc_mmu_hpte_cache_next(vcpu);
 
 	dprintk_mmu("KVM: %c%c Map 0x%llx: [%lx] 0x%llx (0x%llx) -> %lx\n",
 		    orig_pte->may_write ? 'w' : '-',
@@ -329,6 +241,8 @@ next_pteg:
 	pte->pte = *orig_pte;
 	pte->pfn = hpaddr >> PAGE_SHIFT;
 
+	kvmppc_mmu_hpte_cache_map(vcpu, pte);
+
 	return 0;
 }
 
@@ -413,7 +327,7 @@ void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu)
 
 void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu)
 {
-	kvmppc_mmu_pte_flush(vcpu, 0, 0);
+	kvmppc_mmu_hpte_destroy(vcpu);
 	preempt_disable();
 	__destroy_context(to_book3s(vcpu)->context_id);
 	preempt_enable();
@@ -453,5 +367,7 @@ int kvmppc_mmu_init(struct kvm_vcpu *vcpu)
 	htabmask = ((sdr1 & 0x1FF) << 16) | 0xFFC0;
 	htab = (ulong)__va(sdr1 & 0xffff0000);
 
+	kvmppc_mmu_hpte_init(vcpu);
+
 	return 0;
 }
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 4ccdde1..384179a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -47,98 +47,11 @@
 #define dprintk_slb(a, ...) do { } while(0)
 #endif
 
-static void invalidate_pte(struct hpte_cache *pte)
+void kvmppc_mmu_invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
 {
-	dprintk_mmu("KVM: Flushing SPT: 0x%lx (0x%llx) -> 0x%llx\n",
-		    pte->pte.eaddr, pte->pte.vpage, pte->host_va);
-
 	ppc_md.hpte_invalidate(pte->slot, pte->host_va,
 			       MMU_PAGE_4K, MMU_SEGSIZE_256M,
 			       false);
-	pte->host_va = 0;
-
-	if (pte->pte.may_write)
-		kvm_release_pfn_dirty(pte->pfn);
-	else
-		kvm_release_pfn_clean(pte->pfn);
-}
-
-void kvmppc_mmu_pte_flush(struct kvm_vcpu *vcpu, ulong guest_ea, ulong ea_mask)
-{
-	int i;
-
-	dprintk_mmu("KVM: Flushing %d Shadow PTEs: 0x%lx & 0x%lx\n",
-		    vcpu->arch.hpte_cache_offset, guest_ea, ea_mask);
-	BUG_ON(vcpu->arch.hpte_cache_offset > HPTEG_CACHE_NUM);
-
-	guest_ea &= ea_mask;
-	for (i = 0; i < vcpu->arch.hpte_cache_offset; i++) {
-		struct hpte_cache *pte;
-
-		pte = &vcpu->arch.hpte_cache[i];
-		if (!pte->host_va)
-			continue;
-
-		if ((pte->pte.eaddr & ea_mask) == guest_ea) {
-			invalidate_pte(pte);
-		}
-	}
-
-	/* Doing a complete flush -> start from scratch */
-	if (!ea_mask)
-		vcpu->arch.hpte_cache_offset = 0;
-}
-
-void kvmppc_mmu_pte_vflush(struct kvm_vcpu *vcpu, u64 guest_vp, u64 vp_mask)
-{
-	int i;
-
-	dprintk_mmu("KVM: Flushing %d Shadow vPTEs: 0x%llx & 0x%llx\n",
-		    vcpu->arch.hpte_cache_offset, guest_vp, vp_mask);
-	BUG_ON(vcpu->arch.hpte_cache_offset > HPTEG_CACHE_NUM);
-
-	guest_vp &= vp_mask;
-	for (i = 0; i < vcpu->arch.hpte_cache_offset; i++) {
-		struct hpte_cache *pte;
-
-		pte = &vcpu->arch.hpte_cache[i];
-		if (!pte->host_va)
-			continue;
-
-		if ((pte->pte.vpage & vp_mask) == guest_vp) {
-			invalidate_pte(pte);
-		}
-	}
-}
-
-void kvmppc_mmu_pte_pflush(struct kvm_vcpu *vcpu, ulong pa_start, ulong pa_end)
-{
-	int i;
-
-	dprintk_mmu("KVM: Flushing %d Shadow pPTEs: 0x%lx & 0x%lx\n",
-		    vcpu->arch.hpte_cache_offset, pa_start, pa_end);
-	BUG_ON(vcpu->arch.hpte_cache_offset > HPTEG_CACHE_NUM);
-
-	for (i = 0; i < vcpu->arch.hpte_cache_offset; i++) {
-		struct hpte_cache *pte;
-
-		pte = &vcpu->arch.hpte_cache[i];
-		if (!pte->host_va)
-			continue;
-
-		if ((pte->pte.raddr >= pa_start) &&
-		    (pte->pte.raddr < pa_end)) {
-			invalidate_pte(pte);
-		}
-	}
-}
-
-static int kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu)
-{
-	if (vcpu->arch.hpte_cache_offset == HPTEG_CACHE_NUM)
-		kvmppc_mmu_pte_flush(vcpu, 0, 0);
-
-	return vcpu->arch.hpte_cache_offset++;
 }
 
 /* We keep 512 gvsid->hvsid entries, mapping the guest ones to the array using
@@ -246,8 +159,7 @@ map_again:
 		attempt++;
 		goto map_again;
 	} else {
-		int hpte_id = kvmppc_mmu_hpte_cache_next(vcpu);
-		struct hpte_cache *pte = &vcpu->arch.hpte_cache[hpte_id];
+		struct hpte_cache *pte = kvmppc_mmu_hpte_cache_next(vcpu);
 
 		dprintk_mmu("KVM: %c%c Map 0x%lx: [%lx] 0x%lx (0x%llx) -> %lx\n",
 			    ((rflags & HPTE_R_PP) == 3) ? '-' : 'w',
@@ -265,6 +177,8 @@ map_again:
 		pte->host_va = va;
 		pte->pte = *orig_pte;
 		pte->pfn = hpaddr >> PAGE_SHIFT;
+
+		kvmppc_mmu_hpte_cache_map(vcpu, pte);
 	}
 
 	return 0;
@@ -391,7 +305,7 @@ void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu)
 
 void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu)
 {
-	kvmppc_mmu_pte_flush(vcpu, 0, 0);
+	kvmppc_mmu_hpte_destroy(vcpu);
 	__destroy_context(to_book3s(vcpu)->context_id);
 }
 
@@ -409,5 +323,7 @@ int kvmppc_mmu_init(struct kvm_vcpu *vcpu)
 	vcpu3s->vsid_first = vcpu3s->context_id << USER_ESID_BITS;
 	vcpu3s->vsid_next = vcpu3s->vsid_first;
 
+	kvmppc_mmu_hpte_init(vcpu);
+
 	return 0;
 }
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-06-30 13:18 [PATCH 0/2] Faster MMU lookups for Book3s v3 Alexander Graf
  2010-06-30 13:18 ` [PATCH 1/2] KVM: PPC: Add generic hpte management functions Alexander Graf
  2010-06-30 13:18 ` [PATCH 2/2] KVM: PPC: Make use of hash based Shadow MMU Alexander Graf
@ 2010-07-01  7:29 ` Avi Kivity
  2010-07-01  8:18   ` Alexander Graf
  2010-07-01 15:40 ` Marcelo Tosatti
  3 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2010-07-01  7:29 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc

On 06/30/2010 04:18 PM, Alexander Graf wrote:
> Book3s suffered from my really bad shadow MMU implementation so far. So
> I finally got around to implement a combined hash and list mechanism that
> allows for much faster lookup of mapped pages.
>
> To show that it really is faster, I tried to run simple process spawning
> code inside the guest with and without these patches:
>
> [without]
>
> debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello>  /dev/null; done
>
> real    0m20.235s
> user    0m10.418s
> sys     0m9.766s
>
> [with]
>
> debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello>  /dev/null; done
>
> real    0m14.659s
> user    0m8.967s
> sys     0m5.688s
>
> So as you can see, performance improved significantly.
>
> v2 ->  v3:
>
>    - use hlist
>    - use global slab cache
>
>    

Looks good.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01  7:29 ` [PATCH 0/2] Faster MMU lookups for Book3s v3 Avi Kivity
@ 2010-07-01  8:18   ` Alexander Graf
  2010-07-01  8:40     ` Avi Kivity
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2010-07-01  8:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc


On 01.07.2010, at 09:29, Avi Kivity wrote:

> On 06/30/2010 04:18 PM, Alexander Graf wrote:
>> Book3s suffered from my really bad shadow MMU implementation so far. =
So
>> I finally got around to implement a combined hash and list mechanism =
that
>> allows for much faster lookup of mapped pages.
>>=20
>> To show that it really is faster, I tried to run simple process =
spawning
>> code inside the guest with and without these patches:
>>=20
>> [without]
>>=20
>> debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello>  =
/dev/null; done
>>=20
>> real    0m20.235s
>> user    0m10.418s
>> sys     0m9.766s
>>=20
>> [with]
>>=20
>> debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello>  =
/dev/null; done
>>=20
>> real    0m14.659s
>> user    0m8.967s
>> sys     0m5.688s
>>=20
>> So as you can see, performance improved significantly.
>>=20
>> v2 ->  v3:
>>=20
>>   - use hlist
>>   - use global slab cache
>>=20
>>  =20
>=20
> Looks good.

Great :).

How does dirty bitmap flushing work on x86 atm? I loop through all =
mapped pages and flush the ones that match the range of the region I =
need to flush. But wouldn't it be a lot more efficient to have an hlist =
in the memslot and loop through that when I need to flush that memslot?

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01  8:18   ` Alexander Graf
@ 2010-07-01  8:40     ` Avi Kivity
  2010-07-01 10:00       ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2010-07-01  8:40 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc

On 07/01/2010 11:18 AM, Alexander Graf wrote:
>
> How does dirty bitmap flushing work on x86 atm? I loop through all mapped pages and flush the ones that match the range of the region I need to flush. But wouldn't it be a lot more efficient to have an hlist in the memslot and loop through that when I need to flush that memslot?
>    

x86 loops through the reverse-map link list rooted at the memory slot.  
The linked list links all sptes for a single hva.

So, it's like you describe, except it's an array of lists instead of a 
single list.  We need per-page rmap lists to be able to remove a page's 
sptes in response to an mmu notifier callback, and to be able to write 
protect a guest page if it's used as a page table.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01  8:40     ` Avi Kivity
@ 2010-07-01 10:00       ` Alexander Graf
  2010-07-01 11:14         ` Avi Kivity
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2010-07-01 10:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc

Avi Kivity wrote:
> On 07/01/2010 11:18 AM, Alexander Graf wrote:
>>
>> How does dirty bitmap flushing work on x86 atm? I loop through all
>> mapped pages and flush the ones that match the range of the region I
>> need to flush. But wouldn't it be a lot more efficient to have an
>> hlist in the memslot and loop through that when I need to flush that
>> memslot?
>>    
>
> x86 loops through the reverse-map link list rooted at the memory
> slot.  The linked list links all sptes for a single hva.
>
> So, it's like you describe, except it's an array of lists instead of a
> single list.  We need per-page rmap lists to be able to remove a
> page's sptes in response to an mmu notifier callback, and to be able
> to write protect a guest page if it's used as a page table.
>

But doesn't that mean that you still need to loop through all the hvas
that you want to invalidate? Wouldn't it speed up dirty bitmap flushing
a lot if we'd just have a simple linked list of all sPTEs belonging to
that memslot?

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01 10:00       ` Alexander Graf
@ 2010-07-01 11:14         ` Avi Kivity
  2010-07-01 12:28           ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2010-07-01 11:14 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc

On 07/01/2010 01:00 PM, Alexander Graf wrote:
>
> But doesn't that mean that you still need to loop through all the hvas
> that you want to invalidate?

It does.

>   Wouldn't it speed up dirty bitmap flushing
> a lot if we'd just have a simple linked list of all sPTEs belonging to
> that memslot?
>    

The complexity is O(pages_in_slot) + O(sptes_for_slot).

Usually, every page is mapped at least once, so sptes_for_slot 
dominates.  Even when it isn't so, iterating the rmap base pointers is 
very fast since they are linear in memory, while sptes are scattered 
around, causing cache misses.

Another consideration is that on x86, an spte occupies just 64 bits (for 
the hardware pte); if there are multiple sptes per page (rare on modern 
hardware), there is also extra memory for rmap chains; sometimes we also 
allocate 64 bits for the gfn.  Having an extra linked list would require 
more memory to be allocated and maintained.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01 11:14         ` Avi Kivity
@ 2010-07-01 12:28           ` Alexander Graf
  2010-07-01 12:43             ` Avi Kivity
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2010-07-01 12:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc

Avi Kivity wrote:
> On 07/01/2010 01:00 PM, Alexander Graf wrote:
>>
>> But doesn't that mean that you still need to loop through all the hvas
>> that you want to invalidate?
>
> It does.
>
>>   Wouldn't it speed up dirty bitmap flushing
>> a lot if we'd just have a simple linked list of all sPTEs belonging to
>> that memslot?
>>    
>
> The complexity is O(pages_in_slot) + O(sptes_for_slot).
>
> Usually, every page is mapped at least once, so sptes_for_slot
> dominates.  Even when it isn't so, iterating the rmap base pointers is
> very fast since they are linear in memory, while sptes are scattered
> around, causing cache misses.

Why would pages be mapped often? Don't you use lazy spte updates?

>
> Another consideration is that on x86, an spte occupies just 64 bits
> (for the hardware pte); if there are multiple sptes per page (rare on
> modern hardware), there is also extra memory for rmap chains;
> sometimes we also allocate 64 bits for the gfn.  Having an extra
> linked list would require more memory to be allocated and maintained.

Hrm. I was thinking of not having an rmap but only using the chain. The
only slots that would require such a chain would be the ones with dirty
bitmapping enabled, so no penalty for normal RAM (unless you use kemari
or live migration of course).

But then again I probably do need an rmap for the mmu_notifier magic,
right? But I'd rather prefer to have that code path be slow and the
dirty bitmap invalidation fast than the other way around. Swapping is
slow either way.


Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01 12:28           ` Alexander Graf
@ 2010-07-01 12:43             ` Avi Kivity
  2010-07-01 12:52               ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2010-07-01 12:43 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc

On 07/01/2010 03:28 PM, Alexander Graf wrote:
>
>>
>>>    Wouldn't it speed up dirty bitmap flushing
>>> a lot if we'd just have a simple linked list of all sPTEs belonging to
>>> that memslot?
>>>
>>>        
>> The complexity is O(pages_in_slot) + O(sptes_for_slot).
>>
>> Usually, every page is mapped at least once, so sptes_for_slot
>> dominates.  Even when it isn't so, iterating the rmap base pointers is
>> very fast since they are linear in memory, while sptes are scattered
>> around, causing cache misses.
>>      
> Why would pages be mapped often?

It's not a question of how often they are mapped (shadow: very often; 
tdp: very rarely) but what percentage of pages are mapped.  It's usually 
100%.

> Don't you use lazy spte updates?
>    

We do, but given enough time, the guest will touch its entire memory.


>> Another consideration is that on x86, an spte occupies just 64 bits
>> (for the hardware pte); if there are multiple sptes per page (rare on
>> modern hardware), there is also extra memory for rmap chains;
>> sometimes we also allocate 64 bits for the gfn.  Having an extra
>> linked list would require more memory to be allocated and maintained.
>>      
> Hrm. I was thinking of not having an rmap but only using the chain. The
> only slots that would require such a chain would be the ones with dirty
> bitmapping enabled, so no penalty for normal RAM (unless you use kemari
> or live migration of course).
>    

You could also only chain writeable ptes.

> But then again I probably do need an rmap for the mmu_notifier magic,
> right? But I'd rather prefer to have that code path be slow and the
> dirty bitmap invalidation fast than the other way around. Swapping is
> slow either way.
>    

It's not just swapping, it's also page ageing.  That needs to be fast.  
Does ppc have a hardware-set referenced bit?  If so, you need a fast 
rmap for mmu notifiers.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01 12:43             ` Avi Kivity
@ 2010-07-01 12:52               ` Alexander Graf
  2010-07-01 13:42                 ` Avi Kivity
  2010-07-02  2:50                 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 15+ messages in thread
From: Alexander Graf @ 2010-07-01 12:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc

Avi Kivity wrote:
> On 07/01/2010 03:28 PM, Alexander Graf wrote:
>>
>>>
>>>>    Wouldn't it speed up dirty bitmap flushing
>>>> a lot if we'd just have a simple linked list of all sPTEs belonging to
>>>> that memslot?
>>>>
>>>>        
>>> The complexity is O(pages_in_slot) + O(sptes_for_slot).
>>>
>>> Usually, every page is mapped at least once, so sptes_for_slot
>>> dominates.  Even when it isn't so, iterating the rmap base pointers is
>>> very fast since they are linear in memory, while sptes are scattered
>>> around, causing cache misses.
>>>      
>> Why would pages be mapped often?
>
> It's not a question of how often they are mapped (shadow: very often;
> tdp: very rarely) but what percentage of pages are mapped.  It's
> usually 100%.
>
>> Don't you use lazy spte updates?
>>    
>
> We do, but given enough time, the guest will touch its entire memory.

Oh, so that's the major difference. On PPC we have the HTAB with a
fraction of all the mapped pages in it. We don't have a notion of a full
page table for a guest process. We always only have a snapshot of some
mappings and shadow those lazily.

So at worst, we have HPTEG_CACHE_NUM shadow pages mapped, which would be
(1 << 15) * 4k which again would be at most 128MB of guest memory. We
can't hold more mappings than that anyways, so chances are low we have a
mapping for each hva.

>
>
>>> Another consideration is that on x86, an spte occupies just 64 bits
>>> (for the hardware pte); if there are multiple sptes per page (rare on
>>> modern hardware), there is also extra memory for rmap chains;
>>> sometimes we also allocate 64 bits for the gfn.  Having an extra
>>> linked list would require more memory to be allocated and maintained.
>>>      
>> Hrm. I was thinking of not having an rmap but only using the chain. The
>> only slots that would require such a chain would be the ones with dirty
>> bitmapping enabled, so no penalty for normal RAM (unless you use kemari
>> or live migration of course).
>>    
>
> You could also only chain writeable ptes.

Very true. Probably even more useful :).

>
>> But then again I probably do need an rmap for the mmu_notifier magic,
>> right? But I'd rather prefer to have that code path be slow and the
>> dirty bitmap invalidation fast than the other way around. Swapping is
>> slow either way.
>>    
>
> It's not just swapping, it's also page ageing.  That needs to be
> fast.  Does ppc have a hardware-set referenced bit?  If so, you need a
> fast rmap for mmu notifiers.

Page ageing is difficult. The HTAB has a hardware set referenced bit,
but we don't have a guarantee that the entry is still there when we look
for it. Something else could have overwritten it by then, but the entry
could still be lingering around in the TLB.

So I think the only reasonable way to implement page ageing is to unmap
pages. And that's slow, because it means we have to map them again on
access. Bleks. Or we could look for the HTAB entry and only unmap them
if the entry is moot.


Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01 12:52               ` Alexander Graf
@ 2010-07-01 13:42                 ` Avi Kivity
  2010-07-02  2:54                   ` Benjamin Herrenschmidt
  2010-07-02  2:50                 ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2010-07-01 13:42 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc

On 07/01/2010 03:52 PM, Alexander Graf wrote:
>>
>>> Don't you use lazy spte updates?
>>>
>>>        
>> We do, but given enough time, the guest will touch its entire memory.
>>      
> Oh, so that's the major difference. On PPC we have the HTAB with a
> fraction of all the mapped pages in it. We don't have a notion of a full
> page table for a guest process. We always only have a snapshot of some
> mappings and shadow those lazily.
>
> So at worst, we have HPTEG_CACHE_NUM shadow pages mapped, which would be
> (1<<  15) * 4k which again would be at most 128MB of guest memory. We
> can't hold more mappings than that anyways, so chances are low we have a
> mapping for each hva.
>    

Doesn't that seriously impact performance?  A guest that recycles pages 
from its lru will touch pages at random from its entire address space.  
On bare metal that isn't a problem (I imagine) due to large tlbs.  But 
virtualized on 4K pages that means the htlb will be thrashed.

>>> But then again I probably do need an rmap for the mmu_notifier magic,
>>> right? But I'd rather prefer to have that code path be slow and the
>>> dirty bitmap invalidation fast than the other way around. Swapping is
>>> slow either way.
>>>
>>>        
>> It's not just swapping, it's also page ageing.  That needs to be
>> fast.  Does ppc have a hardware-set referenced bit?  If so, you need a
>> fast rmap for mmu notifiers.
>>      
> Page ageing is difficult. The HTAB has a hardware set referenced bit,
> but we don't have a guarantee that the entry is still there when we look
> for it. Something else could have overwritten it by then, but the entry
> could still be lingering around in the TLB.
>    

Whoever's dropping the HTAB needs to update the host struct page, and 
also reflect the bit into the guest's HTAB, no?

In fact, on x86 shadow, we don't have an spte for a gpte that is not 
accessed, precisely so we know the exact point in time when the accessed 
bit is set.

> So I think the only reasonable way to implement page ageing is to unmap
> pages. And that's slow, because it means we have to map them again on
> access. Bleks. Or we could look for the HTAB entry and only unmap them
> if the entry is moot.
>    

I think it works out if you update struct page when you clear out an HTAB.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-06-30 13:18 [PATCH 0/2] Faster MMU lookups for Book3s v3 Alexander Graf
                   ` (2 preceding siblings ...)
  2010-07-01  7:29 ` [PATCH 0/2] Faster MMU lookups for Book3s v3 Avi Kivity
@ 2010-07-01 15:40 ` Marcelo Tosatti
  3 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2010-07-01 15:40 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc

On Wed, Jun 30, 2010 at 03:18:44PM +0200, Alexander Graf wrote:
> Book3s suffered from my really bad shadow MMU implementation so far. So
> I finally got around to implement a combined hash and list mechanism that
> allows for much faster lookup of mapped pages.
> 
> To show that it really is faster, I tried to run simple process spawning
> code inside the guest with and without these patches:
> 
> [without]
> 
> debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello > /dev/null; done
> 
> real    0m20.235s
> user    0m10.418s
> sys     0m9.766s
> 
> [with]
> 
> debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello > /dev/null; done
> 
> real    0m14.659s
> user    0m8.967s
> sys     0m5.688s
> 
> So as you can see, performance improved significantly.
> 
> v2 -> v3:
> 
>   - use hlist
>   - use global slab cache
> 
> Alexander Graf (2):
>   KVM: PPC: Add generic hpte management functions
>   KVM: PPC: Make use of hash based Shadow MMU
> 
>  arch/powerpc/include/asm/kvm_book3s.h |    9 +
>  arch/powerpc/include/asm/kvm_host.h   |   17 ++-
>  arch/powerpc/kvm/Makefile             |    2 +
>  arch/powerpc/kvm/book3s.c             |   14 ++-
>  arch/powerpc/kvm/book3s_32_mmu_host.c |  104 ++-----------
>  arch/powerpc/kvm/book3s_64_mmu_host.c |   98 +-----------
>  arch/powerpc/kvm/book3s_mmu_hpte.c    |  277 +++++++++++++++++++++++++++++++++
>  7 files changed, 331 insertions(+), 190 deletions(-)
>  create mode 100644 arch/powerpc/kvm/book3s_mmu_hpte.c

Applied, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01 12:52               ` Alexander Graf
  2010-07-01 13:42                 ` Avi Kivity
@ 2010-07-02  2:50                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2010-07-02  2:50 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, linuxppc-dev, Avi Kivity, KVM list

On Thu, 2010-07-01 at 14:52 +0200, Alexander Graf wrote:
> Page ageing is difficult. The HTAB has a hardware set referenced bit,
> but we don't have a guarantee that the entry is still there when we look
> for it. Something else could have overwritten it by then, but the entry
> could still be lingering around in the TLB.
> 
> So I think the only reasonable way to implement page ageing is to unmap
> pages. And that's slow, because it means we have to map them again on
> access. Bleks. Or we could look for the HTAB entry and only unmap them
> if the entry is moot.

Well, not quite.

We -could- use the HW reference bit. However, that means that whenever
we flush the hash PTE we get a snapshot of the HW bit and copy it over
to the PTE.

That's not -that- bad for normal invalidations. However, it's a problem
potentially for eviction. IE. When a hash bucket is full, we
pseudo-randomly evict a slot. If we were to use the HW ref bit, we would
need a way to go back to the PTE from the hash bucket to perform that
update (or something really tricky like sticking it in a list somewhere,
and have the young test walk that list when non-empty, etc...)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Faster MMU lookups for Book3s v3
  2010-07-01 13:42                 ` Avi Kivity
@ 2010-07-02  2:54                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2010-07-02  2:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-ppc, linuxppc-dev, Alexander Graf, KVM list

On Thu, 2010-07-01 at 16:42 +0300, Avi Kivity wrote:
> > So I think the only reasonable way to implement page ageing is to
> unmap
> > pages. And that's slow, because it means we have to map them again
> on
> > access. Bleks. Or we could look for the HTAB entry and only unmap
> them
> > if the entry is moot.
> >    
> 
> I think it works out if you update struct page when you clear out an
> HTAB.

Hrm... going to struct page without going through the PTE might work out
indeed. We can get to the struct page from the RPN.

However, that means -reading- the hash entry we want to evict, and
that's a fairly expensive H-Call, especially if we ask phyp to
back-translate the real address into a logical (partition) address so we
can get to the struct page.... While we might be able to reconstitute
the virtual address from the hash content + bucket address. However,
from the vsid back to the page table might be tricky as well.

IE. Either way, it's not a simple process.

Now, eviction is rare, our MMU hash is generally big, so maybe the read
back with back translate to hit struct page might be the way to go here.

As for other kind of invalidations, we do have the PTE around when they
happen so we can go fetch the HW ref bit and update the PTE I suppose.

Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-07-02  2:54 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-30 13:18 [PATCH 0/2] Faster MMU lookups for Book3s v3 Alexander Graf
2010-06-30 13:18 ` [PATCH 1/2] KVM: PPC: Add generic hpte management functions Alexander Graf
2010-06-30 13:18 ` [PATCH 2/2] KVM: PPC: Make use of hash based Shadow MMU Alexander Graf
2010-07-01  7:29 ` [PATCH 0/2] Faster MMU lookups for Book3s v3 Avi Kivity
2010-07-01  8:18   ` Alexander Graf
2010-07-01  8:40     ` Avi Kivity
2010-07-01 10:00       ` Alexander Graf
2010-07-01 11:14         ` Avi Kivity
2010-07-01 12:28           ` Alexander Graf
2010-07-01 12:43             ` Avi Kivity
2010-07-01 12:52               ` Alexander Graf
2010-07-01 13:42                 ` Avi Kivity
2010-07-02  2:54                   ` Benjamin Herrenschmidt
2010-07-02  2:50                 ` Benjamin Herrenschmidt
2010-07-01 15:40 ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).