[PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
@ 2010-06-25 12:05 ` Xiao Guangrong
  2010-06-28  9:43   ` Avi Kivity
  2010-06-25 12:06 ` [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted Xiao Guangrong
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

In no-direct mapping, we mark sp is 'direct' when we mapping the
guest's larger page, but its access is encoded form upper page-struct
entire not include the last mapping, it will cause access conflict.

For example, have this mapping:
        [W] 
      / PDE1 -> |---|
  P[W]          |   | LPA
      \ PDE2 -> |---|
        [R]

P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
same lage page(LPA). The P's access is WR, PDE1's access is WR,
PDE2's access is RO(just consider read-write permissions here)

When guest access PDE1, we will create a direct sp for LPA, the sp's
access is from P, is W, then we will mark the ptes is W in this sp.

Then, guest access PDE2, we will find LPA's shadow page, is the same as
PDE's, and mark the ptes is RO.

So, if guest access PDE1, the incorrect #PF is occured.

Fixed by encode the last mapping access into direct shadow page

And, it also cleanup the code that directly get the last level's dirty flag

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/paging_tmpl.h |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 37c26cb..e46eb8a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -306,6 +306,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	gfn_t table_gfn;
 	int r;
 	int level;
+	bool dirty = is_dirty_gpte(gw->ptes[gw->level-1]);
 	pt_element_t curr_pte;
 	struct kvm_shadow_walk_iterator iterator;
 
@@ -319,8 +320,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			mmu_set_spte(vcpu, sptep, access,
 				     gw->pte_access & access,
 				     user_fault, write_fault,
-				     is_dirty_gpte(gw->ptes[gw->level-1]),
-				     ptwrite, level,
+				     dirty, ptwrite, level,
 				     gw->gfn, pfn, false, true);
 			break;
 		}
@@ -335,10 +335,11 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 		}
 
 		if (level <= gw->level) {
-			int delta = level - gw->level + 1;
 			direct = 1;
-			if (!is_dirty_gpte(gw->ptes[level - delta]))
+			if (!dirty)
 				access &= ~ACC_WRITE_MASK;
+			access &= gw->pte_access;
+
 			/*
 			 * It is a large guest pages backed by small host pages,
 			 * So we set @direct(@sp->role.direct)=1, and set
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
  2010-06-25 12:05 ` [PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp Xiao Guangrong
@ 2010-06-25 12:06 ` Xiao Guangrong
  2010-06-28  9:50   ` Avi Kivity
  2010-06-25 12:06 ` [PATCH v2 4/10] KVM: MMU: fix forgot to flush all vcpu's tlb Xiao Guangrong
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

Consider using small page to fit guest's large page mapping:

If the mapping is writable but the dirty flag is not set, we will find
the read-only direct sp and setup the mapping, then if the write #PF
occur, we will mark this mapping writable in the read-only direct sp,
now, other real read-only mapping will happily write it without #PF.

It may hurt guest's COW

Fixed by re-install the mapping when write #PF occur.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |    3 ++-
 arch/x86/kvm/paging_tmpl.h |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 556a798..0412ba4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -153,7 +153,8 @@ module_param(oos_shadow, bool, 0644);
 #define CREATE_TRACE_POINTS
 #include "mmutrace.h"
 
-#define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_NO_DIRTY		(2ULL << PT_FIRST_AVAIL_BITS_SHIFT)
 
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e46eb8a..fdba751 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -325,6 +325,20 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			break;
 		}
 
+		if (*sptep & SPTE_NO_DIRTY) {
+			struct kvm_mmu_page *child;
+
+			WARN_ON(level !=  gw->level);
+			WARN_ON(!is_shadow_present_pte(*sptep));
+			if (dirty) {
+				child = page_header(*sptep &
+						      PT64_BASE_ADDR_MASK);
+				mmu_page_remove_parent_pte(child, sptep);
+				__set_spte(sptep, shadow_trap_nonpresent_pte);
+				kvm_flush_remote_tlbs(vcpu->kvm);
+			}
+		}
+
 		if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
 			continue;
 
@@ -365,6 +379,10 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			}
 		}
 
+		if (level == gw->level && !dirty &&
+		      access & gw->pte_access & ACC_WRITE_MASK)
+			spte |= SPTE_NO_DIRTY;
+
 		spte = __pa(sp->spt)
 			| PT_PRESENT_MASK | PT_ACCESSED_MASK
 			| PT_WRITABLE_MASK | PT_USER_MASK;
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 4/10] KVM: MMU: fix forgot to flush all vcpu's tlb
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
  2010-06-25 12:05 ` [PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp Xiao Guangrong
  2010-06-25 12:06 ` [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted Xiao Guangrong
@ 2010-06-25 12:06 ` Xiao Guangrong
  2010-06-28  9:55   ` Avi Kivity
  2010-06-25 12:06 ` [PATCH v2 5/10] KVM: MMU: introduce gfn_to_pfn_atomic() function Xiao Guangrong
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

After remove a rmap, we should flush all vcpu's tlb

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0412ba4..f151540 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1933,6 +1933,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			pgprintk("hfn old %lx new %lx\n",
 				 spte_to_pfn(*sptep), pfn);
 			rmap_remove(vcpu->kvm, sptep);
+			__set_spte(sptep, shadow_trap_nonpresent_pte);
+			kvm_flush_remote_tlbs(vcpu->kvm);
 		} else
 			was_rmapped = 1;
 	}
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 5/10] KVM: MMU: introduce gfn_to_pfn_atomic() function
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
                   ` (2 preceding siblings ...)
  2010-06-25 12:06 ` [PATCH v2 4/10] KVM: MMU: fix forgot to flush all vcpu's tlb Xiao Guangrong
@ 2010-06-25 12:06 ` Xiao Guangrong
  2010-06-25 12:06 ` [PATCH v2 6/10] KVM: MMU: introduce gfn_to_hva_many() function Xiao Guangrong
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

Introduce gfn_to_pfn_atomic(), it's the fast path and can used in atomic
context, the later patch will use it

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/mm/gup.c        |    2 ++
 include/linux/kvm_host.h |    1 +
 virt/kvm/kvm_main.c      |   32 +++++++++++++++++++++++++-------
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 738e659..0c9034b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -6,6 +6,7 @@
  */
 #include <linux/sched.h>
 #include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/vmstat.h>
 #include <linux/highmem.h>
 
@@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	return nr;
 }
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);
 
 /**
  * get_user_pages_fast() - pin user pages in memory
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9289d1a..515fefd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -295,6 +295,7 @@ void kvm_release_page_dirty(struct page *page);
 void kvm_set_page_dirty(struct page *page);
 void kvm_set_page_accessed(struct page *page);
 
+pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 885d3f5..60bb3d5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -924,19 +924,25 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
+static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
 {
 	struct page *page[1];
 	int npages;
 	pfn_t pfn;
 
-	might_sleep();
-
-	npages = get_user_pages_fast(addr, 1, 1, page);
+	if (atomic)
+		npages = __get_user_pages_fast(addr, 1, 1, page);
+	else {
+		might_sleep();
+		npages = get_user_pages_fast(addr, 1, 1, page);
+	}
 
 	if (unlikely(npages != 1)) {
 		struct vm_area_struct *vma;
 
+		if (atomic)
+			goto return_bad_page;
+
 		down_read(&current->mm->mmap_sem);
 		if (is_hwpoison_address(addr)) {
 			up_read(&current->mm->mmap_sem);
@@ -949,6 +955,7 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 		if (vma == NULL || addr < vma->vm_start ||
 		    !(vma->vm_flags & VM_PFNMAP)) {
 			up_read(&current->mm->mmap_sem);
+return_bad_page:
 			get_page(bad_page);
 			return page_to_pfn(bad_page);
 		}
@@ -962,7 +969,7 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr)
 	return pfn;
 }
 
-pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic)
 {
 	unsigned long addr;
 
@@ -972,7 +979,18 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 		return page_to_pfn(bad_page);
 	}
 
-	return hva_to_pfn(kvm, addr);
+	return hva_to_pfn(kvm, addr, atomic);
+}
+
+pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_pfn(kvm, gfn, true);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
+
+pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
+{
+	return __gfn_to_pfn(kvm, gfn, false);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
@@ -980,7 +998,7 @@ pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
 			 struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	unsigned long addr = gfn_to_hva_memslot(slot, gfn);
-	return hva_to_pfn(kvm, addr);
+	return hva_to_pfn(kvm, addr, false);
 }
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 6/10] KVM: MMU: introduce gfn_to_hva_many() function
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
                   ` (3 preceding siblings ...)
  2010-06-25 12:06 ` [PATCH v2 5/10] KVM: MMU: introduce gfn_to_pfn_atomic() function Xiao Guangrong
@ 2010-06-25 12:06 ` Xiao Guangrong
  2010-06-25 12:06 ` [PATCH v2 7/10] KVM: MMU: introduce mmu_topup_memory_cache_atomic() Xiao Guangrong
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

This function not only return the gfn's hva but also the page number
after @gfn in the slot

It's used in the later patch

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/kvm_main.c      |   13 ++++++++++++-
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 515fefd..8f7af32 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -289,6 +289,7 @@ void kvm_disable_largepages(void);
 void kvm_arch_flush_shadow(struct kvm *kvm);
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
+unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn, int *entry);
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 60bb3d5..a007889 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -913,15 +913,26 @@ static unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 	return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
 }
 
-unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
+unsigned long gfn_to_hva_many(struct kvm *kvm, gfn_t gfn, int *entry)
 {
 	struct kvm_memory_slot *slot;
 
 	slot = gfn_to_memslot(kvm, gfn);
+
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return bad_hva();
+
+	if (entry)
+		*entry = slot->npages - (gfn - slot->base_gfn);
+
 	return gfn_to_hva_memslot(slot, gfn);
 }
+EXPORT_SYMBOL_GPL(gfn_to_hva_many);
+
+unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
+{
+	return gfn_to_hva_many(kvm, gfn, NULL);
+}
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
 static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 7/10] KVM: MMU: introduce mmu_topup_memory_cache_atomic()
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
                   ` (4 preceding siblings ...)
  2010-06-25 12:06 ` [PATCH v2 6/10] KVM: MMU: introduce gfn_to_hva_many() function Xiao Guangrong
@ 2010-06-25 12:06 ` Xiao Guangrong
  2010-06-28 11:17   ` Avi Kivity
  2010-06-25 12:07 ` [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF Xiao Guangrong
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

Introduce mmu_topup_memory_cache_atomic(), it support topup memory
cache in atomic context

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   29 +++++++++++++++++++++++++----
 1 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f151540..6c06666 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -291,15 +291,16 @@ static void __set_spte(u64 *sptep, u64 spte)
 #endif
 }
 
-static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
-				  struct kmem_cache *base_cache, int min)
+static int __mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
+				    struct kmem_cache *base_cache, int min,
+				    int max, gfp_t flags)
 {
 	void *obj;
 
 	if (cache->nobjs >= min)
 		return 0;
-	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-		obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+	while (cache->nobjs < max) {
+		obj = kmem_cache_zalloc(base_cache, flags);
 		if (!obj)
 			return -ENOMEM;
 		cache->objects[cache->nobjs++] = obj;
@@ -307,6 +308,26 @@ static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 	return 0;
 }
 
+static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
+				  struct kmem_cache *base_cache, int min)
+{
+	return __mmu_topup_memory_cache(cache, base_cache, min,
+				ARRAY_SIZE(cache->objects), GFP_KERNEL);
+}
+
+static int mmu_topup_memory_cache_atomic(struct kvm_mmu_memory_cache *cache,
+				  struct kmem_cache *base_cache, int min)
+{
+	return __mmu_topup_memory_cache(cache, base_cache, min, min,
+					GFP_ATOMIC);
+}
+
+static int pte_prefetch_topup_memory_cache(struct kvm_vcpu *vcpu, int num)
+{
+	return mmu_topup_memory_cache_atomic(&vcpu->arch.mmu_rmap_desc_cache,
+					     rmap_desc_cache, num);
+}
+
 static void mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc,
 				  struct kmem_cache *cache)
 {
-- 
1.6.1.2





^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
                   ` (5 preceding siblings ...)
  2010-06-25 12:06 ` [PATCH v2 7/10] KVM: MMU: introduce mmu_topup_memory_cache_atomic() Xiao Guangrong
@ 2010-06-25 12:07 ` Xiao Guangrong
  2010-06-28 13:04   ` Marcelo Tosatti
  2010-06-25 12:07 ` [PATCH v2 9/10] KVM: MMU: combine guest pte read between walk and pte prefetch Xiao Guangrong
  2010-06-25 12:07 ` [PATCH v2 10/10] KVM: MMU: trace " Xiao Guangrong
  8 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

Support prefetch ptes when intercept guest #PF, avoid to #PF by later
access

If we meet any failure in the prefetch path, we will exit it and
not try other ptes to avoid become heavy path

Note: this speculative will mark page become dirty but it not really
accessed, the same issue is in other speculative paths like invlpg,
pte write, fortunately, it just affect host memory management. After
Avi's patchset named "[PATCH v2 1/4] KVM: MMU: Introduce drop_spte()"
merged, we will easily fix it. Will do it in the future.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |   69 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/paging_tmpl.h |   74 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 143 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6c06666..b2ad723 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -89,6 +89,8 @@ module_param(oos_shadow, bool, 0644);
 	}
 #endif
 
+#define PTE_PREFETCH_NUM	16
+
 #define PT_FIRST_AVAIL_BITS_SHIFT 9
 #define PT64_SECOND_AVAIL_BITS_SHIFT 52
 
@@ -1998,6 +2000,72 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 {
 }
 
+static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
+				    struct kvm_mmu_page *sp,
+				    u64 *start, u64 *end)
+{
+	gfn_t gfn;
+	struct page *pages[PTE_PREFETCH_NUM];
+
+	if (pte_prefetch_topup_memory_cache(vcpu, end - start))
+		return -1;
+
+	gfn = sp->gfn + start - sp->spt;
+	while (start < end) {
+		unsigned long addr;
+		int entry, j, ret;
+
+		addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
+		if (kvm_is_error_hva(addr))
+			return -1;
+
+		entry = min(entry, (int)(end - start));
+		ret = __get_user_pages_fast(addr, entry, 1, pages);
+		if (ret <= 0)
+			return -1;
+
+		for (j = 0; j < ret; j++, gfn++, start++)
+			mmu_set_spte(vcpu, start, ACC_ALL,
+				     sp->role.access, 0, 0, 1, NULL,
+				     sp->role.level, gfn,
+				     page_to_pfn(pages[j]), true, false);
+
+		if (ret < entry)
+			return -1;
+	}
+	return 0;
+}
+
+static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	struct kvm_mmu_page *sp;
+	u64 *start = NULL;
+	int index, i, max;
+
+	sp = page_header(__pa(sptep));
+	WARN_ON(!sp->role.direct);
+
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+		return;
+
+	index = sptep - sp->spt;
+	i = index & ~(PTE_PREFETCH_NUM - 1);
+	max = index | (PTE_PREFETCH_NUM - 1);
+
+	for (; i < max; i++) {
+		u64 *spte = sp->spt + i;
+
+		if (*spte != shadow_trap_nonpresent_pte || spte == sptep) {
+			if (!start)
+				continue;
+			if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
+				break;
+			start = NULL;
+		} else if (!start)
+			start = spte;
+	}
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			int level, gfn_t gfn, pfn_t pfn)
 {
@@ -2012,6 +2080,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 				     0, write, 1, &pt_write,
 				     level, gfn, pfn, false, true);
 			++vcpu->stat.pf_fixed;
+			direct_pte_prefetch(vcpu, iterator.sptep);
 			break;
 		}
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index fdba751..134f031 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -291,6 +291,79 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		     gpte_to_gfn(gpte), pfn, true, true);
 }
 
+static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	struct kvm_mmu_page *sp;
+	pt_element_t gptep[PTE_PREFETCH_NUM];
+	gpa_t first_pte_gpa;
+	int offset = 0, index, i, j, max;
+
+	sp = page_header(__pa(sptep));
+	index = sptep - sp->spt;
+
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+		return;
+
+	if (sp->role.direct)
+		return direct_pte_prefetch(vcpu, sptep);
+
+	index = sptep - sp->spt;
+	i = index & ~(PTE_PREFETCH_NUM - 1);
+	max = index | (PTE_PREFETCH_NUM - 1);
+
+	if (PTTYPE == 32)
+		offset = sp->role.quadrant << PT64_LEVEL_BITS;
+
+	first_pte_gpa = gfn_to_gpa(sp->gfn) +
+				(offset + i) * sizeof(pt_element_t);
+
+	if (kvm_read_guest_atomic(vcpu->kvm, first_pte_gpa, gptep,
+					sizeof(gptep)) < 0)
+		return;
+
+	for (j = 0; i < max; i++, j++) {
+		pt_element_t gpte;
+		unsigned pte_access;
+		u64 *spte = sp->spt + i;
+		gfn_t gfn;
+		pfn_t pfn;
+
+		if (spte == sptep)
+			continue;
+
+		if (*spte != shadow_trap_nonpresent_pte)
+			continue;
+
+		gpte = gptep[j];
+
+		if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL))
+			break;
+
+		if (!(gpte & PT_ACCESSED_MASK))
+			continue;
+
+		if (!is_present_gpte(gpte)) {
+			if (!sp->unsync)
+				__set_spte(spte, shadow_notrap_nonpresent_pte);
+			continue;
+		}
+
+		gfn = gpte_to_gfn(gpte);
+
+		pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
+		if (is_error_pfn(pfn) ||
+			pte_prefetch_topup_memory_cache(vcpu, 1)) {
+			kvm_release_pfn_clean(pfn);
+			break;
+		}
+
+		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+		mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
+			     is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
+			     pfn, true, false);
+	}
+}
+
 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
  */
@@ -322,6 +395,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 				     user_fault, write_fault,
 				     dirty, ptwrite, level,
 				     gw->gfn, pfn, false, true);
+			FNAME(pte_prefetch)(vcpu, sptep);
 			break;
 		}
 
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 9/10] KVM: MMU: combine guest pte read between walk and pte prefetch
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
                   ` (6 preceding siblings ...)
  2010-06-25 12:07 ` [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF Xiao Guangrong
@ 2010-06-25 12:07 ` Xiao Guangrong
  2010-06-25 12:07 ` [PATCH v2 10/10] KVM: MMU: trace " Xiao Guangrong
  8 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

Combine guest pte read between guest pte walk and pte prefetch

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/paging_tmpl.h |   48 ++++++++++++++++++++++++++++++-------------
 1 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 134f031..641c2ca 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -67,6 +67,7 @@ struct guest_walker {
 	int level;
 	gfn_t table_gfn[PT_MAX_FULL_LEVELS];
 	pt_element_t ptes[PT_MAX_FULL_LEVELS];
+	pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
 	gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
 	unsigned pt_access;
 	unsigned pte_access;
@@ -110,6 +111,31 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte)
 	return access;
 }
 
+static int FNAME(read_guest_pte)(struct kvm_vcpu *vcpu,
+				 struct guest_walker *walker,
+				 gpa_t pte_gpa, pt_element_t *pte)
+{
+	int index, ret;
+	u64 mask;
+	gpa_t base_gpa;
+
+	if (walker->level > PT_PAGE_TABLE_LEVEL)
+		return kvm_read_guest(vcpu->kvm, pte_gpa, pte,
+					sizeof(*pte));
+
+	mask = PTE_PREFETCH_NUM * sizeof(*pte) - 1;
+	base_gpa = pte_gpa & ~mask;
+	index = (pte_gpa - base_gpa) / sizeof(*pte);
+
+	ret = kvm_read_guest(vcpu->kvm, base_gpa, walker->prefetch_ptes,
+				sizeof(walker->prefetch_ptes));
+	if (ret)
+		return ret;
+
+	*pte = walker->prefetch_ptes[index];
+	return 0;
+}
+
 /*
  * Fetch a guest pte for a guest virtual address
  */
@@ -151,7 +177,7 @@ walk:
 		walker->table_gfn[walker->level - 1] = table_gfn;
 		walker->pte_gpa[walker->level - 1] = pte_gpa;
 
-		if (kvm_read_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte)))
+		if (FNAME(read_guest_pte)(vcpu, walker, pte_gpa, &pte))
 			goto not_present;
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
@@ -291,12 +317,12 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		     gpte_to_gfn(gpte), pfn, true, true);
 }
 
-static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
+static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,
+				struct guest_walker *gw, u64 *sptep)
 {
 	struct kvm_mmu_page *sp;
-	pt_element_t gptep[PTE_PREFETCH_NUM];
-	gpa_t first_pte_gpa;
-	int offset = 0, index, i, j, max;
+	pt_element_t *gptep;
+	int index, i, j, max;
 
 	sp = page_header(__pa(sptep));
 	index = sptep - sp->spt;
@@ -311,15 +337,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
 	i = index & ~(PTE_PREFETCH_NUM - 1);
 	max = index | (PTE_PREFETCH_NUM - 1);
 
-	if (PTTYPE == 32)
-		offset = sp->role.quadrant << PT64_LEVEL_BITS;
-
-	first_pte_gpa = gfn_to_gpa(sp->gfn) +
-				(offset + i) * sizeof(pt_element_t);
-
-	if (kvm_read_guest_atomic(vcpu->kvm, first_pte_gpa, gptep,
-					sizeof(gptep)) < 0)
-		return;
+	gptep = gw->prefetch_ptes;
 
 	for (j = 0; i < max; i++, j++) {
 		pt_element_t gpte;
@@ -395,7 +413,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 				     user_fault, write_fault,
 				     dirty, ptwrite, level,
 				     gw->gfn, pfn, false, true);
-			FNAME(pte_prefetch)(vcpu, sptep);
+			FNAME(pte_prefetch)(vcpu, gw, sptep);
 			break;
 		}
 
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 10/10] KVM: MMU: trace pte prefetch
       [not found] <4C2498EC.2010006@cn.fujitsu.com>
                   ` (7 preceding siblings ...)
  2010-06-25 12:07 ` [PATCH v2 9/10] KVM: MMU: combine guest pte read between walk and pte prefetch Xiao Guangrong
@ 2010-06-25 12:07 ` Xiao Guangrong
  8 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-25 12:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

Trace pte prefetch, it can help us to improve the prefetch's performance

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |   42 +++++++++++++++++++++++++++++++++---------
 arch/x86/kvm/mmutrace.h    |   33 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/paging_tmpl.h |   29 ++++++++++++++++++++++-------
 3 files changed, 88 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b2ad723..bcf4626 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -91,6 +91,12 @@ module_param(oos_shadow, bool, 0644);
 
 #define PTE_PREFETCH_NUM	16
 
+#define PREFETCH_SUCCESS		0
+#define PREFETCH_ERR_GFN2PFN		1
+#define PREFETCH_ERR_ALLOC_MEM		2
+#define PREFETCH_ERR_RSVD_BITS_SET	3
+#define PREFETCH_ERR_MMIO		4
+
 #define PT_FIRST_AVAIL_BITS_SHIFT 9
 #define PT64_SECOND_AVAIL_BITS_SHIFT 52
 
@@ -2002,13 +2008,16 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 				    struct kvm_mmu_page *sp,
-				    u64 *start, u64 *end)
+				    u64 *start, u64 *end, u64 address)
 {
 	gfn_t gfn;
 	struct page *pages[PTE_PREFETCH_NUM];
 
-	if (pte_prefetch_topup_memory_cache(vcpu, end - start))
+	if (pte_prefetch_topup_memory_cache(vcpu, end - start)) {
+		trace_pte_prefetch(true, address, 0,
+				   PREFETCH_ERR_ALLOC_MEM);
 		return -1;
+	}
 
 	gfn = sp->gfn + start - sp->spt;
 	while (start < end) {
@@ -2016,27 +2025,40 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 		int entry, j, ret;
 
 		addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
-		if (kvm_is_error_hva(addr))
+		if (kvm_is_error_hva(addr)) {
+			trace_pte_prefetch(true, address, 0,
+					   PREFETCH_ERR_MMIO);
 			return -1;
+		}
 
 		entry = min(entry, (int)(end - start));
 		ret = __get_user_pages_fast(addr, entry, 1, pages);
-		if (ret <= 0)
+		if (ret <= 0) {
+			trace_pte_prefetch(true, address, 0,
+					   PREFETCH_ERR_GFN2PFN);
 			return -1;
+		}
 
-		for (j = 0; j < ret; j++, gfn++, start++)
+		for (j = 0; j < ret; j++, gfn++, start++) {
+			trace_pte_prefetch(true, address, 0,
+					   PREFETCH_SUCCESS);
 			mmu_set_spte(vcpu, start, ACC_ALL,
 				     sp->role.access, 0, 0, 1, NULL,
 				     sp->role.level, gfn,
 				     page_to_pfn(pages[j]), true, false);
+		}
 
-		if (ret < entry)
+		if (ret < entry) {
+			trace_pte_prefetch(true, address, 0,
+					   PREFETCH_ERR_GFN2PFN);
 			return -1;
+		}
 	}
 	return 0;
 }
 
-static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
+static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep,
+				u64 addr)
 {
 	struct kvm_mmu_page *sp;
 	u64 *start = NULL;
@@ -2058,7 +2080,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 		if (*spte != shadow_trap_nonpresent_pte || spte == sptep) {
 			if (!start)
 				continue;
-			if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
+			if (direct_pte_prefetch_many(vcpu, sp, start,
+							spte, addr) < 0)
 				break;
 			start = NULL;
 		} else if (!start)
@@ -2080,7 +2103,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 				     0, write, 1, &pt_write,
 				     level, gfn, pfn, false, true);
 			++vcpu->stat.pf_fixed;
-			direct_pte_prefetch(vcpu, iterator.sptep);
+			direct_pte_prefetch(vcpu, iterator.sptep,
+						gfn << PAGE_SHIFT);
 			break;
 		}
 
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 3aab0f0..c07b6a6 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -195,6 +195,39 @@ DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_prepare_zap_page,
 
 	TP_ARGS(sp)
 );
+
+#define pte_prefetch_err					\
+	{PREFETCH_SUCCESS,		"SUCCESS"	},	\
+	{PREFETCH_ERR_GFN2PFN,		"ERR_GFN2PFN"	},	\
+	{PREFETCH_ERR_ALLOC_MEM,	"ERR_ALLOC_MEM"	},	\
+	{PREFETCH_ERR_RSVD_BITS_SET,	"ERR_RSVD_BITS_SET"},	\
+	{PREFETCH_ERR_MMIO,		"ERR_MMIO"	}
+
+TRACE_EVENT(
+	pte_prefetch,
+	TP_PROTO(bool direct, u64 addr, u64 gpte, int err_code),
+
+	TP_ARGS(direct, addr, gpte, err_code),
+
+	TP_STRUCT__entry(
+		__field(bool, direct)
+		__field(u64, addr)
+		__field(u64, gpte)
+		__field(int, err_code)
+	),
+
+	TP_fast_assign(
+		__entry->direct = direct;
+		__entry->addr = addr;
+		__entry->gpte = gpte;
+		__entry->err_code = err_code;
+	),
+
+	TP_printk("%s address:%llx gpte:%llx %s",
+		  __entry->direct ? "direct" : "indirect",
+		  __entry->addr, __entry->gpte,
+		  __print_symbolic(__entry->err_code, pte_prefetch_err))
+	);
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 641c2ca..7256a10 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -318,7 +318,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 }
 
 static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,
-				struct guest_walker *gw, u64 *sptep)
+				struct guest_walker *gw, u64 *sptep, u64 addr)
 {
 	struct kvm_mmu_page *sp;
 	pt_element_t *gptep;
@@ -331,7 +331,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,
 		return;
 
 	if (sp->role.direct)
-		return direct_pte_prefetch(vcpu, sptep);
+		return direct_pte_prefetch(vcpu, sptep, addr);
 
 	index = sptep - sp->spt;
 	i = index & ~(PTE_PREFETCH_NUM - 1);
@@ -354,27 +354,42 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu,
 
 		gpte = gptep[j];
 
-		if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL))
+		if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL)) {
+			trace_pte_prefetch(false, addr, gpte,
+						PREFETCH_ERR_RSVD_BITS_SET);
 			break;
+		}
 
 		if (!(gpte & PT_ACCESSED_MASK))
 			continue;
 
 		if (!is_present_gpte(gpte)) {
-			if (!sp->unsync)
+			if (!sp->unsync) {
+				trace_pte_prefetch(false, addr, gpte,
+							PREFETCH_SUCCESS);
 				__set_spte(spte, shadow_notrap_nonpresent_pte);
+			}
 			continue;
 		}
 
 		gfn = gpte_to_gfn(gpte);
 
 		pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
-		if (is_error_pfn(pfn) ||
-			pte_prefetch_topup_memory_cache(vcpu, 1)) {
+		if (is_error_pfn(pfn)) {
+			trace_pte_prefetch(false, addr, gpte,
+						PREFETCH_ERR_GFN2PFN);
+			kvm_release_pfn_clean(pfn);
+			break;
+		}
+
+		if (pte_prefetch_topup_memory_cache(vcpu, 1)) {
+			trace_pte_prefetch(false, addr, gpte,
+						PREFETCH_ERR_ALLOC_MEM);
 			kvm_release_pfn_clean(pfn);
 			break;
 		}
 
+		trace_pte_prefetch(false, addr, gpte, PREFETCH_SUCCESS);
 		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
 		mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
 			     is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
@@ -413,7 +428,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 				     user_fault, write_fault,
 				     dirty, ptwrite, level,
 				     gw->gfn, pfn, false, true);
-			FNAME(pte_prefetch)(vcpu, gw, sptep);
+			FNAME(pte_prefetch)(vcpu, gw, sptep, addr);
 			break;
 		}
 
-- 
1.6.1.2




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp
  2010-06-25 12:05 ` [PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp Xiao Guangrong
@ 2010-06-28  9:43   ` Avi Kivity
  2010-06-28  9:49     ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-28  9:43 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/25/2010 03:05 PM, Xiao Guangrong wrote:
> In no-direct mapping, we mark sp is 'direct' when we mapping the
> guest's larger page, but its access is encoded form upper page-struct
> entire not include the last mapping, it will cause access conflict.
>
> For example, have this mapping:
>          [W]
>        / PDE1 ->  |---|
>    P[W]          |   | LPA
>        \ PDE2 ->  |---|
>          [R]
>
> P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
> same lage page(LPA). The P's access is WR, PDE1's access is WR,
> PDE2's access is RO(just consider read-write permissions here)
>
> When guest access PDE1, we will create a direct sp for LPA, the sp's
> access is from P, is W, then we will mark the ptes is W in this sp.
>
> Then, guest access PDE2, we will find LPA's shadow page, is the same as
> PDE's, and mark the ptes is RO.
>
> So, if guest access PDE1, the incorrect #PF is occured.
>
> Fixed by encode the last mapping access into direct shadow page
>
> And, it also cleanup the code that directly get the last level's dirty flag
>
>    

Looks good, but please split the cleanup from the fix (we'll want to 
backport the fix but not the cleanup).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp
  2010-06-28  9:43   ` Avi Kivity
@ 2010-06-28  9:49     ` Xiao Guangrong
  0 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-28  9:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:

>>    
> 
> Looks good, but please split the cleanup from the fix (we'll want to
> backport the fix but not the cleanup).
> 

OK, will do it


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-25 12:06 ` [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted Xiao Guangrong
@ 2010-06-28  9:50   ` Avi Kivity
  2010-06-28 10:02     ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-28  9:50 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/25/2010 03:06 PM, Xiao Guangrong wrote:
> Consider using small page to fit guest's large page mapping:
>
> If the mapping is writable but the dirty flag is not set, we will find
> the read-only direct sp and setup the mapping, then if the write #PF
> occur, we will mark this mapping writable in the read-only direct sp,
> now, other real read-only mapping will happily write it without #PF.
>
> It may hurt guest's COW
>
> Fixed by re-install the mapping when write #PF occur.
>
> Signed-off-by: Xiao Guangrong<xiaoguangrong@cn.fujitsu.com>
> ---
>   arch/x86/kvm/mmu.c         |    3 ++-
>   arch/x86/kvm/paging_tmpl.h |   18 ++++++++++++++++++
>   2 files changed, 20 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 556a798..0412ba4 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -153,7 +153,8 @@ module_param(oos_shadow, bool, 0644);
>   #define CREATE_TRACE_POINTS
>   #include "mmutrace.h"
>
> -#define SPTE_HOST_WRITEABLE (1ULL<<  PT_FIRST_AVAIL_BITS_SHIFT)
> +#define SPTE_HOST_WRITEABLE	(1ULL<<  PT_FIRST_AVAIL_BITS_SHIFT)
> +#define SPTE_NO_DIRTY		(2ULL<<  PT_FIRST_AVAIL_BITS_SHIFT)
>
>   #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
>
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index e46eb8a..fdba751 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -325,6 +325,20 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>   			break;
>   		}
>
> +		if (*sptep&  SPTE_NO_DIRTY) {
> +			struct kvm_mmu_page *child;
> +
> +			WARN_ON(level !=  gw->level);
> +			WARN_ON(!is_shadow_present_pte(*sptep));
> +			if (dirty) {
> +				child = page_header(*sptep&
> +						      PT64_BASE_ADDR_MASK);
> +				mmu_page_remove_parent_pte(child, sptep);
> +				__set_spte(sptep, shadow_trap_nonpresent_pte);
> +				kvm_flush_remote_tlbs(vcpu->kvm);
> +			}
> +		}
> +
>    

Instead of adding a new bit, can you encode the protection in the direct 
sp's access bits?  So we'll have one sp for read-only or 
writeable-but-not-dirty small pages, and another sp for 
writeable-and-dirty small pages.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 4/10] KVM: MMU: fix forgot to flush all vcpu's tlb
  2010-06-25 12:06 ` [PATCH v2 4/10] KVM: MMU: fix forgot to flush all vcpu's tlb Xiao Guangrong
@ 2010-06-28  9:55   ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2010-06-28  9:55 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/25/2010 03:06 PM, Xiao Guangrong wrote:
> After remove a rmap, we should flush all vcpu's tlb
>
> Signed-off-by: Xiao Guangrong<xiaoguangrong@cn.fujitsu.com>
> ---
>   arch/x86/kvm/mmu.c |    2 ++
>   1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 0412ba4..f151540 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1933,6 +1933,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>   			pgprintk("hfn old %lx new %lx\n",
>   				 spte_to_pfn(*sptep), pfn);
>   			rmap_remove(vcpu->kvm, sptep);
> +			__set_spte(sptep, shadow_trap_nonpresent_pte);
> +			kvm_flush_remote_tlbs(vcpu->kvm);
>   		} else
>   			was_rmapped = 1;
>   	}
>    

Good catch.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-28  9:50   ` Avi Kivity
@ 2010-06-28 10:02     ` Xiao Guangrong
  2010-06-28 11:13       ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-28 10:02 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list

Avi Kivity wrote:

> 
> Instead of adding a new bit, can you encode the protection in the direct
> sp's access bits?  So we'll have one sp for read-only or
> writeable-but-not-dirty small pages, and another sp for
> writeable-and-dirty small pages.
> 

It looks like it can't solve all problems, it fix the access corrupted,
but will cause D bit losed:

mapping A and mapping B both are writable-and-dirty, when mapping A write
#PF occurs, the mapping is writable, then we can't set B's D bit anymore.

Anyway, i think we should re-intall the mapping when the state is changed. :-(

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-28 10:02     ` Xiao Guangrong
@ 2010-06-28 11:13       ` Avi Kivity
  2010-06-29  1:17         ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-28 11:13 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/28/2010 01:02 PM, Xiao Guangrong wrote:
>
> Avi Kivity wrote:
>
>    
>> Instead of adding a new bit, can you encode the protection in the direct
>> sp's access bits?  So we'll have one sp for read-only or
>> writeable-but-not-dirty small pages, and another sp for
>> writeable-and-dirty small pages.
>>
>>      
> It looks like it can't solve all problems, it fix the access corrupted,
> but will cause D bit losed:
>
> mapping A and mapping B both are writable-and-dirty, when mapping A write
> #PF occurs, the mapping is writable, then we can't set B's D bit anymore.
>    

If B is writeable-and-dirty, then it's D bit is already set, and we 
don't need to do anything.

If B is writeable-and-clean, then we'll have an spte pointing to a 
read-only sp, so we'll get a write fault on access and an opportunity to 
set the D bit.

> Anyway, i think we should re-intall the mapping when the state is changed. :-(
>    

When the gpte is changed from read-only to writeable or from clean to 
dirty, we need to update the spte, yes.  But that's true for other sptes 
as well, not just large gptes.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/10] KVM: MMU: introduce mmu_topup_memory_cache_atomic()
  2010-06-25 12:06 ` [PATCH v2 7/10] KVM: MMU: introduce mmu_topup_memory_cache_atomic() Xiao Guangrong
@ 2010-06-28 11:17   ` Avi Kivity
  2010-06-29  1:18     ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-28 11:17 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/25/2010 03:06 PM, Xiao Guangrong wrote:
> Introduce mmu_topup_memory_cache_atomic(), it support topup memory
> cache in atomic context
>
>    

Can instead preallocate enough for all prefetches, isn't it simpler?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF
  2010-06-25 12:07 ` [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF Xiao Guangrong
@ 2010-06-28 13:04   ` Marcelo Tosatti
  2010-06-29  8:07     ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2010-06-28 13:04 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM list

On Fri, Jun 25, 2010 at 08:07:06PM +0800, Xiao Guangrong wrote:
> Support prefetch ptes when intercept guest #PF, avoid to #PF by later
> access
> 
> If we meet any failure in the prefetch path, we will exit it and
> not try other ptes to avoid become heavy path
> 
> Note: this speculative will mark page become dirty but it not really
> accessed, the same issue is in other speculative paths like invlpg,
> pte write, fortunately, it just affect host memory management. After
> Avi's patchset named "[PATCH v2 1/4] KVM: MMU: Introduce drop_spte()"
> merged, we will easily fix it. Will do it in the future.
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
> ---
>  arch/x86/kvm/mmu.c         |   69 +++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/paging_tmpl.h |   74 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 143 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 6c06666..b2ad723 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -89,6 +89,8 @@ module_param(oos_shadow, bool, 0644);
>  	}
>  #endif
>  
> +#define PTE_PREFETCH_NUM	16
> +
>  #define PT_FIRST_AVAIL_BITS_SHIFT 9
>  #define PT64_SECOND_AVAIL_BITS_SHIFT 52
>  
> @@ -1998,6 +2000,72 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
>  {
>  }
>  
> +static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
> +				    struct kvm_mmu_page *sp,
> +				    u64 *start, u64 *end)
> +{
> +	gfn_t gfn;
> +	struct page *pages[PTE_PREFETCH_NUM];
> +
> +	if (pte_prefetch_topup_memory_cache(vcpu, end - start))
> +		return -1;
> +
> +	gfn = sp->gfn + start - sp->spt;
> +	while (start < end) {
> +		unsigned long addr;
> +		int entry, j, ret;
> +
> +		addr = gfn_to_hva_many(vcpu->kvm, gfn, &entry);
> +		if (kvm_is_error_hva(addr))
> +			return -1;
> +
> +		entry = min(entry, (int)(end - start));
> +		ret = __get_user_pages_fast(addr, entry, 1, pages);
> +		if (ret <= 0)
> +			return -1;
> +
> +		for (j = 0; j < ret; j++, gfn++, start++)
> +			mmu_set_spte(vcpu, start, ACC_ALL,
> +				     sp->role.access, 0, 0, 1, NULL,
> +				     sp->role.level, gfn,
> +				     page_to_pfn(pages[j]), true, false);
> +
> +		if (ret < entry)
> +			return -1;
> +	}
> +	return 0;
> +}
> +
> +static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp;
> +	u64 *start = NULL;
> +	int index, i, max;
> +
> +	sp = page_header(__pa(sptep));
> +	WARN_ON(!sp->role.direct);
> +
> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> +		return;
> +
> +	index = sptep - sp->spt;
> +	i = index & ~(PTE_PREFETCH_NUM - 1);
> +	max = index | (PTE_PREFETCH_NUM - 1);
> +
> +	for (; i < max; i++) {
> +		u64 *spte = sp->spt + i;
> +
> +		if (*spte != shadow_trap_nonpresent_pte || spte == sptep) {
> +			if (!start)
> +				continue;
> +			if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
> +				break;
> +			start = NULL;
> +		} else if (!start)
> +			start = spte;
> +	}
> +}
> +
>  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
>  			int level, gfn_t gfn, pfn_t pfn)
>  {
> @@ -2012,6 +2080,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
>  				     0, write, 1, &pt_write,
>  				     level, gfn, pfn, false, true);
>  			++vcpu->stat.pf_fixed;
> +			direct_pte_prefetch(vcpu, iterator.sptep);
>  			break;
>  		}
>  
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index fdba751..134f031 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -291,6 +291,79 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  		     gpte_to_gfn(gpte), pfn, true, true);
>  }
>  
> +static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp;
> +	pt_element_t gptep[PTE_PREFETCH_NUM];
> +	gpa_t first_pte_gpa;
> +	int offset = 0, index, i, j, max;
> +
> +	sp = page_header(__pa(sptep));
> +	index = sptep - sp->spt;
> +
> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> +		return;
> +
> +	if (sp->role.direct)
> +		return direct_pte_prefetch(vcpu, sptep);

Can never happen.

> +
> +	index = sptep - sp->spt;
> +	i = index & ~(PTE_PREFETCH_NUM - 1);
> +	max = index | (PTE_PREFETCH_NUM - 1);
> +
> +	if (PTTYPE == 32)
> +		offset = sp->role.quadrant << PT64_LEVEL_BITS;
> +
> +	first_pte_gpa = gfn_to_gpa(sp->gfn) +
> +				(offset + i) * sizeof(pt_element_t);
> +
> +	if (kvm_read_guest_atomic(vcpu->kvm, first_pte_gpa, gptep,
> +					sizeof(gptep)) < 0)
> +		return;
> +
> +	for (j = 0; i < max; i++, j++) {
> +		pt_element_t gpte;
> +		unsigned pte_access;
> +		u64 *spte = sp->spt + i;
> +		gfn_t gfn;
> +		pfn_t pfn;
> +
> +		if (spte == sptep)
> +			continue;
> +
> +		if (*spte != shadow_trap_nonpresent_pte)
> +			continue;
> +
> +		gpte = gptep[j];
> +
> +		if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL))
> +			break;
> +
> +		if (!(gpte & PT_ACCESSED_MASK))
> +			continue;
> +
> +		if (!is_present_gpte(gpte)) {
> +			if (!sp->unsync)
> +				__set_spte(spte, shadow_notrap_nonpresent_pte);
> +			continue;
> +		}
> +
> +		gfn = gpte_to_gfn(gpte);
> +
> +		pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
> +		if (is_error_pfn(pfn) ||
> +			pte_prefetch_topup_memory_cache(vcpu, 1)) {
> +			kvm_release_pfn_clean(pfn);
> +			break;
> +		}
> +
> +		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
> +		mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
> +			     is_dirty_gpte(gpte), NULL, sp->role.level, gfn,
> +			     pfn, true, false);
> +	}
> +}
> +
>  /*
>   * Fetch a shadow pte for a specific level in the paging hierarchy.
>   */
> @@ -322,6 +395,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  				     user_fault, write_fault,
>  				     dirty, ptwrite, level,
>  				     gw->gfn, pfn, false, true);
> +			FNAME(pte_prefetch)(vcpu, sptep);
>  			break;
>  		}


I'm afraid this can introduce regressions since it increases mmu_lock
contention. Can you get some numbers with 4-vcpu or 8-vcpu guest and
many threads benchmarks, such as kernbench and apachebench? (on
non-EPT).

Also prefetch should be disabled for EPT, due to lack of accessed bit.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-28 11:13       ` Avi Kivity
@ 2010-06-29  1:17         ` Xiao Guangrong
  2010-06-29  7:06           ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  1:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:
> On 06/28/2010 01:02 PM, Xiao Guangrong wrote:
>>
>> Avi Kivity wrote:
>>
>>   
>>> Instead of adding a new bit, can you encode the protection in the direct
>>> sp's access bits?  So we'll have one sp for read-only or
>>> writeable-but-not-dirty small pages, and another sp for
>>> writeable-and-dirty small pages.
>>>
>>>      
>> It looks like it can't solve all problems, it fix the access corrupted,
>> but will cause D bit losed:
>>
>> mapping A and mapping B both are writable-and-dirty, when mapping A write
>> #PF occurs, the mapping is writable, then we can't set B's D bit anymore.
>>    
> 
> If B is writeable-and-dirty, then it's D bit is already set, and we
> don't need to do anything.
> 
> If B is writeable-and-clean, then we'll have an spte pointing to a
> read-only sp, so we'll get a write fault on access and an opportunity to
> set the D bit.
> 

Sorry, a typo in my reply, i mean mapping A and B both are writable-and-clean,
while A occurs write-#PF, we should change A's spte map to writable sp, if we
only update the spte in writable-and-clean sp(form readonly to writable), the B's
D bit will miss set.

>> Anyway, i think we should re-intall the mapping when the state is
>> changed. :-(
>>    
> 
> When the gpte is changed from read-only to writeable or from clean to
> dirty, we need to update the spte, yes.  But that's true for other sptes
> as well, not just large gptes.
> 

I think the indirect sp is not hurt by this bug since for the indirect sp, the access
just form its upper-level, and the D bit is only in the last level, when we change the
pte's access, is not affect its sp's access.

But for direct sp, the sp's access is form all level. and different mapping that not share
the last mapping page will have the same last sp.






^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/10] KVM: MMU: introduce mmu_topup_memory_cache_atomic()
  2010-06-28 11:17   ` Avi Kivity
@ 2010-06-29  1:18     ` Xiao Guangrong
  0 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  1:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:
> On 06/25/2010 03:06 PM, Xiao Guangrong wrote:
>> Introduce mmu_topup_memory_cache_atomic(), it support topup memory
>> cache in atomic context
>>
>>    
> 
> Can instead preallocate enough for all prefetches, isn't it simpler?

Good idea, will do it in the next version, thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  1:17         ` Xiao Guangrong
@ 2010-06-29  7:06           ` Avi Kivity
  2010-06-29  7:35             ` Xiao Guangrong
  2010-06-29  7:38             ` Avi Kivity
  0 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2010-06-29  7:06 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/29/2010 04:17 AM, Xiao Guangrong wrote:
>
>> If B is writeable-and-dirty, then it's D bit is already set, and we
>> don't need to do anything.
>>
>> If B is writeable-and-clean, then we'll have an spte pointing to a
>> read-only sp, so we'll get a write fault on access and an opportunity to
>> set the D bit.
>>
>>      
> Sorry, a typo in my reply, i mean mapping A and B both are writable-and-clean,
> while A occurs write-#PF, we should change A's spte map to writable sp, if we
> only update the spte in writable-and-clean sp(form readonly to writable), the B's
> D bit will miss set.
>    

Right.

We need to update something to notice this:

  - FNAME(fetch)() to replace the spte
  - FNAME(walk_addr)() to invalidate the spte

I think FNAME(walk_addr) is the right place, we're updating the gpte, so 
we should update the spte at the same time, just like a guest write.  
But that will be expensive (there could be many sptes, so we have to 
call kvm_mmu_pte_write()), so perhaps FNAME(fetch) is easier.

We have now

         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
             continue;

So we need to add a check, if sp->role.access doesn't match pt_access & 
pte_access, we need to get a new sp with the correct access (can only 
change read->write).

>>> Anyway, i think we should re-intall the mapping when the state is
>>> changed. :-(
>>>
>>>        
>> When the gpte is changed from read-only to writeable or from clean to
>> dirty, we need to update the spte, yes.  But that's true for other sptes
>> as well, not just large gptes.
>>
>>      
> I think the indirect sp is not hurt by this bug since for the indirect sp, the access
> just form its upper-level, and the D bit is only in the last level, when we change the
> pte's access, is not affect its sp's access.
>
> But for direct sp, the sp's access is form all level. and different mapping that not share
> the last mapping page will have the same last sp.
>    

Right.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  7:06           ` Avi Kivity
@ 2010-06-29  7:35             ` Xiao Guangrong
  2010-06-29  8:49               ` Avi Kivity
  2010-06-29  7:38             ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  7:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:
> On 06/29/2010 04:17 AM, Xiao Guangrong wrote:
>>
>>> If B is writeable-and-dirty, then it's D bit is already set, and we
>>> don't need to do anything.
>>>
>>> If B is writeable-and-clean, then we'll have an spte pointing to a
>>> read-only sp, so we'll get a write fault on access and an opportunity to
>>> set the D bit.
>>>
>>>      
>> Sorry, a typo in my reply, i mean mapping A and B both are
>> writable-and-clean,
>> while A occurs write-#PF, we should change A's spte map to writable
>> sp, if we
>> only update the spte in writable-and-clean sp(form readonly to
>> writable), the B's
>> D bit will miss set.
>>    
> 
> Right.
> 
> We need to update something to notice this:
> 
>  - FNAME(fetch)() to replace the spte
>  - FNAME(walk_addr)() to invalidate the spte
> 
> I think FNAME(walk_addr) is the right place, we're updating the gpte, so
> we should update the spte at the same time, just like a guest write. 
> But that will be expensive (there could be many sptes, so we have to
> call kvm_mmu_pte_write()), so perhaps FNAME(fetch) is easier.
> 

I agree.

> We have now
> 
>         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
>             continue;
> 
> So we need to add a check, if sp->role.access doesn't match pt_access &
> pte_access, we need to get a new sp with the correct access (can only
> change read->write).
> 

Umm, we should update the spte at the gw->level, so we need get the child
sp, and compare its access at this point, just like this:

if (level == gw->level && is_shadow_present_pte(*sptep)) {
	child_sp = page_header(__pa(*sptep & PT64_BASE_ADDR_MASK));

	if (child_sp->access != pt_access & pte_access & (diry ? 1 : ~ACC_WRITE_MASK )) {
		/* Zap sptep */
		......
	}
		
}

So, why not use the new spte flag (SPTE_NO_DIRTY in my patch) to mark this spte then we can see
this spte whether need updated directly? i think it more simpler ;-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  7:06           ` Avi Kivity
  2010-06-29  7:35             ` Xiao Guangrong
@ 2010-06-29  7:38             ` Avi Kivity
  2010-06-29  7:45               ` Xiao Guangrong
  1 sibling, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-29  7:38 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/29/2010 10:06 AM, Avi Kivity wrote:
> On 06/29/2010 04:17 AM, Xiao Guangrong wrote:
>>
>>> If B is writeable-and-dirty, then it's D bit is already set, and we
>>> don't need to do anything.
>>>
>>> If B is writeable-and-clean, then we'll have an spte pointing to a
>>> read-only sp, so we'll get a write fault on access and an 
>>> opportunity to
>>> set the D bit.
>>>
>> Sorry, a typo in my reply, i mean mapping A and B both are 
>> writable-and-clean,
>> while A occurs write-#PF, we should change A's spte map to writable 
>> sp, if we
>> only update the spte in writable-and-clean sp(form readonly to 
>> writable), the B's
>> D bit will miss set.
>
> Right.
>
> We need to update something to notice this:
>
>  - FNAME(fetch)() to replace the spte
>  - FNAME(walk_addr)() to invalidate the spte
>
> I think FNAME(walk_addr) is the right place, we're updating the gpte, 
> so we should update the spte at the same time, just like a guest 
> write.  But that will be expensive (there could be many sptes, so we 
> have to call kvm_mmu_pte_write()), so perhaps FNAME(fetch) is easier.
>
> We have now
>
>         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
>             continue;
>
> So we need to add a check, if sp->role.access doesn't match pt_access 
> & pte_access, we need to get a new sp with the correct access (can 
> only change read->write).

Note:

- modifying walk_addr() to call kvm_mmu_pte_write() is probably not so 
bad.  It's rare that a large pte walk sets the dirty bit, and it's 
probably rare to share those large ptes.  Still, I think the fetch() 
change is better since it's more local.

- there was once talk that instead of folding pt_access and pte_access 
together into the leaf sp->role.access, each sp level would have its own 
access permissions.  In this case we don't even have to get a new direct 
sp, only change the PT_DIRECTORY_LEVEL spte to add write permissions 
(all direct sp's would be writeable and permissions would be controlled 
at their parent_pte level).  Of course that's a much bigger change than 
this bug fix.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  7:38             ` Avi Kivity
@ 2010-06-29  7:45               ` Xiao Guangrong
  2010-06-29  8:51                 ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  7:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:

> 
> Note:
> 
> - modifying walk_addr() to call kvm_mmu_pte_write() is probably not so
> bad.  It's rare that a large pte walk sets the dirty bit, and it's
> probably rare to share those large ptes.  Still, I think the fetch()
> change is better since it's more local.
> 
> - there was once talk that instead of folding pt_access and pte_access
> together into the leaf sp->role.access, each sp level would have its own
> access permissions.  In this case we don't even have to get a new direct
> sp, only change the PT_DIRECTORY_LEVEL spte to add write permissions
> (all direct sp's would be writeable and permissions would be controlled
> at their parent_pte level).  Of course that's a much bigger change than
> this bug fix.
> 

Yeah, i have considered this way, but it will change the shadow page's mapping
way: it control the access at the upper level, but in the current code, we allow
the upper level have the ALL_ACCESS and control the access right at the last level.
It will break many things, such as write-protected...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF
  2010-06-28 13:04   ` Marcelo Tosatti
@ 2010-06-29  8:07     ` Xiao Guangrong
  2010-06-29 11:44       ` Marcelo Tosatti
  0 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  8:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM list

Marcelo Tosatti wrote:

>> +
>> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
>> +		return;
>> +
>> +	if (sp->role.direct)
>> +		return direct_pte_prefetch(vcpu, sptep);
> 
> Can never happen.
> 

Marcelo,

Thanks for your comment. You mean that we can't meet sp->role.direct here?
could you please tell me why? During my test, it can be triggered.

>> @@ -322,6 +395,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>>  				     user_fault, write_fault,
>>  				     dirty, ptwrite, level,
>>  				     gw->gfn, pfn, false, true);
>> +			FNAME(pte_prefetch)(vcpu, sptep);
>>  			break;
>>  		}
> 
> 
> I'm afraid this can introduce regressions since it increases mmu_lock
> contention. Can you get some numbers with 4-vcpu or 8-vcpu guest and
> many threads benchmarks, such as kernbench and apachebench? (on
> non-EPT).
> 

The pte prefetch is the fast path, it only occupies little time, for the worst
case, only need read 128 byte form the guest pte, and if it prefetched success,
the #PF cause by later access will avoid, then we avoid to exit form the guest,
and walk guest pte, walk shadow pages, flush local tlb... a lots of work can be
reduced.

Before i post this patchset firstly, i do the performance test by using unixbench,
it improved ~3.6% under EPT disable case.
(it's in the first version's chagelog)

Today, i do the kernbench test with 4 vcpu and 1G memory, the result shows it
improved ~1.6% :-)

> Also prefetch should be disabled for EPT, due to lack of accessed bit.
> 

But we call mmu_set_spte() with speculative == false, it not touch the accessed bit.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  7:35             ` Xiao Guangrong
@ 2010-06-29  8:49               ` Avi Kivity
  2010-06-29  9:04                 ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-29  8:49 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/29/2010 10:35 AM, Xiao Guangrong wrote:
>
>> We have now
>>
>>          if (is_shadow_present_pte(*sptep)&&  !is_large_pte(*sptep))
>>              continue;
>>
>> So we need to add a check, if sp->role.access doesn't match pt_access&
>> pte_access, we need to get a new sp with the correct access (can only
>> change read->write).
>>
>>      
> Umm, we should update the spte at the gw->level, so we need get the child
> sp, and compare its access at this point, just like this:
>
> if (level == gw->level&&  is_shadow_present_pte(*sptep)) {
> 	child_sp = page_header(__pa(*sptep&  PT64_BASE_ADDR_MASK));
>
> 	if (child_sp->access != pt_access&  pte_access&  (diry ? 1 : ~ACC_WRITE_MASK )) {
> 		/* Zap sptep */
> 		......
> 	}
> 		
> }
>
> So, why not use the new spte flag (SPTE_NO_DIRTY in my patch) to mark this spte then we can see
> this spte whether need updated directly? i think it more simpler ;-)
>    

It's new state, and new state means more maintenance of that state and 
the need to consider the state in all relevant code paths.

In terms of maintainability, changing walk_addr() is best, since it 
maintains the tight invariant that PT_PAGE_DIRECTORY_LEVEL sptes are 
always consistent with their sptes.  Updating fetch() to allow for a 
relaxed invariant (spte may be read-only while gpte is write-dirty) is 
more complicated, but performs better.  This is also consistent with 
what we do with PT_PAGE_TABLE_LEVEL gptes/sptes and with unsync pages.

btw, how can the patch work?

>
> +		if (level == gw->level&&  !dirty&&
> +		      access&  gw->pte_access&  ACC_WRITE_MASK)
> +			spte |= SPTE_NO_DIRTY;
> +
>   		spte = __pa(sp->spt)
>   			| PT_PRESENT_MASK | PT_ACCESSED_MASK
>   			| PT_WRITABLE_MASK | PT_USER_MASK;
>    

spte is immediately overwritten by the following assignment.

However, the other half of the patch can be adapted:

>
> +		if (*sptep&  SPTE_NO_DIRTY) {
> +			struct kvm_mmu_page *child;
> +
> +			WARN_ON(level !=  gw->level);
> +			WARN_ON(!is_shadow_present_pte(*sptep));
> +			if (dirty) {
> +				child = page_header(*sptep&
> +						      PT64_BASE_ADDR_MASK);
> +				mmu_page_remove_parent_pte(child, sptep);
> +				__set_spte(sptep, shadow_trap_nonpresent_pte);
> +				kvm_flush_remote_tlbs(vcpu->kvm);
> +			}
> +		}
> +
>   		if (is_shadow_present_pte(*sptep)&&  !is_large_pte(*sptep))
>   			continue;
>    

Simply replace (*spte & SPTE_NO_DIRTY) with a condition that checks 
whether sp->access is consistent with gw->pt(e)_access.

Can you write a test case for qemu-kvm.git/kvm/test that demonstrates 
the problem and the fix?  It will help ensure we don't regress in this area.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  7:45               ` Xiao Guangrong
@ 2010-06-29  8:51                 ` Avi Kivity
  2010-06-29  9:08                   ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-29  8:51 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/29/2010 10:45 AM, Xiao Guangrong wrote:
>
>> - there was once talk that instead of folding pt_access and pte_access
>> together into the leaf sp->role.access, each sp level would have its own
>> access permissions.  In this case we don't even have to get a new direct
>> sp, only change the PT_DIRECTORY_LEVEL spte to add write permissions
>> (all direct sp's would be writeable and permissions would be controlled
>> at their parent_pte level).  Of course that's a much bigger change than
>> this bug fix.
>>
>>      
> Yeah, i have considered this way, but it will change the shadow page's mapping
> way: it control the access at the upper level, but in the current code, we allow
> the upper level have the ALL_ACCESS and control the access right at the last level.
> It will break many things, such as write-protected...
>    

spte's access bits have dual purpose, both to map guest protection and 
for host protection (like for shadowed pages, or ksm pages).  So the 
last level sptes still need to consider host write protection.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  8:49               ` Avi Kivity
@ 2010-06-29  9:04                 ` Xiao Guangrong
  2010-06-29  9:13                   ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  9:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:
> On 06/29/2010 10:35 AM, Xiao Guangrong wrote:
>>
>>> We have now
>>>
>>>          if (is_shadow_present_pte(*sptep)&&  !is_large_pte(*sptep))
>>>              continue;
>>>
>>> So we need to add a check, if sp->role.access doesn't match pt_access&
>>> pte_access, we need to get a new sp with the correct access (can only
>>> change read->write).
>>>
>>>      
>> Umm, we should update the spte at the gw->level, so we need get the child
>> sp, and compare its access at this point, just like this:
>>
>> if (level == gw->level&&  is_shadow_present_pte(*sptep)) {
>>     child_sp = page_header(__pa(*sptep&  PT64_BASE_ADDR_MASK));
>>
>>     if (child_sp->access != pt_access&  pte_access&  (diry ? 1 :
>> ~ACC_WRITE_MASK )) {
>>         /* Zap sptep */
>>         ......
>>     }
>>        
>> }
>>
>> So, why not use the new spte flag (SPTE_NO_DIRTY in my patch) to mark
>> this spte then we can see
>> this spte whether need updated directly? i think it more simpler ;-)
>>    
> 
> It's new state, and new state means more maintenance of that state and
> the need to consider the state in all relevant code paths.
> 
> In terms of maintainability, changing walk_addr() is best, since it
> maintains the tight invariant that PT_PAGE_DIRECTORY_LEVEL sptes are
> always consistent with their sptes.  Updating fetch() to allow for a
> relaxed invariant (spte may be read-only while gpte is write-dirty) is
> more complicated, but performs better.  This is also consistent with
> what we do with PT_PAGE_TABLE_LEVEL gptes/sptes and with unsync pages.
> 

Maybe you are right, i just think is more quickly by using SPTE_NO_DIRTY flag
to judge whether need updated. I'll modify this patch as your suggestion.

> btw, how can the patch work?
> 
>>
>> +        if (level == gw->level&&  !dirty&&
>> +              access&  gw->pte_access&  ACC_WRITE_MASK)
>> +            spte |= SPTE_NO_DIRTY;
>> +
>>           spte = __pa(sp->spt)
>>               | PT_PRESENT_MASK | PT_ACCESSED_MASK
>>               | PT_WRITABLE_MASK | PT_USER_MASK;
>>    
> 
> spte is immediately overwritten by the following assignment.
> 

Ah, sorry, i miss it, spte |= SPTE_NO_DIRTY should behind of following assignment.

> However, the other half of the patch can be adapted:
> 
>>
>> +        if (*sptep&  SPTE_NO_DIRTY) {
>> +            struct kvm_mmu_page *child;
>> +
>> +            WARN_ON(level !=  gw->level);
>> +            WARN_ON(!is_shadow_present_pte(*sptep));
>> +            if (dirty) {
>> +                child = page_header(*sptep&
>> +                              PT64_BASE_ADDR_MASK);
>> +                mmu_page_remove_parent_pte(child, sptep);
>> +                __set_spte(sptep, shadow_trap_nonpresent_pte);
>> +                kvm_flush_remote_tlbs(vcpu->kvm);
>> +            }
>> +        }
>> +
>>           if (is_shadow_present_pte(*sptep)&&  !is_large_pte(*sptep))
>>               continue;
>>    
> 
> Simply replace (*spte & SPTE_NO_DIRTY) with a condition that checks
> whether sp->access is consistent with gw->pt(e)_access.
> 

If the guest mapping is writable and it !dirty, we mark SPTE_NO_DIRTY flag in
the spte, when the next #PF occurs, we just need check this flag and see whether
gpte's D bit is set, if it's true, we zap this spte and map to the correct sp.

> Can you write a test case for qemu-kvm.git/kvm/test that demonstrates
> the problem and the fix?  It will help ensure we don't regress in this
> area.
> 

OK, but allow me do it later :-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  8:51                 ` Avi Kivity
@ 2010-06-29  9:08                   ` Xiao Guangrong
  0 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  9:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:
> On 06/29/2010 10:45 AM, Xiao Guangrong wrote:
>>
>>> - there was once talk that instead of folding pt_access and pte_access
>>> together into the leaf sp->role.access, each sp level would have its own
>>> access permissions.  In this case we don't even have to get a new direct
>>> sp, only change the PT_DIRECTORY_LEVEL spte to add write permissions
>>> (all direct sp's would be writeable and permissions would be controlled
>>> at their parent_pte level).  Of course that's a much bigger change than
>>> this bug fix.
>>>
>>>      
>> Yeah, i have considered this way, but it will change the shadow page's
>> mapping
>> way: it control the access at the upper level, but in the current
>> code, we allow
>> the upper level have the ALL_ACCESS and control the access right at
>> the last level.
>> It will break many things, such as write-protected...
>>    
> 
> spte's access bits have dual purpose, both to map guest protection and
> for host protection (like for shadowed pages, or ksm pages).  So the
> last level sptes still need to consider host write protection.
> 

Yeah, i see your mean, thanks, :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  9:04                 ` Xiao Guangrong
@ 2010-06-29  9:13                   ` Avi Kivity
  2010-06-29  9:13                     ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2010-06-29  9:13 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM list

On 06/29/2010 12:04 PM, Xiao Guangrong wrote:
>
>> Simply replace (*spte&  SPTE_NO_DIRTY) with a condition that checks
>> whether sp->access is consistent with gw->pt(e)_access.
>>
>>      
> If the guest mapping is writable and it !dirty, we mark SPTE_NO_DIRTY flag in
> the spte, when the next #PF occurs, we just need check this flag and see whether
> gpte's D bit is set, if it's true, we zap this spte and map to the correct sp.
>    

My point is, SPTE_NO_DIRTY is equivalent to an sp->role.access check 
(the access check is a bit slower, but that shouldn't matter).


>> Can you write a test case for qemu-kvm.git/kvm/test that demonstrates
>> the problem and the fix?  It will help ensure we don't regress in this
>> area.
>>
>>      
> OK, but allow me do it later :-)
>
>    

Sure, but please do it soon.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted
  2010-06-29  9:13                   ` Avi Kivity
@ 2010-06-29  9:13                     ` Xiao Guangrong
  0 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-29  9:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM list



Avi Kivity wrote:
> On 06/29/2010 12:04 PM, Xiao Guangrong wrote:
>>
>>> Simply replace (*spte&  SPTE_NO_DIRTY) with a condition that checks
>>> whether sp->access is consistent with gw->pt(e)_access.
>>>
>>>      
>> If the guest mapping is writable and it !dirty, we mark SPTE_NO_DIRTY
>> flag in
>> the spte, when the next #PF occurs, we just need check this flag and
>> see whether
>> gpte's D bit is set, if it's true, we zap this spte and map to the
>> correct sp.
>>    
> 
> My point is, SPTE_NO_DIRTY is equivalent to an sp->role.access check
> (the access check is a bit slower, but that shouldn't matter).
> 

I see.

> 
>>> Can you write a test case for qemu-kvm.git/kvm/test that demonstrates
>>> the problem and the fix?  It will help ensure we don't regress in this
>>> area.
>>>
>>>      
>> OK, but allow me do it later :-)
>>
>>    
> 
> Sure, but please do it soon.

Sure, i will do it as soon as possible.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF
  2010-06-29  8:07     ` Xiao Guangrong
@ 2010-06-29 11:44       ` Marcelo Tosatti
  2010-06-30  0:58         ` Xiao Guangrong
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2010-06-29 11:44 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM list

On Tue, Jun 29, 2010 at 04:07:40PM +0800, Xiao Guangrong wrote:
> 
> 
> Marcelo Tosatti wrote:
> 
> >> +
> >> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> >> +		return;
> >> +
> >> +	if (sp->role.direct)
> >> +		return direct_pte_prefetch(vcpu, sptep);
> > 
> > Can never happen.
> > 
> 
> Marcelo,
> 
> Thanks for your comment. You mean that we can't meet sp->role.direct here?
> could you please tell me why? During my test, it can be triggered.

Ah, for 1->1 emulation, right.

> >> @@ -322,6 +395,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> >>  				     user_fault, write_fault,
> >>  				     dirty, ptwrite, level,
> >>  				     gw->gfn, pfn, false, true);
> >> +			FNAME(pte_prefetch)(vcpu, sptep);
> >>  			break;
> >>  		}
> > 
> > 
> > I'm afraid this can introduce regressions since it increases mmu_lock
> > contention. Can you get some numbers with 4-vcpu or 8-vcpu guest and
> > many threads benchmarks, such as kernbench and apachebench? (on
> > non-EPT).
> > 
> 
> The pte prefetch is the fast path, it only occupies little time, for the worst
> case, only need read 128 byte form the guest pte, and if it prefetched success,
> the #PF cause by later access will avoid, then we avoid to exit form the guest,
> and walk guest pte, walk shadow pages, flush local tlb... a lots of work can be
> reduced.
> 
> Before i post this patchset firstly, i do the performance test by using unixbench,
> it improved ~3.6% under EPT disable case.
> (it's in the first version's chagelog)
> 
> Today, i do the kernbench test with 4 vcpu and 1G memory, the result shows it
> improved ~1.6% :-)

OK, nice.

> > Also prefetch should be disabled for EPT, due to lack of accessed bit.
> > 
> 
> But we call mmu_set_spte() with speculative == false, it not touch the accessed bit.

There is no accessed bit on EPT. So the aging code (kvm_age_rmapp)
considers any present translation as accessed. There is no way to
distinguish between actually accessed translations and prefetched (but
unused) ones.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF
  2010-06-29 11:44       ` Marcelo Tosatti
@ 2010-06-30  0:58         ` Xiao Guangrong
  0 siblings, 0 replies; 32+ messages in thread
From: Xiao Guangrong @ 2010-06-30  0:58 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM list



Marcelo Tosatti wrote:

>>> Also prefetch should be disabled for EPT, due to lack of accessed bit.
>>>
>> But we call mmu_set_spte() with speculative == false, it not touch the accessed bit.
> 
> There is no accessed bit on EPT. So the aging code (kvm_age_rmapp)
> considers any present translation as accessed. There is no way to
> distinguish between actually accessed translations and prefetched (but
> unused) ones.
> 

You are right, i'll disable the prefetch for EPT in the next version, thanks for
you point it out.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2010-06-30  1:02 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4C2498EC.2010006@cn.fujitsu.com>
2010-06-25 12:05 ` [PATCH v2 2/10] KVM: MMU: fix conflict access permissions in direct sp Xiao Guangrong
2010-06-28  9:43   ` Avi Kivity
2010-06-28  9:49     ` Xiao Guangrong
2010-06-25 12:06 ` [PATCH v2 3/10] KVM: MMU: fix direct sp's access corruptted Xiao Guangrong
2010-06-28  9:50   ` Avi Kivity
2010-06-28 10:02     ` Xiao Guangrong
2010-06-28 11:13       ` Avi Kivity
2010-06-29  1:17         ` Xiao Guangrong
2010-06-29  7:06           ` Avi Kivity
2010-06-29  7:35             ` Xiao Guangrong
2010-06-29  8:49               ` Avi Kivity
2010-06-29  9:04                 ` Xiao Guangrong
2010-06-29  9:13                   ` Avi Kivity
2010-06-29  9:13                     ` Xiao Guangrong
2010-06-29  7:38             ` Avi Kivity
2010-06-29  7:45               ` Xiao Guangrong
2010-06-29  8:51                 ` Avi Kivity
2010-06-29  9:08                   ` Xiao Guangrong
2010-06-25 12:06 ` [PATCH v2 4/10] KVM: MMU: fix forgot to flush all vcpu's tlb Xiao Guangrong
2010-06-28  9:55   ` Avi Kivity
2010-06-25 12:06 ` [PATCH v2 5/10] KVM: MMU: introduce gfn_to_pfn_atomic() function Xiao Guangrong
2010-06-25 12:06 ` [PATCH v2 6/10] KVM: MMU: introduce gfn_to_hva_many() function Xiao Guangrong
2010-06-25 12:06 ` [PATCH v2 7/10] KVM: MMU: introduce mmu_topup_memory_cache_atomic() Xiao Guangrong
2010-06-28 11:17   ` Avi Kivity
2010-06-29  1:18     ` Xiao Guangrong
2010-06-25 12:07 ` [PATCH v2 8/10] KVM: MMU: prefetch ptes when intercepted guest #PF Xiao Guangrong
2010-06-28 13:04   ` Marcelo Tosatti
2010-06-29  8:07     ` Xiao Guangrong
2010-06-29 11:44       ` Marcelo Tosatti
2010-06-30  0:58         ` Xiao Guangrong
2010-06-25 12:07 ` [PATCH v2 9/10] KVM: MMU: combine guest pte read between walk and pte prefetch Xiao Guangrong
2010-06-25 12:07 ` [PATCH v2 10/10] KVM: MMU: trace " Xiao Guangrong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.