All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-03-29  9:20 Xiao Guangrong
  2012-03-29  9:20 ` [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path Xiao Guangrong
                   ` (14 more replies)
  0 siblings, 15 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

* Idea
The present bit of page fault error code (EFEC.P) indicates whether the
page table is populated on all levels, if this bit is set, we can know
the page fault is caused by the page-protection bits (e.g. W/R bit) or
the reserved bits.

In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
simply fixed: the page fault caused by reserved bit
(EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
is just increasing the corresponding access on the spte.

This pachset introduces a fast path to fix this kind of page fault: it
is out of mmu-lock and need not walk host page table to get the mapping
from gfn to pfn.


* Advantage
- it is really fast
  it fixes page fault out of mmu-lock, and uses a very light way to avoid
  the race with other pathes. Also, it fixes page fault in the front of
  gfn_to_pfn, it means no host page table walking.

- we can get lots of page fault with PFEC.P = 1 in KVM:
  - in the case of ept/npt
   after shadow page become stable (all gfn is mapped in shadow page table,
   it is a short stage since only one shadow page table is used and only a
   few of page is needed), almost all page fault is caused by write-protect
   (frame-buffer under Xwindow, migration), the other small part is caused
   by page merge/COW under KSM/THP.

  We do not hope it can fix the page fault caused by the read-only host
  page of KSM, since after COW, all the spte pointing to the gfn will be
  unmapped.

- in the case of soft mmu
  - many spurious page fault due to tlb lazily flushed
  - lots of write-protect page fault (dirty bit track for guest pte, shadow
    page table write-protected, frame-buffer under Xwindow, migration, ...)


* Implementation
We can freely walk the page between walk_shadow_page_lockless_begin and
walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.

In the most case, cmpxchg is fair enough to change the access bit of spte,
but the write-protect path on softmmu/nested mmu is a especial case: it is
a read-check-modify path: read spte, check W bit, then clear W bit. In order
to avoid marking spte writable after/during page write-protect, we do the
trick like below:

      fast page fault path:
            lock RCU
            set identification in the spte
            smp_mb()
            if (!rmap.PTE_LIST_WRITE_PROTECT)
                 cmpxchg + w - vcpu-id
            unlock RCU

      write protect path:
            lock mmu-lock
            set rmap.PTE_LIST_WRITE_PROTECT
                 smp_mb()
            if (spte.w || spte has identification)
                 clear w bit and identification
            unlock mmu-lock

Setting identification in the spte is used to notify page-protect path to
modify the spte, then we can see the change in the cmpxchg.

Setting identification is also a trick: it only set the last bit of spte
that does not change the mapping and lose cpu status bits.

The identification should be unique to avoid the below race:

     VCPU 0                VCPU 1            VCPU 2
      lock RCU
   spte + identification
   check conditions
                       do write-protect, clear
                          identification
                                              lock RCU
                                        set identification
     cmpxchg + w - identification
        OOPS!!!

We choose the vcpu id as the unique value, currently, 254 vcpus on VMX
and 127 vcpus on softmmu can be fast. Keep it simply firtsly. :)


* Performance
It introduces a full memory barrier on the page write-protect path, i
have done the test of kernbench in the text mode which does not generate
write-protect page fault by frame-buffer avoiding the optimization
introduced by this patch, it shows no regression.

And there is the result tested by x11perf and migration on autotest:

x11perf (x11perf -repeat 10 -comppixwin500):
(Host: Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz * 4 + 4G
 Guest: 4 vcpus + 1G)

- For ept:
$ x11perfcomp baseline-hard optimaze-hard
1: baseline-hard
2: optimaze-hard

     1         2    Operation
--------  --------  ---------
  7060.0    7150.0  Composite 500x500 from pixmap to window

- For shadow mmu:
$ x11perfcomp baseline-soft optimaze-soft
1: baseline-soft
2: optimaze-soft

     1         2    Operation
--------  --------  ---------
  6980.0    7490.0  Composite 500x500 from pixmap to window

( It is interesting that after this patch, the performance of x11perf on
  softmmu is better than it on hardmmu, i have tested it for many times,
  it is really true. :) )

autotest migration:
(Host: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz * 12 + 32G)

- For ept:

Before:
                    smp2.Fedora.16.64.migrate
Times   .unix      .with_autotest.dbench.unix     total
 1       102           204                         309
 2       68            203                         275
 3       67            218                         289

After:
                    smp2.Fedora.16.64.migrate
Times   .unix      .with_autotest.dbench.unix     total
 1       103           189                         295
 2       67            188                         259
 3       64            202                         271


- For shadow mmu:

Before:
                    smp2.Fedora.16.64.migrate
Times   .unix      .with_autotest.dbench.unix     total
 1       102           262                         368
 2       68            220                         292
 3       68            234                         307

After:
                    smp2.Fedora.16.64.migrate
Times   .unix      .with_autotest.dbench.unix     total
 1       104           231                         341
 2       68            218                         289
 3       66            205                         275


Any comments are welcome. :)


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
@ 2012-03-29  9:20 ` Xiao Guangrong
  2012-03-29  9:21 ` [PATCH 02/13] KVM: MMU: abstract spte write-protect Xiao Guangrong
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:20 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

Only test present bit is not enough since mmio spte is also set this bit,
use is_rmap_spte() instead of it

Also move the BUG_ONs to the common function to cleanup the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dc5f245..c759e4f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -989,7 +989,12 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)

 static u64 *rmap_next(unsigned long *rmapp, u64 *spte)
 {
-	return pte_list_next(rmapp, spte);
+	u64 *sptep;
+
+	sptep = pte_list_next(rmapp, spte);
+
+	WARN_ON(sptep && !is_rmap_spte(*sptep));
+	return sptep;
 }

 static void rmap_remove(struct kvm *kvm, u64 *spte)
@@ -1016,7 +1021,6 @@ static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level
 	int write_protected = 0;

 	while ((spte = rmap_next(rmapp, spte))) {
-		BUG_ON(!(*spte & PT_PRESENT_MASK));
 		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);

 		if (!is_writable_pte(*spte))
@@ -1087,7 +1091,6 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	int need_tlb_flush = 0;

 	while ((spte = rmap_next(rmapp, NULL))) {
-		BUG_ON(!(*spte & PT_PRESENT_MASK));
 		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
 		drop_spte(kvm, spte);
 		need_tlb_flush = 1;
@@ -1107,7 +1110,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	new_pfn = pte_pfn(*ptep);
 	spte = rmap_next(rmapp, NULL);
 	while (spte) {
-		BUG_ON(!is_shadow_present_pte(*spte));
 		rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", spte, *spte);
 		need_flush = 1;
 		if (pte_write(*ptep)) {
@@ -1200,7 +1202,6 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	while (spte) {
 		int _young;
 		u64 _spte = *spte;
-		BUG_ON(!(_spte & PT_PRESENT_MASK));
 		_young = _spte & PT_ACCESSED_MASK;
 		if (_young) {
 			young = 1;
@@ -1228,7 +1229,6 @@ static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	spte = rmap_next(rmapp, NULL);
 	while (spte) {
 		u64 _spte = *spte;
-		BUG_ON(!(_spte & PT_PRESENT_MASK));
 		young = _spte & PT_ACCESSED_MASK;
 		if (young) {
 			young = 1;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 02/13] KVM: MMU: abstract spte write-protect
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
  2012-03-29  9:20 ` [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path Xiao Guangrong
@ 2012-03-29  9:21 ` Xiao Guangrong
  2012-03-29 11:11   ` Avi Kivity
  2012-03-29  9:22 ` [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte) Xiao Guangrong
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:21 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

Introduce a common function to abstract spte write-protect to cleanup the
code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   57 ++++++++++++++++++++++++++++++---------------------
 1 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c759e4f..ad40647 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1015,27 +1015,43 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }

+/* Return true if the spte is dropped. */
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
+			       int *flush)
+{
+	u64 spte = *sptep;
+
+	if (!is_writable_pte(spte))
+		return false;
+
+	*flush |= true;
+
+	if (large) {
+		pgprintk("rmap_write_protect(large): spte %p %llx\n",
+			 spte, *spte);
+		BUG_ON(!is_large_pte(spte));
+
+		drop_spte(kvm, sptep);
+		--kvm->stat.lpages;
+		return true;
+	}
+
+	rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
+	spte = spte & ~PT_WRITABLE_MASK;
+	mmu_spte_update(sptep, spte);
+
+	return false;
+}
+
 static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level)
 {
 	u64 *spte = NULL;
 	int write_protected = 0;

 	while ((spte = rmap_next(rmapp, spte))) {
-		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
-
-		if (!is_writable_pte(*spte))
-			continue;
-
-		if (level == PT_PAGE_TABLE_LEVEL) {
-			mmu_spte_update(spte, *spte & ~PT_WRITABLE_MASK);
-		} else {
-			BUG_ON(!is_large_pte(*spte));
-			drop_spte(kvm, spte);
-			--kvm->stat.lpages;
+		if (spte_write_protect(kvm, spte, level > PT_PAGE_TABLE_LEVEL,
+		      &write_protected))
 			spte = NULL;
-		}
-
-		write_protected = 1;
 	}

 	return write_protected;
@@ -3858,6 +3874,7 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 {
 	struct kvm_mmu_page *sp;
+	int flush = 0;

 	list_for_each_entry(sp, &kvm->arch.active_mmu_pages, link) {
 		int i;
@@ -3872,16 +3889,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 			      !is_last_spte(pt[i], sp->role.level))
 				continue;

-			if (is_large_pte(pt[i])) {
-				drop_spte(kvm, &pt[i]);
-				--kvm->stat.lpages;
-				continue;
-			}
-
-			/* avoid RMW */
-			if (is_writable_pte(pt[i]))
-				mmu_spte_update(&pt[i],
-						pt[i] & ~PT_WRITABLE_MASK);
+			spte_write_protect(kvm, &pt[i],
+					   is_large_pte(pt[i]), &flush);
 		}
 	}
 	kvm_flush_remote_tlbs(kvm);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte)
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
  2012-03-29  9:20 ` [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path Xiao Guangrong
  2012-03-29  9:21 ` [PATCH 02/13] KVM: MMU: abstract spte write-protect Xiao Guangrong
@ 2012-03-29  9:22 ` Xiao Guangrong
  2012-03-29 13:00   ` Avi Kivity
  2012-03-29  9:22 ` [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa) Xiao Guangrong
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:22 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

Split FNAME(prefetch_invalid_gpte) to check gpte independently which will
be used in the later patch

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/paging_tmpl.h |   29 +++++++++++++++++------------
 1 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1561028..b9fd1c4 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -337,24 +337,29 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 					addr, access);
 }

-static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
-				    struct kvm_mmu_page *sp, u64 *spte,
-				    pt_element_t gpte)
+static bool FNAME(invalid_gpte)(struct kvm_vcpu *vcpu, pt_element_t gpte)
 {
 	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
-		goto no_present;
+		return true;

 	if (!is_present_gpte(gpte))
-		goto no_present;
+		return true;

 	if (!(gpte & PT_ACCESSED_MASK))
-		goto no_present;
+		return true;

 	return false;
+}

-no_present:
-	drop_spte(vcpu->kvm, spte);
-	return true;
+static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
+					 u64 *spte, pt_element_t gpte)
+{
+	if (FNAME(invalid_gpte)(vcpu, gpte)) {
+		drop_spte(vcpu->kvm, spte);
+		return true;
+	}
+
+	return false;
 }

 static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -365,7 +370,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	pfn_t pfn;

 	gpte = *(const pt_element_t *)pte;
-	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
+	if (FNAME(prefetch_invalid_gpte)(vcpu, spte, gpte))
 		return;

 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
@@ -441,7 +446,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,

 		gpte = gptep[i];

-		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
+		if (FNAME(prefetch_invalid_gpte)(vcpu, spte, gpte))
 			continue;

 		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte,
@@ -792,7 +797,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 					  sizeof(pt_element_t)))
 			return -EINVAL;

-		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
+		if (FNAME(prefetch_invalid_gpte)(vcpu, &sp->spt[i], gpte)) {
 			vcpu->kvm->tlbs_dirty++;
 			continue;
 		}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa)
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (2 preceding siblings ...)
  2012-03-29  9:22 ` [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte) Xiao Guangrong
@ 2012-03-29  9:22 ` Xiao Guangrong
  2012-03-29 13:07   ` Avi Kivity
  2012-03-29  9:23 ` [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask Xiao Guangrong
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:22 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

It can calculate the base gpa of the specified shadow page on any level,
let it instead of FNAME(get_level1_sp_gpa)

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/paging_tmpl.h |   26 ++++++++++++--------------
 1 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index b9fd1c4..f0fbde3 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -558,6 +558,16 @@ out_gpte_changed:
 	return NULL;
 }

+static gpa_t FNAME(get_sp_gpa)(struct kvm_mmu_page *sp)
+{
+	int offset, shift;
+
+	shift = PAGE_SHIFT - (PT_LEVEL_BITS - PT64_LEVEL_BITS) * sp->role.level;
+	offset = sp->role.quadrant << shift;
+
+	return gfn_to_gpa(sp->gfn) + offset;
+}
+
 /*
  * Page fault handler.  There are several causes for a page fault:
  *   - there is no shadow pte for the guest pte
@@ -659,18 +669,6 @@ out_unlock:
 	return 0;
 }

-static gpa_t FNAME(get_level1_sp_gpa)(struct kvm_mmu_page *sp)
-{
-	int offset = 0;
-
-	WARN_ON(sp->role.level != 1);
-
-	if (PTTYPE == 32)
-		offset = sp->role.quadrant << PT64_LEVEL_BITS;
-
-	return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
-}
-
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct kvm_shadow_walk_iterator iterator;
@@ -699,7 +697,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 			if (!sp->unsync)
 				break;

-			pte_gpa = FNAME(get_level1_sp_gpa)(sp);
+			pte_gpa = FNAME(get_sp_gpa)(sp);
 			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);

 			if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
@@ -780,7 +778,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 	/* direct kvm_mmu_page can not be unsync. */
 	BUG_ON(sp->role.direct);

-	first_pte_gpa = FNAME(get_level1_sp_gpa)(sp);
+	first_pte_gpa = FNAME(get_sp_gpa)(sp);

 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		unsigned pte_access;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (3 preceding siblings ...)
  2012-03-29  9:22 ` [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa) Xiao Guangrong
@ 2012-03-29  9:23 ` Xiao Guangrong
  2012-03-29 13:10   ` Avi Kivity
  2012-03-29  9:23 ` [PATCH 06/13] KVM: VMX: export PFEC.P bit on ept Xiao Guangrong
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:23 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

Do not use the last byte (bit 56 ~ bit 63) in shadow_mmio_mask, the late
patch will store vcpu id in the last byte

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |    1 +
 arch/x86/kvm/vmx.c |    2 +-
 arch/x86/kvm/x86.c |    2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ad40647..bb4d292 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -190,6 +190,7 @@ static void mmu_spte_set(u64 *sptep, u64 spte);

 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
 {
+	WARN_ON(mmio_mask & (0xffull << 56));
 	shadow_mmio_mask = mmio_mask;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2c22fc7..5ef2b35 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3706,7 +3706,7 @@ static void ept_set_mmio_spte_mask(void)
 	 * Also, magic bits (0xffull << 49) is set to quickly identify mmio
 	 * spte.
 	 */
-	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
+	kvm_mmu_set_mmio_spte_mask(0xffull << 46 | 0x6ull);
 }

 /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9970ee6..19ef25e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4834,7 +4834,7 @@ static void kvm_set_mmio_spte_mask(void)
 	 * Set the reserved bits and the present bit of an paging-structure
 	 * entry to generate page fault with PFER.RSV = 1.
 	 */
-	mask = ((1ull << (62 - maxphyaddr + 1)) - 1) << maxphyaddr;
+	mask = ((1ull << (55 - maxphyaddr + 1)) - 1) << maxphyaddr;
 	mask |= 1ull;

 #ifdef CONFIG_X86_64
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 06/13] KVM: VMX: export PFEC.P bit on ept
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (4 preceding siblings ...)
  2012-03-29  9:23 ` [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask Xiao Guangrong
@ 2012-03-29  9:23 ` Xiao Guangrong
  2012-03-29  9:24 ` [PATCH 07/13] KVM: MMU: store more bits in rmap Xiao Guangrong
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:23 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

Export the present bit of page fault error code, the later patch will
use it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/vmx.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5ef2b35..d31fa05 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4726,6 +4726,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification;
 	gpa_t gpa;
+	u32 error_code;
 	int gla_validity;

 	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4750,7 +4751,13 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)

 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(gpa, exit_qualification);
-	return kvm_mmu_page_fault(vcpu, gpa, exit_qualification & 0x3, NULL, 0);
+
+	/* It is a write fault? */
+	error_code = exit_qualification & (1U << 1);
+	/* ept page table is present? */
+	error_code |= (exit_qualification >> 3) & 0x1;
+
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }

 static u64 ept_rsvd_mask(u64 spte, int level)
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/13] KVM: MMU: store more bits in rmap
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (5 preceding siblings ...)
  2012-03-29  9:23 ` [PATCH 06/13] KVM: VMX: export PFEC.P bit on ept Xiao Guangrong
@ 2012-03-29  9:24 ` Xiao Guangrong
  2012-03-29  9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:24 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

In current code, only one bit (bit 0) is used in rmap, this patch export
more bits from rmap, during pte add/remove, it only touches bit 0 and other
bits are keeped

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |  143 +++++++++++++++++++++++++++++++++-------------------
 1 files changed, 91 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bb4d292..84b9775 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -796,6 +796,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
 	return level - 1;
 }

+#define PTE_LIST_DESC		(0x1ull)
+#define PTE_LIST_FLAG_MASK	(0x3ull)
+
+static void
+pte_list_decode(const unsigned long *pte_list, unsigned long *map,
+		unsigned long *flags)
+{
+	*map = *pte_list & (~PTE_LIST_FLAG_MASK);
+	*flags = *pte_list & PTE_LIST_FLAG_MASK;
+}
+
 /*
  * Pte mapping structures:
  *
@@ -812,50 +823,67 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 			unsigned long *pte_list)
 {
 	struct pte_list_desc *desc;
+	unsigned long map, flags;
 	int i, count = 0;

-	if (!*pte_list) {
+	pte_list_decode(pte_list, &map, &flags);
+
+	if (!map) {
 		rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
-		*pte_list = (unsigned long)spte;
-	} else if (!(*pte_list & 1)) {
+		WARN_ON(flags & PTE_LIST_DESC);
+		*pte_list = (unsigned long)spte | flags ;
+
+		return 0;
+	}
+
+	if (!(flags & PTE_LIST_DESC)) {
 		rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
 		desc = mmu_alloc_pte_list_desc(vcpu);
-		desc->sptes[0] = (u64 *)*pte_list;
+		desc->sptes[0] = (u64 *)map;
 		desc->sptes[1] = spte;
-		*pte_list = (unsigned long)desc | 1;
-		++count;
-	} else {
-		rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
-		desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-		while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
-			desc = desc->more;
-			count += PTE_LIST_EXT;
-		}
-		if (desc->sptes[PTE_LIST_EXT-1]) {
-			desc->more = mmu_alloc_pte_list_desc(vcpu);
-			desc = desc->more;
-		}
-		for (i = 0; desc->sptes[i]; ++i)
-			++count;
-		desc->sptes[i] = spte;
+		*pte_list = (unsigned long)desc | flags | PTE_LIST_DESC;
+
+		return 1;
+	}
+
+	rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+	desc = (struct pte_list_desc *)map;
+	while (desc->sptes[PTE_LIST_EXT - 1] && desc->more) {
+		desc = desc->more;
+		count += PTE_LIST_EXT;
+	}
+
+	if (desc->sptes[PTE_LIST_EXT - 1]) {
+		desc->more = mmu_alloc_pte_list_desc(vcpu);
+		desc = desc->more;
 	}
+
+	for (i = 0; desc->sptes[i]; ++i)
+		++count;
+	desc->sptes[i] = spte;
+
 	return count;
 }

 static u64 *pte_list_next(unsigned long *pte_list, u64 *spte)
 {
 	struct pte_list_desc *desc;
+	unsigned long map, flags;
 	u64 *prev_spte;
 	int i;

-	if (!*pte_list)
+	pte_list_decode(pte_list, &map, &flags);
+
+	if (!map)
 		return NULL;
-	else if (!(*pte_list & 1)) {
+
+	if (!(flags & PTE_LIST_DESC)) {
 		if (!spte)
-			return (u64 *)*pte_list;
+			return (u64 *)map;
 		return NULL;
 	}
-	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+	desc = (struct pte_list_desc *)map;
 	prev_spte = NULL;
 	while (desc) {
 		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) {
@@ -870,7 +898,8 @@ static u64 *pte_list_next(unsigned long *pte_list, u64 *spte)

 static void
 pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
-			   int i, struct pte_list_desc *prev_desc)
+			   int i, struct pte_list_desc *prev_desc,
+			   unsigned long flags)
 {
 	int j;

@@ -881,12 +910,13 @@ pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
 	if (j != 0)
 		return;
 	if (!prev_desc && !desc->more)
-		*pte_list = (unsigned long)desc->sptes[0];
+		*pte_list = (unsigned long)desc->sptes[0] |
+					(flags & (~PTE_LIST_DESC)) ;
 	else
 		if (prev_desc)
 			prev_desc->more = desc->more;
 		else
-			*pte_list = (unsigned long)desc->more | 1;
+			*pte_list = (unsigned long)desc->more | flags;
 	mmu_free_pte_list_desc(desc);
 }

@@ -894,51 +924,60 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
 {
 	struct pte_list_desc *desc;
 	struct pte_list_desc *prev_desc;
+	unsigned long map, flags;
 	int i;

-	if (!*pte_list) {
+	pte_list_decode(pte_list, &map, &flags);
+
+	if (!map) {
 		printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
 		BUG();
-	} else if (!(*pte_list & 1)) {
+		return;
+	}
+
+	if (!(flags & PTE_LIST_DESC)) {
 		rmap_printk("pte_list_remove:  %p 1->0\n", spte);
-		if ((u64 *)*pte_list != spte) {
+		if ((u64 *)map != spte) {
 			printk(KERN_ERR "pte_list_remove:  %p 1->BUG\n", spte);
 			BUG();
 		}
-		*pte_list = 0;
-	} else {
-		rmap_printk("pte_list_remove:  %p many->many\n", spte);
-		desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-		prev_desc = NULL;
-		while (desc) {
-			for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
-				if (desc->sptes[i] == spte) {
-					pte_list_desc_remove_entry(pte_list,
-							       desc, i,
-							       prev_desc);
-					return;
-				}
-			prev_desc = desc;
-			desc = desc->more;
-		}
-		pr_err("pte_list_remove: %p many->many\n", spte);
-		BUG();
+		*pte_list = flags;
+		return;
+	}
+
+	rmap_printk("pte_list_remove:  %p many->many\n", spte);
+	desc = (struct pte_list_desc *)map;
+	prev_desc = NULL;
+	while (desc) {
+		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+			if (desc->sptes[i] == spte) {
+				pte_list_desc_remove_entry(pte_list,
+						desc, i, prev_desc, flags);
+				return;
+			}
+		prev_desc = desc;
+		desc = desc->more;
 	}
+	pr_err("pte_list_remove: %p many->many\n", spte);
+	BUG();
 }

 typedef void (*pte_list_walk_fn) (u64 *spte);
 static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
 {
 	struct pte_list_desc *desc;
+	unsigned long map, flags;
 	int i;

-	if (!*pte_list)
+	pte_list_decode(pte_list, &map, &flags);
+
+	if (!map)
 		return;

-	if (!(*pte_list & 1))
-		return fn((u64 *)*pte_list);
+	if (!(flags & PTE_LIST_DESC))
+		return fn((u64 *)map);

-	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+	desc = (struct pte_list_desc *)map;
 	while (desc) {
 		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
 			fn(desc->sptes[i]);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 08/13] KVM: MMU: fask check whether page is writable
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (6 preceding siblings ...)
  2012-03-29  9:24 ` [PATCH 07/13] KVM: MMU: store more bits in rmap Xiao Guangrong
@ 2012-03-29  9:25 ` Xiao Guangrong
  2012-03-29 15:49   ` Avi Kivity
  2012-04-01 15:52   ` Avi Kivity
  2012-03-29  9:25 ` [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock Xiao Guangrong
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:25 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to
avoid unnecessary shadow page walking

Also if no shadow page is indirect, the page is write-free

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   34 +++++++++++++++++++++++++++++++++-
 1 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 84b9775..3887a07 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -797,6 +797,7 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
 }

 #define PTE_LIST_DESC		(0x1ull)
+#define PTE_LIST_WRITE_PROTECT	(0x2ull)
 #define PTE_LIST_FLAG_MASK	(0x3ull)

 static void
@@ -1016,6 +1017,13 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
 	return mmu_memory_cache_free_objects(cache);
 }

+static void host_page_write_protect(u64 *spte, unsigned long *rmapp)
+{
+	if (!(*spte & SPTE_HOST_WRITEABLE) &&
+	      !(*rmapp & PTE_LIST_WRITE_PROTECT))
+		*rmapp |= PTE_LIST_WRITE_PROTECT;
+}
+
 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
@@ -1023,7 +1031,10 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)

 	sp = page_header(__pa(spte));
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+
 	rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level);
+	host_page_write_protect(spte, rmapp);
+
 	return pte_list_add(vcpu, spte, rmapp);
 }

@@ -1130,8 +1141,17 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 	int write_protected = 0;

 	slot = gfn_to_memslot(kvm, gfn);
+	rmapp = __gfn_to_rmap(gfn, PT_PAGE_TABLE_LEVEL, slot);

-	for (i = PT_PAGE_TABLE_LEVEL;
+	if (*rmapp & PTE_LIST_WRITE_PROTECT)
+		return 0;
+
+	*rmapp |= PTE_LIST_WRITE_PROTECT;
+
+	write_protected |= __rmap_write_protect(kvm, rmapp,
+						PT_PAGE_TABLE_LEVEL);
+
+	for (i = PT_DIRECTORY_LEVEL;
 	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
 		rmapp = __gfn_to_rmap(gfn, i, slot);
 		write_protected |= __rmap_write_protect(kvm, rmapp, i);
@@ -1180,6 +1200,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 			new_spte &= ~shadow_accessed_mask;
 			mmu_spte_clear_track_bits(spte);
 			mmu_spte_set(spte, new_spte);
+			host_page_write_protect(spte, rmapp);
 			spte = rmap_next(rmapp, spte);
 		}
 	}
@@ -2247,8 +2268,16 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_mmu_page *s;
 	struct hlist_node *node;
+	unsigned long *rmap;
 	bool need_unsync = false;

+	if (!vcpu->kvm->arch.indirect_shadow_pages)
+		return 0;
+
+	rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL);
+	if (!(*rmap & PTE_LIST_WRITE_PROTECT))
+		return 0;
+
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn, node) {
 		if (!can_unsync)
 			return 1;
@@ -2262,6 +2291,9 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 	}
 	if (need_unsync)
 		kvm_unsync_pages(vcpu, gfn);
+
+	*rmap &= ~PTE_LIST_WRITE_PROTECT;
+
 	return 0;
 }

-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (7 preceding siblings ...)
  2012-03-29  9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
@ 2012-03-29  9:25 ` Xiao Guangrong
  2012-04-01 15:53   ` Avi Kivity
  2012-03-29  9:26 ` [PATCH 10/13] KVM: MMU: store vcpu id in spte to notify page write-protect path Xiao Guangrong
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:25 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know
whether the page is writable out of mmu-lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c         |   17 +++++++++++++----
 arch/x86/kvm/paging_tmpl.h |    2 +-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3887a07..c029185 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)

 	*rmapp |= PTE_LIST_WRITE_PROTECT;

+	/*
+	 * Setting PTE_LIST_WRITE_PROTECT bit before doing page
+	 * write-protect.
+	 */
+	smp_mb();
+
 	write_protected |= __rmap_write_protect(kvm, rmapp,
 						PT_PAGE_TABLE_LEVEL);

@@ -2264,7 +2270,7 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 }

 static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
-				  bool can_unsync)
+				  bool can_unsync, bool unlock)
 {
 	struct kvm_mmu_page *s;
 	struct hlist_node *node;
@@ -2278,6 +2284,9 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 	if (!(*rmap & PTE_LIST_WRITE_PROTECT))
 		return 0;

+	if (unlock)
+		return 1;
+
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn, node) {
 		if (!can_unsync)
 			return 1;
@@ -2301,7 +2310,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		    unsigned pte_access, int user_fault,
 		    int write_fault, int level,
 		    gfn_t gfn, pfn_t pfn, bool speculative,
-		    bool can_unsync, bool host_writable)
+		    bool can_unsync, bool host_writable, bool unlock)
 {
 	u64 spte, entry = *sptep;
 	int ret = 0;
@@ -2367,7 +2376,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		if (!can_unsync && is_writable_pte(*sptep))
 			goto set_pte;

-		if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
+		if (mmu_need_write_protect(vcpu, gfn, can_unsync, unlock)) {
 			pgprintk("%s: found shadow page for %llx, marking ro\n",
 				 __func__, gfn);
 			ret = 1;
@@ -2433,7 +2442,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,

 	if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
 		      level, gfn, pfn, speculative, true,
-		      host_writable)) {
+		      host_writable, false)) {
 		if (write_fault)
 			*emulate = 1;
 		kvm_mmu_flush_tlb(vcpu);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index f0fbde3..e2af5a5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -820,7 +820,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		set_spte(vcpu, &sp->spt[i], pte_access, 0, 0,
 			 PT_PAGE_TABLE_LEVEL, gfn,
 			 spte_to_pfn(sp->spt[i]), true, false,
-			 host_writable);
+			 host_writable, false);
 	}

 	return !nr_present;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 10/13] KVM: MMU: store vcpu id in spte to notify page write-protect path
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (8 preceding siblings ...)
  2012-03-29  9:25 ` [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock Xiao Guangrong
@ 2012-03-29  9:26 ` Xiao Guangrong
  2012-03-29  9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:26 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

The last byte of spte can be modified freely since cpu does not touch it
and no pfn mapping is in the area, but NX bit (if it exists) is in this
area, we always set it to reduce access

On the PAE shadow mode, vcpu id spte can cause the #PF with RSV = 1, it will
be treated as a mmio spte, we need filter it out in mmio path and clear the
vcpu id in spte to avoid generating the same fault again

It is used as a hit to notify page wirte-protect path that the spte is being
fetched by fast page fault path and it should be been reset

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |  120 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 102 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c029185..a7f7aea 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -147,7 +147,17 @@ module_param(dbg, bool, 0644);
 #define CREATE_TRACE_POINTS
 #include "mmutrace.h"

-#define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+
+/*
+ * On the fast page fault path, we will store vcpu id in the last byte
+ * of spte out of mmu-lock, it is safe since cpu does not set these bits
+ * and no pfn mapping is in this area. But NX bit (if it exists) is in
+ * this area, we always set it to reduce access.
+ */
+#define VCPU_ID_SHIFT		56
+#define VCPU_ID_MASK		((1ull << (63 - VCPU_ID_SHIFT + 1)) - 1)
+#define SPTE_VCPU_ID_MASK	(VCPU_ID_MASK << VCPU_ID_SHIFT)

 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)

@@ -228,6 +238,49 @@ static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
 	return false;
 }

+static u64 spte_vcpu_id_mask(void)
+{
+	return SPTE_VCPU_ID_MASK & (~shadow_nx_mask);
+}
+
+static u64 mark_vcpu_id_spte(u64 *sptep, u64 spte, int vcpu_id)
+{
+	u64 mask = spte_vcpu_id_mask();
+	char *set = (char *)sptep + 7;
+
+	spte &= ~mask;
+	spte |= ((u64)(vcpu_id + 1) << VCPU_ID_SHIFT) | shadow_nx_mask;
+	*set = spte >> VCPU_ID_SHIFT;
+
+	return spte;
+}
+
+static bool is_vcpu_id_spte(u64 spte)
+{
+	return !!(spte & spte_vcpu_id_mask());
+}
+
+/*
+ * The NX bit can be lazily cleared since the executable mapping is hardly
+ * written in guest.
+ */
+static void clear_vcpu_id_spte(u64 *sptep)
+{
+	u64 mask = spte_vcpu_id_mask() >> VCPU_ID_SHIFT;
+	char *set = (char *)sptep + 7;
+
+	*set &= ~mask;
+}
+
+static int max_vcpu_spte(void)
+{
+	int max_vcpu;
+
+	max_vcpu = spte_vcpu_id_mask() >> VCPU_ID_SHIFT;
+
+	return max_vcpu - 1;
+}
+
 static inline u64 rsvd_bits(int s, int e)
 {
 	return ((1ULL << (e - s + 1)) - 1) << s;
@@ -1068,25 +1121,43 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)

 /* Return true if the spte is dropped. */
 static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
-			       int *flush)
+			       int *flush, bool page_table_protect)
 {
 	u64 spte = *sptep;

-	if (!is_writable_pte(spte))
-		return false;
+	if (is_writable_pte(spte)) {
+		*flush |= true;

-	*flush |= true;
+		if (large) {
+			pgprintk("rmap_write_protect(large): spte %p %llx\n",
+				 spte, *spte);
+			BUG_ON(!is_large_pte(spte));

-	if (large) {
-		pgprintk("rmap_write_protect(large): spte %p %llx\n",
-			 spte, *spte);
-		BUG_ON(!is_large_pte(spte));
+			drop_spte(kvm, sptep);
+			--kvm->stat.lpages;
+			return true;
+		}
+		goto reset_spte;
+	}

-		drop_spte(kvm, sptep);
-		--kvm->stat.lpages;
-		return true;
+	/*
+	 * Reset the spte to notify fast page fault path that the spte
+	 * has been changed.
+	 */
+	if (page_table_protect && is_vcpu_id_spte(spte)) {
+		/* The spte is fetched by the fast page fault path which
+		 * fixes the page fault out of mmu-lock, so we can safely
+		 * mask accssed bit here, it lets the spte to be updated
+		 * fast.
+		 */
+		spte |= shadow_accessed_mask;
+		clear_vcpu_id_spte(&spte);
+		goto reset_spte;
 	}

+	return false;
+
+reset_spte:
 	rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
 	spte = spte & ~PT_WRITABLE_MASK;
 	mmu_spte_update(sptep, spte);
@@ -1094,14 +1165,15 @@ static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
 	return false;
 }

-static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level)
+static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
+				int level, bool page_table_protect)
 {
 	u64 *spte = NULL;
 	int write_protected = 0;

 	while ((spte = rmap_next(rmapp, spte))) {
 		if (spte_write_protect(kvm, spte, level > PT_PAGE_TABLE_LEVEL,
-		      &write_protected))
+		      &write_protected, page_table_protect))
 			spte = NULL;
 	}

@@ -1126,7 +1198,7 @@ void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,

 	while (mask) {
 		rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
-		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
+		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL, false);

 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -1155,12 +1227,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 	smp_mb();

 	write_protected |= __rmap_write_protect(kvm, rmapp,
-						PT_PAGE_TABLE_LEVEL);
+						PT_PAGE_TABLE_LEVEL, true);

 	for (i = PT_DIRECTORY_LEVEL;
 	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
 		rmapp = __gfn_to_rmap(gfn, i, slot);
-		write_protected |= __rmap_write_protect(kvm, rmapp, i);
+		write_protected |= __rmap_write_protect(kvm, rmapp, i, true);
 	}

 	return write_protected;
@@ -3034,6 +3106,18 @@ static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
 	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
 		if (!is_shadow_present_pte(spte))
 			break;
+
+	/*
+	 * On the PAE shadow mode, vcpu id spte can cause the #PF with
+	 * RSV = 1, we need clean the vcpu id to avoid generating the
+	 * same fault again.
+	 */
+	if ((vcpu->arch.mmu.shadow_root_level == PT32E_ROOT_LEVEL) &&
+	      is_vcpu_id_spte(spte)) {
+		clear_vcpu_id_spte(iterator.sptep);
+		spte = 0ull;
+	}
+
 	walk_shadow_page_lockless_end(vcpu);

 	return spte;
@@ -3971,7 +4055,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 				continue;

 			spte_write_protect(kvm, &pt[i],
-					   is_large_pte(pt[i]), &flush);
+					   is_large_pte(pt[i]), &flush, false);
 		}
 	}
 	kvm_flush_remote_tlbs(kvm);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 11/13] KVM: MMU: fast path of handling guest page fault
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (9 preceding siblings ...)
  2012-03-29  9:26 ` [PATCH 10/13] KVM: MMU: store vcpu id in spte to notify page write-protect path Xiao Guangrong
@ 2012-03-29  9:27 ` Xiao Guangrong
  2012-03-31 12:24   ` Xiao Guangrong
  2012-04-01 16:23   ` Avi Kivity
  2012-03-29  9:27 ` [PATCH 12/13] KVM: MMU: trace fast " Xiao Guangrong
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:27 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

If the the present bit of page fault error code is set, it indicates
the shadow page is populated on all levels, it means what we do is only
modify the access bit which can be done out of mmu-lock

The tricks in this patch is avoiding the race between fast page fault
path and write-protect path, write-protect path is a read-check-modify
path:
read spte, check W bit, then clear W bit. What we do is populating a
identification in spte, if write-protect meets it, it modify the spte
even if the spte is readonly. See the comment in the code to get more
information

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c         |  265 +++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/paging_tmpl.h |   41 +++++++
 2 files changed, 302 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a7f7aea..4a01be4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2767,18 +2767,267 @@ exit:
 	return ret;
 }

+static u32 page_fault_expected_access(u32 error_code)
+{
+	u32 access = 0;
+
+	if (error_code & PFERR_WRITE_MASK)
+		access |= ACC_WRITE_MASK;
+
+	if (error_code & PFERR_USER_MASK)
+		access |= ACC_USER_MASK;
+
+	if (error_code & PFERR_FETCH_MASK)
+		access |= ACC_EXEC_MASK;
+
+	return access;
+}
+
+static u32 spte_access(u64 spte)
+{
+	u32 access;
+
+	access = spte & PT_WRITABLE_MASK;
+
+	if (spte & shadow_user_mask)
+		access |= ACC_USER_MASK;
+
+	if (shadow_x_mask) {
+		if (spte & shadow_x_mask)
+			access |= ACC_EXEC_MASK;
+
+		return access;
+	}
+
+	if (!(spte & shadow_nx_mask))
+		access |= ACC_EXEC_MASK;
+
+	return access;
+}
+
+static bool spte_satisfied(u64 spte, u32 access)
+{
+	return (spte_access(spte) & access) == access;
+}
+
+static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn,
+				   u32 error_code)
+{
+	unsigned long *rmap;
+	bool write = error_code & PFERR_WRITE_MASK;
+
+	/*
+	 * #PF can be fast only if the shadow page table is present, that
+	 * means we just need change the access bits (e.g: R/W, U/S...)
+	 * which can be done out of mmu-lock.
+	 */
+	if (!(error_code & PFERR_PRESENT_MASK))
+		return false;
+
+	if (unlikely(vcpu->vcpu_id > max_vcpu_spte()))
+		return false;
+
+	rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL);
+
+	/* Quickly check the page can be writable. */
+	if (write && (ACCESS_ONCE(*rmap) & PTE_LIST_WRITE_PROTECT))
+		return false;
+
+	return true;
+}
+
+typedef bool (*fast_pf_fetch_spte)(struct kvm_vcpu *vcpu, u64 *sptep,
+				   u64 *new_spte, gfn_t gfn, u32 expect_access,
+				   u64 spte);
+
+static bool
+fast_pf_fetch_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 *new_spte,
+			  gfn_t gfn, u32 expect_access, u64 spte)
+{
+	struct kvm_mmu_page *sp = page_header(__pa(sptep));
+
+	WARN_ON(!sp->role.direct);
+
+	if (kvm_mmu_page_get_gfn(sp, sptep - sp->spt) != gfn)
+		return false;
+
+	set_spte(vcpu, new_spte, sp->role.access,
+		 expect_access & ACC_USER_MASK, expect_access & ACC_WRITE_MASK,
+		 sp->role.level, gfn, spte_to_pfn(spte), false, false,
+		 spte & SPTE_HOST_WRITEABLE, true);
+
+	return true;
+}
+
+static bool
+fast_page_fault_fix_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte,
+			 gfn_t gfn, u32 expect_access,
+			 fast_pf_fetch_spte fn)
+{
+	u64 new_spte = 0ull;
+	int vcpu_id = vcpu->vcpu_id;
+
+	spte = mark_vcpu_id_spte(sptep, spte, vcpu_id);
+
+	/*
+	 * Storing vcpu id into spte should be before read
+	 * PTE_LIST_WRITABLE bit.
+	 */
+	smp_mb();
+
+	/*
+	 * In most case, cmpxchg is enough to set access bits but we
+	 * should pay more attention to page write-protect path, it is
+	 * a read-check-modify path: read spte, check W bit, then clear
+	 * W bit. In order to avoid marking spte writable after/during
+	 * page write-protect, we do the trick like below:
+	 *
+	 *      fast page fault path:
+	 *            lock RCU
+	 *            set identification in the spte
+	 *            smp_mb()
+	 *            if (!rmap.PTE_LIST_WRITE_PROTECT)
+	 *                 cmpxchg + w - vcpu-id
+	 *            unlock RCU
+	 *
+	 *      write protect path:
+	 *            lock mmu-lock
+	 *            set rmap.PTE_LIST_WRITE_PROTECT
+	 *                 smp_mb()
+	 *            if (spte.w || spte has identification)
+	 *                 clear w bit and identification
+	 *            unlock mmu-lock
+	 *
+	 * Setting identification in the spte is used to notify
+	 * page-protect path to modify the spte, then we can see the
+	 * change in the cmpxchg.
+	 *
+	 * Setting identification is also a trick: it only set the last
+	 * bit of spte that does not change the mapping and lose cpu
+	 * status bits.
+	 *
+	 * The identification should be unique to avoid the below race:
+	 *
+	 *      VCPU 0                VCPU 1            VCPU 2
+	 *      lock RCU
+	 *   spte + identification
+	 *   check conditions
+	 *                       do write-protect, clear
+	 *                          identification
+	 *                                              lock RCU
+	 *                                        set identification
+	 *     cmpxchg + w - identification
+	 *        OOPS!!!
+	 *
+	 * We choose the vcpu id as the unique value.
+	 */
+
+	new_spte = 0ull;
+	if (!fn(vcpu, sptep, &new_spte, gfn, expect_access, spte))
+		return false;
+
+	if (!spte_satisfied(new_spte, expect_access))
+		return false;
+
+	/*
+	 * We can not remap a spte from writable to read-only out of
+	 * mmu-lock, since it need flush tlbs to sync guest page
+	 * write-protect.
+	 * See the comment in set_spte().
+	 */
+	if (unlikely(is_writable_pte(spte) && !is_writable_pte(new_spte)))
+		return false;
+
+	cmpxchg(sptep, spte, new_spte);
+
+	return true;
+}
+
+/*
+ * Return value:
+ * - true: let the vcpu to access on the same address again.
+ * - false: let the real page fault path to fix it.
+ */
+static bool
+fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, int level,
+		u32 error_code, fast_pf_fetch_spte fn)
+{
+	struct kvm_shadow_walk_iterator iterator;
+	struct kvm_mmu_page *sp;
+	u32 expected_access;
+	bool ret = false;
+	u64 spte = 0ull;
+
+	if (!page_fault_can_be_fast(vcpu, gfn, error_code))
+		return false;
+
+	walk_shadow_page_lockless_begin(vcpu);
+	for_each_shadow_entry_lockless(vcpu, gva, iterator, spte)
+		if (!is_shadow_present_pte(spte) || iterator.level < level)
+			break;
+
+	/*
+	 * If the mapping has been changed, let the vcpu fault on the
+	 * same address again.
+	 */
+	if (!is_rmap_spte(spte)) {
+		ret = true;
+		goto exit;
+	}
+
+	/*
+	 * Check if it is a spurious fault caused by TLB lazily flushed.
+	 *
+	 * Need not check the access of upper level table entries since
+	 * they are always ACC_ALL.
+	 */
+	expected_access = page_fault_expected_access(error_code);
+	if (spte_satisfied(spte, expected_access)) {
+		ret = true;
+		goto exit;
+	}
+
+	sp = page_header(__pa(iterator.sptep));
+	if (sp->role.level != level || !is_last_spte(spte, level))
+		goto exit;
+
+	/*
+	 * If the page fault is caused by write but host do not allow
+	 * to write the page, we need cow the host page.
+	 */
+	if ((error_code & PFERR_WRITE_MASK) && !(spte & SPTE_HOST_WRITEABLE))
+		goto exit;
+
+	/*
+	 * Do not expand the access of sp.
+	 *
+	 * Checking sp->role.access here is safe since it is never
+	 * changed after it is linked into shadow page table.
+	 */
+	if ((sp->role.access & expected_access) != expected_access)
+		goto exit;
+
+	ret = fast_page_fault_fix_spte(vcpu, iterator.sptep, spte, gfn,
+				       expected_access, fn);
+
+exit:
+	walk_shadow_page_lockless_end(vcpu);
+
+	return ret;
+}
+
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			 gva_t gva, pfn_t *pfn, bool write, bool *writable);

-static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,
-			 bool prefault)
+static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
+			 gfn_t gfn, bool prefault)
 {
 	int r;
 	int level;
 	int force_pt_level;
 	pfn_t pfn;
 	unsigned long mmu_seq;
-	bool map_writable;
+	bool map_writable, write = error_code & PFERR_WRITE_MASK;

 	force_pt_level = mapping_level_dirty_bitmap(vcpu, gfn);
 	if (likely(!force_pt_level)) {
@@ -2795,6 +3044,10 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,
 	} else
 		level = PT_PAGE_TABLE_LEVEL;

+	if (fast_page_fault(vcpu, v, gfn, level, error_code,
+			    fast_pf_fetch_direct_spte))
+		return 0;
+
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();

@@ -3195,7 +3448,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 	gfn = gva >> PAGE_SHIFT;

 	return nonpaging_map(vcpu, gva & PAGE_MASK,
-			     error_code & PFERR_WRITE_MASK, gfn, prefault);
+			     error_code, gfn, prefault);
 }

 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
@@ -3275,6 +3528,10 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	} else
 		level = PT_PAGE_TABLE_LEVEL;

+	if (fast_page_fault(vcpu, gpa, gfn, level, error_code,
+			    fast_pf_fetch_direct_spte))
+		return 0;
+
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e2af5a5..e1694e8 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -568,6 +568,43 @@ static gpa_t FNAME(get_sp_gpa)(struct kvm_mmu_page *sp)
 	return gfn_to_gpa(sp->gfn) + offset;
 }

+static bool
+FNAME(fast_pf_fetch_indirect_spte)(struct kvm_vcpu *vcpu, u64 *sptep,
+				   u64 *new_spte, gfn_t gfn,
+				   u32 expect_access, u64 spte)
+
+{
+	struct kvm_mmu_page *sp = page_header(__pa(sptep));
+	pt_element_t gpte;
+	gpa_t pte_gpa;
+	unsigned pte_access;
+
+	if (sp->role.direct)
+		return fast_pf_fetch_direct_spte(vcpu, sptep, new_spte,
+						 gfn, expect_access, spte);
+
+	pte_gpa = FNAME(get_sp_gpa)(sp);
+	pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
+
+	if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &gpte,
+				      sizeof(pt_element_t)))
+		return false;
+
+	if (FNAME(invalid_gpte)(vcpu, gpte))
+		return false;
+
+	if (gpte_to_gfn(gpte) != gfn)
+		return false;
+
+	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte, true);
+	set_spte(vcpu, new_spte, pte_access, expect_access & ACC_USER_MASK,
+		 expect_access & ACC_WRITE_MASK, sp->role.level, gfn,
+		 spte_to_pfn(spte), false, false,
+		 spte & SPTE_HOST_WRITEABLE, true);
+
+	return true;
+}
+
 /*
  * Page fault handler.  There are several causes for a page fault:
  *   - there is no shadow pte for the guest pte
@@ -632,6 +669,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 		walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	}

+	if (fast_page_fault(vcpu, addr, walker.gfn, level,
+			    error_code, FNAME(fast_pf_fetch_indirect_spte)))
+		return 0;
+
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();

-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 12/13] KVM: MMU: trace fast page fault
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (10 preceding siblings ...)
  2012-03-29  9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
@ 2012-03-29  9:27 ` Xiao Guangrong
  2012-03-29  9:28 ` [PATCH 13/13] KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint Xiao Guangrong
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:27 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

To see what happen on this path and help us to optimize it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c      |    2 ++
 arch/x86/kvm/mmutrace.h |   43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4a01be4..94cd31d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3011,6 +3011,8 @@ fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, int level,
 				       expected_access, fn);

 exit:
+	trace_fast_page_fault(vcpu, gva, gfn, error_code, iterator.sptep,
+			      spte, ret);
 	walk_shadow_page_lockless_end(vcpu);

 	return ret;
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 89fb0e8..a81c8a7 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -243,6 +243,49 @@ TRACE_EVENT(
 	TP_printk("addr:%llx gfn %llx access %x", __entry->addr, __entry->gfn,
 		  __entry->access)
 );
+
+static u32 page_fault_expected_access(u32 error_code);
+static bool spte_satisfied(u64 spte, u32 access);
+#define __spte_satisfied(__spte)				\
+	(__entry->success && spte_satisfied(__entry->__spte,	\
+	      page_fault_expected_access(__entry->error_code)))
+TRACE_EVENT(
+	fast_page_fault,
+	TP_PROTO(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, u32 error_code,
+		 u64 *sptep, u64 old_spte, bool retry),
+	TP_ARGS(vcpu, gva, gfn, error_code, sptep, old_spte, retry),
+
+	TP_STRUCT__entry(
+		__field(int, vcpu_id)
+		__field(gva_t, gva)
+		__field(gfn_t, gfn)
+		__field(u32, error_code)
+		__field(u64 *, sptep)
+		__field(u64, old_spte)
+		__field(u64, new_spte)
+		__field(bool, success)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->gva = gva;
+		__entry->gfn = gfn;
+		__entry->error_code = error_code;
+		__entry->sptep = sptep;
+		__entry->old_spte = old_spte;
+		__entry->new_spte = *sptep;
+		__entry->success = retry;
+	),
+
+	TP_printk("vcpu %d gva %lx gfn %llx error_code %s sptep %p "
+		  "old %#llx new %llx spurious %d fixed %d", __entry->vcpu_id,
+		  __entry->gva, __entry->gfn,
+		  __print_flags(__entry->error_code, "|",
+				kvm_mmu_trace_pferr_flags),
+		  __entry->sptep, __entry->old_spte, __entry->new_spte,
+		  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
+	)
+);
 #endif /* _TRACE_KVMMMU_H */

 #undef TRACE_INCLUDE_PATH
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 13/13] KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (11 preceding siblings ...)
  2012-03-29  9:27 ` [PATCH 12/13] KVM: MMU: trace fast " Xiao Guangrong
@ 2012-03-29  9:28 ` Xiao Guangrong
  2012-03-29 10:18 ` [PATCH 00/13] KVM: MMU: fast page fault Avi Kivity
  2012-04-09 13:12 ` Avi Kivity
  14 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29  9:28 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

The P bit of page fault error code is missed in this tracepoint, fix it
by passing the full error code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmutrace.h    |    7 +++----
 arch/x86/kvm/paging_tmpl.h |    3 +--
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index a81c8a7..f6cac77 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -54,8 +54,8 @@
  */
 TRACE_EVENT(
 	kvm_mmu_pagetable_walk,
-	TP_PROTO(u64 addr, int write_fault, int user_fault, int fetch_fault),
-	TP_ARGS(addr, write_fault, user_fault, fetch_fault),
+	TP_PROTO(u64 addr, u32 pferr),
+	TP_ARGS(addr, pferr),

 	TP_STRUCT__entry(
 		__field(__u64, addr)
@@ -64,8 +64,7 @@ TRACE_EVENT(

 	TP_fast_assign(
 		__entry->addr = addr;
-		__entry->pferr = (!!write_fault << 1) | (!!user_fault << 2)
-		                 | (!!fetch_fault << 4);
+		__entry->pferr = pferr;
 	),

 	TP_printk("addr %llx pferr %x %s", __entry->addr, __entry->pferr,
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e1694e8..ab6b925 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -154,8 +154,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	const int fetch_fault = access & PFERR_FETCH_MASK;
 	u16 errcode = 0;

-	trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
-				     fetch_fault);
+	trace_kvm_mmu_pagetable_walk(addr, access);
 retry_walk:
 	eperm = false;
 	walker->level = mmu->root_level;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (12 preceding siblings ...)
  2012-03-29  9:28 ` [PATCH 13/13] KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint Xiao Guangrong
@ 2012-03-29 10:18 ` Avi Kivity
  2012-03-29 11:40   ` Xiao Guangrong
  2012-04-09 13:12 ` Avi Kivity
  14 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 10:18 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> * Idea
> The present bit of page fault error code (EFEC.P) indicates whether the
> page table is populated on all levels, if this bit is set, we can know
> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> the reserved bits.
>
> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> simply fixed: the page fault caused by reserved bit
> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> is just increasing the corresponding access on the spte.
>
> This pachset introduces a fast path to fix this kind of page fault: it
> is out of mmu-lock and need not walk host page table to get the mapping
> from gfn to pfn.

Wow!

Looks like interesting times are back in mmu-land.

Comments below are before review of actual patches, so maybe they're
already answered there, or maybe they're just nonsense.

> * Advantage
> - it is really fast
>   it fixes page fault out of mmu-lock, and uses a very light way to avoid
>   the race with other pathes. Also, it fixes page fault in the front of
>   gfn_to_pfn, it means no host page table walking.
>
> - we can get lots of page fault with PFEC.P = 1 in KVM:
>   - in the case of ept/npt
>    after shadow page become stable (all gfn is mapped in shadow page table,
>    it is a short stage since only one shadow page table is used and only a
>    few of page is needed), almost all page fault is caused by write-protect
>    (frame-buffer under Xwindow, migration), the other small part is caused
>    by page merge/COW under KSM/THP.
>
>   We do not hope it can fix the page fault caused by the read-only host
>   page of KSM, since after COW, all the spte pointing to the gfn will be
>   unmapped.
>
> - in the case of soft mmu
>   - many spurious page fault due to tlb lazily flushed
>   - lots of write-protect page fault (dirty bit track for guest pte, shadow
>     page table write-protected, frame-buffer under Xwindow, migration, ...)
>
>
> * Implementation
> We can freely walk the page between walk_shadow_page_lockless_begin and
> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
>
> In the most case, cmpxchg is fair enough to change the access bit of spte,
> but the write-protect path on softmmu/nested mmu is a especial case: it is
> a read-check-modify path: read spte, check W bit, then clear W bit.

We also set gpte.D and gpte.A, no? How do you handle that?

>  In order
> to avoid marking spte writable after/during page write-protect, we do the
> trick like below:
>
>       fast page fault path:
>             lock RCU
>             set identification in the spte

What if you can't (already taken)?  Spin?  Slow path?

>             smp_mb()
>             if (!rmap.PTE_LIST_WRITE_PROTECT)
>                  cmpxchg + w - vcpu-id
>             unlock RCU
>
>       write protect path:
>             lock mmu-lock
>             set rmap.PTE_LIST_WRITE_PROTECT
>                  smp_mb()
>             if (spte.w || spte has identification)
>                  clear w bit and identification
>             unlock mmu-lock
>
> Setting identification in the spte is used to notify page-protect path to
> modify the spte, then we can see the change in the cmpxchg.
>
> Setting identification is also a trick: it only set the last bit of spte
> that does not change the mapping and lose cpu status bits.

There are plenty of available bits, 53-62.

>
> The identification should be unique to avoid the below race:
>
>      VCPU 0                VCPU 1            VCPU 2
>       lock RCU
>    spte + identification
>    check conditions
>                        do write-protect, clear
>                           identification
>                                               lock RCU
>                                         set identification
>      cmpxchg + w - identification
>         OOPS!!!

Is it not sufficient to use just two bits?

pf_lock - taken by page fault path
wp_lock - taken by write protect path

pf cmpxchg checks both bits.

> We choose the vcpu id as the unique value, currently, 254 vcpus on VMX
> and 127 vcpus on softmmu can be fast. Keep it simply firtsly. :)
>
>
> * Performance
> It introduces a full memory barrier on the page write-protect path, i
> have done the test of kernbench in the text mode which does not generate
> write-protect page fault by frame-buffer avoiding the optimization
> introduced by this patch, it shows no regression.
>
> And there is the result tested by x11perf and migration on autotest:
>
> x11perf (x11perf -repeat 10 -comppixwin500):
> (Host: Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz * 4 + 4G
>  Guest: 4 vcpus + 1G)
>
> - For ept:
> $ x11perfcomp baseline-hard optimaze-hard
> 1: baseline-hard
> 2: optimaze-hard
>
>      1         2    Operation
> --------  --------  ---------
>   7060.0    7150.0  Composite 500x500 from pixmap to window
>
> - For shadow mmu:
> $ x11perfcomp baseline-soft optimaze-soft
> 1: baseline-soft
> 2: optimaze-soft
>
>      1         2    Operation
> --------  --------  ---------
>   6980.0    7490.0  Composite 500x500 from pixmap to window
>
> ( It is interesting that after this patch, the performance of x11perf on
>   softmmu is better than it on hardmmu, i have tested it for many times,
>   it is really true. :) )

It could be because you cannot use THP with dirty logging, so you pay
the overhead of TDP.

> autotest migration:
> (Host: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz * 12 + 32G)
>
> - For ept:
>
> Before:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       102           204                         309
>  2       68            203                         275
>  3       67            218                         289
>
> After:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       103           189                         295
>  2       67            188                         259
>  3       64            202                         271
>
>
> - For shadow mmu:
>
> Before:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       102           262                         368
>  2       68            220                         292
>  3       68            234                         307
>
> After:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       104           231                         341
>  2       68            218                         289
>  3       66            205                         275
>
>
> Any comments are welcome. :)
>

Very impressive.  Now to review the patches (will take me some time).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/13] KVM: MMU: abstract spte write-protect
  2012-03-29  9:21 ` [PATCH 02/13] KVM: MMU: abstract spte write-protect Xiao Guangrong
@ 2012-03-29 11:11   ` Avi Kivity
  2012-03-29 11:51     ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 11:11 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:21 AM, Xiao Guangrong wrote:
> Introduce a common function to abstract spte write-protect to cleanup the
> code
>
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c |   57 ++++++++++++++++++++++++++++++---------------------
>  1 files changed, 33 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c759e4f..ad40647 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1015,27 +1015,43 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
>  		rmap_remove(kvm, sptep);
>  }
>
> +/* Return true if the spte is dropped. */
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
> +			       int *flush)

bool *flush

> +{
> +	u64 spte = *sptep;
> +
> +	if (!is_writable_pte(spte))
> +		return false;
> +
> +	*flush |= true;
> +
> +	if (large) {
> +		pgprintk("rmap_write_protect(large): spte %p %llx\n",
> +			 spte, *spte);
> +		BUG_ON(!is_large_pte(spte));
> +
> +		drop_spte(kvm, sptep);
> +		--kvm->stat.lpages;
> +		return true;
> +	}

As I mentioned before, write-protecting a large spte is a good idea,
since it moves some work from protect-time to fault-time, so it reduces
jitter.  This removes the need for the return value.

It may also be a good idea to populate the lower level instead of
dropping the spte.

All outside this patch set of course.  I'd add those ideas to the wiki
but it won't let me edit at the moment.

> +
> +	rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
> +	spte = spte & ~PT_WRITABLE_MASK;
> +	mmu_spte_update(sptep, spte);
> +
> +	return false;
> +}
> +
>


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-03-29 10:18 ` [PATCH 00/13] KVM: MMU: fast page fault Avi Kivity
@ 2012-03-29 11:40   ` Xiao Guangrong
  2012-03-29 12:57     ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29 11:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 06:18 PM, Avi Kivity wrote:

> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>> * Idea
>> The present bit of page fault error code (EFEC.P) indicates whether the
>> page table is populated on all levels, if this bit is set, we can know
>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>> the reserved bits.
>>
>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>> simply fixed: the page fault caused by reserved bit
>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>> is just increasing the corresponding access on the spte.
>>
>> This pachset introduces a fast path to fix this kind of page fault: it
>> is out of mmu-lock and need not walk host page table to get the mapping
>> from gfn to pfn.
> 
> Wow!
> 
> Looks like interesting times are back in mmu-land.
> 


:)

> Comments below are before review of actual patches, so maybe they're
> already answered there, or maybe they're just nonsense.
> 


Your comments are always appreciated!

>> * Implementation
>> We can freely walk the page between walk_shadow_page_lockless_begin and
>> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
>>
>> In the most case, cmpxchg is fair enough to change the access bit of spte,
>> but the write-protect path on softmmu/nested mmu is a especial case: it is
>> a read-check-modify path: read spte, check W bit, then clear W bit.
> 
> We also set gpte.D and gpte.A, no? How do you handle that?
> 


We still need walk gust page table before fast page fault to check
whether the access is valid.

>>  In order
>> to avoid marking spte writable after/during page write-protect, we do the
>> trick like below:
>>
>>       fast page fault path:
>>             lock RCU
>>             set identification in the spte
> 
> What if you can't (already taken)?  Spin?  Slow path?


In this patch, it allows to concurrently access on the same spte:
it freely set its identification on the spte, because i did not
want to introduce other atomic operations.

You remind me that there may be a risk: if many vcpu fault on the
same spte, it will retry the spte forever. Hmm, how about fix it
like this:

if ( spte.identification = 0) {
	set spte.identification = vcpu.id
	goto cmpxchg-path
}

if (spte.identification == vcpu.id)
	goto cmpxchg-path

return to guest and retry the address again;

cmpxchg-path:
	do checks and cmpxchg

It can ensure the spte can be updated.

>> The identification should be unique to avoid the below race:
>>
>>      VCPU 0                VCPU 1            VCPU 2
>>       lock RCU
>>    spte + identification
>>    check conditions
>>                        do write-protect, clear
>>                           identification
>>                                               lock RCU
>>                                         set identification
>>      cmpxchg + w - identification
>>         OOPS!!!
> 
> Is it not sufficient to use just two bits?
> 
> pf_lock - taken by page fault path
> wp_lock - taken by write protect path
> 
> pf cmpxchg checks both bits.
> 


If we just use two byte as identification, it has to use atomic
operations to maintain these bits? or i misunderstood?

>> - For ept:
>> $ x11perfcomp baseline-hard optimaze-hard
>> 1: baseline-hard
>> 2: optimaze-hard
>>
>>      1         2    Operation
>> --------  --------  ---------
>>   7060.0    7150.0  Composite 500x500 from pixmap to window
>>
>> - For shadow mmu:
>> $ x11perfcomp baseline-soft optimaze-soft
>> 1: baseline-soft
>> 2: optimaze-soft
>>
>>      1         2    Operation
>> --------  --------  ---------
>>   6980.0    7490.0  Composite 500x500 from pixmap to window
>>
>> ( It is interesting that after this patch, the performance of x11perf on
>>   softmmu is better than it on hardmmu, i have tested it for many times,
>>   it is really true. :) )
> 
> It could be because you cannot use THP with dirty logging, so you pay
> the overhead of TDP.
> 


Yes, i think so.

>> Any comments are welcome. :)
>>
> 
> Very impressive.  Now to review the patches (will take me some time).
> 


Thank you, Avi!


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/13] KVM: MMU: abstract spte write-protect
  2012-03-29 11:11   ` Avi Kivity
@ 2012-03-29 11:51     ` Xiao Guangrong
  0 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-29 11:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 07:11 PM, Avi Kivity wrote:


>> +/* Return true if the spte is dropped. */
>> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
>> +			       int *flush)
> 
> bool *flush
> 


Okay, will fix.

>> +{
>> +	u64 spte = *sptep;
>> +
>> +	if (!is_writable_pte(spte))
>> +		return false;
>> +
>> +	*flush |= true;
>> +
>> +	if (large) {
>> +		pgprintk("rmap_write_protect(large): spte %p %llx\n",
>> +			 spte, *spte);
>> +		BUG_ON(!is_large_pte(spte));
>> +
>> +		drop_spte(kvm, sptep);
>> +		--kvm->stat.lpages;
>> +		return true;
>> +	}
> 
> As I mentioned before, write-protecting a large spte is a good idea,
> since it moves some work from protect-time to fault-time, so it reduces
> jitter.  This removes the need for the return value.
> 
> It may also be a good idea to populate the lower level instead of
> dropping the spte.
> 
> All outside this patch set of course.  I'd add those ideas to the wiki
> but it won't let me edit at the moment.
> 


I saw your idea, i will pick it up after this patch if no one does it
at that time. :)


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-03-29 11:40   ` Xiao Guangrong
@ 2012-03-29 12:57     ` Avi Kivity
  2012-03-30  9:18       ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 12:57 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 01:40 PM, Xiao Guangrong wrote:
> >> * Implementation
> >> We can freely walk the page between walk_shadow_page_lockless_begin and
> >> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
> >>
> >> In the most case, cmpxchg is fair enough to change the access bit of spte,
> >> but the write-protect path on softmmu/nested mmu is a especial case: it is
> >> a read-check-modify path: read spte, check W bit, then clear W bit.
> > 
> > We also set gpte.D and gpte.A, no? How do you handle that?
> > 
>
>
> We still need walk gust page table before fast page fault to check
> whether the access is valid.

Ok.  But that's outside the mmu lock.

We can also skip that: if !sp->unsync, copy access rights and gpte.A/D
into spare bits in the spte, and use that to check.

>
> >>  In order
> >> to avoid marking spte writable after/during page write-protect, we do the
> >> trick like below:
> >>
> >>       fast page fault path:
> >>             lock RCU
> >>             set identification in the spte
> > 
> > What if you can't (already taken)?  Spin?  Slow path?
>
>
> In this patch, it allows to concurrently access on the same spte:
> it freely set its identification on the spte, because i did not
> want to introduce other atomic operations.
>
> You remind me that there may be a risk: if many vcpu fault on the
> same spte, it will retry the spte forever. Hmm, how about fix it
> like this:
>
> if ( spte.identification = 0) {
> 	set spte.identification = vcpu.id
> 	goto cmpxchg-path
> }
>
> if (spte.identification == vcpu.id)
> 	goto cmpxchg-path
>
> return to guest and retry the address again;
>
> cmpxchg-path:
> 	do checks and cmpxchg
>
> It can ensure the spte can be updated.

What's the difference between this and


  if test_and_set_bit(spte.lock)
       return_to_guest
  else
       do checks and cmpxchg

?

>
> >> The identification should be unique to avoid the below race:
> >>
> >>      VCPU 0                VCPU 1            VCPU 2
> >>       lock RCU
> >>    spte + identification
> >>    check conditions
> >>                        do write-protect, clear
> >>                           identification
> >>                                               lock RCU
> >>                                         set identification
> >>      cmpxchg + w - identification
> >>         OOPS!!!
> > 
> > Is it not sufficient to use just two bits?
> > 
> > pf_lock - taken by page fault path
> > wp_lock - taken by write protect path
> > 
> > pf cmpxchg checks both bits.
> > 
>
>
> If we just use two byte as identification, it has to use atomic
> operations to maintain these bits? or i misunderstood?

Ah, you're using byte writes to set the identification? remember I
didn't read the patches yet.

Please check the optimization manual, I remember there are warnings
there about using different sized accesses for the same location,
particularly when locked ops are involved.  Maybe bitops are cheaper. 
If the cost is similar, they're at least more similar to the rest of the
kernel.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte)
  2012-03-29  9:22 ` [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte) Xiao Guangrong
@ 2012-03-29 13:00   ` Avi Kivity
  2012-03-30  3:51     ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 13:00 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:22 AM, Xiao Guangrong wrote:
> Split FNAME(prefetch_invalid_gpte) to check gpte independently which will
> be used in the later patch
>
>
> -static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> -				    struct kvm_mmu_page *sp, u64 *spte,
> -				    pt_element_t gpte)
> +static bool FNAME(invalid_gpte)(struct kvm_vcpu *vcpu, pt_element_t gpte)
>  {
>  	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> -		goto no_present;
> +		return true;
>
>  	if (!is_present_gpte(gpte))
> -		goto no_present;
> +		return true;
>
>  	if (!(gpte & PT_ACCESSED_MASK))
> -		goto no_present;
> +		return true;
>
>  	return false;

A comment (or a name change) is needed to indicate that we're not
checking for an invalid gpte, just invalid for caching in an spte.

> +}
>
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa)
  2012-03-29  9:22 ` [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa) Xiao Guangrong
@ 2012-03-29 13:07   ` Avi Kivity
  2012-03-30  5:01     ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 13:07 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:22 AM, Xiao Guangrong wrote:
> It can calculate the base gpa of the specified shadow page on any level,
> let it instead of FNAME(get_level1_sp_gpa)
>
>
> +static gpa_t FNAME(get_sp_gpa)(struct kvm_mmu_page *sp)
> +{

Maybe worth adding a pte index and calling it get_spte_gpa(), depends on
the next patches I guess.

> +	int offset, shift;
> +
> +	shift = PAGE_SHIFT - (PT_LEVEL_BITS - PT64_LEVEL_BITS) * sp->role.level;
> +	offset = sp->role.quadrant << shift;

if you add '& ~PAGE_MASK' the compiler may be smart enough to drop the
whole thing for 64-bit ptes.

> +
> +	return gfn_to_gpa(sp->gfn) + offset;
> +}
> +
>  /*
>   * Page fault handler.  There are several causes for a page fault:
>   *   - there is no shadow pte for the guest pte
> @@ -659,18 +669,6 @@ out_unlock:
>  	return 0;
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask
  2012-03-29  9:23 ` [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask Xiao Guangrong
@ 2012-03-29 13:10   ` Avi Kivity
  2012-03-29 15:28     ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 13:10 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:23 AM, Xiao Guangrong wrote:
> Do not use the last byte (bit 56 ~ bit 63) in shadow_mmio_mask, the late
> patch will store vcpu id in the last byte

Bit 63 is PT64_NX_MASK.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask
  2012-03-29 13:10   ` Avi Kivity
@ 2012-03-29 15:28     ` Avi Kivity
  2012-03-29 16:24       ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 15:28 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 03:10 PM, Avi Kivity wrote:
> On 03/29/2012 11:23 AM, Xiao Guangrong wrote:
> > Do not use the last byte (bit 56 ~ bit 63) in shadow_mmio_mask, the late
> > patch will store vcpu id in the last byte
>
> Bit 63 is PT64_NX_MASK.

Oh, you did say that with softmmu we're limited to 127 cpus.  But what
if spte.nx is changed while we update the identity?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable
  2012-03-29  9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
@ 2012-03-29 15:49   ` Avi Kivity
  2012-03-30  5:10     ` Xiao Guangrong
  2012-04-01 15:52   ` Avi Kivity
  1 sibling, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 15:49 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
> Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to
> avoid unnecessary shadow page walking

Does kvm_set_pte_rmapp() need adjustment?

> Also if no shadow page is indirect, the page is write-free
>
>
> +	if (!vcpu->kvm->arch.indirect_shadow_pages)
> +		return 0;
> +
Best in its own little patch.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask
  2012-03-29 15:28     ` Avi Kivity
@ 2012-03-29 16:24       ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-03-29 16:24 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 05:28 PM, Avi Kivity wrote:
> On 03/29/2012 03:10 PM, Avi Kivity wrote:
> > On 03/29/2012 11:23 AM, Xiao Guangrong wrote:
> > > Do not use the last byte (bit 56 ~ bit 63) in shadow_mmio_mask, the late
> > > patch will store vcpu id in the last byte
> >
> > Bit 63 is PT64_NX_MASK.
>
> Oh, you did say that with softmmu we're limited to 127 cpus.  But what
> if spte.nx is changed while we update the identity?

Ok, saw that too.  But I think test_and_set_bit() is better here, to
reduce the complexity (note we need cmpxchg64b on 32-bit instead).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte)
  2012-03-29 13:00   ` Avi Kivity
@ 2012-03-30  3:51     ` Xiao Guangrong
  0 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-30  3:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 09:00 PM, Avi Kivity wrote:

> On 03/29/2012 11:22 AM, Xiao Guangrong wrote:
>> Split FNAME(prefetch_invalid_gpte) to check gpte independently which will
>> be used in the later patch
>>
>>
>> -static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
>> -				    struct kvm_mmu_page *sp, u64 *spte,
>> -				    pt_element_t gpte)
>> +static bool FNAME(invalid_gpte)(struct kvm_vcpu *vcpu, pt_element_t gpte)
>>  {
>>  	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
>> -		goto no_present;
>> +		return true;
>>
>>  	if (!is_present_gpte(gpte))
>> -		goto no_present;
>> +		return true;
>>
>>  	if (!(gpte & PT_ACCESSED_MASK))
>> -		goto no_present;
>> +		return true;
>>
>>  	return false;
> 
> A comment (or a name change) is needed to indicate that we're not
> checking for an invalid gpte, just invalid for caching in an spte.


Yes, i will change the name to FNAME(is_cacheable_gpte) and and some
comments.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa)
  2012-03-29 13:07   ` Avi Kivity
@ 2012-03-30  5:01     ` Xiao Guangrong
  2012-04-01 12:42       ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-30  5:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 09:07 PM, Avi Kivity wrote:

> On 03/29/2012 11:22 AM, Xiao Guangrong wrote:
>> It can calculate the base gpa of the specified shadow page on any level,
>> let it instead of FNAME(get_level1_sp_gpa)
>>
>>
>> +static gpa_t FNAME(get_sp_gpa)(struct kvm_mmu_page *sp)
>> +{
> 
> Maybe worth adding a pte index and calling it get_spte_gpa(), depends on
> the next patches I guess.
> 


Yes.

But FNAME(get_sp_gpa) is still needed by FNAME(sync_page) path which sync
all sptes. I will introduce get_spte_gpa() as you mentioned which will be
used in invlpg path and in the later patch.


>> +	int offset, shift;
>> +
>> +	shift = PAGE_SHIFT - (PT_LEVEL_BITS - PT64_LEVEL_BITS) * sp->role.level;
>> +	offset = sp->role.quadrant << shift;
> 
> if you add '& ~PAGE_MASK' the compiler may be smart enough to drop the
> whole thing for 64-bit ptes.
> 


Sorry. how to do that?

But if you want to remove the unnecessary workload, i will change it
like this:

static gpa_t FNAME(get_sp_gpa)(struct kvm_mmu_page *sp)
{
	int offset = 0;

#if PTTYPE == 32
	int shift;

	shift = PAGE_SHIFT - (PT_LEVEL_BITS - PT64_LEVEL_BITS) * sp->role.level;
	offset = sp->role.quadrant << shift;
#endif
	return gfn_to_gpa(sp->gfn) + offset;
}


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable
  2012-03-29 15:49   ` Avi Kivity
@ 2012-03-30  5:10     ` Xiao Guangrong
  0 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-30  5:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:49 PM, Avi Kivity wrote:

> On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
>> Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to
>> avoid unnecessary shadow page walking
> 
> Does kvm_set_pte_rmapp() need adjustment?
> 


Yes, in kvm_set_pte_rmapp(), if the page is host write-protected, it will
set this bit:

static void host_page_write_protect(u64 *spte, unsigned long *rmapp)
{
	if (!(*spte & SPTE_HOST_WRITEABLE) &&
	      !(*rmapp & PTE_LIST_WRITE_PROTECT))
		*rmapp |= PTE_LIST_WRITE_PROTECT;
}

It is very useful for fast page fault path to avoid useless shadow
page table waling if KSM is enabled.

>> Also if no shadow page is indirect, the page is write-free
>>
>>
>> +	if (!vcpu->kvm->arch.indirect_shadow_pages)
>> +		return 0;
>> +
> Best in its own little patch.
> 

Okay, will split it into a little patch.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-03-29 12:57     ` Avi Kivity
@ 2012-03-30  9:18       ` Xiao Guangrong
  2012-03-31 13:12         ` Xiao Guangrong
  2012-04-01 12:58         ` Avi Kivity
  0 siblings, 2 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-30  9:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 08:57 PM, Avi Kivity wrote:

> On 03/29/2012 01:40 PM, Xiao Guangrong wrote:
>>>> * Implementation
>>>> We can freely walk the page between walk_shadow_page_lockless_begin and
>>>> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
>>>>
>>>> In the most case, cmpxchg is fair enough to change the access bit of spte,
>>>> but the write-protect path on softmmu/nested mmu is a especial case: it is
>>>> a read-check-modify path: read spte, check W bit, then clear W bit.
>>>
>>> We also set gpte.D and gpte.A, no? How do you handle that?
>>>
>>
>>
>> We still need walk gust page table before fast page fault to check
>> whether the access is valid.
> 
> Ok.  But that's outside the mmu lock.
> 
> We can also skip that: if !sp->unsync, copy access rights and gpte.A/D
> into spare bits in the spte, and use that to check.
> 


Great!

gpte.A need not be copied into spte since EFEC.P = 1 means the shadow page
table is present, gpte.A must be set in this case.

And, we do not need to cache gpte access rights into spte, instead of it,
we can atomicly read gpte to get these information (anyway, we need read gpte
to get the gfn.)

I will do it in the next version.

>>
>>>>  In order
>>>> to avoid marking spte writable after/during page write-protect, we do the
>>>> trick like below:
>>>>
>>>>       fast page fault path:
>>>>             lock RCU
>>>>             set identification in the spte
>>>
>>> What if you can't (already taken)?  Spin?  Slow path?
>>
>>
>> In this patch, it allows to concurrently access on the same spte:
>> it freely set its identification on the spte, because i did not
>> want to introduce other atomic operations.
>>
>> You remind me that there may be a risk: if many vcpu fault on the
>> same spte, it will retry the spte forever. Hmm, how about fix it
>> like this:
>>
>> if ( spte.identification = 0) {
>> 	set spte.identification = vcpu.id
>> 	goto cmpxchg-path
>> }
>>
>> if (spte.identification == vcpu.id)
>> 	goto cmpxchg-path
>>
>> return to guest and retry the address again;
>>
>> cmpxchg-path:
>> 	do checks and cmpxchg
>>
>> It can ensure the spte can be updated.
> 
> What's the difference between this and
> 
> 
>   if test_and_set_bit(spte.lock)
>        return_to_guest
>   else
>        do checks and cmpxchg
> 
> ?
> 


test_and_set_bit is a atomic operation that is i want to avoid.

>>
>>>> The identification should be unique to avoid the below race:
>>>>
>>>>      VCPU 0                VCPU 1            VCPU 2
>>>>       lock RCU
>>>>    spte + identification
>>>>    check conditions
>>>>                        do write-protect, clear
>>>>                           identification
>>>>                                               lock RCU
>>>>                                         set identification
>>>>      cmpxchg + w - identification
>>>>         OOPS!!!
>>>
>>> Is it not sufficient to use just two bits?
>>>
>>> pf_lock - taken by page fault path
>>> wp_lock - taken by write protect path
>>>
>>> pf cmpxchg checks both bits.
>>>
>>
>>
>> If we just use two byte as identification, it has to use atomic
>> operations to maintain these bits? or i misunderstood?
> 
> Ah, you're using byte writes to set the identification? remember I
> didn't read the patches yet.
> 


Yes.

> Please check the optimization manual, I remember there are warnings
> there about using different sized accesses for the same location,
> particularly when locked ops are involved.


"using different sized accesses for the same location" is OK i guess,
anyway, we can have the data struct of this style:

union {
	struct	{
		char byte1,
		char byte2,
		char byte3,
		char byte4
      	} foo;

	int bar;
}

Access .bar and .foo.byte1 is allowed.

And for the LOCK operation, i find these descriptions in the manual:

"Locked operations are atomic with respect to all other memory operations and all
externally visible events"

And there is a warning:
Software should access semaphores (shared memory used for signalling between
multiple processors) using identical addresses and operand lengths. For example, if
one processor accesses a semaphore using a word access, other processors should
not access the semaphore using a byte access.

It is the warning you remember?

In our case, we do not treat it as "semaphores": the "byte writes" does not
need be atomic.

> Maybe bitops are cheaper. 
> If the cost is similar, they're at least more similar to the rest of the
> kernel.
> 


The "bitops" you mentioned is the atomic-style bitops? It is cheaper that
byte writes?


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 11/13] KVM: MMU: fast path of handling guest page fault
  2012-03-29  9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
@ 2012-03-31 12:24   ` Xiao Guangrong
  2012-04-01 16:23   ` Avi Kivity
  1 sibling, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-31 12:24 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On 03/29/2012 05:27 PM, Xiao Guangrong wrote:


> +static bool
> +FNAME(fast_pf_fetch_indirect_spte)(struct kvm_vcpu *vcpu, u64 *sptep,
> +				   u64 *new_spte, gfn_t gfn,
> +				   u32 expect_access, u64 spte)
> +
> +{
> +	struct kvm_mmu_page *sp = page_header(__pa(sptep));
> +	pt_element_t gpte;
> +	gpa_t pte_gpa;
> +	unsigned pte_access;
> +
> +	if (sp->role.direct)
> +		return fast_pf_fetch_direct_spte(vcpu, sptep, new_spte,
> +						 gfn, expect_access, spte);
> +
> +	pte_gpa = FNAME(get_sp_gpa)(sp);
> +	pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
> +
> +	if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &gpte,
> +				      sizeof(pt_element_t)))
> +		return false;
> +
> +	if (FNAME(invalid_gpte)(vcpu, gpte))
> +		return false;
> +
> +	if (gpte_to_gfn(gpte) != gfn)
> +		return false;
> +


Oh, it can not prevent the gpte has been changed, below case will be triggered:

      VCPU 0                       VCPU 1                          VCPU 2

gpte = gfn1 + RO + S + NX
spte = gfn1's pfn + RO + NX

                                modify gpte: gpte = gfn2 + W + U+ X
                                (due to unsync-sp or wirte emulation
                                 before calling kvm_mmu_pte_write())

                                                         page fault on gpte:
                                                               gfn = gfn2
                                                         fast page fault:
                                                               spte = gfn1's pfn + W + U + X
                                                 (It also can break shadow page table write-protect)
                                                          OOPS!!!

The issue is that gfn does not match with pfn in spte.

Maybe we can properly using sp->gfns[] to avoid it:
- sp->gfns is freed in the RCU context
- sp->gfns[] is initiated to INVALID_GFN
- while spte is dropped, set sp->gfns[] to INVALID_GFN

On fast page fault path, we can check sp->gfns[] with the gfn which is read from
gpte, then do cmpxchg if they are the same.

Then, the thing becomes safe since:
- we have set the identification in the spte before the check, that means we can
  perceive the spte change in the later cmpxchg.
- check sp->gfns[] can ensure spte is pointing to gfn's pfn.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-03-30  9:18       ` Xiao Guangrong
@ 2012-03-31 13:12         ` Xiao Guangrong
  2012-04-01 12:58         ` Avi Kivity
  1 sibling, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-03-31 13:12 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On 03/30/2012 05:18 PM, Xiao Guangrong wrote:

> On 03/29/2012 08:57 PM, Avi Kivity wrote:
> 
>> On 03/29/2012 01:40 PM, Xiao Guangrong wrote:
>>>>> * Implementation
>>>>> We can freely walk the page between walk_shadow_page_lockless_begin and
>>>>> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
>>>>>
>>>>> In the most case, cmpxchg is fair enough to change the access bit of spte,
>>>>> but the write-protect path on softmmu/nested mmu is a especial case: it is
>>>>> a read-check-modify path: read spte, check W bit, then clear W bit.
>>>>
>>>> We also set gpte.D and gpte.A, no? How do you handle that?
>>>>
>>>
>>>
>>> We still need walk gust page table before fast page fault to check
>>> whether the access is valid.
>>
>> Ok.  But that's outside the mmu lock.
>>
>> We can also skip that: if !sp->unsync, copy access rights and gpte.A/D
>> into spare bits in the spte, and use that to check.
>>
> 
> 
> Great!
> 
> gpte.A need not be copied into spte since EFEC.P = 1 means the shadow page
> table is present, gpte.A must be set in this case.
> 
> And, we do not need to cache gpte access rights into spte, instead of it,
> we can atomicly read gpte to get these information (anyway, we need read gpte
> to get the gfn.)
> 


It needs more thinking, we can excellent improvement for dirty page logged in
this idea, but i am not sure we can gain the performance in the below case:

- the page fault is trigged by the invalid access from guest
  in the origin way, it is fixed on the FNAME(walk_addr) path which walk guest page
  table, we way need call gfn_to_pfn (it is fast since the page is always not
  swap out). After the idea, it is fixed on fast page fault path which walk shadow
  page table with RCU locked, the preemption is disabled.

  They are not too different i think.

- the page fault is caused by host, but we can not quickly check the page writable
  since gfn is unknown, then after shadow page walking we get the gfn (read gpte),
  what will we do if gfn is write-protect?

  - if the page is write-protected by the host (!spte.SPTE_HOST_WRITEABLE), we have
    no choice, just call gfn_to_pfn and waiting the page cowed.

    Comparing with the origin way, the time costed on shadow page table walking is
    wasted, unfortunately, it is triggered really frequently if KSM is enabled. It
    may be a regression.

  - if the write-protect is caused by page table protected, we have two choice:
    - call slow page fault path. It is unacceptable, since the number of this kind
      of page fault is very large. We may wast many CPU time.

    - hold mmu-lock and fix it in the last spte. It is OK but makes thing little
      complex, i am not sure if you agree with this. :)





^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa)
  2012-03-30  5:01     ` Xiao Guangrong
@ 2012-04-01 12:42       ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-01 12:42 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/30/2012 08:01 AM, Xiao Guangrong wrote:
> On 03/29/2012 09:07 PM, Avi Kivity wrote:
>
> > On 03/29/2012 11:22 AM, Xiao Guangrong wrote:
> >> It can calculate the base gpa of the specified shadow page on any level,
> >> let it instead of FNAME(get_level1_sp_gpa)
> >>
> >>
> >> +static gpa_t FNAME(get_sp_gpa)(struct kvm_mmu_page *sp)
> >> +{
> > 
> > Maybe worth adding a pte index and calling it get_spte_gpa(), depends on
> > the next patches I guess.
> > 
>
>
> Yes.
>
> But FNAME(get_sp_gpa) is still needed by FNAME(sync_page) path which sync
> all sptes. I will introduce get_spte_gpa() as you mentioned which will be
> used in invlpg path and in the later patch.
>
>
> >> +	int offset, shift;
> >> +
> >> +	shift = PAGE_SHIFT - (PT_LEVEL_BITS - PT64_LEVEL_BITS) * sp->role.level;
> >> +	offset = sp->role.quadrant << shift;
> > 
> > if you add '& ~PAGE_MASK' the compiler may be smart enough to drop the
> > whole thing for 64-bit ptes.
> > 
>
>
> Sorry. how to do that?

   offset = (sp->role.quadrant << shift) & ~PAGEMASK;

the compiler should see that shift == PAGE_SHIFT and therefore anything
in sp->role.quadrant will be masked out.

> But if you want to remove the unnecessary workload, i will change it
> like this:
>
> static gpa_t FNAME(get_sp_gpa)(struct kvm_mmu_page *sp)
> {
> 	int offset = 0;
>
> #if PTTYPE == 32
> 	int shift;
>
> 	shift = PAGE_SHIFT - (PT_LEVEL_BITS - PT64_LEVEL_BITS) * sp->role.level;
> 	offset = sp->role.quadrant << shift;
> #endif
> 	return gfn_to_gpa(sp->gfn) + offset;
> }

That works too.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-03-30  9:18       ` Xiao Guangrong
  2012-03-31 13:12         ` Xiao Guangrong
@ 2012-04-01 12:58         ` Avi Kivity
  2012-04-05 21:57           ` Xiao Guangrong
  1 sibling, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-01 12:58 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/30/2012 12:18 PM, Xiao Guangrong wrote:
> On 03/29/2012 08:57 PM, Avi Kivity wrote:
>
> > On 03/29/2012 01:40 PM, Xiao Guangrong wrote:
> >>>> * Implementation
> >>>> We can freely walk the page between walk_shadow_page_lockless_begin and
> >>>> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
> >>>>
> >>>> In the most case, cmpxchg is fair enough to change the access bit of spte,
> >>>> but the write-protect path on softmmu/nested mmu is a especial case: it is
> >>>> a read-check-modify path: read spte, check W bit, then clear W bit.
> >>>
> >>> We also set gpte.D and gpte.A, no? How do you handle that?
> >>>
> >>
> >>
> >> We still need walk gust page table before fast page fault to check
> >> whether the access is valid.
> > 
> > Ok.  But that's outside the mmu lock.
> > 
> > We can also skip that: if !sp->unsync, copy access rights and gpte.A/D
> > into spare bits in the spte, and use that to check.
> > 
>
>
> Great!
>
> gpte.A need not be copied into spte since EFEC.P = 1 means the shadow page
> table is present, gpte.A must be set in this case.
>
> And, we do not need to cache gpte access rights into spte, instead of it,
> we can atomicly read gpte to get these information (anyway, we need read gpte
> to get the gfn.)

Why do we need the gfn?  If !sp->unsync then it is unchanged (it's also
in sp->gfns).

Oh, needed for mark_page_dirty().

Caching gpte access in the spte will save decoding the access buts and
sp->role.access.

>
> I will do it in the next version.
>
> >>
> >>>>  In order
> >>>> to avoid marking spte writable after/during page write-protect, we do the
> >>>> trick like below:
> >>>>
> >>>>       fast page fault path:
> >>>>             lock RCU
> >>>>             set identification in the spte
> >>>
> >>> What if you can't (already taken)?  Spin?  Slow path?
> >>
> >>
> >> In this patch, it allows to concurrently access on the same spte:
> >> it freely set its identification on the spte, because i did not
> >> want to introduce other atomic operations.
> >>
> >> You remind me that there may be a risk: if many vcpu fault on the
> >> same spte, it will retry the spte forever. Hmm, how about fix it
> >> like this:
> >>
> >> if ( spte.identification = 0) {
> >> 	set spte.identification = vcpu.id
> >> 	goto cmpxchg-path
> >> }
> >>
> >> if (spte.identification == vcpu.id)
> >> 	goto cmpxchg-path
> >>
> >> return to guest and retry the address again;
> >>
> >> cmpxchg-path:
> >> 	do checks and cmpxchg
> >>
> >> It can ensure the spte can be updated.
> > 
> > What's the difference between this and
> > 
> > 
> >   if test_and_set_bit(spte.lock)
> >        return_to_guest
> >   else
> >        do checks and cmpxchg
> > 
> > ?
> > 
>
>
> test_and_set_bit is a atomic operation that is i want to avoid.

Right.  Can you check what the effect is (with say
test_and_set_bit(spte, 0) in the same path).

I'm not convinced it's significant.

>
> >>
> >>>> The identification should be unique to avoid the below race:
> >>>>
> >>>>      VCPU 0                VCPU 1            VCPU 2
> >>>>       lock RCU
> >>>>    spte + identification
> >>>>    check conditions
> >>>>                        do write-protect, clear
> >>>>                           identification
> >>>>                                               lock RCU
> >>>>                                         set identification
> >>>>      cmpxchg + w - identification
> >>>>         OOPS!!!
> >>>
> >>> Is it not sufficient to use just two bits?
> >>>
> >>> pf_lock - taken by page fault path
> >>> wp_lock - taken by write protect path
> >>>
> >>> pf cmpxchg checks both bits.
> >>>
> >>
> >>
> >> If we just use two byte as identification, it has to use atomic
> >> operations to maintain these bits? or i misunderstood?
> > 
> > Ah, you're using byte writes to set the identification? remember I
> > didn't read the patches yet.
> > 
>
>
> Yes.
>
> > Please check the optimization manual, I remember there are warnings
> > there about using different sized accesses for the same location,
> > particularly when locked ops are involved.
>
>
> "using different sized accesses for the same location" is OK i guess,
> anyway, we can have the data struct of this style:
>
> union {
> 	struct	{
> 		char byte1,
> 		char byte2,
> 		char byte3,
> 		char byte4
>       	} foo;
>
> 	int bar;
> }
>
> Access .bar and .foo.byte1 is allowed.
>
> And for the LOCK operation, i find these descriptions in the manual:
>
> "Locked operations are atomic with respect to all other memory operations and all
> externally visible events"
>
> And there is a warning:
> Software should access semaphores (shared memory used for signalling between
> multiple processors) using identical addresses and operand lengths. For example, if
> one processor accesses a semaphore using a word access, other processors should
> not access the semaphore using a byte access.
>
> It is the warning you remember?

No, it should be safe, but performance is degraded if you mix loads and
stores of different sizes to the same location.

See for example "2.1.5.2 L1 DCache", Store Forwarding section in the
optimization guide.


>
> In our case, we do not treat it as "semaphores": the "byte writes" does not
> need be atomic.
>
> > Maybe bitops are cheaper. 
> > If the cost is similar, they're at least more similar to the rest of the
> > kernel.
> > 
>
>
> The "bitops" you mentioned is the atomic-style bitops? It is cheaper that
> byte writes?

They're not cheaper in general but may be compared to mixing byte writes
and word loads.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable
  2012-03-29  9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
  2012-03-29 15:49   ` Avi Kivity
@ 2012-04-01 15:52   ` Avi Kivity
  2012-04-05 17:54     ` Xiao Guangrong
  1 sibling, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-01 15:52 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
> Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to
> avoid unnecessary shadow page walking
>
> Also if no shadow page is indirect, the page is write-free
>
>
> @@ -2262,6 +2291,9 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	}
>  	if (need_unsync)
>  		kvm_unsync_pages(vcpu, gfn);
> +
> +	*rmap &= ~PTE_LIST_WRITE_PROTECT;
> +
>

So what are the rules for PTE_LIST_WRITE_PROTECT?  Is is a cache for the
mmu_need_write_protect?

I'd like to understand it, I guess it can be set while write protection
is unneeded, and cleared on the next check?

Maybe split into two functions, one the normal mmu_need_write_protect
(but renamed) and a new mmu_need_write_protect(), with locked and
unlocked variants, calling the old one.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock
  2012-03-29  9:25 ` [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock Xiao Guangrong
@ 2012-04-01 15:53   ` Avi Kivity
  2012-04-05 18:25     ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-01 15:53 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
> It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know
> whether the page is writable out of mmu-lock
>
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c         |   17 +++++++++++++----
>  arch/x86/kvm/paging_tmpl.h |    2 +-
>  2 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3887a07..c029185 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
>
>  	*rmapp |= PTE_LIST_WRITE_PROTECT;
>
> +	/*
> +	 * Setting PTE_LIST_WRITE_PROTECT bit before doing page
> +	 * write-protect.
> +	 */
> +	smp_mb();
> +

wmb only needed.

Would it be better to store this bit in all the sptes instead?  We're
touching them in any case.  More work to clear them, but
un-write-protecting a page is beneficial anyway as it can save a fault.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 11/13] KVM: MMU: fast path of handling guest page fault
  2012-03-29  9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
  2012-03-31 12:24   ` Xiao Guangrong
@ 2012-04-01 16:23   ` Avi Kivity
  2012-04-03 13:04     ` Avi Kivity
  2012-04-05 19:39     ` Xiao Guangrong
  1 sibling, 2 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-01 16:23 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:27 AM, Xiao Guangrong wrote:
> If the the present bit of page fault error code is set, it indicates
> the shadow page is populated on all levels, it means what we do is only
> modify the access bit which can be done out of mmu-lock
>
> The tricks in this patch is avoiding the race between fast page fault
> path and write-protect path, write-protect path is a read-check-modify
> path:
> read spte, check W bit, then clear W bit. What we do is populating a
> identification in spte, if write-protect meets it, it modify the spte
> even if the spte is readonly. See the comment in the code to get more
> information
>
> +
> +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn,
> +				   u32 error_code)
> +{
> +	unsigned long *rmap;
> +	bool write = error_code & PFERR_WRITE_MASK;
> +
> +	/*
> +	 * #PF can be fast only if the shadow page table is present, that
> +	 * means we just need change the access bits (e.g: R/W, U/S...)
> +	 * which can be done out of mmu-lock.
> +	 */
> +	if (!(error_code & PFERR_PRESENT_MASK))
> +		return false;
> +
> +	if (unlikely(vcpu->vcpu_id > max_vcpu_spte()))
> +		return false;
> +
> +	rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL);

What prevents sp->gfns from being freed while this is going on?  Did I
miss the patch that makes mmu pages freed by rcu?  Also need barriers in
kvm_mmu_get_page() to make sure sp->gfns is made visible before the page
is hashed in.

+ smp_rmb()

> +
> +	/* Quickly check the page can be writable. */
> +	if (write && (ACCESS_ONCE(*rmap) & PTE_LIST_WRITE_PROTECT))
> +		return false;
> +
> +	return true;
> +}
> +
> +typedef bool (*fast_pf_fetch_spte)(struct kvm_vcpu *vcpu, u64 *sptep,
> +				   u64 *new_spte, gfn_t gfn, u32 expect_access,
> +				   u64 spte);
> +
> +static bool
> +fast_pf_fetch_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 *new_spte,
> +			  gfn_t gfn, u32 expect_access, u64 spte)
> +{
> +	struct kvm_mmu_page *sp = page_header(__pa(sptep));
> +

Why is this stable?  Another cpu could have removed it.

>
> +	WARN_ON(!sp->role.direct);
> +
> +	if (kvm_mmu_page_get_gfn(sp, sptep - sp->spt) != gfn)
> +		return false;

Simpler to check if the page is sync; if it is, gfn has not changed.

(but the check for a sync page is itself unstable).

>
> +
> +	set_spte(vcpu, new_spte, sp->role.access,
> +		 expect_access & ACC_USER_MASK, expect_access & ACC_WRITE_MASK,
> +		 sp->role.level, gfn, spte_to_pfn(spte), false, false,
> +		 spte & SPTE_HOST_WRITEABLE, true);
> +

What if the conditions have changed meanwhile?  Should we set the spte
atomically via set_spte?  For example an mmu notifier clearing the spte
in parallel.

What if another code path has read *spte, and did not re-read it because
it holds the mmu lock and thinks only the guest can access it?

>
> +	return true;
> +}
> +
> +static bool
> +fast_page_fault_fix_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte,
> +			 gfn_t gfn, u32 expect_access,
> +			 fast_pf_fetch_spte fn)
> +{
> +	u64 new_spte = 0ull;
> +	int vcpu_id = vcpu->vcpu_id;
> +
> +	spte = mark_vcpu_id_spte(sptep, spte, vcpu_id);

What's the point of reading spte?  It can change any minute, so the
value is stale and we can't use it.  We have to read and lock it
atomically, no?

Your patches read the spte, then set the identity.  In between some
other path can change the spte.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 11/13] KVM: MMU: fast path of handling guest page fault
  2012-04-01 16:23   ` Avi Kivity
@ 2012-04-03 13:04     ` Avi Kivity
  2012-04-05 19:39     ` Xiao Guangrong
  1 sibling, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-03 13:04 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 03/29/2012 11:27 AM, Xiao Guangrong wrote:
> > If the the present bit of page fault error code is set, it indicates
> > the shadow page is populated on all levels, it means what we do is only
> > modify the access bit which can be done out of mmu-lock
> >
> > The tricks in this patch is avoiding the race between fast page fault
> > path and write-protect path, write-protect path is a read-check-modify
> > path:
> > read spte, check W bit, then clear W bit. What we do is populating a
> > identification in spte, if write-protect meets it, it modify the spte
> > even if the spte is readonly. See the comment in the code to get more
> > information
> >
> > +
> > +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn,
> > +				   u32 error_code)
> > +{
> > +	unsigned long *rmap;
> > +	bool write = error_code & PFERR_WRITE_MASK;
> > +
> > +	/*
> > +	 * #PF can be fast only if the shadow page table is present, that
> > +	 * means we just need change the access bits (e.g: R/W, U/S...)
> > +	 * which can be done out of mmu-lock.
> > +	 */
> > +	if (!(error_code & PFERR_PRESENT_MASK))
> > +		return false;
> > +
> > +	if (unlikely(vcpu->vcpu_id > max_vcpu_spte()))
> > +		return false;
> > +
> > +	rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL);
>
> What prevents sp->gfns from being freed while this is going on?  Did I
> miss the patch that makes mmu pages freed by rcu?  Also need barriers in
> kvm_mmu_get_page() to make sure sp->gfns is made visible before the page
> is hashed in.
>

Ah, it's call_rcu in kvm_mmu_commit_zap_page().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable
  2012-04-01 15:52   ` Avi Kivity
@ 2012-04-05 17:54     ` Xiao Guangrong
  2012-04-12 23:08       ` Marcelo Tosatti
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-05 17:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

Hi Avi,

Thanks very much for your review!

Sorry for the delay reply since i was on vacation.

On 04/01/2012 11:52 PM, Avi Kivity wrote:

> On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
>> Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to
>> avoid unnecessary shadow page walking
>>
>> Also if no shadow page is indirect, the page is write-free
>>
>>
>> @@ -2262,6 +2291,9 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>>  	}
>>  	if (need_unsync)
>>  		kvm_unsync_pages(vcpu, gfn);
>> +
>> +	*rmap &= ~PTE_LIST_WRITE_PROTECT;
>> +
>>
> 
> So what are the rules for PTE_LIST_WRITE_PROTECT?  Is is a cache for the
> mmu_need_write_protect?
> 
> I'd like to understand it, I guess it can be set while write protection
> is unneeded, and cleared on the next check?
> 


Yes, it is used as a cache for mmu_need_write_protect.

When the gfn is protected by sync sp or read-only host page we set this bit,
and it is be cleared when the sp become unsync or host page becomes writable.

> Maybe split into two functions, one the normal mmu_need_write_protect
> (but renamed) and a new mmu_need_write_protect(), with locked and
> unlocked variants, calling the old one.
> 


Okay, i will split it by introducing a new function named mmu_unsync_gfn_sp
which checks whether sp can become unsync and unsync sp if it is allowed under
the protection of mmu-lock.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock
  2012-04-01 15:53   ` Avi Kivity
@ 2012-04-05 18:25     ` Xiao Guangrong
  2012-04-09 12:28       ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-05 18:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 04/01/2012 11:53 PM, Avi Kivity wrote:

> On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
>> It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know
>> whether the page is writable out of mmu-lock
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/mmu.c         |   17 +++++++++++++----
>>  arch/x86/kvm/paging_tmpl.h |    2 +-
>>  2 files changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 3887a07..c029185 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
>>
>>  	*rmapp |= PTE_LIST_WRITE_PROTECT;
>>
>> +	/*
>> +	 * Setting PTE_LIST_WRITE_PROTECT bit before doing page
>> +	 * write-protect.
>> +	 */
>> +	smp_mb();
>> +
> 
> wmb only needed.
> 


We should ensure setting this bit before reading spte, it cooperates with
fast page fault path to avoid this case:

On fast page fault path:                    On rmap_write_protect path:
                                            read spte: old_spte = *spte
                                       (reading spte is reordered to the front of
                                        setting PTE_LIST_WRITE_PROTECT bit)
set spte.identification
   smp_mb
if (!rmap.PTE_LIST_WRITE_PROTECT)
                                            set rmap.PTE_LIST_WRITE_PROTECT
    cmpxchg(sptep, spte, spte | WRITABLE)
                                            see old_spte.identification is not set,
                                            so it does not write-protect this page
                                                  OOPS!!!

> Would it be better to store this bit in all the sptes instead?  We're
> touching them in any case.  More work to clear them, but
> un-write-protecting a page is beneficial anyway as it can save a fault.
> 

There are two reasons:
- if we set this bit in rmap, we can do the quickly check to see the page is
  writble before doing shadow page walking.

- since a full barrier is needed, we should use smp_mb for every spte like this:

  while ((spte = rmap_next(rmapp, spte))) {
	read spte
        smp_mb
        write-protect spte
  }

  smp_mb is called in the loop, i think it is not good, yes?

If you just want to save the fault, we can let all spte to be writeable in
mmu_need_write_protect, but we should cache gpte access bits into spte firstly.
It should be another patchset i think. :)


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 11/13] KVM: MMU: fast path of handling guest page fault
  2012-04-01 16:23   ` Avi Kivity
  2012-04-03 13:04     ` Avi Kivity
@ 2012-04-05 19:39     ` Xiao Guangrong
  1 sibling, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-05 19:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 04/02/2012 12:23 AM, Avi Kivity wrote:

> On 03/29/2012 11:27 AM, Xiao Guangrong wrote:
>> If the the present bit of page fault error code is set, it indicates
>> the shadow page is populated on all levels, it means what we do is only
>> modify the access bit which can be done out of mmu-lock
>>
>> The tricks in this patch is avoiding the race between fast page fault
>> path and write-protect path, write-protect path is a read-check-modify
>> path:
>> read spte, check W bit, then clear W bit. What we do is populating a
>> identification in spte, if write-protect meets it, it modify the spte
>> even if the spte is readonly. See the comment in the code to get more
>> information
>>
>> +
>> +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn,
>> +				   u32 error_code)
>> +{
>> +	unsigned long *rmap;
>> +	bool write = error_code & PFERR_WRITE_MASK;
>> +
>> +	/*
>> +	 * #PF can be fast only if the shadow page table is present, that
>> +	 * means we just need change the access bits (e.g: R/W, U/S...)
>> +	 * which can be done out of mmu-lock.
>> +	 */
>> +	if (!(error_code & PFERR_PRESENT_MASK))
>> +		return false;
>> +
>> +	if (unlikely(vcpu->vcpu_id > max_vcpu_spte()))
>> +		return false;
>> +
>> +	rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL);
> 
> What prevents sp->gfns from being freed while this is going on?  Did I
> miss the patch that makes mmu pages freed by rcu?  


sp->gfns is not used here, what we are using is rmap which came from
memslot protected by srcu.

In this patch, we called kvm_mmu_page_get_gfn in fast_pf_fetch_direct_spte
which only used for direct sp, sp->gfns is also not touched in this case.

But, i want to use it in the next version, it should be freed in the
rcu context, see:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88934

> Also need barriers in
> kvm_mmu_get_page() to make sure sp->gfns is made visible before the page
> is hashed in.
> 


We do not walk the hash tables of shadow page, we walk shadow page table
which currently is being used by vcpu, so we just need care the order
of setting spte and setting sp->gfns, but it has data dependence in
currently code:

set spte
if (is_shadow_present_pte(*sptep))
   set sp->gfns[]

setting sp->gfns can not be reordered to the head of set spte


> + smp_rmb()
> 


Actually, i do not want to use barrier here since a read barrier is
not a complier barrier on X86. And we just quickly check the page
can be writable before shadow page table walking. A real barrier is
used before updating spte.

>> +
>> +	/* Quickly check the page can be writable. */
>> +	if (write && (ACCESS_ONCE(*rmap) & PTE_LIST_WRITE_PROTECT))
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +typedef bool (*fast_pf_fetch_spte)(struct kvm_vcpu *vcpu, u64 *sptep,
>> +				   u64 *new_spte, gfn_t gfn, u32 expect_access,
>> +				   u64 spte);
>> +
>> +static bool
>> +fast_pf_fetch_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 *new_spte,
>> +			  gfn_t gfn, u32 expect_access, u64 spte)
>> +{
>> +	struct kvm_mmu_page *sp = page_header(__pa(sptep));
>> +
> 
> Why is this stable?  Another cpu could have removed it.
> 


It is ok for the removed sp since it is not freed on this path. And if the
sp is remove, cmpxchg can see this change since the spte is zapped.

>>
>> +	WARN_ON(!sp->role.direct);
>> +
>> +	if (kvm_mmu_page_get_gfn(sp, sptep - sp->spt) != gfn)
>> +		return false;
> 
> Simpler to check if the page is sync; if it is, gfn has not changed.
> 
> (but the check for a sync page is itself unstable).
> 


Checking sp.sync is unsafe since there has a window between guest page write
emulation and update spte:

At the Beginning:
gpte = gfn1
spte = gfn1's pfn
(gfn1 is write-protect, gfn2 is write-free)

VCPU 1                          VCPU 2

modify gpte and let it point to gfn2
page fault
write-emulation: gpte = gfn2

                             page fault on gpte:
                                 gfn = gfn2
                             fast page fault:
                                 sett gfn2 is write free and update spte:
                             spte = gfn1's pfn + W
                                 OOPS!!!
zap spte

>>
>> +
>> +	set_spte(vcpu, new_spte, sp->role.access,
>> +		 expect_access & ACC_USER_MASK, expect_access & ACC_WRITE_MASK,
>> +		 sp->role.level, gfn, spte_to_pfn(spte), false, false,
>> +		 spte & SPTE_HOST_WRITEABLE, true);
>> +
> 
> What if the conditions have changed meanwhile?  Should we set the spte
> atomically via set_spte?  For example an mmu notifier clearing the spte
> in parallel.
> 


All of these case are ok, we do not directly update spte and it is updated like
this:

new_spte = 0x0ull /* set it to nonpresent, so rmap path can not be touched. */
set_spte(vcpu, &new_spte, ......) /* the expected spte is temporarily saved to new_spte.*/
cmpxchg(spte, spte_old, new_spte)

It depends on cmpxchg.

> What if another code path has read *spte, and did not re-read it because
> it holds the mmu lock and thinks only the guest can access it?
> 


We just change access bits here and the mapped pfn is not changed. All the read
paths are ok except page write-protect path since its work depends on the W bit.

That it is why we introduce spte.identification.

>>
>> +	return true;
>> +}
>> +
>> +static bool
>> +fast_page_fault_fix_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte,
>> +			 gfn_t gfn, u32 expect_access,
>> +			 fast_pf_fetch_spte fn)
>> +{
>> +	u64 new_spte = 0ull;
>> +	int vcpu_id = vcpu->vcpu_id;
>> +
>> +	spte = mark_vcpu_id_spte(sptep, spte, vcpu_id);
> 
> What's the point of reading spte?  It can change any minute, so the
> value is stale and we can't use it.  We have to read and lock it
> atomically, no?
> 


No. :)

> Your patches read the spte, then set the identity.  In between some
> other path can change the spte.
> 


It can set the new identity (vcpu id) whatever the current identity is, cmpxchg can
see the change: it updates spte only when the identity is matched.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-01 12:58         ` Avi Kivity
@ 2012-04-05 21:57           ` Xiao Guangrong
  2012-04-06  5:24             ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-05 21:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 04/01/2012 08:58 PM, Avi Kivity wrote:

> On 03/30/2012 12:18 PM, Xiao Guangrong wrote:
>> On 03/29/2012 08:57 PM, Avi Kivity wrote:
>>
>>> On 03/29/2012 01:40 PM, Xiao Guangrong wrote:
>>>>>> * Implementation
>>>>>> We can freely walk the page between walk_shadow_page_lockless_begin and
>>>>>> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
>>>>>>
>>>>>> In the most case, cmpxchg is fair enough to change the access bit of spte,
>>>>>> but the write-protect path on softmmu/nested mmu is a especial case: it is
>>>>>> a read-check-modify path: read spte, check W bit, then clear W bit.
>>>>>
>>>>> We also set gpte.D and gpte.A, no? How do you handle that?
>>>>>
>>>>
>>>>
>>>> We still need walk gust page table before fast page fault to check
>>>> whether the access is valid.
>>>
>>> Ok.  But that's outside the mmu lock.
>>>
>>> We can also skip that: if !sp->unsync, copy access rights and gpte.A/D
>>> into spare bits in the spte, and use that to check.
>>>
>>
>>
>> Great!
>>
>> gpte.A need not be copied into spte since EFEC.P = 1 means the shadow page
>> table is present, gpte.A must be set in this case.
>>
>> And, we do not need to cache gpte access rights into spte, instead of it,
>> we can atomicly read gpte to get these information (anyway, we need read gpte
>> to get the gfn.)
> 
> Why do we need the gfn?  If !sp->unsync then it is unchanged (it's also
> in sp->gfns).
> 
> Oh, needed for mark_page_dirty().
> 
> Caching gpte access in the spte will save decoding the access buts and
> sp->role.access.
> 


Yes, we can directly get the gfn from sp->gfns (it is also safe for usync sp
i think), but we want to have a quickly check (check if gfn can be writable)
before shadow page table waking.

I mentioned it in the previous reply, see:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88934

What is your comment?

>>
>> I will do it in the next version.
>>
>>>>
>>>>>>  In order
>>>>>> to avoid marking spte writable after/during page write-protect, we do the
>>>>>> trick like below:
>>>>>>
>>>>>>       fast page fault path:
>>>>>>             lock RCU
>>>>>>             set identification in the spte
>>>>>
>>>>> What if you can't (already taken)?  Spin?  Slow path?
>>>>
>>>>
>>>> In this patch, it allows to concurrently access on the same spte:
>>>> it freely set its identification on the spte, because i did not
>>>> want to introduce other atomic operations.
>>>>
>>>> You remind me that there may be a risk: if many vcpu fault on the
>>>> same spte, it will retry the spte forever. Hmm, how about fix it
>>>> like this:
>>>>
>>>> if ( spte.identification = 0) {
>>>> 	set spte.identification = vcpu.id
>>>> 	goto cmpxchg-path
>>>> }
>>>>
>>>> if (spte.identification == vcpu.id)
>>>> 	goto cmpxchg-path
>>>>
>>>> return to guest and retry the address again;
>>>>
>>>> cmpxchg-path:
>>>> 	do checks and cmpxchg
>>>>
>>>> It can ensure the spte can be updated.
>>>
>>> What's the difference between this and
>>>
>>>
>>>   if test_and_set_bit(spte.lock)
>>>        return_to_guest
>>>   else
>>>        do checks and cmpxchg
>>>
>>> ?
>>>
>>
>>
>> test_and_set_bit is a atomic operation that is i want to avoid.
> 
> Right.  Can you check what the effect is (with say
> test_and_set_bit(spte, 0) in the same path).
> 
> I'm not convinced it's significant.
> 


Okay, if you prefer to test_and_set_bit, we may introduce two bits:
fast_pf and write_protect, do things like this:

on fast page fault path:

	old_spte = *spte

	if (test_and_set_bit(spte.fast_pf))
		return_to_guest

	old_spte |= fast_pf

	if (old_spte.wp)
		clear-fast-pf bit and return_to_guest

	if (!rmap.PTE_LIST_WRITE_PROTECT)
		cmpxchg(spte, old_spte, old_spte +w - fast_pf)
	else
		clear-fast-pf bit

on page write-protect path:
	lock mmu-lock
	set rmap.PTE_LIST_WRITE_PROTECT
        smp_mb()
        if (spte.w || spte.fast_pf)
                spte -w -fast_pf + write_protect

        unlock mmu-lock

I think it works but can not make page fault fast for unsync sp (
e.g: two sptes point to the gfn, a page fault on one of the spte
make gfn's sp become unsync.), it is not too bad since we can cache
access bits in spte and properly make all sptes become writable on
mmu_need_write_protect path in a new patchset.

>>
>>>>
>>>>>> The identification should be unique to avoid the below race:
>>>>>>
>>>>>>      VCPU 0                VCPU 1            VCPU 2
>>>>>>       lock RCU
>>>>>>    spte + identification
>>>>>>    check conditions
>>>>>>                        do write-protect, clear
>>>>>>                           identification
>>>>>>                                               lock RCU
>>>>>>                                         set identification
>>>>>>      cmpxchg + w - identification
>>>>>>         OOPS!!!
>>>>>
>>>>> Is it not sufficient to use just two bits?
>>>>>
>>>>> pf_lock - taken by page fault path
>>>>> wp_lock - taken by write protect path
>>>>>
>>>>> pf cmpxchg checks both bits.
>>>>>
>>>>
>>>>
>>>> If we just use two byte as identification, it has to use atomic
>>>> operations to maintain these bits? or i misunderstood?
>>>
>>> Ah, you're using byte writes to set the identification? remember I
>>> didn't read the patches yet.
>>>
>>
>>
>> Yes.
>>
>>> Please check the optimization manual, I remember there are warnings
>>> there about using different sized accesses for the same location,
>>> particularly when locked ops are involved.
>>
>>
>> "using different sized accesses for the same location" is OK i guess,
>> anyway, we can have the data struct of this style:
>>
>> union {
>> 	struct	{
>> 		char byte1,
>> 		char byte2,
>> 		char byte3,
>> 		char byte4
>>       	} foo;
>>
>> 	int bar;
>> }
>>
>> Access .bar and .foo.byte1 is allowed.
>>
>> And for the LOCK operation, i find these descriptions in the manual:
>>
>> "Locked operations are atomic with respect to all other memory operations and all
>> externally visible events"
>>
>> And there is a warning:
>> Software should access semaphores (shared memory used for signalling between
>> multiple processors) using identical addresses and operand lengths. For example, if
>> one processor accesses a semaphore using a word access, other processors should
>> not access the semaphore using a byte access.
>>
>> It is the warning you remember?
> 
> No, it should be safe, but performance is degraded if you mix loads and
> stores of different sizes to the same location.
> 
> See for example "2.1.5.2 L1 DCache", Store Forwarding section in the
> optimization guide.
> 


Oh, got it, it is a new knowledge point for me.

> 
>>
>> In our case, we do not treat it as "semaphores": the "byte writes" does not
>> need be atomic.
>>
>>> Maybe bitops are cheaper. 
>>> If the cost is similar, they're at least more similar to the rest of the
>>> kernel.
>>>
>>
>>
>> The "bitops" you mentioned is the atomic-style bitops? It is cheaper that
>> byte writes?
> 
> They're not cheaper in general but may be compared to mixing byte writes
> and word loads.
> 


Maybe you are right.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-05 21:57           ` Xiao Guangrong
@ 2012-04-06  5:24             ` Xiao Guangrong
  2012-04-09 13:20               ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-06  5:24 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On 04/06/2012 05:57 AM, Xiao Guangrong wrote:


>>>>
>>>> What's the difference between this and
>>>>
>>>>
>>>>   if test_and_set_bit(spte.lock)
>>>>        return_to_guest
>>>>   else
>>>>        do checks and cmpxchg
>>>>
>>>> ?
>>>>
>>>
>>>
>>> test_and_set_bit is a atomic operation that is i want to avoid.
>>
>> Right.  Can you check what the effect is (with say
>> test_and_set_bit(spte, 0) in the same path).
>>
>> I'm not convinced it's significant.
>>
> 
> 
> Okay, if you prefer to test_and_set_bit, we may introduce two bits:
> fast_pf and write_protect, do things like this:
> 
> on fast page fault path:
> 
> 	old_spte = *spte
> 
> 	if (test_and_set_bit(spte.fast_pf))
> 		return_to_guest
> 
> 	old_spte |= fast_pf
> 
> 	if (old_spte.wp)
> 		clear-fast-pf bit and return_to_guest
> 
> 	if (!rmap.PTE_LIST_WRITE_PROTECT)
> 		cmpxchg(spte, old_spte, old_spte +w - fast_pf)
> 	else
> 		clear-fast-pf bit
> 
> on page write-protect path:
> 	lock mmu-lock
> 	set rmap.PTE_LIST_WRITE_PROTECT
>         smp_mb()
>         if (spte.w || spte.fast_pf)
>                 spte -w -fast_pf + write_protect
> 
>         unlock mmu-lock
> 
> I think it works but can not make page fault fast for unsync sp (
> e.g: two sptes point to the gfn, a page fault on one of the spte
> make gfn's sp become unsync.), it is not too bad since we can cache
> access bits in spte and properly make all sptes become writable on
> mmu_need_write_protect path in a new patchset.
> 


Foolish me, i should be crazy. Sorry for my mistake. :(

Unfortunately, it can not work, we can not get a stable gfn from gpte or
sp->gfns[]. For example:

beginning:
Gpte = Gfn1
gfn_to_pfn(Gfn1) = Pfn
Spte = Pfn
Gfn1 is write-free
Gfn2 is write-protected


VCPU 0                              VCPU 1                     VCPU 2

fault on gpte
fast page fault path:
  set Spte.fast_pf
  get Gfn1 from Gpte/sp->gfns[]
  if (Gfn1 is writable)
                                Pfn is swapped out:
					Spte = 0
				Gpte is modified to Gfn2,
                                and Pfn is realloced and remapped
                                to Gfn2, so:
                                        Spte = Pfn

                                                          fast page fault path:
                                                             set Spte.fast_pf

         cmpxchg  Spte+w
            OOPS!!!
  <we see Spte is not changed and
   happily make it writable, so gfn2 can be writable>

It seems only a unique identification can prevent this. :(


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock
  2012-04-05 18:25     ` Xiao Guangrong
@ 2012-04-09 12:28       ` Avi Kivity
  2012-04-09 13:16         ` Takuya Yoshikawa
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-09 12:28 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 04/05/2012 09:25 PM, Xiao Guangrong wrote:
> On 04/01/2012 11:53 PM, Avi Kivity wrote:
>
> > On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
> >> It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know
> >> whether the page is writable out of mmu-lock
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/mmu.c         |   17 +++++++++++++----
> >>  arch/x86/kvm/paging_tmpl.h |    2 +-
> >>  2 files changed, 14 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 3887a07..c029185 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
> >>
> >>  	*rmapp |= PTE_LIST_WRITE_PROTECT;
> >>
> >> +	/*
> >> +	 * Setting PTE_LIST_WRITE_PROTECT bit before doing page
> >> +	 * write-protect.
> >> +	 */
> >> +	smp_mb();
> >> +
> > 
> > wmb only needed.
> > 
>
>
> We should ensure setting this bit before reading spte, it cooperates with
> fast page fault path to avoid this case:
>
> On fast page fault path:                    On rmap_write_protect path:
>                                             read spte: old_spte = *spte
>                                        (reading spte is reordered to the front of
>                                         setting PTE_LIST_WRITE_PROTECT bit)
> set spte.identification
>    smp_mb
> if (!rmap.PTE_LIST_WRITE_PROTECT)
>                                             set rmap.PTE_LIST_WRITE_PROTECT
>     cmpxchg(sptep, spte, spte | WRITABLE)
>                                             see old_spte.identification is not set,
>                                             so it does not write-protect this page
>                                                   OOPS!!!

Ah, so it's protecting two paths at the same time: rmap.write_protect ->
fast page fault, and lock(sptep) -> write protect.

The whole thing needs to be documented very carefully in locking.txt,
otherwise mmu.c will be write-protected to everyone except you.

> > Would it be better to store this bit in all the sptes instead?  We're
> > touching them in any case.  More work to clear them, but
> > un-write-protecting a page is beneficial anyway as it can save a fault.
> > 
>
> There are two reasons:
> - if we set this bit in rmap, we can do the quickly check to see the page is
>   writble before doing shadow page walking.
>
> - since a full barrier is needed, we should use smp_mb for every spte like this:
>
>   while ((spte = rmap_next(rmapp, spte))) {
> 	read spte
>         smp_mb
>         write-protect spte
>   }
>
>   smp_mb is called in the loop, i think it is not good, yes?

Yes, agree.

> If you just want to save the fault, we can let all spte to be writeable in
> mmu_need_write_protect, but we should cache gpte access bits into spte firstly.
> It should be another patchset i think. :)

Yes.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
                   ` (13 preceding siblings ...)
  2012-03-29 10:18 ` [PATCH 00/13] KVM: MMU: fast page fault Avi Kivity
@ 2012-04-09 13:12 ` Avi Kivity
  2012-04-09 13:55   ` Xiao Guangrong
  2012-04-09 17:58   ` Marcelo Tosatti
  14 siblings, 2 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-09 13:12 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> * Idea
> The present bit of page fault error code (EFEC.P) indicates whether the
> page table is populated on all levels, if this bit is set, we can know
> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> the reserved bits.
>
> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> simply fixed: the page fault caused by reserved bit
> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> is just increasing the corresponding access on the spte.
>
> This pachset introduces a fast path to fix this kind of page fault: it
> is out of mmu-lock and need not walk host page table to get the mapping
> from gfn to pfn.
>
>

This patchset is really worrying to me.

It introduces a lot of concurrency into data structures that were not
designed for it.  Even if it is correct, it will be very hard to
convince ourselves that it is correct, and if it isn't, to debug those
subtle bugs.  It will also be much harder to maintain the mmu code than
it is now.

There are a lot of things to check.  Just as an example, we need to be
sure that if we use rcu_dereference() twice in the same code path, that
any inconsistencies due to a write in between are benign.  Doing that is
a huge task.

But I appreciate the performance improvement and would like to see a
simpler version make it in.  This needs to reduce the amount of data
touched in the fast path so it is easier to validate, and perhaps reduce
the number of cases that the fast path works on.

I would like to see the fast path as simple as

  rcu_read_lock();

  (lockless shadow walk)
  spte = ACCESS_ONCE(*sptep);

  if (!(spte & PT_MAY_ALLOW_WRITES))
        goto slow;

  gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
  mark_page_dirty(kvm, gfn);

  new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
  if (cmpxchg(sptep, spte, new_spte) != spte)
       goto slow;

  rcu_read_unlock();
  return;

slow:
  rcu_read_unlock();
  slow_path();

It now becomes the responsibility of the slow path to maintain *sptep &
PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
can be as simple as a clear_bit() before we update sp->gfns[] or if we
add host write protection.

Sorry, it's too complicated for me.  Marcelo, what's your take?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock
  2012-04-09 12:28       ` Avi Kivity
@ 2012-04-09 13:16         ` Takuya Yoshikawa
  2012-04-09 13:21           ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-09 13:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On Mon, 09 Apr 2012 15:28:47 +0300
Avi Kivity <avi@redhat.com> wrote:

> Ah, so it's protecting two paths at the same time: rmap.write_protect ->
> fast page fault, and lock(sptep) -> write protect.
> 
> The whole thing needs to be documented very carefully in locking.txt,
> otherwise mmu.c will be write-protected to everyone except you.

Before changing mmu_lock usage, better describe it in locking.txt.

I guess from mmu.c git history, it is already restricted to a few
people who can describe that.

	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-06  5:24             ` Xiao Guangrong
@ 2012-04-09 13:20               ` Avi Kivity
  2012-04-09 13:59                 ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-09 13:20 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 04/06/2012 08:24 AM, Xiao Guangrong wrote:
>
> Foolish me, i should be crazy. Sorry for my mistake. :(
>
> Unfortunately, it can not work, we can not get a stable gfn from gpte or
> sp->gfns[]. For example:
>
> beginning:
> Gpte = Gfn1
> gfn_to_pfn(Gfn1) = Pfn
> Spte = Pfn
> Gfn1 is write-free
> Gfn2 is write-protected
>
>
> VCPU 0                              VCPU 1                     VCPU 2
>
> fault on gpte
> fast page fault path:
>   set Spte.fast_pf
>   get Gfn1 from Gpte/sp->gfns[]
>   if (Gfn1 is writable)
>                                 Pfn is swapped out:
> 					Spte = 0
> 				Gpte is modified to Gfn2,
>                                 and Pfn is realloced and remapped
>                                 to Gfn2, so:
>                                         Spte = Pfn
>
>                                                           fast page fault path:
>                                                              set Spte.fast_pf
>
>          cmpxchg  Spte+w
>             OOPS!!!
>   <we see Spte is not changed and
>    happily make it writable, so gfn2 can be writable>
>
> It seems only a unique identification can prevent this. :(
>

Ouch.

What about restricting this to role.direct=1?  Then gfn is stable?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock
  2012-04-09 13:16         ` Takuya Yoshikawa
@ 2012-04-09 13:21           ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-09 13:21 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 04/09/2012 04:16 PM, Takuya Yoshikawa wrote:
> On Mon, 09 Apr 2012 15:28:47 +0300
> Avi Kivity <avi@redhat.com> wrote:
>
> > Ah, so it's protecting two paths at the same time: rmap.write_protect ->
> > fast page fault, and lock(sptep) -> write protect.
> > 
> > The whole thing needs to be documented very carefully in locking.txt,
> > otherwise mmu.c will be write-protected to everyone except you.
>
> Before changing mmu_lock usage, better describe it in locking.txt.
>
> I guess from mmu.c git history, it is already restricted to a few
> people who can describe that.

It's getting harder and harder, yes.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 13:12 ` Avi Kivity
@ 2012-04-09 13:55   ` Xiao Guangrong
  2012-04-09 14:01     ` Xiao Guangrong
  2012-04-09 14:25     ` Avi Kivity
  2012-04-09 17:58   ` Marcelo Tosatti
  1 sibling, 2 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-09 13:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 04/09/2012 09:12 PM, Avi Kivity wrote:

> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>> * Idea
>> The present bit of page fault error code (EFEC.P) indicates whether the
>> page table is populated on all levels, if this bit is set, we can know
>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>> the reserved bits.
>>
>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>> simply fixed: the page fault caused by reserved bit
>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>> is just increasing the corresponding access on the spte.
>>
>> This pachset introduces a fast path to fix this kind of page fault: it
>> is out of mmu-lock and need not walk host page table to get the mapping
>> from gfn to pfn.
>>
>>
> 
> This patchset is really worrying to me.
> 
> It introduces a lot of concurrency into data structures that were not
> designed for it.  Even if it is correct, it will be very hard to
> convince ourselves that it is correct, and if it isn't, to debug those
> subtle bugs.  It will also be much harder to maintain the mmu code than
> it is now.
> 
> There are a lot of things to check.  Just as an example, we need to be
> sure that if we use rcu_dereference() twice in the same code path, that
> any inconsistencies due to a write in between are benign.  Doing that is
> a huge task.
> 
> But I appreciate the performance improvement and would like to see a
> simpler version make it in.  This needs to reduce the amount of data
> touched in the fast path so it is easier to validate, and perhaps reduce
> the number of cases that the fast path works on.
> 
> I would like to see the fast path as simple as
> 
>   rcu_read_lock();
> 
>   (lockless shadow walk)
>   spte = ACCESS_ONCE(*sptep);
> 
>   if (!(spte & PT_MAY_ALLOW_WRITES))
>         goto slow;
> 
>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>   mark_page_dirty(kvm, gfn);
> 
>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>   if (cmpxchg(sptep, spte, new_spte) != spte)
>        goto slow;
> 
>   rcu_read_unlock();
>   return;
> 
> slow:
>   rcu_read_unlock();
>   slow_path();
> 
> It now becomes the responsibility of the slow path to maintain *sptep &
> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> add host write protection.
> 


Okay, let's simplify it as possible:

- let it only fix the page fault with PFEC.P == 1 && PFEC.W = 0, that means
  unlock set_spte path can be dropped

- let it just fixes the page fault caused by dirty-log that means we always
  skip the spte which write-protected by shadow page protection.

Then, things should be fair simper:

In set_spte path, if the spte can be writable, we set ALLOW_WRITE bit
In rmap_write_protect:
  if (spte.PT_WRITABLE_MASK) {
	WARN_ON(!(spte & ALLOW_WRITE));
	spte &= ~PT_WRITABLE_MASK;
	spte |= WRITE_PROTECT;
  }

in fast page fault:

if (spte & PT_WRITABLE_MASK)
	return_go_guest;

if ((spte & ALLOW_WRITE) && !(spte & WRITE_PROTECT))
	cmpxchg spte + PT_WRITABLE_MASK

The information all we needed comes from spte it is independence from other path,
and no barriers.


Hmm, how about this one?


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 13:20               ` Avi Kivity
@ 2012-04-09 13:59                 ` Xiao Guangrong
  0 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-09 13:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On 04/09/2012 09:20 PM, Avi Kivity wrote:

> On 04/06/2012 08:24 AM, Xiao Guangrong wrote:
>>
>> Foolish me, i should be crazy. Sorry for my mistake. :(
>>
>> Unfortunately, it can not work, we can not get a stable gfn from gpte or
>> sp->gfns[]. For example:
>>
>> beginning:
>> Gpte = Gfn1
>> gfn_to_pfn(Gfn1) = Pfn
>> Spte = Pfn
>> Gfn1 is write-free
>> Gfn2 is write-protected
>>
>>
>> VCPU 0                              VCPU 1                     VCPU 2
>>
>> fault on gpte
>> fast page fault path:
>>   set Spte.fast_pf
>>   get Gfn1 from Gpte/sp->gfns[]
>>   if (Gfn1 is writable)
>>                                 Pfn is swapped out:
>> 					Spte = 0
>> 				Gpte is modified to Gfn2,
>>                                 and Pfn is realloced and remapped
>>                                 to Gfn2, so:
>>                                         Spte = Pfn
>>
>>                                                           fast page fault path:
>>                                                              set Spte.fast_pf
>>
>>          cmpxchg  Spte+w
>>             OOPS!!!
>>   <we see Spte is not changed and
>>    happily make it writable, so gfn2 can be writable>
>>
>> It seems only a unique identification can prevent this. :(
>>
> 
> Ouch.
> 
> What about restricting this to role.direct=1?  Then gfn is stable?
> 


Yes.

The gfn of direct sp is stable since it is calculated by sp->gfn which is
independent with gpte.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 13:55   ` Xiao Guangrong
@ 2012-04-09 14:01     ` Xiao Guangrong
  2012-04-09 14:25     ` Avi Kivity
  1 sibling, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-09 14:01 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, LKML, KVM

On 04/09/2012 09:55 PM, Xiao Guangrong wrote:


> 
> 
> Okay, let's simplify it as possible:
> 
> - let it only fix the page fault with PFEC.P == 1 && PFEC.W = 0, that means


PFEC.P == 1 && PFEC.W = 1, my typo.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 13:55   ` Xiao Guangrong
  2012-04-09 14:01     ` Xiao Guangrong
@ 2012-04-09 14:25     ` Avi Kivity
  1 sibling, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-09 14:25 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 04/09/2012 04:55 PM, Xiao Guangrong wrote:
>
> Okay, let's simplify it as possible:
>
> - let it only fix the page fault with PFEC.P == 1 && PFEC.W = 0, that means
>   unlock set_spte path can be dropped
>
> - let it just fixes the page fault caused by dirty-log that means we always
>   skip the spte which write-protected by shadow page protection.
>
> Then, things should be fair simper:
>
> In set_spte path, if the spte can be writable, we set ALLOW_WRITE bit
> In rmap_write_protect:
>   if (spte.PT_WRITABLE_MASK) {
> 	WARN_ON(!(spte & ALLOW_WRITE));
> 	spte &= ~PT_WRITABLE_MASK;
> 	spte |= WRITE_PROTECT;
>   }
>
> in fast page fault:
>
> if (spte & PT_WRITABLE_MASK)
> 	return_go_guest;
>
> if ((spte & ALLOW_WRITE) && !(spte & WRITE_PROTECT))
> 	cmpxchg spte + PT_WRITABLE_MASK
>
> The information all we needed comes from spte it is independence from other path,
> and no barriers.
>
>
> Hmm, how about this one?
>

I like it. WRITE_PROTECT is better than my ALLOW_WRITES, the meaning is
clearer.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 13:12 ` Avi Kivity
  2012-04-09 13:55   ` Xiao Guangrong
@ 2012-04-09 17:58   ` Marcelo Tosatti
  2012-04-09 18:13     ` Xiao Guangrong
  2012-04-09 18:26     ` Xiao Guangrong
  1 sibling, 2 replies; 92+ messages in thread
From: Marcelo Tosatti @ 2012-04-09 17:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, LKML, KVM

On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> > * Idea
> > The present bit of page fault error code (EFEC.P) indicates whether the
> > page table is populated on all levels, if this bit is set, we can know
> > the page fault is caused by the page-protection bits (e.g. W/R bit) or
> > the reserved bits.
> >
> > In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> > simply fixed: the page fault caused by reserved bit
> > (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> > path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> > is just increasing the corresponding access on the spte.
> >
> > This pachset introduces a fast path to fix this kind of page fault: it
> > is out of mmu-lock and need not walk host page table to get the mapping
> > from gfn to pfn.
> >
> >
> 
> This patchset is really worrying to me.
> 
> It introduces a lot of concurrency into data structures that were not
> designed for it.  Even if it is correct, it will be very hard to
> convince ourselves that it is correct, and if it isn't, to debug those
> subtle bugs.  It will also be much harder to maintain the mmu code than
> it is now.
> 
> There are a lot of things to check.  Just as an example, we need to be
> sure that if we use rcu_dereference() twice in the same code path, that
> any inconsistencies due to a write in between are benign.  Doing that is
> a huge task.
> 
> But I appreciate the performance improvement and would like to see a
> simpler version make it in.  This needs to reduce the amount of data
> touched in the fast path so it is easier to validate, and perhaps reduce
> the number of cases that the fast path works on.
> 
> I would like to see the fast path as simple as
> 
>   rcu_read_lock();
> 
>   (lockless shadow walk)
>   spte = ACCESS_ONCE(*sptep);
> 
>   if (!(spte & PT_MAY_ALLOW_WRITES))
>         goto slow;
> 
>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>   mark_page_dirty(kvm, gfn);
> 
>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>   if (cmpxchg(sptep, spte, new_spte) != spte)
>        goto slow;
> 
>   rcu_read_unlock();
>   return;
> 
> slow:
>   rcu_read_unlock();
>   slow_path();
> 
> It now becomes the responsibility of the slow path to maintain *sptep &
> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> add host write protection.
> 
> Sorry, it's too complicated for me.  Marcelo, what's your take?

The improvement is small and limited to special cases (migration should
be rare and framebuffer memory accounts for a small percentage of total
memory).

For one, how can this be safe against mmu notifier methods?

KSM			      |VCPU0		| VCPU1
		 	      | fault		| fault
			      | cow-page	|
			      |	set spte RW	|
			      |			|
write protect host pte	      |			|
grab mmu_lock		      |			|
remove writeable bit in spte  |			|
increase mmu_notifier_seq     |			|  spte = read-only spte
release mmu_lock	      |			|  cmpxchg succeeds, RO->RW!

MMU notifiers rely on the fault path sequence being

read host pte
read mmu_notifier_seq
spin_lock(mmu_lock)
if (mmu_notifier_seq changed)
	goodbye, host pte value is stale
spin_unlock(mmu_lock)

By the example above, you cannot rely on the spte value alone,
mmu_notifier_seq must be taken into account.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 17:58   ` Marcelo Tosatti
@ 2012-04-09 18:13     ` Xiao Guangrong
  2012-04-09 19:31       ` Marcelo Tosatti
  2012-04-09 18:26     ` Xiao Guangrong
  1 sibling, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-09 18:13 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, Xiao Guangrong, LKML, KVM

On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

> On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
>> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>>> * Idea
>>> The present bit of page fault error code (EFEC.P) indicates whether the
>>> page table is populated on all levels, if this bit is set, we can know
>>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>>> the reserved bits.
>>>
>>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>>> simply fixed: the page fault caused by reserved bit
>>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>>> is just increasing the corresponding access on the spte.
>>>
>>> This pachset introduces a fast path to fix this kind of page fault: it
>>> is out of mmu-lock and need not walk host page table to get the mapping
>>> from gfn to pfn.
>>>
>>>
>>
>> This patchset is really worrying to me.
>>
>> It introduces a lot of concurrency into data structures that were not
>> designed for it.  Even if it is correct, it will be very hard to
>> convince ourselves that it is correct, and if it isn't, to debug those
>> subtle bugs.  It will also be much harder to maintain the mmu code than
>> it is now.
>>
>> There are a lot of things to check.  Just as an example, we need to be
>> sure that if we use rcu_dereference() twice in the same code path, that
>> any inconsistencies due to a write in between are benign.  Doing that is
>> a huge task.
>>
>> But I appreciate the performance improvement and would like to see a
>> simpler version make it in.  This needs to reduce the amount of data
>> touched in the fast path so it is easier to validate, and perhaps reduce
>> the number of cases that the fast path works on.
>>
>> I would like to see the fast path as simple as
>>
>>   rcu_read_lock();
>>
>>   (lockless shadow walk)
>>   spte = ACCESS_ONCE(*sptep);
>>
>>   if (!(spte & PT_MAY_ALLOW_WRITES))
>>         goto slow;
>>
>>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>>   mark_page_dirty(kvm, gfn);
>>
>>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>>   if (cmpxchg(sptep, spte, new_spte) != spte)
>>        goto slow;
>>
>>   rcu_read_unlock();
>>   return;
>>
>> slow:
>>   rcu_read_unlock();
>>   slow_path();
>>
>> It now becomes the responsibility of the slow path to maintain *sptep &
>> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
>> can be as simple as a clear_bit() before we update sp->gfns[] or if we
>> add host write protection.
>>
>> Sorry, it's too complicated for me.  Marcelo, what's your take?
> 
> The improvement is small and limited to special cases (migration should
> be rare and framebuffer memory accounts for a small percentage of total
> memory).
> 
> For one, how can this be safe against mmu notifier methods?
> 
> KSM			      |VCPU0		| VCPU1
> 		 	      | fault		| fault
> 			      | cow-page	|
> 			      |	set spte RW	|
> 			      |			|
> write protect host pte	      |			|
> grab mmu_lock		      |			|
> remove writeable bit in spte  |			|
> increase mmu_notifier_seq     |			|  spte = read-only spte
> release mmu_lock	      |			|  cmpxchg succeeds, RO->RW!
> 
> MMU notifiers rely on the fault path sequence being
> 
> read host pte
> read mmu_notifier_seq
> spin_lock(mmu_lock)
> if (mmu_notifier_seq changed)
> 	goodbye, host pte value is stale
> spin_unlock(mmu_lock)
> 
> By the example above, you cannot rely on the spte value alone,
> mmu_notifier_seq must be taken into account.


No.

When KSM change the host page to read-only, the HOST_WRITABLE bit
of spte should be removed, that means, the spte should be changed
that can be watched by cmpxchg.

Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
set.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 17:58   ` Marcelo Tosatti
  2012-04-09 18:13     ` Xiao Guangrong
@ 2012-04-09 18:26     ` Xiao Guangrong
  2012-04-09 19:46       ` Marcelo Tosatti
  2012-04-10 10:10       ` Avi Kivity
  1 sibling, 2 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-09 18:26 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, Xiao Guangrong, LKML, KVM

On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

> On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
>> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>>> * Idea
>>> The present bit of page fault error code (EFEC.P) indicates whether the
>>> page table is populated on all levels, if this bit is set, we can know
>>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>>> the reserved bits.
>>>
>>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>>> simply fixed: the page fault caused by reserved bit
>>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>>> is just increasing the corresponding access on the spte.
>>>
>>> This pachset introduces a fast path to fix this kind of page fault: it
>>> is out of mmu-lock and need not walk host page table to get the mapping
>>> from gfn to pfn.
>>>
>>>
>>
>> This patchset is really worrying to me.
>>
>> It introduces a lot of concurrency into data structures that were not
>> designed for it.  Even if it is correct, it will be very hard to
>> convince ourselves that it is correct, and if it isn't, to debug those
>> subtle bugs.  It will also be much harder to maintain the mmu code than
>> it is now.
>>
>> There are a lot of things to check.  Just as an example, we need to be
>> sure that if we use rcu_dereference() twice in the same code path, that
>> any inconsistencies due to a write in between are benign.  Doing that is
>> a huge task.
>>
>> But I appreciate the performance improvement and would like to see a
>> simpler version make it in.  This needs to reduce the amount of data
>> touched in the fast path so it is easier to validate, and perhaps reduce
>> the number of cases that the fast path works on.
>>
>> I would like to see the fast path as simple as
>>
>>   rcu_read_lock();
>>
>>   (lockless shadow walk)
>>   spte = ACCESS_ONCE(*sptep);
>>
>>   if (!(spte & PT_MAY_ALLOW_WRITES))
>>         goto slow;
>>
>>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>>   mark_page_dirty(kvm, gfn);
>>
>>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>>   if (cmpxchg(sptep, spte, new_spte) != spte)
>>        goto slow;
>>
>>   rcu_read_unlock();
>>   return;
>>
>> slow:
>>   rcu_read_unlock();
>>   slow_path();
>>
>> It now becomes the responsibility of the slow path to maintain *sptep &
>> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
>> can be as simple as a clear_bit() before we update sp->gfns[] or if we
>> add host write protection.
>>
>> Sorry, it's too complicated for me.  Marcelo, what's your take?
> 
> The improvement is small and limited to special cases (migration should
> be rare and framebuffer memory accounts for a small percentage of total
> memory).


Actually, although the framebuffer is small but it is modified really
frequently, and another unlucky things is that dirty-log is also
very frequently and need hold mmu-lock to do write-protect.

Yes, if Xwindow is not enabled, the benefit is limited. :)

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 18:13     ` Xiao Guangrong
@ 2012-04-09 19:31       ` Marcelo Tosatti
  0 siblings, 0 replies; 92+ messages in thread
From: Marcelo Tosatti @ 2012-04-09 19:31 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Xiao Guangrong, LKML, KVM

On Tue, Apr 10, 2012 at 02:13:41AM +0800, Xiao Guangrong wrote:
> On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
> 
> > On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
> >> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> >>> * Idea
> >>> The present bit of page fault error code (EFEC.P) indicates whether the
> >>> page table is populated on all levels, if this bit is set, we can know
> >>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> >>> the reserved bits.
> >>>
> >>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> >>> simply fixed: the page fault caused by reserved bit
> >>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> >>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> >>> is just increasing the corresponding access on the spte.
> >>>
> >>> This pachset introduces a fast path to fix this kind of page fault: it
> >>> is out of mmu-lock and need not walk host page table to get the mapping
> >>> from gfn to pfn.
> >>>
> >>>
> >>
> >> This patchset is really worrying to me.
> >>
> >> It introduces a lot of concurrency into data structures that were not
> >> designed for it.  Even if it is correct, it will be very hard to
> >> convince ourselves that it is correct, and if it isn't, to debug those
> >> subtle bugs.  It will also be much harder to maintain the mmu code than
> >> it is now.
> >>
> >> There are a lot of things to check.  Just as an example, we need to be
> >> sure that if we use rcu_dereference() twice in the same code path, that
> >> any inconsistencies due to a write in between are benign.  Doing that is
> >> a huge task.
> >>
> >> But I appreciate the performance improvement and would like to see a
> >> simpler version make it in.  This needs to reduce the amount of data
> >> touched in the fast path so it is easier to validate, and perhaps reduce
> >> the number of cases that the fast path works on.
> >>
> >> I would like to see the fast path as simple as
> >>
> >>   rcu_read_lock();
> >>
> >>   (lockless shadow walk)
> >>   spte = ACCESS_ONCE(*sptep);
> >>
> >>   if (!(spte & PT_MAY_ALLOW_WRITES))
> >>         goto slow;
> >>
> >>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
> >>   mark_page_dirty(kvm, gfn);
> >>
> >>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
> >>   if (cmpxchg(sptep, spte, new_spte) != spte)
> >>        goto slow;
> >>
> >>   rcu_read_unlock();
> >>   return;
> >>
> >> slow:
> >>   rcu_read_unlock();
> >>   slow_path();
> >>
> >> It now becomes the responsibility of the slow path to maintain *sptep &
> >> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> >> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> >> add host write protection.
> >>
> >> Sorry, it's too complicated for me.  Marcelo, what's your take?
> > 
> > The improvement is small and limited to special cases (migration should
> > be rare and framebuffer memory accounts for a small percentage of total
> > memory).
> > 
> > For one, how can this be safe against mmu notifier methods?
> > 
> > KSM			      |VCPU0		| VCPU1
> > 		 	      | fault		| fault
> > 			      | cow-page	|
> > 			      |	set spte RW	|
> > 			      |			|
> > write protect host pte	      |			|
> > grab mmu_lock		      |			|
> > remove writeable bit in spte  |			|
> > increase mmu_notifier_seq     |			|  spte = read-only spte
> > release mmu_lock	      |			|  cmpxchg succeeds, RO->RW!
> > 
> > MMU notifiers rely on the fault path sequence being
> > 
> > read host pte
> > read mmu_notifier_seq
> > spin_lock(mmu_lock)
> > if (mmu_notifier_seq changed)
> > 	goodbye, host pte value is stale
> > spin_unlock(mmu_lock)
> > 
> > By the example above, you cannot rely on the spte value alone,
> > mmu_notifier_seq must be taken into account.
> 
> 
> No.
> 
> When KSM change the host page to read-only, the HOST_WRITABLE bit
> of spte should be removed, that means, the spte should be changed
> that can be watched by cmpxchg.
> 
> Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
> set.

Right. 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 18:26     ` Xiao Guangrong
@ 2012-04-09 19:46       ` Marcelo Tosatti
  2012-04-10  3:06         ` Xiao Guangrong
                           ` (2 more replies)
  2012-04-10 10:10       ` Avi Kivity
  1 sibling, 3 replies; 92+ messages in thread
From: Marcelo Tosatti @ 2012-04-09 19:46 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Xiao Guangrong, LKML, KVM

On Tue, Apr 10, 2012 at 02:26:27AM +0800, Xiao Guangrong wrote:
> On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
> 
> > On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
> >> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> >>> * Idea
> >>> The present bit of page fault error code (EFEC.P) indicates whether the
> >>> page table is populated on all levels, if this bit is set, we can know
> >>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> >>> the reserved bits.
> >>>
> >>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> >>> simply fixed: the page fault caused by reserved bit
> >>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> >>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> >>> is just increasing the corresponding access on the spte.
> >>>
> >>> This pachset introduces a fast path to fix this kind of page fault: it
> >>> is out of mmu-lock and need not walk host page table to get the mapping
> >>> from gfn to pfn.
> >>>
> >>>
> >>
> >> This patchset is really worrying to me.
> >>
> >> It introduces a lot of concurrency into data structures that were not
> >> designed for it.  Even if it is correct, it will be very hard to
> >> convince ourselves that it is correct, and if it isn't, to debug those
> >> subtle bugs.  It will also be much harder to maintain the mmu code than
> >> it is now.
> >>
> >> There are a lot of things to check.  Just as an example, we need to be
> >> sure that if we use rcu_dereference() twice in the same code path, that
> >> any inconsistencies due to a write in between are benign.  Doing that is
> >> a huge task.
> >>
> >> But I appreciate the performance improvement and would like to see a
> >> simpler version make it in.  This needs to reduce the amount of data
> >> touched in the fast path so it is easier to validate, and perhaps reduce
> >> the number of cases that the fast path works on.
> >>
> >> I would like to see the fast path as simple as
> >>
> >>   rcu_read_lock();
> >>
> >>   (lockless shadow walk)
> >>   spte = ACCESS_ONCE(*sptep);
> >>
> >>   if (!(spte & PT_MAY_ALLOW_WRITES))
> >>         goto slow;
> >>
> >>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
> >>   mark_page_dirty(kvm, gfn);
> >>
> >>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
> >>   if (cmpxchg(sptep, spte, new_spte) != spte)
> >>        goto slow;
> >>
> >>   rcu_read_unlock();
> >>   return;
> >>
> >> slow:
> >>   rcu_read_unlock();
> >>   slow_path();
> >>
> >> It now becomes the responsibility of the slow path to maintain *sptep &
> >> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> >> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> >> add host write protection.
> >>
> >> Sorry, it's too complicated for me.  Marcelo, what's your take?
> > 
> > The improvement is small and limited to special cases (migration should
> > be rare and framebuffer memory accounts for a small percentage of total
> > memory).
> 
> 
> Actually, although the framebuffer is small but it is modified really
> frequently, and another unlucky things is that dirty-log is also
> very frequently and need hold mmu-lock to do write-protect.
> 
> Yes, if Xwindow is not enabled, the benefit is limited. :)

Ignoring that fact, the safety of lockless set_spte and friends is not
clear.

Perhaps the mmu_lock hold times by get_dirty are a large component here?
If that can be alleviated, not only RO->RW faults benefit.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 19:46       ` Marcelo Tosatti
@ 2012-04-10  3:06         ` Xiao Guangrong
  2012-04-10 10:04         ` Avi Kivity
  2012-04-10 10:39         ` Avi Kivity
  2 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-10  3:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, Avi Kivity, LKML, KVM

On 04/10/2012 03:46 AM, Marcelo Tosatti wrote:

> On Tue, Apr 10, 2012 at 02:26:27AM +0800, Xiao Guangrong wrote:
>> On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
>>
>>> On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
>>>> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>>>>> * Idea
>>>>> The present bit of page fault error code (EFEC.P) indicates whether the
>>>>> page table is populated on all levels, if this bit is set, we can know
>>>>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>>>>> the reserved bits.
>>>>>
>>>>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>>>>> simply fixed: the page fault caused by reserved bit
>>>>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>>>>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>>>>> is just increasing the corresponding access on the spte.
>>>>>
>>>>> This pachset introduces a fast path to fix this kind of page fault: it
>>>>> is out of mmu-lock and need not walk host page table to get the mapping
>>>>> from gfn to pfn.
>>>>>
>>>>>
>>>>
>>>> This patchset is really worrying to me.
>>>>
>>>> It introduces a lot of concurrency into data structures that were not
>>>> designed for it.  Even if it is correct, it will be very hard to
>>>> convince ourselves that it is correct, and if it isn't, to debug those
>>>> subtle bugs.  It will also be much harder to maintain the mmu code than
>>>> it is now.
>>>>
>>>> There are a lot of things to check.  Just as an example, we need to be
>>>> sure that if we use rcu_dereference() twice in the same code path, that
>>>> any inconsistencies due to a write in between are benign.  Doing that is
>>>> a huge task.
>>>>
>>>> But I appreciate the performance improvement and would like to see a
>>>> simpler version make it in.  This needs to reduce the amount of data
>>>> touched in the fast path so it is easier to validate, and perhaps reduce
>>>> the number of cases that the fast path works on.
>>>>
>>>> I would like to see the fast path as simple as
>>>>
>>>>   rcu_read_lock();
>>>>
>>>>   (lockless shadow walk)
>>>>   spte = ACCESS_ONCE(*sptep);
>>>>
>>>>   if (!(spte & PT_MAY_ALLOW_WRITES))
>>>>         goto slow;
>>>>
>>>>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>>>>   mark_page_dirty(kvm, gfn);
>>>>
>>>>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>>>>   if (cmpxchg(sptep, spte, new_spte) != spte)
>>>>        goto slow;
>>>>
>>>>   rcu_read_unlock();
>>>>   return;
>>>>
>>>> slow:
>>>>   rcu_read_unlock();
>>>>   slow_path();
>>>>
>>>> It now becomes the responsibility of the slow path to maintain *sptep &
>>>> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
>>>> can be as simple as a clear_bit() before we update sp->gfns[] or if we
>>>> add host write protection.
>>>>
>>>> Sorry, it's too complicated for me.  Marcelo, what's your take?
>>>
>>> The improvement is small and limited to special cases (migration should
>>> be rare and framebuffer memory accounts for a small percentage of total
>>> memory).
>>
>>
>> Actually, although the framebuffer is small but it is modified really
>> frequently, and another unlucky things is that dirty-log is also
>> very frequently and need hold mmu-lock to do write-protect.
>>
>> Yes, if Xwindow is not enabled, the benefit is limited. :)
> 
> Ignoring that fact, the safety of lockless set_spte and friends is not
> clear.
> 


That is why AVI suggested me to simplify the whole things. :)


> Perhaps the mmu_lock hold times by get_dirty are a large component here?
> If that can be alleviated, not only RO->RW faults benefit.
> 

Yes.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 19:46       ` Marcelo Tosatti
  2012-04-10  3:06         ` Xiao Guangrong
@ 2012-04-10 10:04         ` Avi Kivity
  2012-04-11  1:47           ` Marcelo Tosatti
  2012-04-10 10:39         ` Avi Kivity
  2 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-10 10:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, Xiao Guangrong, LKML, KVM

On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
> Perhaps the mmu_lock hold times by get_dirty are a large component here?

That's my concern, because it affects the scaling of migration for wider
guests.

> If that can be alleviated, not only RO->RW faults benefit.

Those are the most common types of faults on modern hardware, no?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 18:26     ` Xiao Guangrong
  2012-04-09 19:46       ` Marcelo Tosatti
@ 2012-04-10 10:10       ` Avi Kivity
  1 sibling, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-10 10:10 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, Xiao Guangrong, LKML, KVM, qemu-devel, spice-devel

On 04/09/2012 09:26 PM, Xiao Guangrong wrote:
> Yes, if Xwindow is not enabled, the benefit is limited. :)

I'm more interested in migration.

We could optimize the framebuffer by disabling dirty logging when
VNC/Spice is not connected (which should usually be the case), or when
the SDL window is minimized (shouldn't be that often, unfortunately)

Related, qxl doesn't seem to stop the dirty log when switching to
accelerated mode.  vmsvga gets it right:

    case SVGA_REG_ENABLE:
        s->enable = value;
        s->config &= !!value;
        s->width = -1;
        s->height = -1;
        s->invalidated = 1;
        s->vga.invalidate(&s->vga);
        if (s->enable) {
            s->fb_size = ((s->depth + 7) >> 3) * s->new_width *
s->new_height;
            vga_dirty_log_stop(&s->vga);
        } else {
            vga_dirty_log_start(&s->vga);
        }
        break;


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-09 19:46       ` Marcelo Tosatti
  2012-04-10  3:06         ` Xiao Guangrong
  2012-04-10 10:04         ` Avi Kivity
@ 2012-04-10 10:39         ` Avi Kivity
  2012-04-10 11:40           ` Takuya Yoshikawa
  2 siblings, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-10 10:39 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, Xiao Guangrong, LKML, KVM

On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
> Perhaps the mmu_lock hold times by get_dirty are a large component here?
> If that can be alleviated, not only RO->RW faults benefit.
>
>

Currently the longest holder in normal use is probably reading the dirty
log and write protecting the shadow page tables.

We could fix that by switching to O(1) write protection
(write-protecting PML4Es instead of PTEs).  It would be interesting to
combine O(1) write protection with lockless write-enabling.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-10 10:39         ` Avi Kivity
@ 2012-04-10 11:40           ` Takuya Yoshikawa
  2012-04-10 11:58             ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-10 11:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Xiao Guangrong, Xiao Guangrong, LKML, KVM

On Tue, 10 Apr 2012 13:39:14 +0300
Avi Kivity <avi@redhat.com> wrote:

> On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
> > Perhaps the mmu_lock hold times by get_dirty are a large component here?
> > If that can be alleviated, not only RO->RW faults benefit.
> >
> >
> 
> Currently the longest holder in normal use is probably reading the dirty
> log and write protecting the shadow page tables.
> 
> We could fix that by switching to O(1) write protection
> (write-protecting PML4Es instead of PTEs).  It would be interesting to
> combine O(1) write protection with lockless write-enabling.
> 

As Marcelo suggested during reviewing srcu-less dirty logging, we can
mitigate the get_dirty's mmu_lock hold time problem cleanly, locally in
get_dirty_log(), by using cond_resched_lock() -- although we need to
introduce cond_rescheck_lock_cb() to conditionally flush TLB.

I have already started that work.

Actually I introduced rmap based get_dirty for that kind of fine-grained
contention control.

I think we should do our best not to affect mmu so much just for the
limited time of live migration.

	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-10 11:40           ` Takuya Yoshikawa
@ 2012-04-10 11:58             ` Xiao Guangrong
  2012-04-11 12:15               ` Takuya Yoshikawa
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-10 11:58 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Avi Kivity, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/10/2012 07:40 PM, Takuya Yoshikawa wrote:

> On Tue, 10 Apr 2012 13:39:14 +0300
> Avi Kivity <avi@redhat.com> wrote:
> 
>> On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
>>> Perhaps the mmu_lock hold times by get_dirty are a large component here?
>>> If that can be alleviated, not only RO->RW faults benefit.
>>>
>>>
>>
>> Currently the longest holder in normal use is probably reading the dirty
>> log and write protecting the shadow page tables.
>>
>> We could fix that by switching to O(1) write protection
>> (write-protecting PML4Es instead of PTEs).  It would be interesting to
>> combine O(1) write protection with lockless write-enabling.
>>
> 
> As Marcelo suggested during reviewing srcu-less dirty logging, we can
> mitigate the get_dirty's mmu_lock hold time problem cleanly, locally in
> get_dirty_log(), by using cond_resched_lock() -- although we need to
> introduce cond_rescheck_lock_cb() to conditionally flush TLB.
> 


Although it can reduce the contention but it is not reduce the overload
of dirty-log.


> I have already started that work.
> 
> Actually I introduced rmap based get_dirty for that kind of fine-grained
> contention control.
> 


I do not think this way is better that O(1). Avi has explained the reason
for many times, and i agree with that. :)

> I think we should do our best not to affect mmu so much just for the
> limited time of live migration.
> 


No, i do not really agree with that.

We really can get great benefit from O(1) especially if lockless write-protect
is introduced for O(1), live migration is very useful for cloud computing
architecture to balance the overload on all nodes.

And no reason to disallow us touch the code of MMU, yes, it needs simply
but it does not means stop the development of MMU.

For another hander, the mechanism like your to improve dirty-log also need
introduce lots of code and it does not make MMU clearer. :)


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-10 10:04         ` Avi Kivity
@ 2012-04-11  1:47           ` Marcelo Tosatti
  2012-04-11  9:15             ` Avi Kivity
  0 siblings, 1 reply; 92+ messages in thread
From: Marcelo Tosatti @ 2012-04-11  1:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Xiao Guangrong, LKML, KVM

On Tue, Apr 10, 2012 at 01:04:13PM +0300, Avi Kivity wrote:
> On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
> > Perhaps the mmu_lock hold times by get_dirty are a large component here?
> 
> That's my concern, because it affects the scaling of migration for wider
> guests.
> 
> > If that can be alleviated, not only RO->RW faults benefit.
> 
> Those are the most common types of faults on modern hardware, no?

Depends on your workload, of course. If there is memory pressure,
0->PRESENT might be very frequent. My point is that reduction of
mmu_lock contention is a good thing overall.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-11  1:47           ` Marcelo Tosatti
@ 2012-04-11  9:15             ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-11  9:15 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, Xiao Guangrong, LKML, KVM

On 04/11/2012 04:47 AM, Marcelo Tosatti wrote:
> On Tue, Apr 10, 2012 at 01:04:13PM +0300, Avi Kivity wrote:
> > On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
> > > Perhaps the mmu_lock hold times by get_dirty are a large component here?
> > 
> > That's my concern, because it affects the scaling of migration for wider
> > guests.
> > 
> > > If that can be alleviated, not only RO->RW faults benefit.
> > 
> > Those are the most common types of faults on modern hardware, no?
>
> Depends on your workload, of course. If there is memory pressure,
> 0->PRESENT might be very frequent. My point is that reduction of
> mmu_lock contention is a good thing overall.
>

Agreed.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-10 11:58             ` Xiao Guangrong
@ 2012-04-11 12:15               ` Takuya Yoshikawa
  2012-04-11 12:38                 ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-11 12:15 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Tue, 10 Apr 2012 19:58:44 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> No, i do not really agree with that.
> 
> We really can get great benefit from O(1) especially if lockless write-protect
> is introduced for O(1), live migration is very useful for cloud computing
> architecture to balance the overload on all nodes.

Recently, you said to me that you were not familiar with live migration.
Actually you did not know the basics of pre-copy live migration.

I know live migration better than you because NTT has Kemari and it uses
live migration infrastructure.  My work is originated from the data
I got during profiling Kemari.

SRCU-less dirty logging was also motivated by the pressures from scheduler
developers.  Everything was really needed.


Have you ever used live migration for real service?

I cannot say whether O(1) is OK with me without any real background.

	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-11 12:15               ` Takuya Yoshikawa
@ 2012-04-11 12:38                 ` Xiao Guangrong
  2012-04-11 14:14                   ` Takuya Yoshikawa
  0 siblings, 1 reply; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-11 12:38 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Avi Kivity, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/11/2012 08:15 PM, Takuya Yoshikawa wrote:

> On Tue, 10 Apr 2012 19:58:44 +0800
> Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:
> 
>> No, i do not really agree with that.
>>
>> We really can get great benefit from O(1) especially if lockless write-protect
>> is introduced for O(1), live migration is very useful for cloud computing
>> architecture to balance the overload on all nodes.
> 
> Recently, you said to me that you were not familiar with live migration.
> Actually you did not know the basics of pre-copy live migration.
> 
> I know live migration better than you because NTT has Kemari and it uses
> live migration infrastructure.  My work is originated from the data
> I got during profiling Kemari.


Well, my point is that live migration is so very useful that it is worth
to be improved, the description of your also proves this point.

What is your really want to say but i missed?

> 
> SRCU-less dirty logging was also motivated by the pressures from scheduler
> developers.  Everything was really needed.
> 


Totally agree, please note, i did not negate your contribution on dirty
logging at all.

> 
> Have you ever used live migration for real service?
> 


I should admit that you are better at live migration, but it does not
hinder our discussion. If you think i was wrong, welcome to correct me
at any time.

> I cannot say whether O(1) is OK with me without any real background.
> 


Okay, let us to compare the performance number after O(1) implemented.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-11 12:38                 ` Xiao Guangrong
@ 2012-04-11 14:14                   ` Takuya Yoshikawa
  2012-04-11 14:21                     ` Avi Kivity
  2012-04-13 14:25                     ` Takuya Yoshikawa
  0 siblings, 2 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-11 14:14 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Wed, 11 Apr 2012 20:38:57 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> Well, my point is that live migration is so very useful that it is worth
> to be improved, the description of your also proves this point.
> 
> What is your really want to say but i missed?

How to improve and what we should pay for that.

Note that I am not objecting to O(1) itself.

Do you remember that when we discussed O(1) issue last year, with Avi,
the agreement was that we should take more time and look carefully
with more measurements to confirm if it's really worthwhile.

The point is whether we should do O(1) now, including near future.

My opinion is that we should do what we can do now and wait for feedback
from real users.

Before making the current code stable, I do not want to see it replaced
so dramatically.  Otherwise when can we use live migration with enough
confidence?  There may be another subtle bugs we should fix now.

In addition, XBRZLE and post-copy is now being developed in QEMU.


What do you think about this Avi, Marcelo?


I am testing the current live migration to see when and for what it can
be used.  I really want to see it become stable and usable for real
services.


> Okay, let us to compare the performance number after O(1) implemented.

>From my experience, I want to say that live migration is very difficult
to say about performance.  That is the problem I am now struggling with.

I developed dirty-log-perf unit-test for that but that was not enough.

Needless to say, checking the correctness is harder.


So I really do not want to see drastic change now without any real need
or feedback from real users -- this is my point.


Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-11 14:14                   ` Takuya Yoshikawa
@ 2012-04-11 14:21                     ` Avi Kivity
  2012-04-11 22:26                       ` Takuya Yoshikawa
  2012-04-13 14:25                     ` Takuya Yoshikawa
  1 sibling, 1 reply; 92+ messages in thread
From: Avi Kivity @ 2012-04-11 14:21 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/11/2012 05:14 PM, Takuya Yoshikawa wrote:
> On Wed, 11 Apr 2012 20:38:57 +0800
> Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:
>
> > Well, my point is that live migration is so very useful that it is worth
> > to be improved, the description of your also proves this point.
> > 
> > What is your really want to say but i missed?
>
> How to improve and what we should pay for that.
>
> Note that I am not objecting to O(1) itself.
>
> Do you remember that when we discussed O(1) issue last year, with Avi,
> the agreement was that we should take more time and look carefully
> with more measurements to confirm if it's really worthwhile.
>
> The point is whether we should do O(1) now, including near future.
>
> My opinion is that we should do what we can do now and wait for feedback
> from real users.
>
> Before making the current code stable, I do not want to see it replaced
> so dramatically.  Otherwise when can we use live migration with enough
> confidence?  There may be another subtle bugs we should fix now.
>
> In addition, XBRZLE and post-copy is now being developed in QEMU.
>
>
> What do you think about this Avi, Marcelo?

Currently the main performance bottleneck for migration is qemu, which
is single threaded and generally inefficient.  However I am sure that
once the qemu bottlenecks will be removed we'll encounter kvm problems,
particularly with wide (many vcpus) and large (lots of memory) guests. 
So it's a good idea to improve in this area.  I agree we'll need to
measure each change, perhaps with a test program until qemu catches up.

> I am testing the current live migration to see when and for what it can
> be used.  I really want to see it become stable and usable for real
> services.

Well, it's used in production now.

> > Okay, let us to compare the performance number after O(1) implemented.
>
> From my experience, I want to say that live migration is very difficult
> to say about performance.  That is the problem I am now struggling with.
>
> I developed dirty-log-perf unit-test for that but that was not enough.
>
> Needless to say, checking the correctness is harder.
>
>
> So I really do not want to see drastic change now without any real need
> or feedback from real users -- this is my point.
>

It's a good point, we should avoid change for its own sake.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-11 14:21                     ` Avi Kivity
@ 2012-04-11 22:26                       ` Takuya Yoshikawa
  0 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-11 22:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Wed, 11 Apr 2012 17:21:30 +0300
Avi Kivity <avi@redhat.com> wrote:

> Currently the main performance bottleneck for migration is qemu, which
> is single threaded and generally inefficient.  However I am sure that
> once the qemu bottlenecks will be removed we'll encounter kvm problems,
> particularly with wide (many vcpus) and large (lots of memory) guests. 
> So it's a good idea to improve in this area.  I agree we'll need to
> measure each change, perhaps with a test program until qemu catches up.

I agree.

I am especially interested in XBRLE + current srcu-less.

> > I am testing the current live migration to see when and for what it can
> > be used.  I really want to see it become stable and usable for real
> > services.

> Well, it's used in production now.

About RHEL6 e.g., yes of course and we are ...

My comment was about the current srcu-less and whether I can make it enough
stable in this rc-cycle.  I think it will enlarge real use cases in some
extent.

> > So I really do not want to see drastic change now without any real need
> > or feedback from real users -- this is my point.

> It's a good point, we should avoid change for its own sake.

Yes, especially because live migration users are limited to those who have
such services.

I hope that kernel developers start using it in their desktops!!!???

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable
  2012-04-05 17:54     ` Xiao Guangrong
@ 2012-04-12 23:08       ` Marcelo Tosatti
  2012-04-13 10:26         ` Xiao Guangrong
  0 siblings, 1 reply; 92+ messages in thread
From: Marcelo Tosatti @ 2012-04-12 23:08 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, Xiao Guangrong, LKML, KVM

On Fri, Apr 06, 2012 at 01:54:52AM +0800, Xiao Guangrong wrote:
> Hi Avi,
> 
> Thanks very much for your review!
> 
> Sorry for the delay reply since i was on vacation.
> 
> On 04/01/2012 11:52 PM, Avi Kivity wrote:
> 
> > On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
> >> Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to
> >> avoid unnecessary shadow page walking
> >>
> >> Also if no shadow page is indirect, the page is write-free
> >>
> >>
> >> @@ -2262,6 +2291,9 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
> >>  	}
> >>  	if (need_unsync)
> >>  		kvm_unsync_pages(vcpu, gfn);
> >> +
> >> +	*rmap &= ~PTE_LIST_WRITE_PROTECT;
> >> +
> >>
> > 
> > So what are the rules for PTE_LIST_WRITE_PROTECT?  Is is a cache for the
> > mmu_need_write_protect?
> > 
> > I'd like to understand it, I guess it can be set while write protection
> > is unneeded, and cleared on the next check?
> > 
> 
> 
> Yes, it is used as a cache for mmu_need_write_protect.
> 
> When the gfn is protected by sync sp or read-only host page we set this bit,
> and it is be cleared when the sp become unsync or host page becomes writable.

Wouldnt dropping support for shadow entirely make it much simpler? 

The idea to handle RO->RW without mmu_lock is very neat, but the
complexity with shadow is horrible.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable
  2012-04-12 23:08       ` Marcelo Tosatti
@ 2012-04-13 10:26         ` Xiao Guangrong
  0 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-13 10:26 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Xiao Guangrong, Avi Kivity, LKML, KVM

On 04/13/2012 07:08 AM, Marcelo Tosatti wrote:


>> Yes, it is used as a cache for mmu_need_write_protect.
>>
>> When the gfn is protected by sync sp or read-only host page we set this bit,
>> and it is be cleared when the sp become unsync or host page becomes writable.
> 
> Wouldnt dropping support for shadow entirely make it much simpler? 
> 
> The idea to handle RO->RW without mmu_lock is very neat, but the
> complexity with shadow is horrible.
> 


I have posted the v2 that is fairly simple, please review that one. :)
Thank you, Marcelo!


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-11 14:14                   ` Takuya Yoshikawa
  2012-04-11 14:21                     ` Avi Kivity
@ 2012-04-13 14:25                     ` Takuya Yoshikawa
  2012-04-15  9:32                       ` Avi Kivity
  1 sibling, 1 reply; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-13 14:25 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, Avi Kivity, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

Xiao,

Takuya Yoshikawa <takuya.yoshikawa@gmail.com> wrote:

> > What is your really want to say but i missed?
> 
> How to improve and what we should pay for that.
> 
> Note that I am not objecting to O(1) itself.
> 

I forgot to say one important thing -- I might give you wrong impression.

I am perfectly fine with your lock-less work.  It is really nice!

The reason I say much about O(1) is that O(1) and rmap based
GET_DIRTY_LOG have fundamentally different characteristics.

I am thinking really seriously how to make dirty page tracking work
well with QEMU in the future.

For example, I am thinking about multi-threaded and fine-grained
GET_DIRTY_LOG.

If we use rmap based GET_DIRTY_LOG, we can restrict write protection to
only a selected area of one guest memory slot.

So we may be able to make each thread process dirty pages independently
from other threads by calling GET_DIRTY_LOG for its own area.

But I know that O(1) has its own good point.
So please wait a bit.  I will write up what I am thinking or send patches.

Anyway, I am looking forward to your lock-less work!
It will improve the current GET_DIRTY_LOG performance.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-13 14:25                     ` Takuya Yoshikawa
@ 2012-04-15  9:32                       ` Avi Kivity
  2012-04-16 15:49                           ` Takuya Yoshikawa
  2012-04-17  6:16                         ` Xiao Guangrong
  0 siblings, 2 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-15  9:32 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/13/2012 05:25 PM, Takuya Yoshikawa wrote:
> I forgot to say one important thing -- I might give you wrong impression.
>
> I am perfectly fine with your lock-less work.  It is really nice!
>
> The reason I say much about O(1) is that O(1) and rmap based
> GET_DIRTY_LOG have fundamentally different characteristics.
>
> I am thinking really seriously how to make dirty page tracking work
> well with QEMU in the future.
>
> For example, I am thinking about multi-threaded and fine-grained
> GET_DIRTY_LOG.
>
> If we use rmap based GET_DIRTY_LOG, we can restrict write protection to
> only a selected area of one guest memory slot.
>
> So we may be able to make each thread process dirty pages independently
> from other threads by calling GET_DIRTY_LOG for its own area.
>
> But I know that O(1) has its own good point.
> So please wait a bit.  I will write up what I am thinking or send patches.
>
> Anyway, I am looking forward to your lock-less work!
> It will improve the current GET_DIRTY_LOG performance.
>
>

Just to throw another idea into the mix - we can have write-protect-less
dirty logging, too.  Instead of write protection, drop the dirty bit,
and check it again when reading the dirty log.  It might look like we're
accessing the spte twice here, but it's actually just once - when we
check it to report for GET_DIRTY_LOG call N, we also prepare it for call
N+1.

This doesn't work for EPT, which lacks a dirty bit.  But we can emulate
it: take a free bit and call it spte.NOTDIRTY, when it is set, we also
clear spte.WRITE, and teach the mmu that if it sees spte.NOTDIRTY and
can just set spte.WRITE and clear spte.NOTDIRTY.  Now that looks exactly
like Xiao's lockless write enabling.

Another note: O(1) write protection is not mutually exclusive with rmap
based write protection.  In GET_DIRTY_LOG, you write protect everything,
and proceed to write enable on faults.  When you reach the page table
level, you perform the rmap check to see if you should write protect or
not.  With role.direct=1 the check is very cheap (and sometimes you can
drop the entire page table and replace it with a large spte).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-15  9:32                       ` Avi Kivity
@ 2012-04-16 15:49                           ` Takuya Yoshikawa
  2012-04-17  6:16                         ` Xiao Guangrong
  1 sibling, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-16 15:49 UTC (permalink / raw)
  To: Avi Kivity, kvm-ppc
  Cc: Xiao Guangrong, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Sun, 15 Apr 2012 12:32:59 +0300
Avi Kivity <avi@redhat.com> wrote:

> Just to throw another idea into the mix - we can have write-protect-less
> dirty logging, too.  Instead of write protection, drop the dirty bit,
> and check it again when reading the dirty log.  It might look like we're
> accessing the spte twice here, but it's actually just once - when we
> check it to report for GET_DIRTY_LOG call N, we also prepare it for call
> N+1.

kvm-ppc's dirty tracking, implemented by Paul, is using information supplied
by hardware and seems to be similar to what you described here.

We may be able to get feedback from kvm-ppc developers.

> This doesn't work for EPT, which lacks a dirty bit.  But we can emulate
> it: take a free bit and call it spte.NOTDIRTY, when it is set, we also
> clear spte.WRITE, and teach the mmu that if it sees spte.NOTDIRTY and
> can just set spte.WRITE and clear spte.NOTDIRTY.  Now that looks exactly
> like Xiao's lockless write enabling.

How do we sync with dirty_bitmap?

> Another note: O(1) write protection is not mutually exclusive with rmap
> based write protection.  In GET_DIRTY_LOG, you write protect everything,
> and proceed to write enable on faults.  When you reach the page table
> level, you perform the rmap check to see if you should write protect or
> not.  With role.direct=1 the check is very cheap (and sometimes you can
> drop the entire page table and replace it with a large spte).

I understand that there are many possible combinations.

But the question is whether the complexity is really worth it.


Once, when we were searching a way to find atomic bitmap switch, you said
to me that we should do our best not to add overheads to VCPU threads.

>From then, I tried my best to mitigate the latency problem without adding
code to VCPU thread paths: if we add cond_resched patch, we will get a simple
solution to the current known problem -- probably 64GB guests will work well
without big latencies, once QEMU gets improved.

	I also surveyed other known hypervisors internally.  We can easily see
	hundreds of ms latency during migration.  But people rarely complain
	about that if they are stable and usable in most situations.


Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
overheads to page fault handling.  We may need to hold mmu_lock for properly
handling O(1)'s write protection and ~500 write protections will not be so
cheap.  And there is no answer to the question how to achive slot-wise write
protection.

Of course, we may need such a tree-wide write protection when we want to
support guests with hundreds of GB, or TB, of memory.  Sadly it's not now.


Well, if you need the best answer now, we should discuss the whole design:
KVM Forum may be a good place for that.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-16 15:49                           ` Takuya Yoshikawa
  0 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-16 15:49 UTC (permalink / raw)
  To: Avi Kivity, kvm-ppc
  Cc: Xiao Guangrong, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Sun, 15 Apr 2012 12:32:59 +0300
Avi Kivity <avi@redhat.com> wrote:

> Just to throw another idea into the mix - we can have write-protect-less
> dirty logging, too.  Instead of write protection, drop the dirty bit,
> and check it again when reading the dirty log.  It might look like we're
> accessing the spte twice here, but it's actually just once - when we
> check it to report for GET_DIRTY_LOG call N, we also prepare it for call
> N+1.

kvm-ppc's dirty tracking, implemented by Paul, is using information supplied
by hardware and seems to be similar to what you described here.

We may be able to get feedback from kvm-ppc developers.

> This doesn't work for EPT, which lacks a dirty bit.  But we can emulate
> it: take a free bit and call it spte.NOTDIRTY, when it is set, we also
> clear spte.WRITE, and teach the mmu that if it sees spte.NOTDIRTY and
> can just set spte.WRITE and clear spte.NOTDIRTY.  Now that looks exactly
> like Xiao's lockless write enabling.

How do we sync with dirty_bitmap?

> Another note: O(1) write protection is not mutually exclusive with rmap
> based write protection.  In GET_DIRTY_LOG, you write protect everything,
> and proceed to write enable on faults.  When you reach the page table
> level, you perform the rmap check to see if you should write protect or
> not.  With role.direct=1 the check is very cheap (and sometimes you can
> drop the entire page table and replace it with a large spte).

I understand that there are many possible combinations.

But the question is whether the complexity is really worth it.


Once, when we were searching a way to find atomic bitmap switch, you said
to me that we should do our best not to add overheads to VCPU threads.

From then, I tried my best to mitigate the latency problem without adding
code to VCPU thread paths: if we add cond_resched patch, we will get a simple
solution to the current known problem -- probably 64GB guests will work well
without big latencies, once QEMU gets improved.

	I also surveyed other known hypervisors internally.  We can easily see
	hundreds of ms latency during migration.  But people rarely complain
	about that if they are stable and usable in most situations.


Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
overheads to page fault handling.  We may need to hold mmu_lock for properly
handling O(1)'s write protection and ~500 write protections will not be so
cheap.  And there is no answer to the question how to achive slot-wise write
protection.

Of course, we may need such a tree-wide write protection when we want to
support guests with hundreds of GB, or TB, of memory.  Sadly it's not now.


Well, if you need the best answer now, we should discuss the whole design:
KVM Forum may be a good place for that.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-16 15:49                           ` Takuya Yoshikawa
@ 2012-04-16 16:02                             ` Avi Kivity
  -1 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-16 16:02 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: kvm-ppc, Xiao Guangrong, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/16/2012 06:49 PM, Takuya Yoshikawa wrote:
> > This doesn't work for EPT, which lacks a dirty bit.  But we can emulate
> > it: take a free bit and call it spte.NOTDIRTY, when it is set, we also
> > clear spte.WRITE, and teach the mmu that if it sees spte.NOTDIRTY and
> > can just set spte.WRITE and clear spte.NOTDIRTY.  Now that looks exactly
> > like Xiao's lockless write enabling.
>
> How do we sync with dirty_bitmap?

In Xiao's patch we call mark_page_dirty() at fault time.  With the
write-protect-less approach, we look at spte.DIRTY (or spte.NOTDIRTY)
during GET_DIRTY_LOG, or when the spte is torn down.

> > Another note: O(1) write protection is not mutually exclusive with rmap
> > based write protection.  In GET_DIRTY_LOG, you write protect everything,
> > and proceed to write enable on faults.  When you reach the page table
> > level, you perform the rmap check to see if you should write protect or
> > not.  With role.direct=1 the check is very cheap (and sometimes you can
> > drop the entire page table and replace it with a large spte).
>
> I understand that there are many possible combinations.
>
> But the question is whether the complexity is really worth it.

We don't know yet.  I'm just throwing ideas around.

> Once, when we were searching a way to find atomic bitmap switch, you said
> to me that we should do our best not to add overheads to VCPU threads.
>
> From then, I tried my best to mitigate the latency problem without adding
> code to VCPU thread paths: if we add cond_resched patch, we will get a simple
> solution to the current known problem -- probably 64GB guests will work well
> without big latencies, once QEMU gets improved.

Sure, I'm not advocating doing the most nifty idea.  After all I'm the
one that suffers most from it.  Everything should be proven to improve,
and the improvement should be material, not just a random measurement
that doesn't matter to anyone.

>
> 	I also surveyed other known hypervisors internally.  We can easily see
> 	hundreds of ms latency during migration.  But people rarely complain
> 	about that if they are stable and usable in most situations.

There is also the unavoidable latency during the final stop-and-copy
phase, at least without post-copy.  And the migration thread (when we
have one) is hardly latency sensitive.

> Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
> overheads to page fault handling.  We may need to hold mmu_lock for properly
> handling O(1)'s write protection and ~500 write protections will not be so
> cheap.  And there is no answer to the question how to achive slot-wise write
> protection.
>
> Of course, we may need such a tree-wide write protection when we want to
> support guests with hundreds of GB, or TB, of memory.  Sadly it's not now.
>
>
> Well, if you need the best answer now, we should discuss the whole design:
> KVM Forum may be a good place for that.

We don't need the best answer now, I'm satisfied with incremental
improvements.  But it's good to have the ideas out in the open, maybe
some of them will be adopted, or maybe they'll trigger a better idea.

(btw O(1) write protection is equally applicable to ordinary fork())

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-16 16:02                             ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-16 16:02 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: kvm-ppc, Xiao Guangrong, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/16/2012 06:49 PM, Takuya Yoshikawa wrote:
> > This doesn't work for EPT, which lacks a dirty bit.  But we can emulate
> > it: take a free bit and call it spte.NOTDIRTY, when it is set, we also
> > clear spte.WRITE, and teach the mmu that if it sees spte.NOTDIRTY and
> > can just set spte.WRITE and clear spte.NOTDIRTY.  Now that looks exactly
> > like Xiao's lockless write enabling.
>
> How do we sync with dirty_bitmap?

In Xiao's patch we call mark_page_dirty() at fault time.  With the
write-protect-less approach, we look at spte.DIRTY (or spte.NOTDIRTY)
during GET_DIRTY_LOG, or when the spte is torn down.

> > Another note: O(1) write protection is not mutually exclusive with rmap
> > based write protection.  In GET_DIRTY_LOG, you write protect everything,
> > and proceed to write enable on faults.  When you reach the page table
> > level, you perform the rmap check to see if you should write protect or
> > not.  With role.direct=1 the check is very cheap (and sometimes you can
> > drop the entire page table and replace it with a large spte).
>
> I understand that there are many possible combinations.
>
> But the question is whether the complexity is really worth it.

We don't know yet.  I'm just throwing ideas around.

> Once, when we were searching a way to find atomic bitmap switch, you said
> to me that we should do our best not to add overheads to VCPU threads.
>
> From then, I tried my best to mitigate the latency problem without adding
> code to VCPU thread paths: if we add cond_resched patch, we will get a simple
> solution to the current known problem -- probably 64GB guests will work well
> without big latencies, once QEMU gets improved.

Sure, I'm not advocating doing the most nifty idea.  After all I'm the
one that suffers most from it.  Everything should be proven to improve,
and the improvement should be material, not just a random measurement
that doesn't matter to anyone.

>
> 	I also surveyed other known hypervisors internally.  We can easily see
> 	hundreds of ms latency during migration.  But people rarely complain
> 	about that if they are stable and usable in most situations.

There is also the unavoidable latency during the final stop-and-copy
phase, at least without post-copy.  And the migration thread (when we
have one) is hardly latency sensitive.

> Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
> overheads to page fault handling.  We may need to hold mmu_lock for properly
> handling O(1)'s write protection and ~500 write protections will not be so
> cheap.  And there is no answer to the question how to achive slot-wise write
> protection.
>
> Of course, we may need such a tree-wide write protection when we want to
> support guests with hundreds of GB, or TB, of memory.  Sadly it's not now.
>
>
> Well, if you need the best answer now, we should discuss the whole design:
> KVM Forum may be a good place for that.

We don't need the best answer now, I'm satisfied with incremental
improvements.  But it's good to have the ideas out in the open, maybe
some of them will be adopted, or maybe they'll trigger a better idea.

(btw O(1) write protection is equally applicable to ordinary fork())

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-15  9:32                       ` Avi Kivity
  2012-04-16 15:49                           ` Takuya Yoshikawa
@ 2012-04-17  6:16                         ` Xiao Guangrong
  1 sibling, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-17  6:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Takuya Yoshikawa, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/15/2012 05:32 PM, Avi Kivity wrote:

> On 04/13/2012 05:25 PM, Takuya Yoshikawa wrote:
>> I forgot to say one important thing -- I might give you wrong impression.
>>
>> I am perfectly fine with your lock-less work.  It is really nice!
>>
>> The reason I say much about O(1) is that O(1) and rmap based
>> GET_DIRTY_LOG have fundamentally different characteristics.
>>
>> I am thinking really seriously how to make dirty page tracking work
>> well with QEMU in the future.
>>
>> For example, I am thinking about multi-threaded and fine-grained
>> GET_DIRTY_LOG.
>>
>> If we use rmap based GET_DIRTY_LOG, we can restrict write protection to
>> only a selected area of one guest memory slot.
>>
>> So we may be able to make each thread process dirty pages independently
>> from other threads by calling GET_DIRTY_LOG for its own area.
>>
>> But I know that O(1) has its own good point.
>> So please wait a bit.  I will write up what I am thinking or send patches.
>>
>> Anyway, I am looking forward to your lock-less work!
>> It will improve the current GET_DIRTY_LOG performance.
>>
>>
> 
> Just to throw another idea into the mix - we can have write-protect-less
> dirty logging, too.  Instead of write protection, drop the dirty bit,
> and check it again when reading the dirty log.  It might look like we're
> accessing the spte twice here, but it's actually just once - when we
> check it to report for GET_DIRTY_LOG call N, we also prepare it for call
> N+1.
> 


Walking all gfn's rmap is still expensive, at least, it is not good for
the scalability.

I want to use a generation number to notify mmu write-protect the PML4s.
It is complete out of mmu-lock and comparing lockless write enabling can
let it rungs as parallel as possible.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-16 15:49                           ` Takuya Yoshikawa
@ 2012-04-17  6:26                             ` Xiao Guangrong
  -1 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-17  6:26 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Avi Kivity, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/16/2012 11:49 PM, Takuya Yoshikawa wrote:


> Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
> overheads to page fault handling.  We may need to hold mmu_lock for properly
> handling O(1)'s write protection and ~500 write protections will not be so
> cheap.  And there is no answer to the question how to achive slot-wise write
> protection.
> 


Actually no.

We do not increase the overload on page fault for migration. The number of
page fault of O(1) is the same as write-protect all spte.

And, we can also avoid to hold mmu_lock to write-protect PML4s, we can use
a generation number, and notify mmu to update its page table when dirty-log
is enabled.

Anyway, no performance data, no truth. Let me implement it first.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-17  6:26                             ` Xiao Guangrong
  0 siblings, 0 replies; 92+ messages in thread
From: Xiao Guangrong @ 2012-04-17  6:26 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Avi Kivity, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/16/2012 11:49 PM, Takuya Yoshikawa wrote:


> Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
> overheads to page fault handling.  We may need to hold mmu_lock for properly
> handling O(1)'s write protection and ~500 write protections will not be so
> cheap.  And there is no answer to the question how to achive slot-wise write
> protection.
> 


Actually no.

We do not increase the overload on page fault for migration. The number of
page fault of O(1) is the same as write-protect all spte.

And, we can also avoid to hold mmu_lock to write-protect PML4s, we can use
a generation number, and notify mmu to update its page table when dirty-log
is enabled.

Anyway, no performance data, no truth. Let me implement it first.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-17  6:26                             ` Xiao Guangrong
@ 2012-04-17  7:51                               ` Avi Kivity
  -1 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-17  7:51 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Takuya Yoshikawa, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/17/2012 09:26 AM, Xiao Guangrong wrote:
> On 04/16/2012 11:49 PM, Takuya Yoshikawa wrote:
>
>
> > Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
> > overheads to page fault handling.  We may need to hold mmu_lock for properly
> > handling O(1)'s write protection and ~500 write protections will not be so
> > cheap.  And there is no answer to the question how to achive slot-wise write
> > protection.
> > 
>
>
> Actually no.
>
> We do not increase the overload on page fault for migration. The number of
> page fault of O(1) is the same as write-protect all spte.

That's true with the write protect everything approach we use now.  But
it's not true with range-based write protection, where you issue
GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.

(the motivation for that is to decrease the time between GET_DIRTY_LOG
and sending the page; as the time increases, the chances that the page
got re-dirtied go up).

That doesn't mean O(1) is unusable for this, just that it requires more
thought.  Especially with direct maps, we can write-enable pages very
quickly.

> And, we can also avoid to hold mmu_lock to write-protect PML4s, we can use
> a generation number, and notify mmu to update its page table when dirty-log
> is enabled.

Generation numbers are also useful for o(1) invalidation.

>
> Anyway, no performance data, no truth. Let me implement it first.
>


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-17  7:51                               ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-17  7:51 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Takuya Yoshikawa, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/17/2012 09:26 AM, Xiao Guangrong wrote:
> On 04/16/2012 11:49 PM, Takuya Yoshikawa wrote:
>
>
> > Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
> > overheads to page fault handling.  We may need to hold mmu_lock for properly
> > handling O(1)'s write protection and ~500 write protections will not be so
> > cheap.  And there is no answer to the question how to achive slot-wise write
> > protection.
> > 
>
>
> Actually no.
>
> We do not increase the overload on page fault for migration. The number of
> page fault of O(1) is the same as write-protect all spte.

That's true with the write protect everything approach we use now.  But
it's not true with range-based write protection, where you issue
GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.

(the motivation for that is to decrease the time between GET_DIRTY_LOG
and sending the page; as the time increases, the chances that the page
got re-dirtied go up).

That doesn't mean O(1) is unusable for this, just that it requires more
thought.  Especially with direct maps, we can write-enable pages very
quickly.

> And, we can also avoid to hold mmu_lock to write-protect PML4s, we can use
> a generation number, and notify mmu to update its page table when dirty-log
> is enabled.

Generation numbers are also useful for o(1) invalidation.

>
> Anyway, no performance data, no truth. Let me implement it first.
>


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-17  7:51                               ` Avi Kivity
@ 2012-04-17 12:37                                 ` Takuya Yoshikawa
  -1 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-17 12:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Tue, 17 Apr 2012 10:51:40 +0300
Avi Kivity <avi@redhat.com> wrote:

> That's true with the write protect everything approach we use now.  But
> it's not true with range-based write protection, where you issue
> GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.
> 
> (the motivation for that is to decrease the time between GET_DIRTY_LOG
> and sending the page; as the time increases, the chances that the page
> got re-dirtied go up).

Thank you for explaining this.

I was planning to give the userspace more freedom.

Since there are many known algorithms to predict hot memory pages,
the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
parts not to get too many faults repeatedly, if we can restrict the range
of pages to protect.

This is the fine-grained control.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-17 12:37                                 ` Takuya Yoshikawa
  0 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-17 12:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Tue, 17 Apr 2012 10:51:40 +0300
Avi Kivity <avi@redhat.com> wrote:

> That's true with the write protect everything approach we use now.  But
> it's not true with range-based write protection, where you issue
> GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.
> 
> (the motivation for that is to decrease the time between GET_DIRTY_LOG
> and sending the page; as the time increases, the chances that the page
> got re-dirtied go up).

Thank you for explaining this.

I was planning to give the userspace more freedom.

Since there are many known algorithms to predict hot memory pages,
the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
parts not to get too many faults repeatedly, if we can restrict the range
of pages to protect.

This is the fine-grained control.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-17 12:37                                 ` Takuya Yoshikawa
@ 2012-04-17 12:41                                   ` Avi Kivity
  -1 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-17 12:41 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/17/2012 03:37 PM, Takuya Yoshikawa wrote:
> On Tue, 17 Apr 2012 10:51:40 +0300
> Avi Kivity <avi@redhat.com> wrote:
>
> > That's true with the write protect everything approach we use now.  But
> > it's not true with range-based write protection, where you issue
> > GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.
> > 
> > (the motivation for that is to decrease the time between GET_DIRTY_LOG
> > and sending the page; as the time increases, the chances that the page
> > got re-dirtied go up).
>
> Thank you for explaining this.
>
> I was planning to give the userspace more freedom.
>
> Since there are many known algorithms to predict hot memory pages,
> the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> parts not to get too many faults repeatedly, if we can restrict the range
> of pages to protect.
>
> This is the fine-grained control.

Do you want per-page control, or just range-based?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-17 12:41                                   ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-17 12:41 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/17/2012 03:37 PM, Takuya Yoshikawa wrote:
> On Tue, 17 Apr 2012 10:51:40 +0300
> Avi Kivity <avi@redhat.com> wrote:
>
> > That's true with the write protect everything approach we use now.  But
> > it's not true with range-based write protection, where you issue
> > GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.
> > 
> > (the motivation for that is to decrease the time between GET_DIRTY_LOG
> > and sending the page; as the time increases, the chances that the page
> > got re-dirtied go up).
>
> Thank you for explaining this.
>
> I was planning to give the userspace more freedom.
>
> Since there are many known algorithms to predict hot memory pages,
> the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> parts not to get too many faults repeatedly, if we can restrict the range
> of pages to protect.
>
> This is the fine-grained control.

Do you want per-page control, or just range-based?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-17 12:41                                   ` Avi Kivity
@ 2012-04-17 14:54                                     ` Takuya Yoshikawa
  -1 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-17 14:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Tue, 17 Apr 2012 15:41:39 +0300
Avi Kivity <avi@redhat.com> wrote:

> > Since there are many known algorithms to predict hot memory pages,
> > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> > parts not to get too many faults repeatedly, if we can restrict the range
> > of pages to protect.
> >
> > This is the fine-grained control.
> 
> Do you want per-page control, or just range-based?

Difficult question.

For live migration, range-based control may be enough duo to the locality
of WWS.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-17 14:54                                     ` Takuya Yoshikawa
  0 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-17 14:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Tue, 17 Apr 2012 15:41:39 +0300
Avi Kivity <avi@redhat.com> wrote:

> > Since there are many known algorithms to predict hot memory pages,
> > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> > parts not to get too many faults repeatedly, if we can restrict the range
> > of pages to protect.
> >
> > This is the fine-grained control.
> 
> Do you want per-page control, or just range-based?

Difficult question.

For live migration, range-based control may be enough duo to the locality
of WWS.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-17 14:54                                     ` Takuya Yoshikawa
@ 2012-04-17 14:56                                       ` Avi Kivity
  -1 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-17 14:56 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/17/2012 05:54 PM, Takuya Yoshikawa wrote:
> On Tue, 17 Apr 2012 15:41:39 +0300
> Avi Kivity <avi@redhat.com> wrote:
>
> > > Since there are many known algorithms to predict hot memory pages,
> > > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> > > parts not to get too many faults repeatedly, if we can restrict the range
> > > of pages to protect.
> > >
> > > This is the fine-grained control.
> > 
> > Do you want per-page control, or just range-based?
>
> Difficult question.
>
> For live migration, range-based control may be enough duo to the locality
> of WWS.

What's WWS?

I don't see a reason to assume locality in gpa space (or in gva space
either, for some workloads).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-17 14:56                                       ` Avi Kivity
  0 siblings, 0 replies; 92+ messages in thread
From: Avi Kivity @ 2012-04-17 14:56 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On 04/17/2012 05:54 PM, Takuya Yoshikawa wrote:
> On Tue, 17 Apr 2012 15:41:39 +0300
> Avi Kivity <avi@redhat.com> wrote:
>
> > > Since there are many known algorithms to predict hot memory pages,
> > > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> > > parts not to get too many faults repeatedly, if we can restrict the range
> > > of pages to protect.
> > >
> > > This is the fine-grained control.
> > 
> > Do you want per-page control, or just range-based?
>
> Difficult question.
>
> For live migration, range-based control may be enough duo to the locality
> of WWS.

What's WWS?

I don't see a reason to assume locality in gpa space (or in gva space
either, for some workloads).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
  2012-04-17 14:56                                       ` Avi Kivity
@ 2012-04-18 13:42                                         ` Takuya Yoshikawa
  -1 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-18 13:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Tue, 17 Apr 2012 17:56:24 +0300
Avi Kivity <avi@redhat.com> wrote:

> > For live migration, range-based control may be enough duo to the locality
> > of WWS.
> 
> What's WWS?

IIRC it was mentioned in a usenix paper: Writable Working Set.

May not be a commonly known concept.
Kind of working set, but is written often.

> I don't see a reason to assume locality in gpa space (or in gva space
> either, for some workloads).

Sorry, I may be wrong.

I once thought to log dirty_bitmaps during live migration, but didn't.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 00/13] KVM: MMU: fast page fault
@ 2012-04-18 13:42                                         ` Takuya Yoshikawa
  0 siblings, 0 replies; 92+ messages in thread
From: Takuya Yoshikawa @ 2012-04-18 13:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Xiao Guangrong, kvm-ppc, Marcelo Tosatti, Xiao Guangrong, LKML, KVM

On Tue, 17 Apr 2012 17:56:24 +0300
Avi Kivity <avi@redhat.com> wrote:

> > For live migration, range-based control may be enough duo to the locality
> > of WWS.
> 
> What's WWS?

IIRC it was mentioned in a usenix paper: Writable Working Set.

May not be a commonly known concept.
Kind of working set, but is written often.

> I don't see a reason to assume locality in gpa space (or in gva space
> either, for some workloads).

Sorry, I may be wrong.

I once thought to log dirty_bitmaps during live migration, but didn't.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2012-04-18 13:42 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
2012-03-29  9:20 ` [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path Xiao Guangrong
2012-03-29  9:21 ` [PATCH 02/13] KVM: MMU: abstract spte write-protect Xiao Guangrong
2012-03-29 11:11   ` Avi Kivity
2012-03-29 11:51     ` Xiao Guangrong
2012-03-29  9:22 ` [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte) Xiao Guangrong
2012-03-29 13:00   ` Avi Kivity
2012-03-30  3:51     ` Xiao Guangrong
2012-03-29  9:22 ` [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa) Xiao Guangrong
2012-03-29 13:07   ` Avi Kivity
2012-03-30  5:01     ` Xiao Guangrong
2012-04-01 12:42       ` Avi Kivity
2012-03-29  9:23 ` [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask Xiao Guangrong
2012-03-29 13:10   ` Avi Kivity
2012-03-29 15:28     ` Avi Kivity
2012-03-29 16:24       ` Avi Kivity
2012-03-29  9:23 ` [PATCH 06/13] KVM: VMX: export PFEC.P bit on ept Xiao Guangrong
2012-03-29  9:24 ` [PATCH 07/13] KVM: MMU: store more bits in rmap Xiao Guangrong
2012-03-29  9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
2012-03-29 15:49   ` Avi Kivity
2012-03-30  5:10     ` Xiao Guangrong
2012-04-01 15:52   ` Avi Kivity
2012-04-05 17:54     ` Xiao Guangrong
2012-04-12 23:08       ` Marcelo Tosatti
2012-04-13 10:26         ` Xiao Guangrong
2012-03-29  9:25 ` [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock Xiao Guangrong
2012-04-01 15:53   ` Avi Kivity
2012-04-05 18:25     ` Xiao Guangrong
2012-04-09 12:28       ` Avi Kivity
2012-04-09 13:16         ` Takuya Yoshikawa
2012-04-09 13:21           ` Avi Kivity
2012-03-29  9:26 ` [PATCH 10/13] KVM: MMU: store vcpu id in spte to notify page write-protect path Xiao Guangrong
2012-03-29  9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
2012-03-31 12:24   ` Xiao Guangrong
2012-04-01 16:23   ` Avi Kivity
2012-04-03 13:04     ` Avi Kivity
2012-04-05 19:39     ` Xiao Guangrong
2012-03-29  9:27 ` [PATCH 12/13] KVM: MMU: trace fast " Xiao Guangrong
2012-03-29  9:28 ` [PATCH 13/13] KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint Xiao Guangrong
2012-03-29 10:18 ` [PATCH 00/13] KVM: MMU: fast page fault Avi Kivity
2012-03-29 11:40   ` Xiao Guangrong
2012-03-29 12:57     ` Avi Kivity
2012-03-30  9:18       ` Xiao Guangrong
2012-03-31 13:12         ` Xiao Guangrong
2012-04-01 12:58         ` Avi Kivity
2012-04-05 21:57           ` Xiao Guangrong
2012-04-06  5:24             ` Xiao Guangrong
2012-04-09 13:20               ` Avi Kivity
2012-04-09 13:59                 ` Xiao Guangrong
2012-04-09 13:12 ` Avi Kivity
2012-04-09 13:55   ` Xiao Guangrong
2012-04-09 14:01     ` Xiao Guangrong
2012-04-09 14:25     ` Avi Kivity
2012-04-09 17:58   ` Marcelo Tosatti
2012-04-09 18:13     ` Xiao Guangrong
2012-04-09 19:31       ` Marcelo Tosatti
2012-04-09 18:26     ` Xiao Guangrong
2012-04-09 19:46       ` Marcelo Tosatti
2012-04-10  3:06         ` Xiao Guangrong
2012-04-10 10:04         ` Avi Kivity
2012-04-11  1:47           ` Marcelo Tosatti
2012-04-11  9:15             ` Avi Kivity
2012-04-10 10:39         ` Avi Kivity
2012-04-10 11:40           ` Takuya Yoshikawa
2012-04-10 11:58             ` Xiao Guangrong
2012-04-11 12:15               ` Takuya Yoshikawa
2012-04-11 12:38                 ` Xiao Guangrong
2012-04-11 14:14                   ` Takuya Yoshikawa
2012-04-11 14:21                     ` Avi Kivity
2012-04-11 22:26                       ` Takuya Yoshikawa
2012-04-13 14:25                     ` Takuya Yoshikawa
2012-04-15  9:32                       ` Avi Kivity
2012-04-16 15:49                         ` Takuya Yoshikawa
2012-04-16 15:49                           ` Takuya Yoshikawa
2012-04-16 16:02                           ` Avi Kivity
2012-04-16 16:02                             ` Avi Kivity
2012-04-17  6:26                           ` Xiao Guangrong
2012-04-17  6:26                             ` Xiao Guangrong
2012-04-17  7:51                             ` Avi Kivity
2012-04-17  7:51                               ` Avi Kivity
2012-04-17 12:37                               ` Takuya Yoshikawa
2012-04-17 12:37                                 ` Takuya Yoshikawa
2012-04-17 12:41                                 ` Avi Kivity
2012-04-17 12:41                                   ` Avi Kivity
2012-04-17 14:54                                   ` Takuya Yoshikawa
2012-04-17 14:54                                     ` Takuya Yoshikawa
2012-04-17 14:56                                     ` Avi Kivity
2012-04-17 14:56                                       ` Avi Kivity
2012-04-18 13:42                                       ` Takuya Yoshikawa
2012-04-18 13:42                                         ` Takuya Yoshikawa
2012-04-17  6:16                         ` Xiao Guangrong
2012-04-10 10:10       ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.