linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/15] KVM: MMU: locklessly write-protect
@ 2013-10-23 13:29 Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
                   ` (15 more replies)
  0 siblings, 16 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Changelog v3:
- the changes from Gleb's review:
  1) drop the patch which fixed the count of spte number in rmap since it
     can not be easily fixed and it has gone after applying this patchset

- ideas from Gleb and discussion with Marcelo is also very appreciated:
  2) change the way to locklessly access shadow page - use SLAB_DESTROY_BY_RCU
     to protect shadow page instead of conditionally using call_rcu()
  3) improve is_last_spte() that checks last spte by only using some bits on
     the spte, then it is safely used when we locklessly write-protect the
     shadow page table

Changelog v2:

- the changes from Gleb's review:
  1) fix calculating the number of spte in the pte_list_add()
  2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
     rmap_get_next()
  3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
  4) use rcu functions to access the rcu protected pointer
  5) spte will be missed in lockless walker if the spte is moved in a desc
     (remove a spte from the rmap using only one desc). Fix it by bottom-up
     walking the desc

- the changes from Paolo's review
  1) make the order and memory barriers between update spte / add spte into
     rmap and dirty-log more clear
  
- the changes from Marcelo's review:
  1) let fast page fault only fix the spts on the last level (level = 1)
  2) improve some changelogs and comments

- the changes from Takuya's review:
  move the patch "flush tlb if the spte can be locklessly modified" forward
  to make it's more easily merged

Thank all of you very much for your time and patience on this patchset!
  
Since we use rcu_assign_pointer() to update the points in desc even if dirty
log is disabled, i have measured the performance:
Host: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz * 12 + 36G memory

- migrate-perf (benchmark the time of get-dirty-log)
  before: Run 10 times, Avg time:9009483 ns.
  after: Run 10 times, Avg time:4807343 ns.

- kerbench
  Guest: 12 VCPUs + 8G memory
  before:
EPT is enabled:
# cat 09-05-origin-ept | grep real       
real 85.58
real 83.47
real 82.95

EPT is disabled:
# cat 09-05-origin-shadow | grep real
real 138.77
real 138.99
real 139.55

  after:
EPT is enabled:
# cat 09-05-lockless-ept | grep real
real 83.40
real 82.81
real 83.39

EPT is disabled:
# cat 09-05-lockless-shadow | grep real
real 138.91
real 139.71
real 138.94

No performance regression!



Background
==========
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we need to
hold mmu-lock which is also required by vcpu to fix its page table fault and
mmu-notifier when host page is being changed. In the extreme cpu / memory used
guest, it becomes a scalability issue.

This patchset introduces a way to locklessly write-protect guest memory.

Idea
==========
There are the challenges we meet and the ideas to resolve them.

1) How to locklessly walk rmap?
The first idea we got to prevent "desc" being freed when we are walking the
rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
it updates the rmap really frequently.

So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
to be reused more quickly. We also store a "nulls" in the last "desc"
(desc->more) which can help us to detect whether the "desc" is moved to anther
rmap then we can re-walk the rmap if that happened. I learned this idea from
nulls-list.

Another issue is, when a spte is deleted from the "desc", another spte in the
last "desc" will be moved to this position to replace the deleted one. If the
deleted one has been accessed and we do not access the replaced one, the
replaced one is missed when we do lockless walk.
To fix this case, we do not backward move the spte, instead, we forward move
the entry: when a spte is deleted, we move the entry in the first desc to that
position.

2) How to locklessly access shadow page table?
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page be freed. But we are on the ioctl context
and the paths we are optimizing for have heavy workload, disabling interrupt is
not good for the system performance.

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability.

3) How to locklessly write-protect guest memory?
Currently, there are two behaviors when we write-protect guest memory, one is
clearing the Writable bit on spte and the another one is dropping spte when it
points to large page. The former is easy we only need to atomicly clear a bit
but the latter is hard since we need to remove the spte from rmap. so we unify
these two behaviors that only make the spte readonly. Making large spte
readonly instead of nonpresent is also good for reducing jitter.

And we need to pay more attention on the order of making spte writable, adding
spte into rmap and setting the corresponding bit on dirty bitmap since
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
we should ensure the writable spte can be found in rmap before the dirty bitmap
is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
the page.

Performance result
====================
The performance result and the benchmark can be found at:
  http://permalink.gmane.org/gmane.linux.kernel/1534876

Xiao Guangrong (15):
  KVM: MMU: properly check last spte in fast_page_fault()
  KVM: MMU: lazily drop large spte
  KVM: MMU: flush tlb if the spte can be locklessly modified
  KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  KVM: MMU: update spte and add it into rmap before dirty log
  KVM: MMU: redesign the algorithm of pte_list
  KVM: MMU: introduce nulls desc
  KVM: MMU: introduce pte-list lockless walker
  KVM: MMU: initialize the pointers in pte_list_desc properly
  KVM: MMU: allocate shadow pages from slab
  KVM: MMU: locklessly access shadow page under rcu protection
  KVM: MMU: check last spte with unawareness of mapping level
  KVM: MMU: locklessly write-protect the page
  KVM: MMU: clean up spte_write_protect
  KVM: MMU: use rcu functions to access the pointer

 arch/x86/include/asm/kvm_host.h |   7 +-
 arch/x86/kvm/mmu.c              | 586 ++++++++++++++++++++++++++++++----------
 arch/x86/kvm/mmu.h              |   6 +
 arch/x86/kvm/mmu_audit.c        |   6 +-
 arch/x86/kvm/paging_tmpl.h      |   6 +-
 arch/x86/kvm/x86.c              |  34 ++-
 6 files changed, 475 insertions(+), 170 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault()
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-12  0:25   ` Marcelo Tosatti
  2013-10-23 13:29 ` [PATCH v3 02/15] KVM: MMU: lazily drop large spte Xiao Guangrong
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)

This patch makes the code more readable and avoids potential issue in the
further development

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40772ef..d2aacc2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2798,9 +2798,9 @@ static bool page_fault_can_be_fast(u32 error_code)
 }
 
 static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+			u64 *sptep, u64 spte)
 {
-	struct kvm_mmu_page *sp = page_header(__pa(sptep));
 	gfn_t gfn;
 
 	WARN_ON(!sp->role.direct);
@@ -2826,6 +2826,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 			    u32 error_code)
 {
 	struct kvm_shadow_walk_iterator iterator;
+	struct kvm_mmu_page *sp;
 	bool ret = false;
 	u64 spte = 0ull;
 
@@ -2846,7 +2847,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		goto exit;
 	}
 
-	if (!is_last_spte(spte, level))
+	sp = page_header(__pa(iterator.sptep));
+	if (!is_last_spte(spte, sp->role.level))
 		goto exit;
 
 	/*
@@ -2872,7 +2874,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	 * the gfn is not stable for indirect shadow page.
 	 * See Documentation/virtual/kvm/locking.txt to get more detail.
 	 */
-	ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+	ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
 exit:
 	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
 			      spte, ret);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 02/15] KVM: MMU: lazily drop large spte
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-12 22:44   ` Marcelo Tosatti
  2013-10-23 13:29 ` [PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 36 ++++++++++++++++++++----------------
 arch/x86/kvm/x86.c |  8 ++++++--
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d2aacc2..8739208 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1176,8 +1176,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1186,10 +1185,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
  * - for spte protection, the spte can be writable only after unsync-ing
  *   shadow page.
  *
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
  */
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
 {
 	u64 spte = *sptep;
 
@@ -1199,17 +1197,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
 
 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-	if (__drop_large_spte(kvm, sptep)) {
-		*flush |= true;
-		return true;
-	}
-
 	if (pt_protect)
 		spte &= ~SPTE_MMU_WRITEABLE;
 	spte = spte & ~PT_WRITABLE_MASK;
 
-	*flush |= mmu_spte_update(sptep, spte);
-	return false;
+	return mmu_spte_update(sptep, spte);
 }
 
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1221,11 +1213,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 
 	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
 		BUG_ON(!(*sptep & PT_PRESENT_MASK));
-		if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
-			sptep = rmap_get_first(*rmapp, &iter);
-			continue;
-		}
 
+		flush |= spte_write_protect(kvm, sptep, pt_protect);
 		sptep = rmap_get_next(&iter);
 	}
 
@@ -2669,6 +2658,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			break;
 		}
 
+		drop_large_spte(vcpu, iterator.sptep);
+
 		if (!is_shadow_present_pte(*iterator.sptep)) {
 			u64 base_addr = iterator.addr;
 
@@ -2870,6 +2861,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		goto exit;
 
 	/*
+	 * Do not fix write-permission on the large spte since we only dirty
+	 * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+	 * that means other pages are missed if its slot is dirty-logged.
+	 *
+	 * Instead, we let the slow page fault path create a normal spte to
+	 * fix the access.
+	 *
+	 * See the comments in kvm_arch_commit_memory_region().
+	 */
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+		goto exit;
+
+	/*
 	 * Currently, fast page fault only works for direct mapping since
 	 * the gfn is not stable for indirect shadow page.
 	 * See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index edf2a07..b3aa650 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7223,8 +7223,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 		kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
 	/*
 	 * Write protect all pages for dirty logging.
-	 * Existing largepage mappings are destroyed here and new ones will
-	 * not be created until the end of the logging.
+	 *
+	 * All the sptes including the large sptes which point to this
+	 * slot are set to readonly. We can not create any new large
+	 * spte on this slot until the end of the logging.
+	 *
+	 * See the comments in fast_page_fault().
 	 */
 	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
 		kvm_mmu_slot_remove_write_access(kvm, mem->slot);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 02/15] KVM: MMU: lazily drop large spte Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-13  0:10   ` Marcelo Tosatti
  2013-10-23 13:29 ` [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes Xiao Guangrong
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

      VCPU 0                         VCPU 1
lockless wirte protection:
      set spte.w = 0
                                 lock mmu-lock

                                 write protection the spte to sync shadow page,
                                 see spte.w = 0, then without flush tlb

				 unlock mmu-lock

                                 !!! At this point, the shadow page can still be
                                     writable due to the corrupt tlb entry
     Flush all TLB

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8739208..62f18ec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 * we always atomicly update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
-	if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+	if (spte_is_locklessly_modifiable(old_spte) &&
+	      !is_writable_pte(new_spte))
 		ret = true;
 
 	if (!shadow_accessed_mask)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (2 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-14  0:36   ` Marcelo Tosatti
  2013-10-23 13:29 ` [PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log Xiao Guangrong
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
  just change the spte from writable to readonly so that we only need to care
  the case of changing spte from present to present (changing the spte from
  present to nonpresent will flush all the TLBs immediately), in other words,
  the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we haved checked
  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
  means it does not depend on PT_WRITABLE_MASK anymore

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 18 ++++++++++++++----
 arch/x86/kvm/x86.c |  9 +++++++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 62f18ec..337d173 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 			if (*rmapp)
 				__rmap_write_protect(kvm, rmapp, false);
 
-			if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-				kvm_flush_remote_tlbs(kvm);
+			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
 				cond_resched_lock(&kvm->mmu_lock);
-			}
 		}
 	}
 
-	kvm_flush_remote_tlbs(kvm);
 	spin_unlock(&kvm->mmu_lock);
+
+	/*
+	 * We can flush all the TLBs out of the mmu lock without TLB
+	 * corruption since we just change the spte from writable to
+	 * readonly so that we only need to care the case of changing
+	 * spte from present to present (changing the spte from present
+	 * to nonpresent will flush all the TLBs immediately), in other
+	 * words, the only case we care is mmu_spte_update() where we
+	 * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+	 * instead of PT_WRITABLE_MASK, that means it does not depend
+	 * on PT_WRITABLE_MASK anymore.
+	 */
+	kvm_flush_remote_tlbs(kvm);
 }
 
 #define BATCH_ZAP_PAGES	10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b3aa650..573c6b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 		offset = i * BITS_PER_LONG;
 		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
 	}
-	if (is_dirty)
-		kvm_flush_remote_tlbs(kvm);
 
 	spin_unlock(&kvm->mmu_lock);
 
+	/*
+	 * All the TLBs can be flushed out of mmu lock, see the comments in
+	 * kvm_mmu_slot_remove_write_access().
+	 */
+	if (is_dirty)
+		kvm_flush_remote_tlbs(kvm);
+
 	r = -EFAULT;
 	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
 		goto out;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (3 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-15  0:08   ` Marcelo Tosatti
  2013-10-23 13:29 ` [PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
bitmap, so we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
write-protect the page which is detailed in the comments in this patch

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 84 ++++++++++++++++++++++++++++++++++++++++++------------
 arch/x86/kvm/x86.c | 10 +++++++
 2 files changed, 76 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 337d173..e85eed6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2427,6 +2427,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 {
 	u64 spte;
 	int ret = 0;
+	bool remap = is_rmap_spte(*sptep);
 
 	if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
 		return 0;
@@ -2488,12 +2489,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 
-	if (pte_access & ACC_WRITE_MASK)
-		mark_page_dirty(vcpu->kvm, gfn);
-
 set_pte:
 	if (mmu_spte_update(sptep, spte))
 		kvm_flush_remote_tlbs(vcpu->kvm);
+
+	if (!remap) {
+		if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+			rmap_recycle(vcpu, sptep, gfn);
+
+		if (level > PT_PAGE_TABLE_LEVEL)
+			++vcpu->kvm->stat.lpages;
+	}
+
+	/*
+	 * The orders we require are:
+	 * 1) set spte to writable __before__ set the dirty bitmap.
+	 *    It makes sure that dirty-logging is not missed when do
+	 *    live migration at the final step where kvm should stop
+	 *    the guest and push the remaining dirty pages got from
+	 *    dirty-bitmap to the destination. The similar cases are
+	 *    in fast_pf_fix_direct_spte() and kvm_write_guest_page().
+	 *
+	 * 2) add the spte into rmap __before__ set the dirty bitmap.
+	 *
+	 * They can ensure we can find the writable spte on the rmap
+	 * when we do lockless write-protection since
+	 * kvm_vm_ioctl_get_dirty_log() write-protects the pages based
+	 * on its dirty-bitmap, otherwise these cases will happen:
+	 *
+	 *      CPU 0                         CPU 1
+	 *                              kvm ioctl doing get-dirty-pages
+	 * mark_page_dirty(gfn) which
+	 * set the gfn on the dirty maps
+	 *                              mask = xchg(dirty_bitmap, 0)
+	 *
+	 *                              try to write-protect gfns which
+	 *                              are set on "mask" then walk then
+	 *                              rmap, see no spte on that rmap
+	 * add the spte into rmap
+	 *
+	 * !!!!!! Then the page can be freely wrote but not recorded in
+	 * the dirty bitmap.
+	 *
+	 * And:
+	 *
+	 *      VCPU 0                        CPU 1
+	 *                                kvm ioctl doing get-dirty-pages
+	 * mark_page_dirty(gfn) which
+	 * set the gfn on the dirty maps
+	 *
+	 * add spte into rmap
+	 *                                mask = xchg(dirty_bitmap, 0)
+	 *
+	 *                                try to write-protect gfns which
+	 *                                are set on "mask" then walk then
+	 *                                rmap, see spte is on the ramp
+	 *                                but it is readonly or nonpresent
+	 * Mark spte writable
+	 *
+	 * !!!!!! Then the page can be freely wrote but not recorded in the
+	 * dirty bitmap.
+	 *
+	 * See the comments in kvm_vm_ioctl_get_dirty_log().
+	 */
+	smp_wmb();
+
+	if (pte_access & ACC_WRITE_MASK)
+		mark_page_dirty(vcpu->kvm, gfn);
 done:
 	return ret;
 }
@@ -2503,9 +2565,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			 int level, gfn_t gfn, pfn_t pfn, bool speculative,
 			 bool host_writable)
 {
-	int was_rmapped = 0;
-	int rmap_count;
-
 	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 		 *sptep, write_fault, gfn);
 
@@ -2527,8 +2586,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				 spte_to_pfn(*sptep), pfn);
 			drop_spte(vcpu->kvm, sptep);
 			kvm_flush_remote_tlbs(vcpu->kvm);
-		} else
-			was_rmapped = 1;
+		}
 	}
 
 	if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2546,16 +2604,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		 is_large_pte(*sptep)? "2MB" : "4kB",
 		 *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
 		 *sptep, sptep);
-	if (!was_rmapped && is_large_pte(*sptep))
-		++vcpu->kvm->stat.lpages;
-
-	if (is_shadow_present_pte(*sptep)) {
-		if (!was_rmapped) {
-			rmap_count = rmap_add(vcpu, sptep, gfn);
-			if (rmap_count > RMAP_RECYCLE_THRESHOLD)
-				rmap_recycle(vcpu, sptep, gfn);
-		}
-	}
 
 	kvm_release_pfn_clean(pfn);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 573c6b3..4ac3a27 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3566,6 +3566,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 		is_dirty = true;
 
 		mask = xchg(&dirty_bitmap[i], 0);
+		/*
+		 * smp_rmb();
+		 *
+		 * The read-barrier is implied in xchg() acting as a
+		 * full barrier and it ensures getting dirty bitmap
+		 * is before walking the rmap and spte write-protection.
+		 *
+		 * See the comments in set_spte().
+		 */
+
 		dirty_bitmap_buffer[i] = mask;
 
 		offset = i * BITS_PER_LONG;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (4 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-19  0:48   ` Marcelo Tosatti
  2013-10-23 13:29 ` [PATCH v3 07/15] KVM: MMU: introduce nulls desc Xiao Guangrong
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
   that is good to implement rcu-nulls-list-like lockless rmap
   walking

2) always move the entry in the first desc to the the position we want
   to remove when delete a spte in the parent_ptes/rmap (backward-move).
   It is good for us to implement lockless rmap walk since in the current
   code, when a spte is deleted from the "desc", another spte in the last
   "desc" will be moved to this position to replace the deleted one. If the
   deleted one has been accessed and we do not access the replaced one, the
   replaced one is missed when we do lockless walk.
   To fix this case, we do not backward move the spte, instead, we forward
   move the entry: when a spte is deleted, we move the entry in the first
   desc to that position

Both of these also can reduce cache miss

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 179 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 123 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e85eed6..5cce039 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
 	return level - 1;
 }
 
+static int __find_first_free(struct pte_list_desc *desc)
+{
+	int i;
+
+	for (i = 0; i < PTE_LIST_EXT; i++)
+		if (!desc->sptes[i])
+			break;
+	return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+	int free = __find_first_free(desc);
+
+	WARN_ON(free >= PTE_LIST_EXT);
+	return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+	int used = __find_first_free(desc) - 1;
+
+	WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+	return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+	int first_free, desc_num;
+
+	first_free = __find_first_free(desc);
+
+	for (desc_num = 0; desc->more; desc = desc->more)
+		desc_num++;
+
+	return first_free + desc_num * PTE_LIST_EXT;
+}
+
 /*
  * Pte mapping structures:
  *
@@ -923,98 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
  *
  * Returns the number of pte entries before the spte was added or zero if
  * the spte was not added.
- *
  */
 static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 			unsigned long *pte_list)
 {
 	struct pte_list_desc *desc;
-	int i, count = 0;
+	int free_pos;
 
 	if (!*pte_list) {
 		rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
 		*pte_list = (unsigned long)spte;
-	} else if (!(*pte_list & 1)) {
+		return 0;
+	}
+
+	if (!(*pte_list & 1)) {
 		rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
 		desc = mmu_alloc_pte_list_desc(vcpu);
 		desc->sptes[0] = (u64 *)*pte_list;
 		desc->sptes[1] = spte;
 		*pte_list = (unsigned long)desc | 1;
-		++count;
-	} else {
-		rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
-		desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-		while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
-			desc = desc->more;
-			count += PTE_LIST_EXT;
-		}
-		if (desc->sptes[PTE_LIST_EXT-1]) {
-			desc->more = mmu_alloc_pte_list_desc(vcpu);
-			desc = desc->more;
-		}
-		for (i = 0; desc->sptes[i]; ++i)
-			++count;
-		desc->sptes[i] = spte;
+		return 1;
 	}
-	return count;
+
+	rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+	/* No empty entry in the desc. */
+	if (desc->sptes[PTE_LIST_EXT - 1]) {
+		struct pte_list_desc *new_desc;
+		new_desc = mmu_alloc_pte_list_desc(vcpu);
+		new_desc->more = desc;
+		desc = new_desc;
+		*pte_list = (unsigned long)desc | 1;
+	}
+
+	free_pos = find_first_free(desc);
+	desc->sptes[free_pos] = spte;
+	return count_spte_number(desc) - 1;
 }
 
 static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
-			   int i, struct pte_list_desc *prev_desc)
+pte_list_desc_remove_entry(unsigned long *pte_list,
+			   struct pte_list_desc *desc, int i)
 {
-	int j;
+	struct pte_list_desc *first_desc;
+	int last_used;
 
-	for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
-		;
-	desc->sptes[i] = desc->sptes[j];
-	desc->sptes[j] = NULL;
-	if (j != 0)
+	first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+	last_used = find_last_used(first_desc);
+
+	/*
+	 * Move the entry from the first desc to this position we want
+	 * to remove.
+	 */
+	desc->sptes[i] = first_desc->sptes[last_used];
+	first_desc->sptes[last_used] = NULL;
+
+	/* No valid entry in this desc, we can free this desc now. */
+	if (!first_desc->sptes[0]) {
+		struct pte_list_desc *next_desc = first_desc->more;
+
+		/*
+		 * Only one entry existing but still use a desc to store it?
+		 */
+		WARN_ON(!next_desc);
+
+		mmu_free_pte_list_desc(first_desc);
+		*pte_list = (unsigned long)next_desc | 1ul;
 		return;
-	if (!prev_desc && !desc->more)
-		*pte_list = (unsigned long)desc->sptes[0];
-	else
-		if (prev_desc)
-			prev_desc->more = desc->more;
-		else
-			*pte_list = (unsigned long)desc->more | 1;
-	mmu_free_pte_list_desc(desc);
+	}
+
+	/*
+	 * Only one entry in this desc, move the entry to the head
+	 * then the desc can be freed.
+	 */
+	if (!first_desc->sptes[1] && !first_desc->more) {
+		*pte_list = (unsigned long)first_desc->sptes[0];
+		mmu_free_pte_list_desc(first_desc);
+	}
 }
 
 static void pte_list_remove(u64 *spte, unsigned long *pte_list)
 {
 	struct pte_list_desc *desc;
-	struct pte_list_desc *prev_desc;
 	int i;
 
 	if (!*pte_list) {
-		printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
+		pr_err("pte_list_remove: %p 0->BUG\n", spte);
 		BUG();
-	} else if (!(*pte_list & 1)) {
+		return;
+	}
+
+	if (!(*pte_list & 1)) {
 		rmap_printk("pte_list_remove:  %p 1->0\n", spte);
 		if ((u64 *)*pte_list != spte) {
-			printk(KERN_ERR "pte_list_remove:  %p 1->BUG\n", spte);
+			pr_err("pte_list_remove:  %p 1->BUG\n", spte);
 			BUG();
 		}
 		*pte_list = 0;
-	} else {
-		rmap_printk("pte_list_remove:  %p many->many\n", spte);
-		desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-		prev_desc = NULL;
-		while (desc) {
-			for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
-				if (desc->sptes[i] == spte) {
-					pte_list_desc_remove_entry(pte_list,
-							       desc, i,
-							       prev_desc);
-					return;
-				}
-			prev_desc = desc;
-			desc = desc->more;
-		}
-		pr_err("pte_list_remove: %p many->many\n", spte);
-		BUG();
+		return;
 	}
+
+	rmap_printk("pte_list_remove:  %p many->many\n", spte);
+	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+	while (desc) {
+		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+			if (desc->sptes[i] == spte) {
+				pte_list_desc_remove_entry(pte_list,
+							       desc, i);
+				return;
+			}
+		desc = desc->more;
+	}
+
+	pr_err("pte_list_remove: %p many->many\n", spte);
+	BUG();
 }
 
 typedef void (*pte_list_walk_fn) (u64 *spte);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (5 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-22 19:14   ` Marcelo Tosatti
  2013-10-23 13:29 ` [PATCH v3 08/15] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cce039..4687329 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
 	return level - 1;
 }
 
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
+{
+	unsigned long marker;
+
+	marker = (unsigned long)pte_list | 1UL;
+	desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+	return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+	return (unsigned long *)((unsigned long)desc & ~1);
+}
+
 static int __find_first_free(struct pte_list_desc *desc)
 {
 	int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)
 
 	first_free = __find_first_free(desc);
 
-	for (desc_num = 0; desc->more; desc = desc->more)
+	for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
 		desc_num++;
 
 	return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		desc = mmu_alloc_pte_list_desc(vcpu);
 		desc->sptes[0] = (u64 *)*pte_list;
 		desc->sptes[1] = spte;
+		desc_mark_nulls(pte_list, desc);
 		*pte_list = (unsigned long)desc | 1;
 		return 1;
 	}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 		/*
 		 * Only one entry existing but still use a desc to store it?
 		 */
-		WARN_ON(!next_desc);
+		WARN_ON(desc_is_a_nulls(next_desc));
 
 		mmu_free_pte_list_desc(first_desc);
 		*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 	 * Only one entry in this desc, move the entry to the head
 	 * then the desc can be freed.
 	 */
-	if (!first_desc->sptes[1] && !first_desc->more) {
+	if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
 		*pte_list = (unsigned long)first_desc->sptes[0];
 		mmu_free_pte_list_desc(first_desc);
 	}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
 
 	rmap_printk("pte_list_remove:  %p many->many\n", spte);
 	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-	while (desc) {
+	while (!desc_is_a_nulls(desc)) {
 		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
 			if (desc->sptes[i] == spte) {
 				pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
 		return fn((u64 *)*pte_list);
 
 	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-	while (desc) {
+	while (!desc_is_a_nulls(desc)) {
 		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
 			fn(desc->sptes[i]);
 		desc = desc->more;
 	}
+
+	WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
 
 	iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
 	iter->pos = 0;
+	WARN_ON(desc_is_a_nulls(iter->desc));
 	return iter->desc->sptes[iter->pos];
 }
 
@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
 				return sptep;
 		}
 
-		iter->desc = iter->desc->more;
+		iter->desc = desc_is_a_nulls(iter->desc->more) ?
+				NULL : iter->desc->more;
 
 		if (iter->desc) {
 			iter->pos = 0;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 08/15] KVM: MMU: introduce pte-list lockless walker
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (6 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 07/15] KVM: MMU: introduce nulls desc Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 09/15] KVM: MMU: initialize the pointers in pte_list_desc properly Xiao Guangrong
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list

Note, we should do bottom-up walk in the desc since we always move
the bottom entry to the deleted position

Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4687329..a864140 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
 	return first_free + desc_num * PTE_LIST_EXT;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value)				\
+	rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),	\
+			(unsigned long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 
 	if (!*pte_list) {
 		rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
-		*pte_list = (unsigned long)spte;
+		rcu_assign_pte_list(pte_list, spte);
 		return 0;
 	}
 
@@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		desc->sptes[0] = (u64 *)*pte_list;
 		desc->sptes[1] = spte;
 		desc_mark_nulls(pte_list, desc);
-		*pte_list = (unsigned long)desc | 1;
+		rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
 		return 1;
 	}
 
@@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		new_desc = mmu_alloc_pte_list_desc(vcpu);
 		new_desc->more = desc;
 		desc = new_desc;
-		*pte_list = (unsigned long)desc | 1;
+		rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
 	}
 
 	free_pos = find_first_free(desc);
@@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
 	WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
+/* The caller should hold rcu lock. */
+static void pte_list_walk_lockless(unsigned long *pte_list,
+				   pte_list_walk_fn fn)
+{
+	struct pte_list_desc *desc;
+	unsigned long pte_list_value;
+	int i;
+
+restart:
+	/*
+	 * Force the pte_list to be reloaded.
+	 *
+	 * See the comments in hlist_nulls_for_each_entry_rcu().
+	 */
+	barrier();
+	pte_list_value = *rcu_dereference(pte_list);
+	if (!pte_list_value)
+		return;
+
+	if (!(pte_list_value & 1))
+		return fn((u64 *)pte_list_value);
+
+	desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+	while (!desc_is_a_nulls(desc)) {
+		/*
+		 * We should do top-down walk since we always use the higher
+		 * indices to replace the deleted entry if only one desc is
+		 * used in the rmap when a spte is removed. Otherwise the
+		 * moved entry will be missed.
+		 */
+		for (i = PTE_LIST_EXT - 1; i >= 0; i--)
+			if (desc->sptes[i])
+				fn(desc->sptes[i]);
+
+		desc = rcu_dereference(desc->more);
+
+		/* It is being initialized. */
+		if (unlikely(!desc))
+			goto restart;
+	}
+
+	if (unlikely(desc_get_nulls_value(desc) != pte_list))
+		goto restart;
+}
+
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
 				    struct kvm_memory_slot *slot)
 {
@@ -4615,7 +4664,7 @@ int kvm_mmu_module_init(void)
 {
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
 					    sizeof(struct pte_list_desc),
-					    0, 0, NULL);
+					    0, SLAB_DESTROY_BY_RCU, NULL);
 	if (!pte_list_desc_cache)
 		goto nomem;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 09/15] KVM: MMU: initialize the pointers in pte_list_desc properly
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (7 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 08/15] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab Xiao Guangrong
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Since pte_list_desc will be locklessly accessed we need to atomicly initialize
its pointers so that the lockless walker can not get the partial value from the
pointer

In this patch we use the way of assigning pointer to initialize its pointers
which is always atomic instead of using kmem_cache_zalloc

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a864140..f3ae74e6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 }
 
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
-				  struct kmem_cache *base_cache, int min)
+				  struct kmem_cache *base_cache, int min,
+				  gfp_t flags)
 {
 	void *obj;
 
 	if (cache->nobjs >= min)
 		return 0;
 	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-		obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+		obj = kmem_cache_alloc(base_cache, flags);
 		if (!obj)
 			return -ENOMEM;
 		cache->objects[cache->nobjs++] = obj;
@@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
 	int r;
 
 	r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-				   pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
+				   pte_list_desc_cache, 8 + PTE_PREFETCH_NUM,
+				   GFP_KERNEL);
 	if (r)
 		goto out;
 	r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
 	if (r)
 		goto out;
 	r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
-				   mmu_page_header_cache, 4);
+				   mmu_page_header_cache, 4,
+				   GFP_KERNEL | __GFP_ZERO);
 out:
 	return r;
 }
@@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
 	return level - 1;
 }
 
+static void pte_list_desc_ctor(void *p)
+{
+	struct pte_list_desc *desc = p;
+	int i;
+
+	for (i = 0; i < PTE_LIST_EXT; i++)
+		desc->sptes[i] = NULL;
+
+	desc->more = NULL;
+}
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
 {
 	unsigned long marker;
@@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 	 */
 	if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
 		*pte_list = (unsigned long)first_desc->sptes[0];
+		first_desc->sptes[0] = NULL;
 		mmu_free_pte_list_desc(first_desc);
 	}
 }
@@ -4663,8 +4678,8 @@ static void mmu_destroy_caches(void)
 int kvm_mmu_module_init(void)
 {
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
-					    sizeof(struct pte_list_desc),
-					    0, SLAB_DESTROY_BY_RCU, NULL);
+				sizeof(struct pte_list_desc),
+				0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor);
 	if (!pte_list_desc_cache)
 		goto nomem;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (8 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 09/15] KVM: MMU: initialize the pointers in pte_list_desc properly Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-10-24  9:19   ` Gleb Natapov
  2013-10-23 13:29 ` [PATCH v3 11/15] KVM: MMU: locklessly access shadow page under rcu protection Xiao Guangrong
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Allocate shadow pages from slab instead of page-allocator, frequent
shadow page allocation and free can be hit in the slab cache, it is
very useful for shadow mmu

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c              | 46 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5cbf316..df9ae10 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,7 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu *walk_mmu;
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
+	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
@@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 {
 	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
 
-	return (struct kvm_mmu_page *)page_private(page);
+	return (struct kvm_mmu_page *)(page->mapping);
 }
 
 static inline u16 kvm_read_ldt(void)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3ae74e6..1bcc8c8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -178,6 +178,7 @@ struct kvm_shadow_walk_iterator {
 	     __shadow_walk_next(&(_walker), spte))
 
 static struct kmem_cache *pte_list_desc_cache;
+static struct kmem_cache *mmu_shadow_page_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
@@ -746,7 +747,14 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
 				   GFP_KERNEL);
 	if (r)
 		goto out;
-	r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
+
+	r = mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+				   mmu_shadow_page_cache, 4,
+				   GFP_KERNEL);
+	if (r)
+		goto out;
+
+	r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 4);
 	if (r)
 		goto out;
 	r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
@@ -760,6 +768,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
 				pte_list_desc_cache);
+	mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+				mmu_shadow_page_cache);
 	mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
 	mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache,
 				mmu_page_header_cache);
@@ -1675,12 +1685,28 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
+static void set_page_header(struct kvm_mmu_page *sp)
+{
+	struct page *page = virt_to_page(sp->spt);
+
+	WARN_ON(page->mapping);
+	page->mapping = (struct address_space *)sp;
+}
+
+static void clear_page_header(struct kvm_mmu_page *sp)
+{
+	struct page *page = virt_to_page(sp->spt);
+
+	page->mapping = NULL;
+}
+
 static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
 	ASSERT(is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
 	list_del(&sp->link);
-	free_page((unsigned long)sp->spt);
+	clear_page_header(sp);
+	kmem_cache_free(mmu_shadow_page_cache, sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
 	kmem_cache_free(mmu_page_header_cache, sp);
@@ -1719,10 +1745,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
 	struct kvm_mmu_page *sp;
 
 	sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
+	sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	if (!direct)
 		sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+	set_page_header(sp);
 
 	/*
 	 * The active_mmu_pages list is the FIFO list, do not move the
@@ -2046,12 +2072,13 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 	}
 }
 
-static void init_shadow_page_table(struct kvm_mmu_page *sp)
+static void init_shadow_page_table(void *p)
 {
+	u64 *sptp = (u64 *)p;
 	int i;
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
-		sp->spt[i] = 0ull;
+		sptp[i] = 0ull;
 }
 
 static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
@@ -2137,7 +2164,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		account_shadowed(vcpu->kvm, gfn);
 	}
 	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	init_shadow_page_table(sp);
 	trace_kvm_mmu_get_page(sp, true);
 	return sp;
 }
@@ -4683,6 +4709,12 @@ int kvm_mmu_module_init(void)
 	if (!pte_list_desc_cache)
 		goto nomem;
 
+	mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
+						  PAGE_SIZE, PAGE_SIZE, 0,
+						  init_shadow_page_table);
+	if (!mmu_shadow_page_cache)
+		goto nomem;
+
 	mmu_page_header_cache = kmem_cache_create("kvm_mmu_page_header",
 						  sizeof(struct kvm_mmu_page),
 						  0, 0, NULL);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 11/15] KVM: MMU: locklessly access shadow page under rcu protection
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (9 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level Xiao Guangrong
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Use SLAB_DESTROY_BY_RCU to prevent the shadow page to be freed from the
slab, so that it can be locklessly accessed by holding rcu lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1bcc8c8..5b42858 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4710,8 +4710,8 @@ int kvm_mmu_module_init(void)
 		goto nomem;
 
 	mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
-						  PAGE_SIZE, PAGE_SIZE, 0,
-						  init_shadow_page_table);
+				   PAGE_SIZE, PAGE_SIZE, SLAB_DESTROY_BY_RCU,
+				   init_shadow_page_table);
 	if (!mmu_shadow_page_cache)
 		goto nomem;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (10 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 11/15] KVM: MMU: locklessly access shadow page under rcu protection Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page Xiao Guangrong
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

The sptes on the middle level should obey these rules:
- they are always writable
- they are not pointing to process's page, so that SPTE_HOST_WRITEABLE has
  no chance to be set

So we can check last spte by using PT_WRITABLE_MASK and SPTE_HOST_WRITEABLE
that can be got from spte, then we can let is_last_spte() do not depend on
the mapping level anymore

This is important to implement lockless write-protection since only spte is
available at that time

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c         | 25 ++++++++++++-------------
 arch/x86/kvm/mmu_audit.c   |  6 +++---
 arch/x86/kvm/paging_tmpl.h |  6 ++----
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5b42858..8b96d96 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -337,13 +337,13 @@ static int is_rmap_spte(u64 pte)
 	return is_shadow_present_pte(pte);
 }
 
-static int is_last_spte(u64 pte, int level)
+static int is_last_spte(u64 pte)
 {
-	if (level == PT_PAGE_TABLE_LEVEL)
-		return 1;
-	if (is_large_pte(pte))
-		return 1;
-	return 0;
+	/*
+	 * All the sptes on the middle level are writable but
+	 * SPTE_HOST_WRITEABLE is not set.
+	 */
+	return !(is_writable_pte(pte) && !(pte & SPTE_HOST_WRITEABLE));
 }
 
 static pfn_t spte_to_pfn(u64 pte)
@@ -2203,7 +2203,7 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
 static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
 			       u64 spte)
 {
-	if (is_last_spte(spte, iterator->level)) {
+	if (is_last_spte(spte)) {
 		iterator->level = 0;
 		return;
 	}
@@ -2255,15 +2255,14 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	}
 }
 
-static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
-			     u64 *spte)
+static bool mmu_page_zap_pte(struct kvm *kvm, u64 *spte)
 {
 	u64 pte;
 	struct kvm_mmu_page *child;
 
 	pte = *spte;
 	if (is_shadow_present_pte(pte)) {
-		if (is_last_spte(pte, sp->role.level)) {
+		if (is_last_spte(pte)) {
 			drop_spte(kvm, spte);
 			if (is_large_pte(pte))
 				--kvm->stat.lpages;
@@ -2286,7 +2285,7 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm,
 	unsigned i;
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
-		mmu_page_zap_pte(kvm, sp, sp->spt + i);
+		mmu_page_zap_pte(kvm, sp->spt + i);
 }
 
 static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
@@ -3068,7 +3067,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	}
 
 	sp = page_header(__pa(iterator.sptep));
-	if (!is_last_spte(spte, sp->role.level))
+	if (!is_last_spte(spte))
 		goto exit;
 
 	/*
@@ -4316,7 +4315,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		local_flush = true;
 		while (npte--) {
 			entry = *spte;
-			mmu_page_zap_pte(vcpu->kvm, sp, spte);
+			mmu_page_zap_pte(vcpu->kvm, spte);
 			if (gentry &&
 			      !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
 			      & mask.word) && rmap_can_add(vcpu))
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index daff69e..d54e2ad 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -45,7 +45,7 @@ static void __mmu_spte_walk(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		fn(vcpu, ent + i, level);
 
 		if (is_shadow_present_pte(ent[i]) &&
-		      !is_last_spte(ent[i], level)) {
+		      !is_last_spte(ent[i])) {
 			struct kvm_mmu_page *child;
 
 			child = page_header(ent[i] & PT64_BASE_ADDR_MASK);
@@ -110,7 +110,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
 		}
 	}
 
-	if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep, level))
+	if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep))
 		return;
 
 	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
@@ -158,7 +158,7 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
 
 static void audit_sptes_have_rmaps(struct kvm_vcpu *vcpu, u64 *sptep, int level)
 {
-	if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep, level))
+	if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep))
 		inspect_spte_has_rmap(vcpu->kvm, sptep);
 }
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ad75d77..33f0216 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -809,7 +809,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
-	int level;
 	u64 *sptep;
 
 	vcpu_clear_mmio_info(vcpu, gva);
@@ -822,11 +821,10 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 
 	spin_lock(&vcpu->kvm->mmu_lock);
 	for_each_shadow_entry(vcpu, gva, iterator) {
-		level = iterator.level;
 		sptep = iterator.sptep;
 
 		sp = page_header(__pa(sptep));
-		if (is_last_spte(*sptep, level)) {
+		if (is_last_spte(*sptep)) {
 			pt_element_t gpte;
 			gpa_t pte_gpa;
 
@@ -836,7 +834,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 			pte_gpa = FNAME(get_level1_sp_gpa)(sp);
 			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
 
-			if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
+			if (mmu_page_zap_pte(vcpu->kvm, sptep))
 				kvm_flush_remote_tlbs(vcpu->kvm);
 
 			if (!rmap_can_add(vcpu))
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (11 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-10-24  9:17   ` Gleb Natapov
  2013-10-23 13:29 ` [PATCH v3 14/15] KVM: MMU: clean up spte_write_protect Xiao Guangrong
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we
need to hold mmu-lock which is also required by vcpu to fix its page table
fault and mmu-notifier when host page is being changed. In the extreme
cpu / memory used guest, it becomes a scalability issue

This patch introduces a way to locklessly write-protect guest memory

Now, lockless rmap walk, lockless shadow page table access and lockless
spte wirte-protection are ready, it is the time to implements page
write-protection out of mmu-lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ---
 arch/x86/kvm/mmu.c              | 59 ++++++++++++++++++++++++++++++-----------
 arch/x86/kvm/mmu.h              |  6 +++++
 arch/x86/kvm/x86.c              | 11 ++++----
 4 files changed, 55 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index df9ae10..cdb6f29 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -793,10 +793,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 		u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-				     struct kvm_memory_slot *slot,
-				     gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8b96d96..d82bbec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1386,8 +1386,37 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 	return flush;
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep)
+{
+	u64 spte;
+
+retry:
+	/*
+	 * Note we may partly read the sptep on 32bit host, however, we
+	 * allow this case because:
+	 * - we do not access the page got from the sptep.
+	 * - cmpxchg64 can detect that case and avoid setting a wrong value
+	 *   to the sptep.
+	 */
+	spte = *rcu_dereference(sptep);
+	if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
+		return;
+
+	if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+		return;
+
+	goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp)
+{
+	pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1396,16 +1425,17 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-				     struct kvm_memory_slot *slot,
-				     gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+					 struct kvm_memory_slot *slot,
+					 gfn_t gfn_offset, unsigned long mask)
 {
 	unsigned long *rmapp;
 
 	while (mask) {
 		rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 				      PT_PAGE_TABLE_LEVEL, slot);
-		__rmap_write_protect(kvm, rmapp, false);
+		rmap_write_protect_lockless(rmapp);
 
 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -4477,7 +4507,7 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
 	init_kvm_mmu(vcpu);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
 {
 	struct kvm_memory_slot *memslot;
 	gfn_t last_gfn;
@@ -4486,8 +4516,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 	memslot = id_to_memslot(kvm->memslots, slot);
 	last_gfn = memslot->base_gfn + memslot->npages - 1;
 
-	spin_lock(&kvm->mmu_lock);
-
+	rcu_read_lock();
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
 		unsigned long *rmapp;
@@ -4497,15 +4526,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
 
 		for (index = 0; index <= last_index; ++index, ++rmapp) {
-			if (*rmapp)
-				__rmap_write_protect(kvm, rmapp, false);
+			rmap_write_protect_lockless(rmapp);
 
-			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
-				cond_resched_lock(&kvm->mmu_lock);
+			if (need_resched()) {
+				rcu_read_lock();
+				rcu_read_unlock();
+			}
 		}
 	}
-
-	spin_unlock(&kvm->mmu_lock);
+	rcu_read_unlock();
 
 	/*
 	 * We can flush all the TLBs out of the mmu lock without TLB
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..33f313b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,4 +117,10 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot);
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+					 struct kvm_memory_slot *slot,
+					 gfn_t gfn_offset, unsigned long mask);
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ac3a27..c6233e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3554,8 +3554,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
 	memset(dirty_bitmap_buffer, 0, n);
 
-	spin_lock(&kvm->mmu_lock);
-
+	rcu_read_lock();
 	for (i = 0; i < n / sizeof(long); i++) {
 		unsigned long mask;
 		gfn_t offset;
@@ -3579,10 +3578,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 		dirty_bitmap_buffer[i] = mask;
 
 		offset = i * BITS_PER_LONG;
-		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
+		kvm_mmu_write_protect_pt_masked_lockless(kvm, memslot,
+							 offset, mask);
 	}
-
-	spin_unlock(&kvm->mmu_lock);
+	rcu_read_unlock();
 
 	/*
 	 * All the TLBs can be flushed out of mmu lock, see the comments in
@@ -7246,7 +7245,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	 * See the comments in fast_page_fault().
 	 */
 	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
-		kvm_mmu_slot_remove_write_access(kvm, mem->slot);
+		kvm_mmu_slot_remove_write_access_lockless(kvm, mem->slot);
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 14/15] KVM: MMU: clean up spte_write_protect
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (12 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-10-23 13:29 ` [PATCH v3 15/15] KVM: MMU: use rcu functions to access the pointer Xiao Guangrong
  2013-11-03 12:29 ` [PATCH v3 00/15] KVM: MMU: locklessly write-protect Gleb Natapov
  15 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d82bbec..3e4b941 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1340,8 +1340,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 }
 
 /*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1352,25 +1351,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
  *
  * Return true if tlb need be flushed.
  */
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
 {
 	u64 spte = *sptep;
 
 	if (!is_writable_pte(spte) &&
-	      !(pt_protect && spte_is_locklessly_modifiable(spte)))
+	      !spte_is_locklessly_modifiable(spte))
 		return false;
 
 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-	if (pt_protect)
-		spte &= ~SPTE_MMU_WRITEABLE;
-	spte = spte & ~PT_WRITABLE_MASK;
+	spte &= ~SPTE_MMU_WRITEABLE;
+	spte &= ~PT_WRITABLE_MASK;
 
 	return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
-				 bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1379,7 +1376,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
 		BUG_ON(!(*sptep & PT_PRESENT_MASK));
 
-		flush |= spte_write_protect(kvm, sptep, pt_protect);
+		flush |= spte_write_protect(sptep);
 		sptep = rmap_get_next(&iter);
 	}
 
@@ -1454,7 +1451,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
 		rmapp = __gfn_to_rmap(gfn, i, slot);
-		write_protected |= __rmap_write_protect(kvm, rmapp, true);
+		write_protected |= __rmap_write_protect(rmapp);
 	}
 
 	return write_protected;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 15/15] KVM: MMU: use rcu functions to access the pointer
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (13 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 14/15] KVM: MMU: clean up spte_write_protect Xiao Guangrong
@ 2013-10-23 13:29 ` Xiao Guangrong
  2013-11-03 12:29 ` [PATCH v3 00/15] KVM: MMU: locklessly write-protect Gleb Natapov
  15 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-23 13:29 UTC (permalink / raw)
  To: gleb; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm, Xiao Guangrong

Use rcu_assign_pointer() to update all the pointer in desc
and use rcu_dereference() to lockless read the pointer

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 46 ++++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3e4b941..68dac26 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -937,12 +937,23 @@ static void pte_list_desc_ctor(void *p)
 	desc->more = NULL;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value)				\
+	rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),	\
+			  (unsigned long *)(value))
+
+#define rcu_assign_desc_more(morep, value)				\
+	rcu_assign_pointer(*(unsigned long __rcu **)&morep,		\
+			  (unsigned long *)value)
+
+#define rcu_assign_spte(sptep, value)					\
+	rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value)
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc *desc)
 {
 	unsigned long marker;
 
 	marker = (unsigned long)pte_list | 1UL;
-	desc->more = (struct pte_list_desc *)marker;
+	rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker);
 }
 
 static bool desc_is_a_nulls(struct pte_list_desc *desc)
@@ -999,10 +1010,6 @@ static int count_spte_number(struct pte_list_desc *desc)
 	return first_free + desc_num * PTE_LIST_EXT;
 }
 
-#define rcu_assign_pte_list(pte_list_p, value)				\
-	rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),	\
-			(unsigned long *)(value))
-
 /*
  * Pte mapping structures:
  *
@@ -1029,8 +1036,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 	if (!(*pte_list & 1)) {
 		rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
 		desc = mmu_alloc_pte_list_desc(vcpu);
-		desc->sptes[0] = (u64 *)*pte_list;
-		desc->sptes[1] = spte;
+		rcu_assign_spte(desc->sptes[0], *pte_list);
+		rcu_assign_spte(desc->sptes[1], spte);
 		desc_mark_nulls(pte_list, desc);
 		rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
 		return 1;
@@ -1043,13 +1050,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 	if (desc->sptes[PTE_LIST_EXT - 1]) {
 		struct pte_list_desc *new_desc;
 		new_desc = mmu_alloc_pte_list_desc(vcpu);
-		new_desc->more = desc;
+		rcu_assign_desc_more(new_desc->more, desc);
 		desc = new_desc;
 		rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
 	}
 
 	free_pos = find_first_free(desc);
-	desc->sptes[free_pos] = spte;
+	rcu_assign_spte(desc->sptes[free_pos], spte);
 	return count_spte_number(desc) - 1;
 }
 
@@ -1067,8 +1074,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 	 * Move the entry from the first desc to this position we want
 	 * to remove.
 	 */
-	desc->sptes[i] = first_desc->sptes[last_used];
-	first_desc->sptes[last_used] = NULL;
+	rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]);
+	rcu_assign_spte(first_desc->sptes[last_used], NULL);
 
 	/* No valid entry in this desc, we can free this desc now. */
 	if (!first_desc->sptes[0]) {
@@ -1080,7 +1087,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 		WARN_ON(desc_is_a_nulls(next_desc));
 
 		mmu_free_pte_list_desc(first_desc);
-		*pte_list = (unsigned long)next_desc | 1ul;
+		rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul);
 		return;
 	}
 
@@ -1089,8 +1096,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 	 * then the desc can be freed.
 	 */
 	if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
-		*pte_list = (unsigned long)first_desc->sptes[0];
-		first_desc->sptes[0] = NULL;
+		rcu_assign_pte_list(pte_list, first_desc->sptes[0]);
+		rcu_assign_spte(first_desc->sptes[0], NULL);
 		mmu_free_pte_list_desc(first_desc);
 	}
 }
@@ -1112,7 +1119,7 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list)
 			pr_err("pte_list_remove:  %p 1->BUG\n", spte);
 			BUG();
 		}
-		*pte_list = 0;
+		rcu_assign_pte_list(pte_list, 0);
 		return;
 	}
 
@@ -1184,9 +1191,12 @@ restart:
 		 * used in the rmap when a spte is removed. Otherwise the
 		 * moved entry will be missed.
 		 */
-		for (i = PTE_LIST_EXT - 1; i >= 0; i--)
-			if (desc->sptes[i])
-				fn(desc->sptes[i]);
+		for (i = PTE_LIST_EXT - 1; i >= 0; i--) {
+			u64 *sptep = rcu_dereference(desc->sptes[i]);
+
+			if (sptep)
+				fn(sptep);
+		}
 
 		desc = rcu_dereference(desc->more);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page
  2013-10-23 13:29 ` [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page Xiao Guangrong
@ 2013-10-24  9:17   ` Gleb Natapov
  2013-10-24  9:24     ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-10-24  9:17 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:31PM +0800, Xiao Guangrong wrote:
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we
> need to hold mmu-lock which is also required by vcpu to fix its page table
> fault and mmu-notifier when host page is being changed. In the extreme
> cpu / memory used guest, it becomes a scalability issue
> 
> This patch introduces a way to locklessly write-protect guest memory
> 
> Now, lockless rmap walk, lockless shadow page table access and lockless
> spte wirte-protection are ready, it is the time to implements page
> write-protection out of mmu-lock
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 ---
>  arch/x86/kvm/mmu.c              | 59 ++++++++++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu.h              |  6 +++++
>  arch/x86/kvm/x86.c              | 11 ++++----
>  4 files changed, 55 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index df9ae10..cdb6f29 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -793,10 +793,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  		u64 dirty_mask, u64 nx_mask, u64 x_mask);
>  
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
> -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> -				     struct kvm_memory_slot *slot,
> -				     gfn_t gfn_offset, unsigned long mask);
>  void kvm_mmu_zap_all(struct kvm *kvm);
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
>  unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 8b96d96..d82bbec 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1386,8 +1386,37 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
>  	return flush;
>  }
>  
> -/**
> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> +static void __rmap_write_protect_lockless(u64 *sptep)
> +{
> +	u64 spte;
> +
> +retry:
> +	/*
> +	 * Note we may partly read the sptep on 32bit host, however, we
> +	 * allow this case because:
> +	 * - we do not access the page got from the sptep.
> +	 * - cmpxchg64 can detect that case and avoid setting a wrong value
> +	 *   to the sptep.
> +	 */
> +	spte = *rcu_dereference(sptep);
> +	if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
is_last_spte gets two parameters.

> +		return;
> +
> +	if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
> +		return;
> +
> +	goto retry;
> +}
> +
> +static void rmap_write_protect_lockless(unsigned long *rmapp)
> +{
> +	pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
> +}
> +
> +/*
> + * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
> + * pages out of mmu-lock.
> + *
>   * @kvm: kvm instance
>   * @slot: slot to protect
>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
> @@ -1396,16 +1425,17 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
>   * Used when we do not need to care about huge page mappings: e.g. during dirty
>   * logging we do not have any such mappings.
>   */
> -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> -				     struct kvm_memory_slot *slot,
> -				     gfn_t gfn_offset, unsigned long mask)
> +void
> +kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
> +					 struct kvm_memory_slot *slot,
> +					 gfn_t gfn_offset, unsigned long mask)
>  {
>  	unsigned long *rmapp;
>  
>  	while (mask) {
>  		rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
>  				      PT_PAGE_TABLE_LEVEL, slot);
> -		__rmap_write_protect(kvm, rmapp, false);
> +		rmap_write_protect_lockless(rmapp);
>  
>  		/* clear the first set bit */
>  		mask &= mask - 1;
> @@ -4477,7 +4507,7 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
>  	init_kvm_mmu(vcpu);
>  }
>  
> -void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> +void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
>  {
>  	struct kvm_memory_slot *memslot;
>  	gfn_t last_gfn;
> @@ -4486,8 +4516,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>  	memslot = id_to_memslot(kvm->memslots, slot);
>  	last_gfn = memslot->base_gfn + memslot->npages - 1;
>  
> -	spin_lock(&kvm->mmu_lock);
> -
> +	rcu_read_lock();
>  	for (i = PT_PAGE_TABLE_LEVEL;
>  	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
>  		unsigned long *rmapp;
> @@ -4497,15 +4526,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>  		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
>  
>  		for (index = 0; index <= last_index; ++index, ++rmapp) {
> -			if (*rmapp)
> -				__rmap_write_protect(kvm, rmapp, false);
> +			rmap_write_protect_lockless(rmapp);
>  
> -			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> -				cond_resched_lock(&kvm->mmu_lock);
> +			if (need_resched()) {
> +				rcu_read_lock();
> +				rcu_read_unlock();
> +			}
>  		}
>  	}
> -
> -	spin_unlock(&kvm->mmu_lock);
> +	rcu_read_unlock();
>  
>  	/*
>  	 * We can flush all the TLBs out of the mmu lock without TLB
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 2926152..33f313b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -117,4 +117,10 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access,
>  }
>  
>  void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
> +
> +void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot);
> +void
> +kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
> +					 struct kvm_memory_slot *slot,
> +					 gfn_t gfn_offset, unsigned long mask);
>  #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4ac3a27..c6233e1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3554,8 +3554,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
>  	memset(dirty_bitmap_buffer, 0, n);
>  
> -	spin_lock(&kvm->mmu_lock);
> -
> +	rcu_read_lock();
>  	for (i = 0; i < n / sizeof(long); i++) {
>  		unsigned long mask;
>  		gfn_t offset;
> @@ -3579,10 +3578,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  		dirty_bitmap_buffer[i] = mask;
>  
>  		offset = i * BITS_PER_LONG;
> -		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
> +		kvm_mmu_write_protect_pt_masked_lockless(kvm, memslot,
> +							 offset, mask);
>  	}
> -
> -	spin_unlock(&kvm->mmu_lock);
> +	rcu_read_unlock();
>  
>  	/*
>  	 * All the TLBs can be flushed out of mmu lock, see the comments in
> @@ -7246,7 +7245,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  	 * See the comments in fast_page_fault().
>  	 */
>  	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
> -		kvm_mmu_slot_remove_write_access(kvm, mem->slot);
> +		kvm_mmu_slot_remove_write_access_lockless(kvm, mem->slot);
>  }
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
> -- 
> 1.8.1.4

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-23 13:29 ` [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab Xiao Guangrong
@ 2013-10-24  9:19   ` Gleb Natapov
  2013-10-24  9:29     ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-10-24  9:19 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:28PM +0800, Xiao Guangrong wrote:
> Allocate shadow pages from slab instead of page-allocator, frequent
> shadow page allocation and free can be hit in the slab cache, it is
> very useful for shadow mmu
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 ++-
>  arch/x86/kvm/mmu.c              | 46 ++++++++++++++++++++++++++++++++++-------
>  2 files changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5cbf316..df9ae10 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -389,6 +389,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu *walk_mmu;
>  
>  	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> +	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>  	struct kvm_mmu_memory_cache mmu_page_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
>  
> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>  {
>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>  
> -	return (struct kvm_mmu_page *)page_private(page);
> +	return (struct kvm_mmu_page *)(page->mapping);
Why?

>  }
>  
>  static inline u16 kvm_read_ldt(void)
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f3ae74e6..1bcc8c8 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -178,6 +178,7 @@ struct kvm_shadow_walk_iterator {
>  	     __shadow_walk_next(&(_walker), spte))
>  
>  static struct kmem_cache *pte_list_desc_cache;
> +static struct kmem_cache *mmu_shadow_page_cache;
>  static struct kmem_cache *mmu_page_header_cache;
>  static struct percpu_counter kvm_total_used_mmu_pages;
>  
> @@ -746,7 +747,14 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
>  				   GFP_KERNEL);
>  	if (r)
>  		goto out;
> -	r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
> +
> +	r = mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +				   mmu_shadow_page_cache, 4,
> +				   GFP_KERNEL);
> +	if (r)
> +		goto out;
> +
> +	r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 4);
>  	if (r)
>  		goto out;
>  	r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> @@ -760,6 +768,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
>  	mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
>  				pte_list_desc_cache);
> +	mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> +				mmu_shadow_page_cache);
>  	mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
>  	mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache,
>  				mmu_page_header_cache);
> @@ -1675,12 +1685,28 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
>  	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
>  }
>  
> +static void set_page_header(struct kvm_mmu_page *sp)
> +{
> +	struct page *page = virt_to_page(sp->spt);
> +
> +	WARN_ON(page->mapping);
> +	page->mapping = (struct address_space *)sp;
> +}
> +
> +static void clear_page_header(struct kvm_mmu_page *sp)
> +{
> +	struct page *page = virt_to_page(sp->spt);
> +
> +	page->mapping = NULL;
> +}
> +
>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>  {
>  	ASSERT(is_empty_shadow_page(sp->spt));
>  	hlist_del(&sp->hash_link);
>  	list_del(&sp->link);
> -	free_page((unsigned long)sp->spt);
> +	clear_page_header(sp);
> +	kmem_cache_free(mmu_shadow_page_cache, sp->spt);
>  	if (!sp->role.direct)
>  		free_page((unsigned long)sp->gfns);
>  	kmem_cache_free(mmu_page_header_cache, sp);
> @@ -1719,10 +1745,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
>  	struct kvm_mmu_page *sp;
>  
>  	sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
> +	sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>  	if (!direct)
>  		sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
> -	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> +	set_page_header(sp);
>  
>  	/*
>  	 * The active_mmu_pages list is the FIFO list, do not move the
> @@ -2046,12 +2072,13 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
>  	}
>  }
>  
> -static void init_shadow_page_table(struct kvm_mmu_page *sp)
> +static void init_shadow_page_table(void *p)
>  {
> +	u64 *sptp = (u64 *)p;
>  	int i;
>  
>  	for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
> -		sp->spt[i] = 0ull;
> +		sptp[i] = 0ull;
>  }
>  
>  static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
> @@ -2137,7 +2164,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		account_shadowed(vcpu->kvm, gfn);
>  	}
>  	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> -	init_shadow_page_table(sp);
>  	trace_kvm_mmu_get_page(sp, true);
>  	return sp;
>  }
> @@ -4683,6 +4709,12 @@ int kvm_mmu_module_init(void)
>  	if (!pte_list_desc_cache)
>  		goto nomem;
>  
> +	mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
> +						  PAGE_SIZE, PAGE_SIZE, 0,
> +						  init_shadow_page_table);
> +	if (!mmu_shadow_page_cache)
> +		goto nomem;
> +
>  	mmu_page_header_cache = kmem_cache_create("kvm_mmu_page_header",
>  						  sizeof(struct kvm_mmu_page),
>  						  0, 0, NULL);
> -- 
> 1.8.1.4

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page
  2013-10-24  9:17   ` Gleb Natapov
@ 2013-10-24  9:24     ` Xiao Guangrong
  2013-10-24  9:32       ` Gleb Natapov
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-24  9:24 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On 10/24/2013 05:17 PM, Gleb Natapov wrote:

>>  
>> -/**
>> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
>> +static void __rmap_write_protect_lockless(u64 *sptep)
>> +{
>> +	u64 spte;
>> +
>> +retry:
>> +	/*
>> +	 * Note we may partly read the sptep on 32bit host, however, we
>> +	 * allow this case because:
>> +	 * - we do not access the page got from the sptep.
>> +	 * - cmpxchg64 can detect that case and avoid setting a wrong value
>> +	 *   to the sptep.
>> +	 */
>> +	spte = *rcu_dereference(sptep);
>> +	if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
> is_last_spte gets two parameters.

In patch [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level,
we have removed the 'level' from the parameter list.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-24  9:19   ` Gleb Natapov
@ 2013-10-24  9:29     ` Xiao Guangrong
  2013-10-24  9:52       ` Gleb Natapov
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-24  9:29 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On 10/24/2013 05:19 PM, Gleb Natapov wrote:

>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>  {
>>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>  
>> -	return (struct kvm_mmu_page *)page_private(page);
>> +	return (struct kvm_mmu_page *)(page->mapping);
> Why?

That's because page->private has been used by slab:

	/* Remainder is not double word aligned */
	union {
		unsigned long private;		/* Mapping-private opaque data:
					 	 * usually used for buffer_heads
						 * if PagePrivate set; used for
						 * swp_entry_t if PageSwapCache;
						 * indicates order in the buddy
						 * system if PG_buddy is set.
						 */
#if USE_SPLIT_PTLOCKS
		spinlock_t ptl;
#endif
		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
		struct page *first_page;	/* Compound tail pages */
	};


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page
  2013-10-24  9:24     ` Xiao Guangrong
@ 2013-10-24  9:32       ` Gleb Natapov
  0 siblings, 0 replies; 69+ messages in thread
From: Gleb Natapov @ 2013-10-24  9:32 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Thu, Oct 24, 2013 at 05:24:12PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 05:17 PM, Gleb Natapov wrote:
> 
> >>  
> >> -/**
> >> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> >> +static void __rmap_write_protect_lockless(u64 *sptep)
> >> +{
> >> +	u64 spte;
> >> +
> >> +retry:
> >> +	/*
> >> +	 * Note we may partly read the sptep on 32bit host, however, we
> >> +	 * allow this case because:
> >> +	 * - we do not access the page got from the sptep.
> >> +	 * - cmpxchg64 can detect that case and avoid setting a wrong value
> >> +	 *   to the sptep.
> >> +	 */
> >> +	spte = *rcu_dereference(sptep);
> >> +	if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
> > is_last_spte gets two parameters.
> 
> In patch [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level,
> we have removed the 'level' from the parameter list.
Ah, I haven't got 12/15 and git am did not complain :(

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-24  9:29     ` Xiao Guangrong
@ 2013-10-24  9:52       ` Gleb Natapov
  2013-10-24 10:10         ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-10-24  9:52 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
> 
> >> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >>  {
> >>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>  
> >> -	return (struct kvm_mmu_page *)page_private(page);
> >> +	return (struct kvm_mmu_page *)(page->mapping);
> > Why?
> 
> That's because page->private has been used by slab:
>
But does lockless path actually looks at it?
  
> 	/* Remainder is not double word aligned */
> 	union {
> 		unsigned long private;		/* Mapping-private opaque data:
> 					 	 * usually used for buffer_heads
> 						 * if PagePrivate set; used for
> 						 * swp_entry_t if PageSwapCache;
> 						 * indicates order in the buddy
> 						 * system if PG_buddy is set.
> 						 */
> #if USE_SPLIT_PTLOCKS
> 		spinlock_t ptl;
> #endif
> 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
> 		struct page *first_page;	/* Compound tail pages */
> 	};

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-24  9:52       ` Gleb Natapov
@ 2013-10-24 10:10         ` Xiao Guangrong
  2013-10-24 10:39           ` Gleb Natapov
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-24 10:10 UTC (permalink / raw)
  To: Gleb Natapov, Xiao Guangrong
  Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On 10/24/2013 05:52 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>
>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>>  {
>>>>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>  
>>>> -	return (struct kvm_mmu_page *)page_private(page);
>>>> +	return (struct kvm_mmu_page *)(page->mapping);
>>> Why?
>>
>> That's because page->private has been used by slab:
>>
> But does lockless path actually looks at it?

Lockless path does not use it, however, it is used by kvm_mmu_page():

static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
{
	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);

	return (struct kvm_mmu_page *)(page->mapping);
}

which is used in the common code.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-24 10:10         ` Xiao Guangrong
@ 2013-10-24 10:39           ` Gleb Natapov
  2013-10-24 11:01             ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-10-24 10:39 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Xiao Guangrong, avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
> > On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
> >> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
> >>
> >>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >>>>  {
> >>>>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>>>  
> >>>> -	return (struct kvm_mmu_page *)page_private(page);
> >>>> +	return (struct kvm_mmu_page *)(page->mapping);
> >>> Why?
> >>
> >> That's because page->private has been used by slab:
> >>
> > But does lockless path actually looks at it?
> 
> Lockless path does not use it, however, it is used by kvm_mmu_page():
> 
> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> {
> 	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> 
> 	return (struct kvm_mmu_page *)(page->mapping);
> }
> 
> which is used in the common code.
Ah, so the pointer is not available even after object is allocated.
Make sense since we allocate object, not page here, but is it safe to
use mapping like that?

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-24 10:39           ` Gleb Natapov
@ 2013-10-24 11:01             ` Xiao Guangrong
  2013-10-24 12:32               ` Gleb Natapov
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-24 11:01 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Xiao Guangrong, avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On 10/24/2013 06:39 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
>>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>>>
>>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>>>>  {
>>>>>>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>>>  
>>>>>> -	return (struct kvm_mmu_page *)page_private(page);
>>>>>> +	return (struct kvm_mmu_page *)(page->mapping);
>>>>> Why?
>>>>
>>>> That's because page->private has been used by slab:
>>>>
>>> But does lockless path actually looks at it?
>>
>> Lockless path does not use it, however, it is used by kvm_mmu_page():
>>
>> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>> {
>> 	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>
>> 	return (struct kvm_mmu_page *)(page->mapping);
>> }
>>
>> which is used in the common code.
> Ah, so the pointer is not available even after object is allocated.
> Make sense since we allocate object, not page here, but is it safe to
> use mapping like that?

The commens says:

	struct address_space *mapping;	/* If low bit clear, points to
					 * inode address_space, or NULL.
					 * If page mapped as anonymous
					 * memory, low bit is set, and
					 * it points to anon_vma object:
					 * see PAGE_MAPPING_ANON below.

It seems mapping is used for address_space or anonymous memory, in
our case, the page is used by slab, so I guess it is ok. And the bug
i put in set_page_header() was not tiggered on both slab and slub.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-24 11:01             ` Xiao Guangrong
@ 2013-10-24 12:32               ` Gleb Natapov
  2013-10-28  3:16                 ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-10-24 12:32 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Xiao Guangrong, avi.kivity, mtosatti, pbonzini, linux-kernel,
	kvm, aarcange

On Thu, Oct 24, 2013 at 07:01:49PM +0800, Xiao Guangrong wrote:
> On 10/24/2013 06:39 PM, Gleb Natapov wrote:
> > On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
> >> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
> >>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
> >>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
> >>>>
> >>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >>>>>>  {
> >>>>>>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>>>>>  
> >>>>>> -	return (struct kvm_mmu_page *)page_private(page);
> >>>>>> +	return (struct kvm_mmu_page *)(page->mapping);
> >>>>> Why?
> >>>>
> >>>> That's because page->private has been used by slab:
> >>>>
> >>> But does lockless path actually looks at it?
> >>
> >> Lockless path does not use it, however, it is used by kvm_mmu_page():
> >>
> >> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
> >> {
> >> 	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> >>
> >> 	return (struct kvm_mmu_page *)(page->mapping);
> >> }
> >>
> >> which is used in the common code.
> > Ah, so the pointer is not available even after object is allocated.
> > Make sense since we allocate object, not page here, but is it safe to
> > use mapping like that?
> 
> The commens says:
> 
> 	struct address_space *mapping;	/* If low bit clear, points to
> 					 * inode address_space, or NULL.
> 					 * If page mapped as anonymous
> 					 * memory, low bit is set, and
> 					 * it points to anon_vma object:
> 					 * see PAGE_MAPPING_ANON below.
> 
> It seems mapping is used for address_space or anonymous memory, in
> our case, the page is used by slab, so I guess it is ok. And the bug
> i put in set_page_header() was not tiggered on both slab and slub.
> 
Yeah, I also think so. I asked Andrea (copied) and he thinks that it is
safe too currently, but things changes fast in this area. Andrea?
Another option is too save slab_cache pointer and reset it before
freeing the object but it looks ugly.

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab
  2013-10-24 12:32               ` Gleb Natapov
@ 2013-10-28  3:16                 ` Xiao Guangrong
  0 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-10-28  3:16 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Xiao Guangrong, avi.kivity, mtosatti, pbonzini, linux-kernel,
	kvm, aarcange

On 10/24/2013 08:32 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 07:01:49PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 06:39 PM, Gleb Natapov wrote:
>>> On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
>>>> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
>>>>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>>>>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>>>>>
>>>>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>>>>>>  {
>>>>>>>>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>>>>>  
>>>>>>>> -	return (struct kvm_mmu_page *)page_private(page);
>>>>>>>> +	return (struct kvm_mmu_page *)(page->mapping);
>>>>>>> Why?
>>>>>>
>>>>>> That's because page->private has been used by slab:
>>>>>>
>>>>> But does lockless path actually looks at it?
>>>>
>>>> Lockless path does not use it, however, it is used by kvm_mmu_page():
>>>>
>>>> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>> {
>>>> 	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>
>>>> 	return (struct kvm_mmu_page *)(page->mapping);
>>>> }
>>>>
>>>> which is used in the common code.
>>> Ah, so the pointer is not available even after object is allocated.
>>> Make sense since we allocate object, not page here, but is it safe to
>>> use mapping like that?
>>
>> The commens says:
>>
>> 	struct address_space *mapping;	/* If low bit clear, points to
>> 					 * inode address_space, or NULL.
>> 					 * If page mapped as anonymous
>> 					 * memory, low bit is set, and
>> 					 * it points to anon_vma object:
>> 					 * see PAGE_MAPPING_ANON below.
>>
>> It seems mapping is used for address_space or anonymous memory, in
>> our case, the page is used by slab, so I guess it is ok. And the bug
>> i put in set_page_header() was not tiggered on both slab and slub.
>>
> Yeah, I also think so. I asked Andrea (copied) and he thinks that it is
> safe too currently, but things changes fast in this area. Andrea?
> Another option is too save slab_cache pointer and reset it before
> freeing the object but it looks ugly.

It's ugly but it isn't too bad. :)

Since currently kvm is extensively used to test/measure linux kernel
and the BUG_ON() in set_page_header() can help us to detect the potential
issue, it is easy for us to fix the possible bug in the development-cycle
if 'mapping' is used by slab. If that really happen, maybe we can switch
it to your way instead.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 00/15] KVM: MMU: locklessly write-protect
  2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
                   ` (14 preceding siblings ...)
  2013-10-23 13:29 ` [PATCH v3 15/15] KVM: MMU: use rcu functions to access the pointer Xiao Guangrong
@ 2013-11-03 12:29 ` Gleb Natapov
  2013-11-11  5:33   ` Xiao Guangrong
  15 siblings, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-11-03 12:29 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

Marcelo can you review it please?

On Wed, Oct 23, 2013 at 09:29:18PM +0800, Xiao Guangrong wrote:
> Changelog v3:
> - the changes from Gleb's review:
>   1) drop the patch which fixed the count of spte number in rmap since it
>      can not be easily fixed and it has gone after applying this patchset
> 
> - ideas from Gleb and discussion with Marcelo is also very appreciated:
>   2) change the way to locklessly access shadow page - use SLAB_DESTROY_BY_RCU
>      to protect shadow page instead of conditionally using call_rcu()
>   3) improve is_last_spte() that checks last spte by only using some bits on
>      the spte, then it is safely used when we locklessly write-protect the
>      shadow page table
> 
> Changelog v2:
> 
> - the changes from Gleb's review:
>   1) fix calculating the number of spte in the pte_list_add()
>   2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
>      rmap_get_next()
>   3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
>   4) use rcu functions to access the rcu protected pointer
>   5) spte will be missed in lockless walker if the spte is moved in a desc
>      (remove a spte from the rmap using only one desc). Fix it by bottom-up
>      walking the desc
> 
> - the changes from Paolo's review
>   1) make the order and memory barriers between update spte / add spte into
>      rmap and dirty-log more clear
>   
> - the changes from Marcelo's review:
>   1) let fast page fault only fix the spts on the last level (level = 1)
>   2) improve some changelogs and comments
> 
> - the changes from Takuya's review:
>   move the patch "flush tlb if the spte can be locklessly modified" forward
>   to make it's more easily merged
> 
> Thank all of you very much for your time and patience on this patchset!
>   
> Since we use rcu_assign_pointer() to update the points in desc even if dirty
> log is disabled, i have measured the performance:
> Host: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz * 12 + 36G memory
> 
> - migrate-perf (benchmark the time of get-dirty-log)
>   before: Run 10 times, Avg time:9009483 ns.
>   after: Run 10 times, Avg time:4807343 ns.
> 
> - kerbench
>   Guest: 12 VCPUs + 8G memory
>   before:
> EPT is enabled:
> # cat 09-05-origin-ept | grep real       
> real 85.58
> real 83.47
> real 82.95
> 
> EPT is disabled:
> # cat 09-05-origin-shadow | grep real
> real 138.77
> real 138.99
> real 139.55
> 
>   after:
> EPT is enabled:
> # cat 09-05-lockless-ept | grep real
> real 83.40
> real 82.81
> real 83.39
> 
> EPT is disabled:
> # cat 09-05-lockless-shadow | grep real
> real 138.91
> real 139.71
> real 138.94
> 
> No performance regression!
> 
> 
> 
> Background
> ==========
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
> 
> This patchset introduces a way to locklessly write-protect guest memory.
> 
> Idea
> ==========
> There are the challenges we meet and the ideas to resolve them.
> 
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
> 
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
> 
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
> 
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl context
> and the paths we are optimizing for have heavy workload, disabling interrupt is
> not good for the system performance.
> 
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
> 
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
> 
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
> we should ensure the writable spte can be found in rmap before the dirty bitmap
> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
> the page.
> 
> Performance result
> ====================
> The performance result and the benchmark can be found at:
>   http://permalink.gmane.org/gmane.linux.kernel/1534876
> 
> Xiao Guangrong (15):
>   KVM: MMU: properly check last spte in fast_page_fault()
>   KVM: MMU: lazily drop large spte
>   KVM: MMU: flush tlb if the spte can be locklessly modified
>   KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
>   KVM: MMU: update spte and add it into rmap before dirty log
>   KVM: MMU: redesign the algorithm of pte_list
>   KVM: MMU: introduce nulls desc
>   KVM: MMU: introduce pte-list lockless walker
>   KVM: MMU: initialize the pointers in pte_list_desc properly
>   KVM: MMU: allocate shadow pages from slab
>   KVM: MMU: locklessly access shadow page under rcu protection
>   KVM: MMU: check last spte with unawareness of mapping level
>   KVM: MMU: locklessly write-protect the page
>   KVM: MMU: clean up spte_write_protect
>   KVM: MMU: use rcu functions to access the pointer
> 
>  arch/x86/include/asm/kvm_host.h |   7 +-
>  arch/x86/kvm/mmu.c              | 586 ++++++++++++++++++++++++++++++----------
>  arch/x86/kvm/mmu.h              |   6 +
>  arch/x86/kvm/mmu_audit.c        |   6 +-
>  arch/x86/kvm/paging_tmpl.h      |   6 +-
>  arch/x86/kvm/x86.c              |  34 ++-
>  6 files changed, 475 insertions(+), 170 deletions(-)
> 
> -- 
> 1.8.1.4

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 00/15] KVM: MMU: locklessly write-protect
  2013-11-03 12:29 ` [PATCH v3 00/15] KVM: MMU: locklessly write-protect Gleb Natapov
@ 2013-11-11  5:33   ` Xiao Guangrong
  0 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-11  5:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi.kivity, mtosatti, pbonzini, linux-kernel, kvm

On 11/03/2013 08:29 PM, Gleb Natapov wrote:
> Marcelo can you review it please?
> 

Ping......

> On Wed, Oct 23, 2013 at 09:29:18PM +0800, Xiao Guangrong wrote:
>> Changelog v3:
>> - the changes from Gleb's review:
>>   1) drop the patch which fixed the count of spte number in rmap since it
>>      can not be easily fixed and it has gone after applying this patchset
>>
>> - ideas from Gleb and discussion with Marcelo is also very appreciated:
>>   2) change the way to locklessly access shadow page - use SLAB_DESTROY_BY_RCU
>>      to protect shadow page instead of conditionally using call_rcu()
>>   3) improve is_last_spte() that checks last spte by only using some bits on
>>      the spte, then it is safely used when we locklessly write-protect the
>>      shadow page table
>>
>> Changelog v2:
>>
>> - the changes from Gleb's review:
>>   1) fix calculating the number of spte in the pte_list_add()
>>   2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
>>      rmap_get_next()
>>   3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
>>   4) use rcu functions to access the rcu protected pointer
>>   5) spte will be missed in lockless walker if the spte is moved in a desc
>>      (remove a spte from the rmap using only one desc). Fix it by bottom-up
>>      walking the desc
>>
>> - the changes from Paolo's review
>>   1) make the order and memory barriers between update spte / add spte into
>>      rmap and dirty-log more clear
>>   
>> - the changes from Marcelo's review:
>>   1) let fast page fault only fix the spts on the last level (level = 1)
>>   2) improve some changelogs and comments
>>
>> - the changes from Takuya's review:
>>   move the patch "flush tlb if the spte can be locklessly modified" forward
>>   to make it's more easily merged
>>
>> Thank all of you very much for your time and patience on this patchset!
>>   
>> Since we use rcu_assign_pointer() to update the points in desc even if dirty
>> log is disabled, i have measured the performance:
>> Host: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz * 12 + 36G memory
>>
>> - migrate-perf (benchmark the time of get-dirty-log)
>>   before: Run 10 times, Avg time:9009483 ns.
>>   after: Run 10 times, Avg time:4807343 ns.
>>
>> - kerbench
>>   Guest: 12 VCPUs + 8G memory
>>   before:
>> EPT is enabled:
>> # cat 09-05-origin-ept | grep real       
>> real 85.58
>> real 83.47
>> real 82.95
>>
>> EPT is disabled:
>> # cat 09-05-origin-shadow | grep real
>> real 138.77
>> real 138.99
>> real 139.55
>>
>>   after:
>> EPT is enabled:
>> # cat 09-05-lockless-ept | grep real
>> real 83.40
>> real 82.81
>> real 83.39
>>
>> EPT is disabled:
>> # cat 09-05-lockless-shadow | grep real
>> real 138.91
>> real 139.71
>> real 138.94
>>
>> No performance regression!
>>
>>
>>
>> Background
>> ==========
>> Currently, when mark memslot dirty logged or get dirty page, we need to
>> write-protect large guest memory, it is the heavy work, especially, we need to
>> hold mmu-lock which is also required by vcpu to fix its page table fault and
>> mmu-notifier when host page is being changed. In the extreme cpu / memory used
>> guest, it becomes a scalability issue.
>>
>> This patchset introduces a way to locklessly write-protect guest memory.
>>
>> Idea
>> ==========
>> There are the challenges we meet and the ideas to resolve them.
>>
>> 1) How to locklessly walk rmap?
>> The first idea we got to prevent "desc" being freed when we are walking the
>> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
>> it updates the rmap really frequently.
>>
>> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
>> to be reused more quickly. We also store a "nulls" in the last "desc"
>> (desc->more) which can help us to detect whether the "desc" is moved to anther
>> rmap then we can re-walk the rmap if that happened. I learned this idea from
>> nulls-list.
>>
>> Another issue is, when a spte is deleted from the "desc", another spte in the
>> last "desc" will be moved to this position to replace the deleted one. If the
>> deleted one has been accessed and we do not access the replaced one, the
>> replaced one is missed when we do lockless walk.
>> To fix this case, we do not backward move the spte, instead, we forward move
>> the entry: when a spte is deleted, we move the entry in the first desc to that
>> position.
>>
>> 2) How to locklessly access shadow page table?
>> It is easy if the handler is in the vcpu context, in that case we can use
>> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
>> disable interrupt to stop shadow page be freed. But we are on the ioctl context
>> and the paths we are optimizing for have heavy workload, disabling interrupt is
>> not good for the system performance.
>>
>> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
>> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
>> indicator are protected by slot-lock, so it need not be atomic and does not
>> hurt the performance and the scalability.
>>
>> 3) How to locklessly write-protect guest memory?
>> Currently, there are two behaviors when we write-protect guest memory, one is
>> clearing the Writable bit on spte and the another one is dropping spte when it
>> points to large page. The former is easy we only need to atomicly clear a bit
>> but the latter is hard since we need to remove the spte from rmap. so we unify
>> these two behaviors that only make the spte readonly. Making large spte
>> readonly instead of nonpresent is also good for reducing jitter.
>>
>> And we need to pay more attention on the order of making spte writable, adding
>> spte into rmap and setting the corresponding bit on dirty bitmap since
>> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
>> we should ensure the writable spte can be found in rmap before the dirty bitmap
>> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
>> the page.
>>
>> Performance result
>> ====================
>> The performance result and the benchmark can be found at:
>>   http://permalink.gmane.org/gmane.linux.kernel/1534876
>>
>> Xiao Guangrong (15):
>>   KVM: MMU: properly check last spte in fast_page_fault()
>>   KVM: MMU: lazily drop large spte
>>   KVM: MMU: flush tlb if the spte can be locklessly modified
>>   KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
>>   KVM: MMU: update spte and add it into rmap before dirty log
>>   KVM: MMU: redesign the algorithm of pte_list
>>   KVM: MMU: introduce nulls desc
>>   KVM: MMU: introduce pte-list lockless walker
>>   KVM: MMU: initialize the pointers in pte_list_desc properly
>>   KVM: MMU: allocate shadow pages from slab
>>   KVM: MMU: locklessly access shadow page under rcu protection
>>   KVM: MMU: check last spte with unawareness of mapping level
>>   KVM: MMU: locklessly write-protect the page
>>   KVM: MMU: clean up spte_write_protect
>>   KVM: MMU: use rcu functions to access the pointer
>>
>>  arch/x86/include/asm/kvm_host.h |   7 +-
>>  arch/x86/kvm/mmu.c              | 586 ++++++++++++++++++++++++++++++----------
>>  arch/x86/kvm/mmu.h              |   6 +
>>  arch/x86/kvm/mmu_audit.c        |   6 +-
>>  arch/x86/kvm/paging_tmpl.h      |   6 +-
>>  arch/x86/kvm/x86.c              |  34 ++-
>>  6 files changed, 475 insertions(+), 170 deletions(-)
>>
>> -- 
>> 1.8.1.4
> 
> --
> 			Gleb.
> 
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault()
  2013-10-23 13:29 ` [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
@ 2013-11-12  0:25   ` Marcelo Tosatti
  0 siblings, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-12  0:25 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:19PM +0800, Xiao Guangrong wrote:
> Using sp->role.level instead of @level since @level is not got from the
> page table hierarchy
> 
> There is no issue in current code since the fast page fault currently only
> fixes the fault caused by dirty-log that is always on the last level
> (level = 1)
> 
> This patch makes the code more readable and avoids potential issue in the
> further development
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 40772ef..d2aacc2 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2798,9 +2798,9 @@ static bool page_fault_can_be_fast(u32 error_code)
>  }
>  
>  static bool
> -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
> +fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> +			u64 *sptep, u64 spte)
>  {
> -	struct kvm_mmu_page *sp = page_header(__pa(sptep));
>  	gfn_t gfn;
>  
>  	WARN_ON(!sp->role.direct);
> @@ -2826,6 +2826,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  			    u32 error_code)
>  {
>  	struct kvm_shadow_walk_iterator iterator;
> +	struct kvm_mmu_page *sp;
>  	bool ret = false;
>  	u64 spte = 0ull;
>  
> @@ -2846,7 +2847,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  		goto exit;
>  	}
>  
> -	if (!is_last_spte(spte, level))
> +	sp = page_header(__pa(iterator.sptep));
> +	if (!is_last_spte(spte, sp->role.level))
>  		goto exit;
>  
>  	/*
> @@ -2872,7 +2874,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  	 * the gfn is not stable for indirect shadow page.
>  	 * See Documentation/virtual/kvm/locking.txt to get more detail.
>  	 */
> -	ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
> +	ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
>  exit:
>  	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
>  			      spte, ret);
> -- 
> 1.8.1.4


Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 02/15] KVM: MMU: lazily drop large spte
  2013-10-23 13:29 ` [PATCH v3 02/15] KVM: MMU: lazily drop large spte Xiao Guangrong
@ 2013-11-12 22:44   ` Marcelo Tosatti
  0 siblings, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-12 22:44 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:20PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoided
> 
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter.  This removes the need for the return value.
> 
> This version has fixed the issue reported in 6b73a9606, the reason of that
> issue is that fast_page_fault() directly sets the readonly large spte to
> writable but only dirty the first page into the dirty-bitmap that means
> other pages are missed. Fixed it by only the normal sptes (on the
> PT_PAGE_TABLE_LEVEL level) can be fast fixed
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified
  2013-10-23 13:29 ` [PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
@ 2013-11-13  0:10   ` Marcelo Tosatti
  0 siblings, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-13  0:10 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:21PM +0800, Xiao Guangrong wrote:
> Relax the tlb flush condition since we will write-protect the spte out of mmu
> lock. Note lockless write-protection only marks the writable spte to readonly
> and the spte can be writable only if both SPTE_HOST_WRITEABLE and
> SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
> 
> This patch is used to avoid this kind of race:
> 
>       VCPU 0                         VCPU 1
> lockless wirte protection:
>       set spte.w = 0
>                                  lock mmu-lock
> 
>                                  write protection the spte to sync shadow page,
>                                  see spte.w = 0, then without flush tlb
> 
> 				 unlock mmu-lock
> 
>                                  !!! At this point, the shadow page can still be
>                                      writable due to the corrupt tlb entry
>      Flush all TLB
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  2013-10-23 13:29 ` [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes Xiao Guangrong
@ 2013-11-14  0:36   ` Marcelo Tosatti
  2013-11-14  5:15     ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-14  0:36 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:22PM +0800, Xiao Guangrong wrote:
> Now we can flush all the TLBs out of the mmu lock without TLB corruption when
> write-proect the sptes, it is because:
> - we have marked large sptes readonly instead of dropping them that means we
>   just change the spte from writable to readonly so that we only need to care
>   the case of changing spte from present to present (changing the spte from
>   present to nonpresent will flush all the TLBs immediately), in other words,
>   the only case we need to care is mmu_spte_update()

Xiao,

Any code location which reads the writable bit in the spte and assumes if its not
set, that the translation which the spte refers to is not cached in a
remote CPU's TLB can become buggy. (*)

It might be the case that now its not an issue, but its so subtle that
it should be improved.

Can you add a fat comment on top of is_writeable_bit describing this?
(and explain why is_writable_pte users do not make an assumption
about (*). 

"Writeable bit of locklessly modifiable sptes might be cleared
but TLBs not flushed: so whenever reading locklessly modifiable sptes
you cannot assume TLBs are flushed".

For example this one is unclear:

                if (!can_unsync && is_writable_pte(*sptep))
                        goto set_pte;
And:

        if (!is_writable_pte(spte) &&
              !(pt_protect && spte_is_locklessly_modifiable(spte)))
                return false;

This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
serialized by a single mutex (if there were two mutexes, it would not be
safe). Can you add an assert to both
kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
for (slots_lock) is locked, and explain?

So just improve the comments please, thanks (no need to resend whole
series).

> - in mmu_spte_update(), we haved checked
>   SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
>   means it does not depend on PT_WRITABLE_MASK anymore
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 18 ++++++++++++++----
>  arch/x86/kvm/x86.c |  9 +++++++--
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 62f18ec..337d173 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>  			if (*rmapp)
>  				__rmap_write_protect(kvm, rmapp, false);
>  
> -			if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> -				kvm_flush_remote_tlbs(kvm);
> +			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>  				cond_resched_lock(&kvm->mmu_lock);
> -			}
>  		}
>  	}
>  
> -	kvm_flush_remote_tlbs(kvm);
>  	spin_unlock(&kvm->mmu_lock);
> +
> +	/*
> +	 * We can flush all the TLBs out of the mmu lock without TLB
> +	 * corruption since we just change the spte from writable to
> +	 * readonly so that we only need to care the case of changing
> +	 * spte from present to present (changing the spte from present
> +	 * to nonpresent will flush all the TLBs immediately), in other
> +	 * words, the only case we care is mmu_spte_update() where we
> +	 * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
> +	 * instead of PT_WRITABLE_MASK, that means it does not depend
> +	 * on PT_WRITABLE_MASK anymore.
> +	 */
> +	kvm_flush_remote_tlbs(kvm);
>  }
>  
>  #define BATCH_ZAP_PAGES	10
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b3aa650..573c6b3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  		offset = i * BITS_PER_LONG;
>  		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
>  	}
> -	if (is_dirty)
> -		kvm_flush_remote_tlbs(kvm);
>  
>  	spin_unlock(&kvm->mmu_lock);
>  
> +	/*
> +	 * All the TLBs can be flushed out of mmu lock, see the comments in
> +	 * kvm_mmu_slot_remove_write_access().
> +	 */
> +	if (is_dirty)
> +		kvm_flush_remote_tlbs(kvm);
> +
>  	r = -EFAULT;
>  	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
>  		goto out;
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  2013-11-14  0:36   ` Marcelo Tosatti
@ 2013-11-14  5:15     ` Xiao Guangrong
  2013-11-14 18:39       ` Marcelo Tosatti
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-14  5:15 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm


Hi Marcelo,

On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:

> 
> Any code location which reads the writable bit in the spte and assumes if its not
> set, that the translation which the spte refers to is not cached in a
> remote CPU's TLB can become buggy. (*)
> 
> It might be the case that now its not an issue, but its so subtle that
> it should be improved.
> 
> Can you add a fat comment on top of is_writeable_bit describing this?
> (and explain why is_writable_pte users do not make an assumption
> about (*). 
> 
> "Writeable bit of locklessly modifiable sptes might be cleared
> but TLBs not flushed: so whenever reading locklessly modifiable sptes
> you cannot assume TLBs are flushed".
> 
> For example this one is unclear:
> 
>                 if (!can_unsync && is_writable_pte(*sptep))
>                         goto set_pte;
> And:
> 
>         if (!is_writable_pte(spte) &&
>               !(pt_protect && spte_is_locklessly_modifiable(spte)))
>                 return false;
> 
> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
> serialized by a single mutex (if there were two mutexes, it would not be
> safe). Can you add an assert to both
> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
> for (slots_lock) is locked, and explain?
> 
> So just improve the comments please, thanks (no need to resend whole
> series).

Thank you very much for your time to review it and really appreciate
for you detailed the issue so clearly to me.

I will do it on the top of this patchset or after it is merged
(if it's possiable).





^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  2013-11-14  5:15     ` Xiao Guangrong
@ 2013-11-14 18:39       ` Marcelo Tosatti
  2013-11-15  7:09         ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-14 18:39 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote:
> 
> Hi Marcelo,
> 
> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:
> 
> > 
> > Any code location which reads the writable bit in the spte and assumes if its not
> > set, that the translation which the spte refers to is not cached in a
> > remote CPU's TLB can become buggy. (*)
> > 
> > It might be the case that now its not an issue, but its so subtle that
> > it should be improved.
> > 
> > Can you add a fat comment on top of is_writeable_bit describing this?
> > (and explain why is_writable_pte users do not make an assumption
> > about (*). 
> > 
> > "Writeable bit of locklessly modifiable sptes might be cleared
> > but TLBs not flushed: so whenever reading locklessly modifiable sptes
> > you cannot assume TLBs are flushed".
> > 
> > For example this one is unclear:
> > 
> >                 if (!can_unsync && is_writable_pte(*sptep))
> >                         goto set_pte;
> > And:
> > 
> >         if (!is_writable_pte(spte) &&
> >               !(pt_protect && spte_is_locklessly_modifiable(spte)))
> >                 return false;
> > 
> > This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
> > serialized by a single mutex (if there were two mutexes, it would not be
> > safe). Can you add an assert to both
> > kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
> > for (slots_lock) is locked, and explain?
> > 
> > So just improve the comments please, thanks (no need to resend whole
> > series).
> 
> Thank you very much for your time to review it and really appreciate
> for you detailed the issue so clearly to me.
> 
> I will do it on the top of this patchset or after it is merged
> (if it's possiable).

Ok, can you explain why every individual caller of is_writable_pte have
no such assumption now? (the one mentioned above is not clear to me for
example, should explain all of them).

OK to improve comments later.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log
  2013-10-23 13:29 ` [PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log Xiao Guangrong
@ 2013-11-15  0:08   ` Marcelo Tosatti
  0 siblings, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-15  0:08 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:23PM +0800, Xiao Guangrong wrote:
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
> bitmap, so we should ensure the writable spte can be found in rmap before the
> dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
> write-protect the page which is detailed in the comments in this patch
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 84 ++++++++++++++++++++++++++++++++++++++++++------------
>  arch/x86/kvm/x86.c | 10 +++++++
>  2 files changed, 76 insertions(+), 18 deletions(-)

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  2013-11-14 18:39       ` Marcelo Tosatti
@ 2013-11-15  7:09         ` Xiao Guangrong
  2013-11-19  0:19           ` Marcelo Tosatti
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-15  7:09 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On 11/15/2013 02:39 AM, Marcelo Tosatti wrote:
> On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote:
>>
>> Hi Marcelo,
>>
>> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:
>>
>>>
>>> Any code location which reads the writable bit in the spte and assumes if its not
>>> set, that the translation which the spte refers to is not cached in a
>>> remote CPU's TLB can become buggy. (*)
>>>
>>> It might be the case that now its not an issue, but its so subtle that
>>> it should be improved.
>>>
>>> Can you add a fat comment on top of is_writeable_bit describing this?
>>> (and explain why is_writable_pte users do not make an assumption
>>> about (*). 
>>>
>>> "Writeable bit of locklessly modifiable sptes might be cleared
>>> but TLBs not flushed: so whenever reading locklessly modifiable sptes
>>> you cannot assume TLBs are flushed".
>>>
>>> For example this one is unclear:
>>>
>>>                 if (!can_unsync && is_writable_pte(*sptep))
>>>                         goto set_pte;
>>> And:
>>>
>>>         if (!is_writable_pte(spte) &&
>>>               !(pt_protect && spte_is_locklessly_modifiable(spte)))
>>>                 return false;
>>>
>>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
>>> serialized by a single mutex (if there were two mutexes, it would not be
>>> safe). Can you add an assert to both
>>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
>>> for (slots_lock) is locked, and explain?
>>>
>>> So just improve the comments please, thanks (no need to resend whole
>>> series).
>>
>> Thank you very much for your time to review it and really appreciate
>> for you detailed the issue so clearly to me.
>>
>> I will do it on the top of this patchset or after it is merged
>> (if it's possiable).
> 
> Ok, can you explain why every individual caller of is_writable_pte have
> no such assumption now? (the one mentioned above is not clear to me for
> example, should explain all of them).

Okay.

Generally speak, we 1) needn't care readonly spte too much since it
can not be locklessly write-protected and 2) if is_writable_pte() is used
to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead.

There are the places is_writable_pte is used:
1) in spte_has_volatile_bits():
 527 static bool spte_has_volatile_bits(u64 spte)
 528 {
 529         /*
 530          * Always atomicly update spte if it can be updated
 531          * out of mmu-lock, it can ensure dirty bit is not lost,
 532          * also, it can help us to get a stable is_writable_pte()
 533          * to ensure tlb flush is not missed.
 534          */
 535         if (spte_is_locklessly_modifiable(spte))
 536                 return true;
 537
 538         if (!shadow_accessed_mask)
 539                 return false;
 540
 541         if (!is_shadow_present_pte(spte))
 542                 return false;
 543
 544         if ((spte & shadow_accessed_mask) &&
 545               (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
 546                 return false;
 547
 548         return true;
 549 }

this path is not broken since any spte can be lockless modifiable will do
lockless update (will always return 'true' in  the line 536).

2): in mmu_spte_update()
594         /*
 595          * For the spte updated out of mmu-lock is safe, since
 596          * we always atomicly update it, see the comments in
 597          * spte_has_volatile_bits().
 598          */
 599         if (spte_is_locklessly_modifiable(old_spte) &&
 600               !is_writable_pte(new_spte))
 601                 ret = true;

The new_spte is a temp value that can not be fetched by lockless
write-protection and !is_writable_pte() is stable enough (can not be
locklessly write-protected).

3) in spte_write_protect()
1368         if (!is_writable_pte(spte) &&
1369               !spte_is_locklessly_modifiable(spte))
1370                 return false;
1371

It always do write-protection if the spte is lockelss modifiable.
(This code is the aspect after applying the whole pachset, the code is safe too
before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since
the lockless write-protection path is serialized by a single lock.).

4) in set_spte()
2690                 /*
2691                  * Optimization: for pte sync, if spte was writable the hash
2692                  * lookup is unnecessary (and expensive). Write protection
2693                  * is responsibility of mmu_get_page / kvm_sync_page.
2694                  * Same reasoning can be applied to dirty page accounting.
2695                  */
2696                 if (!can_unsync && is_writable_pte(*sptep))
2697                         goto set_pte;

It is used for a optimization and the worst case is the optimization is disabled
(walking the shadow pages in the hast table) when the spte has been locklessly
write-protected. It does not hurt anything since it is a rare event. And the
optimization can be back if we check SPTE_MMU_WRITEABLE instead.

5) fast_page_fault()
3110         /*
3111          * Check if it is a spurious fault caused by TLB lazily flushed.
3112          *
3113          * Need not check the access of upper level table entries since
3114          * they are always ACC_ALL.
3115          */
3116          if (is_writable_pte(spte)) {
3117                 ret = true;
3118                 goto exit;
3119         }

Since kvm_vm_ioctl_get_dirty_log() firstly get-and-clear dirty-bitmap before
do write-protect, the dirty-bitmap will be properly set again when fast_page_fault
fix the spte who is write-protected by lockless write-protection.

6) in fast_page_fault's tracepoint:
244 #define __spte_satisfied(__spte)                                \
245         (__entry->retry && is_writable_pte(__entry->__spte))
It causes the tracepoint reports the wrong result when fast_page_fault
and tdp_page_fault/lockless-write-protect run concurrently, i guess it's
okay since it's only used for trace.

7) in audit_write_protection():
202                 if (is_writable_pte(*sptep))
203                         audit_printk(kvm, "shadow page has writable "
204                                      "mappings: gfn %llx role %x\n",
205                                      sp->gfn, sp->role.word);
It's okay since lockless-write-protection does not update the readonly sptes.

> 
> OK to improve comments later.

Thank you, Marcelo!



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  2013-11-15  7:09         ` Xiao Guangrong
@ 2013-11-19  0:19           ` Marcelo Tosatti
  0 siblings, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-19  0:19 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Fri, Nov 15, 2013 at 03:09:13PM +0800, Xiao Guangrong wrote:
> On 11/15/2013 02:39 AM, Marcelo Tosatti wrote:
> > On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote:
> >>
> >> Hi Marcelo,
> >>
> >> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:
> >>
> >>>
> >>> Any code location which reads the writable bit in the spte and assumes if its not
> >>> set, that the translation which the spte refers to is not cached in a
> >>> remote CPU's TLB can become buggy. (*)
> >>>
> >>> It might be the case that now its not an issue, but its so subtle that
> >>> it should be improved.
> >>>
> >>> Can you add a fat comment on top of is_writeable_bit describing this?
> >>> (and explain why is_writable_pte users do not make an assumption
> >>> about (*). 
> >>>
> >>> "Writeable bit of locklessly modifiable sptes might be cleared
> >>> but TLBs not flushed: so whenever reading locklessly modifiable sptes
> >>> you cannot assume TLBs are flushed".
> >>>
> >>> For example this one is unclear:
> >>>
> >>>                 if (!can_unsync && is_writable_pte(*sptep))
> >>>                         goto set_pte;
> >>> And:
> >>>
> >>>         if (!is_writable_pte(spte) &&
> >>>               !(pt_protect && spte_is_locklessly_modifiable(spte)))
> >>>                 return false;
> >>>
> >>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
> >>> serialized by a single mutex (if there were two mutexes, it would not be
> >>> safe). Can you add an assert to both
> >>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
> >>> for (slots_lock) is locked, and explain?
> >>>
> >>> So just improve the comments please, thanks (no need to resend whole
> >>> series).
> >>
> >> Thank you very much for your time to review it and really appreciate
> >> for you detailed the issue so clearly to me.
> >>
> >> I will do it on the top of this patchset or after it is merged
> >> (if it's possiable).
> > 
> > Ok, can you explain why every individual caller of is_writable_pte have
> > no such assumption now? (the one mentioned above is not clear to me for
> > example, should explain all of them).
> 
> Okay.

OK, thanks for checking.

> Generally speak, we 1) needn't care readonly spte too much since it
> can not be locklessly write-protected and 2) if is_writable_pte() is used
> to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead.
> 
> There are the places is_writable_pte is used:
> 1) in spte_has_volatile_bits():
>  527 static bool spte_has_volatile_bits(u64 spte)
>  528 {
>  529         /*
>  530          * Always atomicly update spte if it can be updated
>  531          * out of mmu-lock, it can ensure dirty bit is not lost,
>  532          * also, it can help us to get a stable is_writable_pte()
>  533          * to ensure tlb flush is not missed.
>  534          */
>  535         if (spte_is_locklessly_modifiable(spte))
>  536                 return true;
>  537
>  538         if (!shadow_accessed_mask)
>  539                 return false;
>  540
>  541         if (!is_shadow_present_pte(spte))
>  542                 return false;
>  543
>  544         if ((spte & shadow_accessed_mask) &&
>  545               (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
>  546                 return false;
>  547
>  548         return true;
>  549 }
> 
> this path is not broken since any spte can be lockless modifiable will do
> lockless update (will always return 'true' in  the line 536).
> 
> 2): in mmu_spte_update()
> 594         /*
>  595          * For the spte updated out of mmu-lock is safe, since
>  596          * we always atomicly update it, see the comments in
>  597          * spte_has_volatile_bits().
>  598          */
>  599         if (spte_is_locklessly_modifiable(old_spte) &&
>  600               !is_writable_pte(new_spte))
>  601                 ret = true;
> 
> The new_spte is a temp value that can not be fetched by lockless
> write-protection and !is_writable_pte() is stable enough (can not be
> locklessly write-protected).
> 
> 3) in spte_write_protect()
> 1368         if (!is_writable_pte(spte) &&
> 1369               !spte_is_locklessly_modifiable(spte))
> 1370                 return false;
> 1371
> 
> It always do write-protection if the spte is lockelss modifiable.
> (This code is the aspect after applying the whole pachset, the code is safe too
> before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since
> the lockless write-protection path is serialized by a single lock.).
> 
> 4) in set_spte()
> 2690                 /*
> 2691                  * Optimization: for pte sync, if spte was writable the hash
> 2692                  * lookup is unnecessary (and expensive). Write protection
> 2693                  * is responsibility of mmu_get_page / kvm_sync_page.
> 2694                  * Same reasoning can be applied to dirty page accounting.
> 2695                  */
> 2696                 if (!can_unsync && is_writable_pte(*sptep))
> 2697                         goto set_pte;
> 
> It is used for a optimization and the worst case is the optimization is disabled
> (walking the shadow pages in the hast table) when the spte has been locklessly
> write-protected. It does not hurt anything since it is a rare event. And the
> optimization can be back if we check SPTE_MMU_WRITEABLE instead.
> 
> 5) fast_page_fault()
> 3110         /*
> 3111          * Check if it is a spurious fault caused by TLB lazily flushed.
> 3112          *
> 3113          * Need not check the access of upper level table entries since
> 3114          * they are always ACC_ALL.
> 3115          */
> 3116          if (is_writable_pte(spte)) {
> 3117                 ret = true;
> 3118                 goto exit;
> 3119         }
> 
> Since kvm_vm_ioctl_get_dirty_log() firstly get-and-clear dirty-bitmap before
> do write-protect, the dirty-bitmap will be properly set again when fast_page_fault
> fix the spte who is write-protected by lockless write-protection.
> 
> 6) in fast_page_fault's tracepoint:
> 244 #define __spte_satisfied(__spte)                                \
> 245         (__entry->retry && is_writable_pte(__entry->__spte))
> It causes the tracepoint reports the wrong result when fast_page_fault
> and tdp_page_fault/lockless-write-protect run concurrently, i guess it's
> okay since it's only used for trace.
> 
> 7) in audit_write_protection():
> 202                 if (is_writable_pte(*sptep))
> 203                         audit_printk(kvm, "shadow page has writable "
> 204                                      "mappings: gfn %llx role %x\n",
> 205                                      sp->gfn, sp->role.word);
> It's okay since lockless-write-protection does not update the readonly sptes.
> 
> > 
> > OK to improve comments later.
> 
> Thank you, Marcelo!
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list
  2013-10-23 13:29 ` [PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
@ 2013-11-19  0:48   ` Marcelo Tosatti
  0 siblings, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-19  0:48 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm

On Wed, Oct 23, 2013 at 09:29:24PM +0800, Xiao Guangrong wrote:
> Change the algorithm to:
> 1) always add new desc to the first desc (pointed by parent_ptes/rmap)
>    that is good to implement rcu-nulls-list-like lockless rmap
>    walking
> 
> 2) always move the entry in the first desc to the the position we want
>    to remove when delete a spte in the parent_ptes/rmap (backward-move).
>    It is good for us to implement lockless rmap walk since in the current
>    code, when a spte is deleted from the "desc", another spte in the last
>    "desc" will be moved to this position to replace the deleted one. If the
>    deleted one has been accessed and we do not access the replaced one, the
>    replaced one is missed when we do lockless walk.
>    To fix this case, we do not backward move the spte, instead, we forward
>    move the entry: when a spte is deleted, we move the entry in the first
>    desc to that position
> 
> Both of these also can reduce cache miss
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-10-23 13:29 ` [PATCH v3 07/15] KVM: MMU: introduce nulls desc Xiao Guangrong
@ 2013-11-22 19:14   ` Marcelo Tosatti
  2013-11-25  6:11     ` Xiao Guangrong
  2013-11-25  9:31     ` Peter Zijlstra
  0 siblings, 2 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-22 19:14 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm, Eric Dumazet,
	Peter Zijlstra

On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
> It likes nulls list and we use the pte-list as the nulls which can help us to
> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
> if that happened
> 
> kvm->slots_lock is held when we do lockless walking that prevents rmap
> is reused (free rmap need to hold that lock) so that we can not see the same
> nulls used on different rmaps
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

How about simplified lockless walk on the slot while rmapp entry
contains a single spte? (which should be the case with two-dimensional
paging).

That is, grab the lock when finding a rmap with more than one spte in
it (and then keep it locked until the end).

For example, nothing prevents lockless walker to move into some
parent_ptes chain, right?

Also, there is no guarantee of termination (as long as sptes are
deleted with the correct timing). BTW, can't see any guarantee of
termination for rculist nulls either (a writer can race with a lockless
reader indefinately, restarting the lockless walk every time).


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-22 19:14   ` Marcelo Tosatti
@ 2013-11-25  6:11     ` Xiao Guangrong
  2013-11-25  6:29       ` Xiao Guangrong
                         ` (3 more replies)
  2013-11-25  9:31     ` Peter Zijlstra
  1 sibling, 4 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-25  6:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra


On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>> It likes nulls list and we use the pte-list as the nulls which can help us to
>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
>> if that happened
>> 
>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>> is reused (free rmap need to hold that lock) so that we can not see the same
>> nulls used on different rmaps
>> 
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> 
> How about simplified lockless walk on the slot while rmapp entry
> contains a single spte? (which should be the case with two-dimensional
> paging).
> 
> That is, grab the lock when finding a rmap with more than one spte in
> it (and then keep it locked until the end).

Hmm… that isn't straightforward and more complex than the approach
in this patchset. Also it can drop the improvement for shadow mmu that
gets great improvement by this patchset.

> 
> For example, nothing prevents lockless walker to move into some
> parent_ptes chain, right?

No.

The nulls can help us to detect this case, for parent_ptes, the nulls points
to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
is no chance that the “rmap" is used as shadow page when slot-lock is held.

> 
> Also, there is no guarantee of termination (as long as sptes are
> deleted with the correct timing). BTW, can't see any guarantee of
> termination for rculist nulls either (a writer can race with a lockless
> reader indefinately, restarting the lockless walk every time).

Hmm, that can be avoided by checking dirty-bitmap before rewalk,
that means, if the dirty-bitmap has been set during lockless write-protection,
it’s unnecessary to write-protect its sptes. Your idea?

But… do we really need to care it. :(







^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25  6:11     ` Xiao Guangrong
@ 2013-11-25  6:29       ` Xiao Guangrong
  2013-11-25 18:12         ` Marcelo Tosatti
  2013-11-25 10:19       ` Gleb Natapov
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-25  6:29 UTC (permalink / raw)
  To: Xiao Guangrong, Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 11/25/2013 02:11 PM, Xiao Guangrong wrote:
> 
> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
>> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>>> It likes nulls list and we use the pte-list as the nulls which can help us to
>>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
>>> if that happened
>>>
>>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>>> is reused (free rmap need to hold that lock) so that we can not see the same
>>> nulls used on different rmaps
>>>
>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>
>> How about simplified lockless walk on the slot while rmapp entry
>> contains a single spte? (which should be the case with two-dimensional
>> paging).
>>
>> That is, grab the lock when finding a rmap with more than one spte in
>> it (and then keep it locked until the end).
> 
> Hmm� that isn't straightforward and more complex than the approach
> in this patchset. Also it can drop the improvement for shadow mmu that
> gets great improvement by this patchset.
> 
>>
>> For example, nothing prevents lockless walker to move into some
>> parent_ptes chain, right?
> 
> No.
> 
> The nulls can help us to detect this case, for parent_ptes, the nulls points
> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
> is no chance that the �rmap" is used as shadow page when slot-lock is held.
> 
>>
>> Also, there is no guarantee of termination (as long as sptes are
>> deleted with the correct timing). BTW, can't see any guarantee of
>> termination for rculist nulls either (a writer can race with a lockless
>> reader indefinately, restarting the lockless walk every time).
> 
> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> that means, if the dirty-bitmap has been set during lockless write-protection,
> it�s unnecessary to write-protect its sptes. Your idea?

This idea is based on the fact that the number of rmap is limited by
RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
we can break the rewalk at once, in the case of deleting, we can only
rewalk RMAP_RECYCLE_THRESHOLD times.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-22 19:14   ` Marcelo Tosatti
  2013-11-25  6:11     ` Xiao Guangrong
@ 2013-11-25  9:31     ` Peter Zijlstra
  2013-11-25 10:59       ` Xiao Guangrong
  1 sibling, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2013-11-25  9:31 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Xiao Guangrong, gleb, avi.kivity, pbonzini, linux-kernel, kvm,
	Eric Dumazet

On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
> Also, there is no guarantee of termination (as long as sptes are
> deleted with the correct timing). BTW, can't see any guarantee of
> termination for rculist nulls either (a writer can race with a lockless
> reader indefinately, restarting the lockless walk every time).

What's an rculist null? rculists have regular termination conditions,
they'll reach the end (also head, due to circular etc..) in N steps,
where N is the number of elements.

Of course you can keep adding elements to protract this situation, but
you'll run out of memory eventually -- you also have to make sure to
insert them after the element being read, otherwise the iteration will
simply miss them.

Deleting an element preserves the element itself -- it has to be
RCU-freed to be part of an rculist, and the element also preserves its
fwd link, so any iterator will either not see the element, or if they're
at the element, they'll continue their iteration as normal (rculist
doesn't have backward iterators).

A deleted element may not be re-used before an RCU grace period has
lapsed. Same as for freeing such an element. So if you want to move an
rculist element you need to:

  list_del_rcu()
  synchronize_rcu();
  list_add_rcu();

Or use the list_splice_init_rcu() primitive which also explicitly takes
a @sync argument.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25  6:11     ` Xiao Guangrong
  2013-11-25  6:29       ` Xiao Guangrong
@ 2013-11-25 10:19       ` Gleb Natapov
  2013-11-25 10:25         ` Xiao Guangrong
  2013-11-25 12:48       ` Avi Kivity
  2013-11-25 14:08       ` Marcelo Tosatti
  3 siblings, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-11-25 10:19 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
> > 
> > For example, nothing prevents lockless walker to move into some
> > parent_ptes chain, right?
> 
> No.
> 
> The nulls can help us to detect this case, for parent_ptes, the nulls points
> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
> is no chance that the “rmap" is used as shadow page when slot-lock is held.
> 
But meanwhile we will write protect non-last level sptes, no? Better to
create separate slab caches for rmap and parent_ptes lists.

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 10:19       ` Gleb Natapov
@ 2013-11-25 10:25         ` Xiao Guangrong
  0 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-25 10:25 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 11/25/2013 06:19 PM, Gleb Natapov wrote:
> On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
>>>
>>> For example, nothing prevents lockless walker to move into some
>>> parent_ptes chain, right?
>>
>> No.
>>
>> The nulls can help us to detect this case, for parent_ptes, the nulls points
>> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
>> is no chance that the �rmap" is used as shadow page when slot-lock is held.
>>
> But meanwhile we will write protect non-last level sptes, no? Better to

It will meet the non-last sptes but does not write-protect them since
we have do is_last_spte() check before cmpxchg.

> create separate slab caches for rmap and parent_ptes lists.

Yes, this is a good idea. Thanks you, Gleb!


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25  9:31     ` Peter Zijlstra
@ 2013-11-25 10:59       ` Xiao Guangrong
  2013-11-25 11:05         ` Peter Zijlstra
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-25 10:59 UTC (permalink / raw)
  To: Peter Zijlstra, Marcelo Tosatti
  Cc: gleb, avi.kivity, pbonzini, linux-kernel, kvm, Eric Dumazet


Hi Peter,

On 11/25/2013 05:31 PM, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
>> Also, there is no guarantee of termination (as long as sptes are
>> deleted with the correct timing). BTW, can't see any guarantee of
>> termination for rculist nulls either (a writer can race with a lockless
>> reader indefinately, restarting the lockless walk every time).
> 
> What's an rculist null? 

I guess Marcelo was talking about rculist_nulls.h
(Documentation/RCU/rculist_nulls.txt).

> rculists have regular termination conditions,
> they'll reach the end (also head, due to circular etc..) in N steps,
> where N is the number of elements.
> 
> Of course you can keep adding elements to protract this situation, but
> you'll run out of memory eventually -- you also have to make sure to
> insert them after the element being read, otherwise the iteration will
> simply miss them.
> 
> Deleting an element preserves the element itself -- it has to be
> RCU-freed to be part of an rculist, and the element also preserves its
> fwd link, so any iterator will either not see the element, or if they're
> at the element, they'll continue their iteration as normal (rculist
> doesn't have backward iterators).
> 
> A deleted element may not be re-used before an RCU grace period has
> lapsed. Same as for freeing such an element. So if you want to move an
> rculist element you need to:
> 
>   list_del_rcu()
>   synchronize_rcu();
>   list_add_rcu();
> 
> Or use the list_splice_init_rcu() primitive which also explicitly takes
> a @sync argument.

Thanks for your detailed explanation, Peter!

What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That
means the element may be reused while do iteration.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 10:59       ` Xiao Guangrong
@ 2013-11-25 11:05         ` Peter Zijlstra
  2013-11-25 11:29           ` Peter Zijlstra
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2013-11-25 11:05 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, gleb, avi.kivity, pbonzini, linux-kernel, kvm,
	Eric Dumazet

On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote:
> 
> Hi Peter,
> 
> On 11/25/2013 05:31 PM, Peter Zijlstra wrote:
> > On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
> >> Also, there is no guarantee of termination (as long as sptes are
> >> deleted with the correct timing). BTW, can't see any guarantee of
> >> termination for rculist nulls either (a writer can race with a lockless
> >> reader indefinately, restarting the lockless walk every time).
> > 
> > What's an rculist null? 
> 
> I guess Marcelo was talking about rculist_nulls.h
> (Documentation/RCU/rculist_nulls.txt).

Oh, let me have a look, I don't think I've really looked at that yet.

> > rculists have regular termination conditions,
> > they'll reach the end (also head, due to circular etc..) in N steps,
> > where N is the number of elements.
> > 
> > Of course you can keep adding elements to protract this situation, but
> > you'll run out of memory eventually -- you also have to make sure to
> > insert them after the element being read, otherwise the iteration will
> > simply miss them.
> > 
> > Deleting an element preserves the element itself -- it has to be
> > RCU-freed to be part of an rculist, and the element also preserves its
> > fwd link, so any iterator will either not see the element, or if they're
> > at the element, they'll continue their iteration as normal (rculist
> > doesn't have backward iterators).
> > 
> > A deleted element may not be re-used before an RCU grace period has
> > lapsed. Same as for freeing such an element. So if you want to move an
> > rculist element you need to:
> > 
> >   list_del_rcu()
> >   synchronize_rcu();
> >   list_add_rcu();
> > 
> > Or use the list_splice_init_rcu() primitive which also explicitly takes
> > a @sync argument.
> 
> Thanks for your detailed explanation, Peter!
> 
> What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That
> means the element may be reused while do iteration.

Then its broken, SLAB_DESTROY_BY_RCU rarely does what people expect it
to; which is why I wrote that honkin' huge comment right next to it.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 11:05         ` Peter Zijlstra
@ 2013-11-25 11:29           ` Peter Zijlstra
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Zijlstra @ 2013-11-25 11:29 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, gleb, avi.kivity, pbonzini, linux-kernel, kvm,
	Eric Dumazet

On Mon, Nov 25, 2013 at 12:05:00PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote:
> > I guess Marcelo was talking about rculist_nulls.h
> > (Documentation/RCU/rculist_nulls.txt).
> 
> Oh, let me have a look, I don't think I've really looked at that yet.

Egads, that's far too clever ;-)

Yes, if that's what Marcello was referencing to he's right, it doesn't
have a proper termination condition in the face of unlimited
modification.

And it can indeed use SLAB_DESTROY_BY_RCU as you alluded to -- although
the documentation is a tad scarce explaining why this is so.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25  6:11     ` Xiao Guangrong
  2013-11-25  6:29       ` Xiao Guangrong
  2013-11-25 10:19       ` Gleb Natapov
@ 2013-11-25 12:48       ` Avi Kivity
  2013-11-25 14:23         ` Marcelo Tosatti
  2013-11-25 14:08       ` Marcelo Tosatti
  3 siblings, 1 reply; 69+ messages in thread
From: Avi Kivity @ 2013-11-25 12:48 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, Gleb Natapov, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
<xiaoguangrong@linux.vnet.ibm.com> wrote:
>
> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

<snip complicated stuff about parent_pte>

I'm not really following, but note that parent_pte predates EPT (and
the use of rcu in kvm), so all the complexity that is the result of
trying to pack as many list entries into a cache line can be dropped.
Most setups now would have exactly one list entry, which is handled
specially antyway.

Alternatively, the trick of storing multiple entries in one list entry
can be moved to generic code, it may be useful to others.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25  6:11     ` Xiao Guangrong
                         ` (2 preceding siblings ...)
  2013-11-25 12:48       ` Avi Kivity
@ 2013-11-25 14:08       ` Marcelo Tosatti
  2013-11-26  3:02         ` Xiao Guangrong
  3 siblings, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-25 14:08 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
> 
> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
> >> It likes nulls list and we use the pte-list as the nulls which can help us to
> >> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
> >> if that happened
> >> 
> >> kvm->slots_lock is held when we do lockless walking that prevents rmap
> >> is reused (free rmap need to hold that lock) so that we can not see the same
> >> nulls used on different rmaps
> >> 
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> > 
> > How about simplified lockless walk on the slot while rmapp entry
> > contains a single spte? (which should be the case with two-dimensional
> > paging).
> > 
> > That is, grab the lock when finding a rmap with more than one spte in
> > it (and then keep it locked until the end).
> 
> Hmm… that isn't straightforward and more complex than the approach
> in this patchset. Also it can drop the improvement for shadow mmu that
> gets great improvement by this patchset.

It is not more complex, since it would remove list lockless walk. Only
the spte pointer at rmap[spte] is accessed without a lock. Its much much
simpler.

> > For example, nothing prevents lockless walker to move into some
> > parent_ptes chain, right?
> 
> No.
> 
> The nulls can help us to detect this case, for parent_ptes, the nulls points
> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
> is no chance that the “rmap" is used as shadow page when slot-lock is held.

The SLAB cache is the same, so entries can be reused. What prevents
a desc entry living in slot.arch.rmap to be freed and reused by a
parent_ptes desc?

> > Also, there is no guarantee of termination (as long as sptes are
> > deleted with the correct timing). BTW, can't see any guarantee of
> > termination for rculist nulls either (a writer can race with a lockless
> > reader indefinately, restarting the lockless walk every time).
> 
> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> that means, if the dirty-bitmap has been set during lockless write-protection,
> it’s unnecessary to write-protect its sptes. Your idea?
> 
> But… do we really need to care it. :(

See my reply to Avi.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 12:48       ` Avi Kivity
@ 2013-11-25 14:23         ` Marcelo Tosatti
  2013-11-25 14:29           ` Gleb Natapov
  2013-11-26  3:10           ` Xiao Guangrong
  0 siblings, 2 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-25 14:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Xiao Guangrong, Gleb Natapov, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> <xiaoguangrong@linux.vnet.ibm.com> wrote:
> >
> > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> <snip complicated stuff about parent_pte>
> 
> I'm not really following, but note that parent_pte predates EPT (and
> the use of rcu in kvm), so all the complexity that is the result of
> trying to pack as many list entries into a cache line can be dropped.
> Most setups now would have exactly one list entry, which is handled
> specially antyway.
> 
> Alternatively, the trick of storing multiple entries in one list entry
> can be moved to generic code, it may be useful to others.

Yes, can the lockless list walking code be transformed into generic
single-linked list walking? So the correctness can be verified
independently, and KVM becomes a simple user of that interface.

The simpler version is to maintain lockless walk on depth-1 rmap entries
(and grab the lock once depth-2 entry is found).


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 14:23         ` Marcelo Tosatti
@ 2013-11-25 14:29           ` Gleb Natapov
  2013-11-25 18:06             ` Marcelo Tosatti
  2013-11-26  3:10           ` Xiao Guangrong
  1 sibling, 1 reply; 69+ messages in thread
From: Gleb Natapov @ 2013-11-25 14:29 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, Xiao Guangrong, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> > On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> > <xiaoguangrong@linux.vnet.ibm.com> wrote:
> > >
> > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > 
> > <snip complicated stuff about parent_pte>
> > 
> > I'm not really following, but note that parent_pte predates EPT (and
> > the use of rcu in kvm), so all the complexity that is the result of
> > trying to pack as many list entries into a cache line can be dropped.
> > Most setups now would have exactly one list entry, which is handled
> > specially antyway.
> > 
> > Alternatively, the trick of storing multiple entries in one list entry
> > can be moved to generic code, it may be useful to others.
> 
> Yes, can the lockless list walking code be transformed into generic
> single-linked list walking? So the correctness can be verified
> independently, and KVM becomes a simple user of that interface.
> 
The code will become simpler but the problem of never ending walk of
rculist_nulls will remain.

> The simpler version is to maintain lockless walk on depth-1 rmap entries
> (and grab the lock once depth-2 entry is found).
And release it between each rmap walk or at the very end of write
protect?

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 14:29           ` Gleb Natapov
@ 2013-11-25 18:06             ` Marcelo Tosatti
  0 siblings, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-25 18:06 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Xiao Guangrong, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

GOn Mon, Nov 25, 2013 at 04:29:28PM +0200, Gleb Natapov wrote:
> On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> > > On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> > > <xiaoguangrong@linux.vnet.ibm.com> wrote:
> > > >
> > > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > 
> > > <snip complicated stuff about parent_pte>
> > > 
> > > I'm not really following, but note that parent_pte predates EPT (and
> > > the use of rcu in kvm), so all the complexity that is the result of
> > > trying to pack as many list entries into a cache line can be dropped.
> > > Most setups now would have exactly one list entry, which is handled
> > > specially antyway.
> > > 
> > > Alternatively, the trick of storing multiple entries in one list entry
> > > can be moved to generic code, it may be useful to others.
> > 
> > Yes, can the lockless list walking code be transformed into generic
> > single-linked list walking? So the correctness can be verified
> > independently, and KVM becomes a simple user of that interface.
> > 
> The code will become simpler but the problem of never ending walk of
> rculist_nulls will remain.

Yes. Hopefully it can be fixed in some way.

> > The simpler version is to maintain lockless walk on depth-1 rmap entries
> > (and grab the lock once depth-2 entry is found).
> And release it between each rmap walk or at the very end of write
> protect?

Or keep it locked until 10 consecutive sptes with depth 1 are found.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25  6:29       ` Xiao Guangrong
@ 2013-11-25 18:12         ` Marcelo Tosatti
  2013-11-26  3:21           ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-25 18:12 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >> Also, there is no guarantee of termination (as long as sptes are
> >> deleted with the correct timing). BTW, can't see any guarantee of
> >> termination for rculist nulls either (a writer can race with a lockless
> >> reader indefinately, restarting the lockless walk every time).
> > 
> > Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> > that means, if the dirty-bitmap has been set during lockless write-protection,
> > it�s unnecessary to write-protect its sptes. Your idea?
> This idea is based on the fact that the number of rmap is limited by
> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> we can break the rewalk at once, in the case of deleting, we can only
> rewalk RMAP_RECYCLE_THRESHOLD times.

Please explain in more detail.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 14:08       ` Marcelo Tosatti
@ 2013-11-26  3:02         ` Xiao Guangrong
  0 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-26  3:02 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 11/25/2013 10:08 PM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
>>
>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>>> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>>>> It likes nulls list and we use the pte-list as the nulls which can help us to
>>>> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
>>>> if that happened
>>>>
>>>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>>>> is reused (free rmap need to hold that lock) so that we can not see the same
>>>> nulls used on different rmaps
>>>>
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>>
>>> How about simplified lockless walk on the slot while rmapp entry
>>> contains a single spte? (which should be the case with two-dimensional
>>> paging).
>>>
>>> That is, grab the lock when finding a rmap with more than one spte in
>>> it (and then keep it locked until the end).
>>
>> Hmm… that isn't straightforward and more complex than the approach
>> in this patchset. Also it can drop the improvement for shadow mmu that
>> gets great improvement by this patchset.
> 
> It is not more complex, since it would remove list lockless walk. Only
> the spte pointer at rmap[spte] is accessed without a lock. Its much much
> simpler.
> 
>>> For example, nothing prevents lockless walker to move into some
>>> parent_ptes chain, right?
>>
>> No.
>>
>> The nulls can help us to detect this case, for parent_ptes, the nulls points
>> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
>> is no chance that the “rmap" is used as shadow page when slot-lock is held.
> 
> The SLAB cache is the same, so entries can be reused. What prevents
> a desc entry living in slot.arch.rmap to be freed and reused by a
> parent_ptes desc?
> 

We will check is_last_spte(), all the sptes on parent_ptes should be failed.
And Gleb suggested to use a separate slab for rmap, that should be excellent.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 14:23         ` Marcelo Tosatti
  2013-11-25 14:29           ` Gleb Natapov
@ 2013-11-26  3:10           ` Xiao Guangrong
  2013-11-26 10:15             ` Gleb Natapov
  2013-11-26 19:58             ` Marcelo Tosatti
  1 sibling, 2 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-26  3:10 UTC (permalink / raw)
  To: Marcelo Tosatti, Avi Kivity
  Cc: Gleb Natapov, pbonzini@redhat.com Bonzini, linux-kernel, kvm,
	Eric Dumazet, Peter Zijlstra

On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
>> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
>> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>>>
>>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>> <snip complicated stuff about parent_pte>
>>
>> I'm not really following, but note that parent_pte predates EPT (and
>> the use of rcu in kvm), so all the complexity that is the result of
>> trying to pack as many list entries into a cache line can be dropped.
>> Most setups now would have exactly one list entry, which is handled
>> specially antyway.
>>
>> Alternatively, the trick of storing multiple entries in one list entry
>> can be moved to generic code, it may be useful to others.
> 
> Yes, can the lockless list walking code be transformed into generic
> single-linked list walking? So the correctness can be verified
> independently, and KVM becomes a simple user of that interface.

I'am afraid the signle-entry list is not so good as we expected. In my
experience, there're too many entries on rmap, more than 300 sometimes.
(consider a case that a lib shared by all processes).

> 
> The simpler version is to maintain lockless walk on depth-1 rmap entries
> (and grab the lock once depth-2 entry is found).

I still think rmap-lockless is more graceful: soft mmu can get benefit
from it also it is promising to be used in some mmu-notify functions. :)



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-25 18:12         ` Marcelo Tosatti
@ 2013-11-26  3:21           ` Xiao Guangrong
  2013-11-26 10:12             ` Gleb Natapov
  2013-11-26 19:31             ` Marcelo Tosatti
  0 siblings, 2 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-26  3:21 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>> Also, there is no guarantee of termination (as long as sptes are
>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>> termination for rculist nulls either (a writer can race with a lockless
>>>> reader indefinately, restarting the lockless walk every time).
>>>
>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>> it�s unnecessary to write-protect its sptes. Your idea?
>> This idea is based on the fact that the number of rmap is limited by
>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>> we can break the rewalk at once, in the case of deleting, we can only
>> rewalk RMAP_RECYCLE_THRESHOLD times.
> 
> Please explain in more detail.

Okay.

My proposal is like this:

pte_list_walk_lockless()
{
restart:

+	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
+		return;

	code-doing-lockless-walking;
	......
}

Before do lockless-walking, we check the dirty-bitmap first, if
it is set we can simply skip write-protection for the gfn, that
is the case that new spte is being added into rmap when we lockless
access the rmap.

For the case of deleting spte from rmap, the number of entry is limited
by RMAP_RECYCLE_THRESHOLD, that is not endlessly.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-26  3:21           ` Xiao Guangrong
@ 2013-11-26 10:12             ` Gleb Natapov
  2013-11-26 19:31             ` Marcelo Tosatti
  1 sibling, 0 replies; 69+ messages in thread
From: Gleb Natapov @ 2013-11-26 10:12 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Peter Zijlstra

On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >>>> Also, there is no guarantee of termination (as long as sptes are
> >>>> deleted with the correct timing). BTW, can't see any guarantee of
> >>>> termination for rculist nulls either (a writer can race with a lockless
> >>>> reader indefinately, restarting the lockless walk every time).
> >>>
> >>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> >>> that means, if the dirty-bitmap has been set during lockless write-protection,
> >>> it�s unnecessary to write-protect its sptes. Your idea?
> >> This idea is based on the fact that the number of rmap is limited by
> >> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> >> we can break the rewalk at once, in the case of deleting, we can only
> >> rewalk RMAP_RECYCLE_THRESHOLD times.
> > 
> > Please explain in more detail.
> 
> Okay.
> 
> My proposal is like this:
> 
> pte_list_walk_lockless()
> {
> restart:
> 
> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
> +		return;
> 
> 	code-doing-lockless-walking;
> 	......
> }
> 
> Before do lockless-walking, we check the dirty-bitmap first, if
> it is set we can simply skip write-protection for the gfn, that
> is the case that new spte is being added into rmap when we lockless
> access the rmap.
> 
> For the case of deleting spte from rmap, the number of entry is limited
> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
The point is that rmap entry that you are inspecting can be constantly
deleted and added to the beginning of some other list, so the code that
traverse the list will never reach the end.

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-26  3:10           ` Xiao Guangrong
@ 2013-11-26 10:15             ` Gleb Natapov
  2013-11-26 19:58             ` Marcelo Tosatti
  1 sibling, 0 replies; 69+ messages in thread
From: Gleb Natapov @ 2013-11-26 10:15 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Marcelo Tosatti, Avi Kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Peter Zijlstra

On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> >> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> >> <xiaoguangrong@linux.vnet.ibm.com> wrote:
> >>>
> >>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >>
> >> <snip complicated stuff about parent_pte>
> >>
> >> I'm not really following, but note that parent_pte predates EPT (and
> >> the use of rcu in kvm), so all the complexity that is the result of
> >> trying to pack as many list entries into a cache line can be dropped.
> >> Most setups now would have exactly one list entry, which is handled
> >> specially antyway.
> >>
> >> Alternatively, the trick of storing multiple entries in one list entry
> >> can be moved to generic code, it may be useful to others.
> > 
> > Yes, can the lockless list walking code be transformed into generic
> > single-linked list walking? So the correctness can be verified
> > independently, and KVM becomes a simple user of that interface.
> 
> I'am afraid the signle-entry list is not so good as we expected. In my
> experience, there're too many entries on rmap, more than 300 sometimes.
> (consider a case that a lib shared by all processes).
> 
This is without EPT though and non EPT HW is not performance king
anyway. Nested EPT uses shadow paging too, but VMs hardly share any
pages. With KSM they may though.

--
			Gleb.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-26  3:21           ` Xiao Guangrong
  2013-11-26 10:12             ` Gleb Natapov
@ 2013-11-26 19:31             ` Marcelo Tosatti
  2013-11-28  8:53               ` Xiao Guangrong
  1 sibling, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-26 19:31 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >>>> Also, there is no guarantee of termination (as long as sptes are
> >>>> deleted with the correct timing). BTW, can't see any guarantee of
> >>>> termination for rculist nulls either (a writer can race with a lockless
> >>>> reader indefinately, restarting the lockless walk every time).
> >>>
> >>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> >>> that means, if the dirty-bitmap has been set during lockless write-protection,
> >>> it�s unnecessary to write-protect its sptes. Your idea?
> >> This idea is based on the fact that the number of rmap is limited by
> >> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> >> we can break the rewalk at once, in the case of deleting, we can only
> >> rewalk RMAP_RECYCLE_THRESHOLD times.
> > 
> > Please explain in more detail.
> 
> Okay.
> 
> My proposal is like this:
> 
> pte_list_walk_lockless()
> {
> restart:
> 
> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
> +		return;
> 
> 	code-doing-lockless-walking;
> 	......
> }
> 
> Before do lockless-walking, we check the dirty-bitmap first, if
> it is set we can simply skip write-protection for the gfn, that
> is the case that new spte is being added into rmap when we lockless
> access the rmap.

The dirty bit could be set after the check.

> For the case of deleting spte from rmap, the number of entry is limited
> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

It can shrink and grow while lockless walk is performed.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-26  3:10           ` Xiao Guangrong
  2013-11-26 10:15             ` Gleb Natapov
@ 2013-11-26 19:58             ` Marcelo Tosatti
  2013-11-28  8:32               ` Xiao Guangrong
  1 sibling, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-11-26 19:58 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Avi Kivity, Gleb Natapov, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
> > On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
> >> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
> >> <xiaoguangrong@linux.vnet.ibm.com> wrote:
> >>>
> >>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >>
> >> <snip complicated stuff about parent_pte>
> >>
> >> I'm not really following, but note that parent_pte predates EPT (and
> >> the use of rcu in kvm), so all the complexity that is the result of
> >> trying to pack as many list entries into a cache line can be dropped.
> >> Most setups now would have exactly one list entry, which is handled
> >> specially antyway.
> >>
> >> Alternatively, the trick of storing multiple entries in one list entry
> >> can be moved to generic code, it may be useful to others.
> > 
> > Yes, can the lockless list walking code be transformed into generic
> > single-linked list walking? So the correctness can be verified
> > independently, and KVM becomes a simple user of that interface.
> 
> I'am afraid the signle-entry list is not so good as we expected. In my
> experience, there're too many entries on rmap, more than 300 sometimes.
> (consider a case that a lib shared by all processes).

single linked list was about moving singly-linked lockless walking
to generic code.

http://www.spinics.net/lists/linux-usb/msg39643.html
http://marc.info/?l=linux-kernel&m=103305635013575&w=3

> > The simpler version is to maintain lockless walk on depth-1 rmap entries
> > (and grab the lock once depth-2 entry is found).
> 
> I still think rmap-lockless is more graceful: soft mmu can get benefit
> from it also it is promising to be used in some mmu-notify functions. :)

OK.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-26 19:58             ` Marcelo Tosatti
@ 2013-11-28  8:32               ` Xiao Guangrong
  0 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-28  8:32 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, Gleb Natapov, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 11/27/2013 03:58 AM, Marcelo Tosatti wrote:
> On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
>> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
>>> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
>>>> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
>>>> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>>>>>
>>>>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>>>
>>>> <snip complicated stuff about parent_pte>
>>>>
>>>> I'm not really following, but note that parent_pte predates EPT (and
>>>> the use of rcu in kvm), so all the complexity that is the result of
>>>> trying to pack as many list entries into a cache line can be dropped.
>>>> Most setups now would have exactly one list entry, which is handled
>>>> specially antyway.
>>>>
>>>> Alternatively, the trick of storing multiple entries in one list entry
>>>> can be moved to generic code, it may be useful to others.
>>>
>>> Yes, can the lockless list walking code be transformed into generic
>>> single-linked list walking? So the correctness can be verified
>>> independently, and KVM becomes a simple user of that interface.
>>
>> I'am afraid the signle-entry list is not so good as we expected. In my
>> experience, there're too many entries on rmap, more than 300 sometimes.
>> (consider a case that a lib shared by all processes).
> 
> single linked list was about moving singly-linked lockless walking
> to generic code.
> 
> http://www.spinics.net/lists/linux-usb/msg39643.html
> http://marc.info/?l=linux-kernel&m=103305635013575&w=3
> 

Oh, i confused "single linked" and "single entry", sorry about that.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-26 19:31             ` Marcelo Tosatti
@ 2013-11-28  8:53               ` Xiao Guangrong
  2013-12-03  7:10                 ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-11-28  8:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>>>> Also, there is no guarantee of termination (as long as sptes are
>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>>>> termination for rculist nulls either (a writer can race with a lockless
>>>>>> reader indefinately, restarting the lockless walk every time).
>>>>>
>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>>>> it�s unnecessary to write-protect its sptes. Your idea?
>>>> This idea is based on the fact that the number of rmap is limited by
>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>>>> we can break the rewalk at once, in the case of deleting, we can only
>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
>>>
>>> Please explain in more detail.
>>
>> Okay.
>>
>> My proposal is like this:
>>
>> pte_list_walk_lockless()
>> {
>> restart:
>>
>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
>> +		return;
>>
>> 	code-doing-lockless-walking;
>> 	......
>> }
>>
>> Before do lockless-walking, we check the dirty-bitmap first, if
>> it is set we can simply skip write-protection for the gfn, that
>> is the case that new spte is being added into rmap when we lockless
>> access the rmap.
> 
> The dirty bit could be set after the check.
> 
>> For the case of deleting spte from rmap, the number of entry is limited
>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
> 
> It can shrink and grow while lockless walk is performed.

Yes, indeed.

Hmmm, another idea in my mind to fix this is encoding the position into
the reserved bits of desc->more pointer, for example:

          +------+    +------+    +------+
rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
          +------+    +------+    +------+

There are 3 descs on the rmap, and:
rmapp = &desc0 | 1UL | 3UL << 50;
desc0->more = desc1 | 2UL << 50;
desc1->more = desc0 | 1UL << 50
desc2->more = &rmapp | 1UL; (The nulls pointer)

We will walk to the next desc only if the "position" of current desc
is >= the position of next desc. That can make sure we can reach the
last desc anyway.

And in order to avoiding doing too many "rewalk", we will goto the
slow path (do walk with holding the lock) instead when retry the walk
more that N times.

Thanks all you guys in thanksgiving day. :)


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-11-28  8:53               ` Xiao Guangrong
@ 2013-12-03  7:10                 ` Xiao Guangrong
  2013-12-05 13:50                   ` Marcelo Tosatti
  0 siblings, 1 reply; 69+ messages in thread
From: Xiao Guangrong @ 2013-12-03  7:10 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
> On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
>> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
>>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
>>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>>>>> Also, there is no guarantee of termination (as long as sptes are
>>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>>>>> termination for rculist nulls either (a writer can race with a lockless
>>>>>>> reader indefinately, restarting the lockless walk every time).
>>>>>>
>>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>>>>> it�s unnecessary to write-protect its sptes. Your idea?
>>>>> This idea is based on the fact that the number of rmap is limited by
>>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>>>>> we can break the rewalk at once, in the case of deleting, we can only
>>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
>>>>
>>>> Please explain in more detail.
>>>
>>> Okay.
>>>
>>> My proposal is like this:
>>>
>>> pte_list_walk_lockless()
>>> {
>>> restart:
>>>
>>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
>>> +		return;
>>>
>>> 	code-doing-lockless-walking;
>>> 	......
>>> }
>>>
>>> Before do lockless-walking, we check the dirty-bitmap first, if
>>> it is set we can simply skip write-protection for the gfn, that
>>> is the case that new spte is being added into rmap when we lockless
>>> access the rmap.
>>
>> The dirty bit could be set after the check.
>>
>>> For the case of deleting spte from rmap, the number of entry is limited
>>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
>>
>> It can shrink and grow while lockless walk is performed.
> 
> Yes, indeed.
> 
> Hmmm, another idea in my mind to fix this is encoding the position into
> the reserved bits of desc->more pointer, for example:
> 
>           +------+    +------+    +------+
> rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
>           +------+    +------+    +------+
> 
> There are 3 descs on the rmap, and:
> rmapp = &desc0 | 1UL | 3UL << 50;
> desc0->more = desc1 | 2UL << 50;
> desc1->more = desc0 | 1UL << 50
> desc2->more = &rmapp | 1UL; (The nulls pointer)
> 
> We will walk to the next desc only if the "position" of current desc
> is >= the position of next desc. That can make sure we can reach the
> last desc anyway.
> 
> And in order to avoiding doing too many "rewalk", we will goto the
> slow path (do walk with holding the lock) instead when retry the walk
> more that N times.

How about this idea? Or you guys still prefer to the idea of lockless on
first-level?



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-12-03  7:10                 ` Xiao Guangrong
@ 2013-12-05 13:50                   ` Marcelo Tosatti
  2013-12-05 15:30                     ` Xiao Guangrong
  0 siblings, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-12-05 13:50 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote:
> On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
> > On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
> >> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
> >>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> >>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
> >>>>>>> Also, there is no guarantee of termination (as long as sptes are
> >>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
> >>>>>>> termination for rculist nulls either (a writer can race with a lockless
> >>>>>>> reader indefinately, restarting the lockless walk every time).
> >>>>>>
> >>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> >>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
> >>>>>> it�s unnecessary to write-protect its sptes. Your idea?
> >>>>> This idea is based on the fact that the number of rmap is limited by
> >>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
> >>>>> we can break the rewalk at once, in the case of deleting, we can only
> >>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
> >>>>
> >>>> Please explain in more detail.
> >>>
> >>> Okay.
> >>>
> >>> My proposal is like this:
> >>>
> >>> pte_list_walk_lockless()
> >>> {
> >>> restart:
> >>>
> >>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
> >>> +		return;
> >>>
> >>> 	code-doing-lockless-walking;
> >>> 	......
> >>> }
> >>>
> >>> Before do lockless-walking, we check the dirty-bitmap first, if
> >>> it is set we can simply skip write-protection for the gfn, that
> >>> is the case that new spte is being added into rmap when we lockless
> >>> access the rmap.
> >>
> >> The dirty bit could be set after the check.
> >>
> >>> For the case of deleting spte from rmap, the number of entry is limited
> >>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
> >>
> >> It can shrink and grow while lockless walk is performed.
> > 
> > Yes, indeed.
> > 
> > Hmmm, another idea in my mind to fix this is encoding the position into
> > the reserved bits of desc->more pointer, for example:
> > 
> >           +------+    +------+    +------+
> > rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
> >           +------+    +------+    +------+
> > 
> > There are 3 descs on the rmap, and:
> > rmapp = &desc0 | 1UL | 3UL << 50;
> > desc0->more = desc1 | 2UL << 50;
> > desc1->more = desc0 | 1UL << 50
> > desc2->more = &rmapp | 1UL; (The nulls pointer)
> > 
> > We will walk to the next desc only if the "position" of current desc
> > is >= the position of next desc. That can make sure we can reach the
> > last desc anyway.
> > 
> > And in order to avoiding doing too many "rewalk", we will goto the
> > slow path (do walk with holding the lock) instead when retry the walk
> > more that N times.
> 
> How about this idea? Or you guys still prefer to the idea of lockless on
> first-level?

Xiao,

Is it not the case that simply moving to the slow path once a maximum of
rewalks has been reached enough? (looks a like a good solution).

Please move lockless rcu walking code to generic code where it belongs.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-12-05 13:50                   ` Marcelo Tosatti
@ 2013-12-05 15:30                     ` Xiao Guangrong
  2013-12-06  0:15                       ` Marcelo Tosatti
  2013-12-06  0:22                       ` Marcelo Tosatti
  0 siblings, 2 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-12-05 15:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra


On Dec 5, 2013, at 9:50 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote:
>> On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
>>> On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
>>>> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
>>>>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
>>>>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>>>>>>> Also, there is no guarantee of termination (as long as sptes are
>>>>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>>>>>>> termination for rculist nulls either (a writer can race with a lockless
>>>>>>>>> reader indefinately, restarting the lockless walk every time).
>>>>>>>> 
>>>>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>>>>>>> that means, if the dirty-bitmap has been set during lockless write-protection,
>>>>>>>> it�s unnecessary to write-protect its sptes. Your idea?
>>>>>>> This idea is based on the fact that the number of rmap is limited by
>>>>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>>>>>>> we can break the rewalk at once, in the case of deleting, we can only
>>>>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
>>>>>> 
>>>>>> Please explain in more detail.
>>>>> 
>>>>> Okay.
>>>>> 
>>>>> My proposal is like this:
>>>>> 
>>>>> pte_list_walk_lockless()
>>>>> {
>>>>> restart:
>>>>> 
>>>>> +	if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
>>>>> +		return;
>>>>> 
>>>>> 	code-doing-lockless-walking;
>>>>> 	......
>>>>> }
>>>>> 
>>>>> Before do lockless-walking, we check the dirty-bitmap first, if
>>>>> it is set we can simply skip write-protection for the gfn, that
>>>>> is the case that new spte is being added into rmap when we lockless
>>>>> access the rmap.
>>>> 
>>>> The dirty bit could be set after the check.
>>>> 
>>>>> For the case of deleting spte from rmap, the number of entry is limited
>>>>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
>>>> 
>>>> It can shrink and grow while lockless walk is performed.
>>> 
>>> Yes, indeed.
>>> 
>>> Hmmm, another idea in my mind to fix this is encoding the position into
>>> the reserved bits of desc->more pointer, for example:
>>> 
>>>         +------+    +------+    +------+
>>> rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
>>>         +------+    +------+    +------+
>>> 
>>> There are 3 descs on the rmap, and:
>>> rmapp = &desc0 | 1UL | 3UL << 50;
>>> desc0->more = desc1 | 2UL << 50;
>>> desc1->more = desc0 | 1UL << 50
>>> desc2->more = &rmapp | 1UL; (The nulls pointer)
>>> 
>>> We will walk to the next desc only if the "position" of current desc
>>> is >= the position of next desc. That can make sure we can reach the
>>> last desc anyway.
>>> 
>>> And in order to avoiding doing too many "rewalk", we will goto the
>>> slow path (do walk with holding the lock) instead when retry the walk
>>> more that N times.
>> 
>> How about this idea? Or you guys still prefer to the idea of lockless on
>> first-level?
> 
> Xiao,
> 
> Is it not the case that simply moving to the slow path once a maximum of
> rewalks has been reached enough? (looks a like a good solution).

In some cases, the lockless walker will do endless-walking on desc and
without rewalk, consider this case:

there are two descs: desc1 and desc2 who is pointed by desc1->next:
desc1->next = desc2.

CPU 0                                                                            CPU 1

lockless walk on desc1
                                                                 deleting desc1 from the rmap
lockless walk on desc2 (desc1->next)
                                                                delete desc2 from the rmap
                                                                add desc1
                                                                add desc2, then desc2->next = desc1

lockless walk on desc1
                                                               delete desc2
                                                               delete desc1
                                                               add desc2
                                                               add desc1; the desc1->next = desc2
lockless walk on desc2

……

Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.


> 
> Please move lockless rcu walking code to generic code where it belongs.

Okay.




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-12-05 15:30                     ` Xiao Guangrong
@ 2013-12-06  0:15                       ` Marcelo Tosatti
  2013-12-06  0:22                       ` Marcelo Tosatti
  1 sibling, 0 replies; 69+ messages in thread
From: Marcelo Tosatti @ 2013-12-06  0:15 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
> > Is it not the case that simply moving to the slow path once a maximum of
> > rewalks has been reached enough? (looks a like a good solution).
> 
> In some cases, the lockless walker will do endless-walking on desc and
> without rewalk, consider this case:
> 
> there are two descs: desc1 and desc2 who is pointed by desc1->next:
> desc1->next = desc2.
> 
> CPU 0                                                                            CPU 1
> 
> lockless walk on desc1
>                                                                  deleting desc1 from the rmap
> lockless walk on desc2 (desc1->next)
>                                                                 delete desc2 from the rmap
>                                                                 add desc1
>                                                                 add desc2, then desc2->next = desc1
> 
> lockless walk on desc1
>                                                                delete desc2
>                                                                delete desc1
>                                                                add desc2
>                                                                add desc1; the desc1->next = desc2
> lockless walk on desc2
> 
> ……
> 
> Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.

Ouch, this is the sort of thing that is worrysome. Do you still think
that having the benefit for shadow is important enough to justify this
complexity?

Also, another separate possibility is to use the dirty bit (which recent
Intel processors support). Then there are no faults at all to handle.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-12-05 15:30                     ` Xiao Guangrong
  2013-12-06  0:15                       ` Marcelo Tosatti
@ 2013-12-06  0:22                       ` Marcelo Tosatti
  2013-12-10  6:58                         ` Xiao Guangrong
  1 sibling, 1 reply; 69+ messages in thread
From: Marcelo Tosatti @ 2013-12-06  0:22 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
> In some cases, the lockless walker will do endless-walking on desc and
> without rewalk, consider this case:
> 
> there are two descs: desc1 and desc2 who is pointed by desc1->next:
> desc1->next = desc2.
> 
> CPU 0                                                                            CPU 1
> 
> lockless walk on desc1
>                                                                  deleting desc1 from the rmap
> lockless walk on desc2 (desc1->next)
>                                                                 delete desc2 from the rmap
>                                                                 add desc1
>                                                                 add desc2, then desc2->next = desc1
> 
> lockless walk on desc1
>                                                                delete desc2
>                                                                delete desc1
>                                                                add desc2
>                                                                add desc1; the desc1->next = desc2
> lockless walk on desc2
> 
> ……
> 
> Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.

The counter can be local to the walker. If its more than maximum rmap
size, break.

(but still, please consider carefully whether lockless list walking is
really necessary or can be avoided).


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
  2013-12-06  0:22                       ` Marcelo Tosatti
@ 2013-12-10  6:58                         ` Xiao Guangrong
  0 siblings, 0 replies; 69+ messages in thread
From: Xiao Guangrong @ 2013-12-10  6:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, avi.kivity, pbonzini@redhat.com Bonzini,
	linux-kernel, kvm, Eric Dumazet, Peter Zijlstra

On 12/06/2013 08:22 AM, Marcelo Tosatti wrote:
> On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
>> In some cases, the lockless walker will do endless-walking on desc and
>> without rewalk, consider this case:
>>
>> there are two descs: desc1 and desc2 who is pointed by desc1->next:
>> desc1->next = desc2.
>>
>> CPU 0                                                                            CPU 1
>>
>> lockless walk on desc1
>>                                                                  deleting desc1 from the rmap
>> lockless walk on desc2 (desc1->next)
>>                                                                 delete desc2 from the rmap
>>                                                                 add desc1
>>                                                                 add desc2, then desc2->next = desc1
>>
>> lockless walk on desc1
>>                                                                delete desc2
>>                                                                delete desc1
>>                                                                add desc2
>>                                                                add desc1; the desc1->next = desc2
>> lockless walk on desc2
>>
>> ……
>>
>> Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.
> 
> The counter can be local to the walker. If its more than maximum rmap
> size, break.
> 
> (but still, please consider carefully whether lockless list walking is
> really necessary or can be avoided).

Yep, Marcelo, you're right.

After thinking more, i do not have any idea to simplify this. Your approach
(lockless on the first level) seems a better solution. Will do it based on that
ways.

Thanks!



^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2013-12-10  6:58 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-23 13:29 [PATCH v3 00/15] KVM: MMU: locklessly write-protect Xiao Guangrong
2013-10-23 13:29 ` [PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
2013-11-12  0:25   ` Marcelo Tosatti
2013-10-23 13:29 ` [PATCH v3 02/15] KVM: MMU: lazily drop large spte Xiao Guangrong
2013-11-12 22:44   ` Marcelo Tosatti
2013-10-23 13:29 ` [PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
2013-11-13  0:10   ` Marcelo Tosatti
2013-10-23 13:29 ` [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes Xiao Guangrong
2013-11-14  0:36   ` Marcelo Tosatti
2013-11-14  5:15     ` Xiao Guangrong
2013-11-14 18:39       ` Marcelo Tosatti
2013-11-15  7:09         ` Xiao Guangrong
2013-11-19  0:19           ` Marcelo Tosatti
2013-10-23 13:29 ` [PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log Xiao Guangrong
2013-11-15  0:08   ` Marcelo Tosatti
2013-10-23 13:29 ` [PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
2013-11-19  0:48   ` Marcelo Tosatti
2013-10-23 13:29 ` [PATCH v3 07/15] KVM: MMU: introduce nulls desc Xiao Guangrong
2013-11-22 19:14   ` Marcelo Tosatti
2013-11-25  6:11     ` Xiao Guangrong
2013-11-25  6:29       ` Xiao Guangrong
2013-11-25 18:12         ` Marcelo Tosatti
2013-11-26  3:21           ` Xiao Guangrong
2013-11-26 10:12             ` Gleb Natapov
2013-11-26 19:31             ` Marcelo Tosatti
2013-11-28  8:53               ` Xiao Guangrong
2013-12-03  7:10                 ` Xiao Guangrong
2013-12-05 13:50                   ` Marcelo Tosatti
2013-12-05 15:30                     ` Xiao Guangrong
2013-12-06  0:15                       ` Marcelo Tosatti
2013-12-06  0:22                       ` Marcelo Tosatti
2013-12-10  6:58                         ` Xiao Guangrong
2013-11-25 10:19       ` Gleb Natapov
2013-11-25 10:25         ` Xiao Guangrong
2013-11-25 12:48       ` Avi Kivity
2013-11-25 14:23         ` Marcelo Tosatti
2013-11-25 14:29           ` Gleb Natapov
2013-11-25 18:06             ` Marcelo Tosatti
2013-11-26  3:10           ` Xiao Guangrong
2013-11-26 10:15             ` Gleb Natapov
2013-11-26 19:58             ` Marcelo Tosatti
2013-11-28  8:32               ` Xiao Guangrong
2013-11-25 14:08       ` Marcelo Tosatti
2013-11-26  3:02         ` Xiao Guangrong
2013-11-25  9:31     ` Peter Zijlstra
2013-11-25 10:59       ` Xiao Guangrong
2013-11-25 11:05         ` Peter Zijlstra
2013-11-25 11:29           ` Peter Zijlstra
2013-10-23 13:29 ` [PATCH v3 08/15] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
2013-10-23 13:29 ` [PATCH v3 09/15] KVM: MMU: initialize the pointers in pte_list_desc properly Xiao Guangrong
2013-10-23 13:29 ` [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab Xiao Guangrong
2013-10-24  9:19   ` Gleb Natapov
2013-10-24  9:29     ` Xiao Guangrong
2013-10-24  9:52       ` Gleb Natapov
2013-10-24 10:10         ` Xiao Guangrong
2013-10-24 10:39           ` Gleb Natapov
2013-10-24 11:01             ` Xiao Guangrong
2013-10-24 12:32               ` Gleb Natapov
2013-10-28  3:16                 ` Xiao Guangrong
2013-10-23 13:29 ` [PATCH v3 11/15] KVM: MMU: locklessly access shadow page under rcu protection Xiao Guangrong
2013-10-23 13:29 ` [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level Xiao Guangrong
2013-10-23 13:29 ` [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page Xiao Guangrong
2013-10-24  9:17   ` Gleb Natapov
2013-10-24  9:24     ` Xiao Guangrong
2013-10-24  9:32       ` Gleb Natapov
2013-10-23 13:29 ` [PATCH v3 14/15] KVM: MMU: clean up spte_write_protect Xiao Guangrong
2013-10-23 13:29 ` [PATCH v3 15/15] KVM: MMU: use rcu functions to access the pointer Xiao Guangrong
2013-11-03 12:29 ` [PATCH v3 00/15] KVM: MMU: locklessly write-protect Gleb Natapov
2013-11-11  5:33   ` Xiao Guangrong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).