linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: srcu-less dirty logging -v2
@ 2012-03-01 10:30 Takuya Yoshikawa
  2012-03-01 10:31 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
                   ` (5 more replies)
  0 siblings, 6 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-01 10:30 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel

v2: changed to protect masked pages

Live migration gets a bit faster than v1.

	Takuya


=== from v1

This patch series is the result of the integration of my dirty logging
optimization work, including preparation for the new GET_DIRTY_LOG API,
and the attempt to get rid of controversial synchronize_srcu_expedited().

1 - KVM: MMU: Split the main body of rmap_write_protect() off from others
2 - KVM: Avoid checking huge page mappings in get_dirty_log()
3 - KVM: Switch to srcu-less get_dirty_log()
4 - KVM: Remove unused dirty_bitmap_head and nr_dirty_pages

Although there are still some remaining tasks, the test result obtained
looks very promising.


Remaining tasks:

- Implement set_bit_le() for mark_page_dirty()

  Some drivers are using their own implementation of it and a bit of
  work is needed to make it generic.  I want to do this separately
  later because it cannot be done within kvm tree.

- Stop allocating extra dirty bitmap buffer area

  According to Peter, mmu_notifier has become preemptible.  If we can
  change mmu_lock from spin_lock to mutex_lock, as Avi said before, this
  would be staightforward because we can use __put_user() right after
  xchg() with the mmu_lock held.


Test results:

1. dirty-log-perf unit test (on Sandy Bridge core-i3 32-bit host)

With some changes added since the previous post, the performance was
much improved: now even when every page in the slot is dirty, the number
is reasonably close to the original one.  For others, needless to say,
we have achieved very nice improvement.

- kvm.git next
average(ns)    stdev     ns/page    pages

 147018.6    77604.9    147018.6        1
 158080.2    82211.9     79040.1        2
 127555.6    80619.8     31888.9        4
 108865.6    78499.3     13608.2        8
 114707.8    43508.6      7169.2       16
  76679.0    37659.8      2396.2       32
  59159.8    20417.1       924.3       64
  60418.2    19405.7       472.0      128
  76267.0    21450.5       297.9      256
 113182.0    22684.9       221.0      512
 930344.2   153766.5       908.5       1K
 939098.2   163800.3       458.5       2K
 996813.4    77921.0       243.3       4K
1113232.6   107782.6       135.8       8K
1241206.4    82282.5        75.7      16K
1529526.4   116388.2        46.6      32K
2147538.4   227375.9        32.7      64K
3309619.4    79356.8        25.2     128K
6016951.8   549873.4        22.9     256K

- kvm.git next + srcu-less series
average(ns)    stdev     ns/page    pages    improvement(%)

  14086.0     3532.3     14086.0        1     944
  13303.6     3317.7      6651.8        2    1088
  13455.6     3315.2      3363.9        4     848
  14125.8     3435.4      1765.7        8     671
  15322.4     3690.1       957.6       16     649
  17026.6     4037.2       532.0       32     350
  21258.6     4852.3       332.1       64     178
  33845.6    14115.8       264.4      128      79
  37893.0      681.8       148.0      256     101
  61707.4     1057.6       120.5      512      83
  88861.4     2131.0        86.7       1K     947
 151315.6     6490.5        73.8       2K     521
 290579.6     8523.0        70.9       4K     243
 518231.0    20412.6        63.2       8K     115
2271171.4    12064.9       138.6      16K     -45
3375866.2    14743.3       103.0      32K     -55
4408395.6    10720.0        67.2      64K     -51
5915336.2    26538.1        45.1     128K     -44
8497356.4    16441.0        32.4     256K     -29

Note that when the number of dirty pages was large, we spent less than
100ns for getting one dirty page information: see ns/page column.

As Avi noted before, this is much faster than the userspace send one
page to the destination node.

Furthermore, with the already proposed new GET_DIRTY_LOG API, we will
be able to restrict the area from which we get the log and will not need
to care about ms order of latency observed for very large number of dirty
pages.

2. real workloads (on Xeon W3520 64-bit host)

I traced kvm_vm_ioctl_get_dirty_log() during heavy VGA updates and
during live migration.

2.1. VGA: guest was doing "x11perf -rect1 -rect10 -rect100 -rect500"

As can be guessed from the result of dirty-log-perf, we observed very
nice improvement.

- kvm.git next
For heavy updates: 100us to 300us.
Worst: 300us

- kvm.git next + srcu-less series
For heavy updates: 3us to 10us.
Worst: 50us.

2.2. live migration: guest was doing "dd if=/path/to/a/file of=/dev/null"

The improvement was significant again.

- kvm.git next
For heavy updates: 1ms to 3ms

- kvm.git next + srcu-less series
For heavy updates: 50us to 300us

Probably we gained a lot from the locality of WWS.


	Takuya


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others
  2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
@ 2012-03-01 10:31 ` Takuya Yoshikawa
  2012-03-12  7:39   ` Takuya Yoshikawa
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-01 10:31 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel

We will use this in the following patch to implement another function
which needs to write protect pages using the rmap information.

Note that there is a small change in debug printing for large pages:
we do not differentiate them from others to avoid duplicating code.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/mmu.c |   53 ++++++++++++++++++++++++++-------------------------
 1 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ff053ca..67857bd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1010,42 +1010,43 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot)
+static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level)
 {
-	unsigned long *rmapp;
-	u64 *spte;
-	int i, write_protected = 0;
+	u64 *spte = NULL;
+	int write_protected = 0;
 
-	rmapp = __gfn_to_rmap(gfn, PT_PAGE_TABLE_LEVEL, slot);
-	spte = rmap_next(rmapp, NULL);
-	while (spte) {
+	while ((spte = rmap_next(rmapp, spte))) {
 		BUG_ON(!(*spte & PT_PRESENT_MASK));
 		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
-		if (is_writable_pte(*spte)) {
+
+		if (!is_writable_pte(*spte))
+			continue;
+
+		if (level == PT_PAGE_TABLE_LEVEL) {
 			mmu_spte_update(spte, *spte & ~PT_WRITABLE_MASK);
-			write_protected = 1;
+		} else {
+			BUG_ON(!is_large_pte(*spte));
+			drop_spte(kvm, spte);
+			--kvm->stat.lpages;
+			spte = NULL;
 		}
-		spte = rmap_next(rmapp, spte);
+
+		write_protected = 1;
 	}
 
-	/* check for huge page mappings */
-	for (i = PT_DIRECTORY_LEVEL;
+	return write_protected;
+}
+
+int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
+			       struct kvm_memory_slot *slot)
+{
+	unsigned long *rmapp;
+	int i, write_protected = 0;
+
+	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
 		rmapp = __gfn_to_rmap(gfn, i, slot);
-		spte = rmap_next(rmapp, NULL);
-		while (spte) {
-			BUG_ON(!(*spte & PT_PRESENT_MASK));
-			BUG_ON(!is_large_pte(*spte));
-			pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", spte, *spte, gfn);
-			if (is_writable_pte(*spte)) {
-				drop_spte(kvm, spte);
-				--kvm->stat.lpages;
-				spte = NULL;
-				write_protected = 1;
-			}
-			spte = rmap_next(rmapp, spte);
-		}
+		write_protected |= __rmap_write_protect(kvm, rmapp, i);
 	}
 
 	return write_protected;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
  2012-03-01 10:31 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
@ 2012-03-01 10:32 ` Takuya Yoshikawa
  2012-03-02  2:56   ` Takuya Yoshikawa
                     ` (2 more replies)
  2012-03-01 10:33 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-01 10:32 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel

Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/include/asm/kvm_host.h |    5 ++-
 arch/x86/kvm/mmu.c              |   40 +++++++++++++++++++++++++++++---------
 arch/x86/kvm/x86.c              |    8 ++----
 3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 74c9edf..935cbcc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -712,8 +712,9 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot);
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+				     struct kvm_memory_slot *slot,
+				     gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 67857bd..be8a529 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1037,27 +1037,47 @@ static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level
 	return write_protected;
 }
 
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot)
+/**
+ * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+ * @kvm: kvm instance
+ * @slot: slot to protect
+ * @gfn_offset: start of the BITS_PER_LONG pages we care about
+ * @mask: indicates which pages we should protect
+ *
+ * Used when we do not need to care about huge page mappings: e.g. during dirty
+ * logging we do not have any such mappings.
+ */
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+				     struct kvm_memory_slot *slot,
+				     gfn_t gfn_offset, unsigned long mask)
 {
 	unsigned long *rmapp;
-	int i, write_protected = 0;
 
-	for (i = PT_PAGE_TABLE_LEVEL;
-	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
-		rmapp = __gfn_to_rmap(gfn, i, slot);
-		write_protected |= __rmap_write_protect(kvm, rmapp, i);
-	}
+	while (mask) {
+		rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
+		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
 
-	return write_protected;
+		/* clear the first set bit */
+		mask &= mask - 1;
+	}
 }
 
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
+	unsigned long *rmapp;
+	int i;
+	int write_protected = 0;
 
 	slot = gfn_to_memslot(kvm, gfn);
-	return kvm_mmu_rmap_write_protect(kvm, gfn, slot);
+
+	for (i = PT_PAGE_TABLE_LEVEL;
+	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+		rmapp = __gfn_to_rmap(gfn, i, slot);
+		write_protected |= __rmap_write_protect(kvm, rmapp, i);
+	}
+
+	return write_protected;
 }
 
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9d99e5..3bc1922 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3069,13 +3069,11 @@ static void write_protect_slot(struct kvm *kvm,
 
 	/* Not many dirty pages compared to # of shadow pages. */
 	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
-		unsigned long gfn_offset;
+		gfn_t offset;
 
-		for_each_set_bit(gfn_offset, dirty_bitmap, memslot->npages) {
-			unsigned long gfn = memslot->base_gfn + gfn_offset;
+		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
+			kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, 1);
 
-			kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
-		}
 		kvm_flush_remote_tlbs(kvm);
 	} else
 		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
  2012-03-01 10:31 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
@ 2012-03-01 10:33 ` Takuya Yoshikawa
  2012-03-03  5:21   ` [PATCH 3/4 changelog-v2] " Takuya Yoshikawa
  2012-03-16  5:03   ` [PATCH 3/4] " Xiao Guangrong
  2012-03-01 10:34 ` [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages Takuya Yoshikawa
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-01 10:33 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel

We have seen some problems of the current implementation of
get_dirty_log() which uses synchronize_srcu_expedited() for updating
dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
order of latency when we use VGA displays.

Furthermore the recent discussion on the following thread
    "srcu: Implement call_srcu()"
    http://lkml.org/lkml/2012/1/31/211
also motivated us to implement get_dirty_log() without SRCU.

This patch achieves this goal without sacrificing the performance of
both VGA and live migration: in practice the new code is much faster
than the old one unless we have too many dirty pages.

Implementation:

The key part of the implementation is the use of xchg() operation for
clearing dirty bits atomically.  Since this allows us to update only
BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
until every dirty bit is cleared again for the next call.

Although some people may worry about the problem of using the atomic
memory instruction many times to the concurrently accessible bitmap,
it is usually accessed with mmu_lock held and we rarely see concurrent
accesses: so what we need to care about is the pure xchg() overheads.

Another point to note is that we do not use for_each_set_bit() to check
which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
simply use __ffs() in a loop.  This is much faster than repeatedly call
find_next_bit().

Performance:

The dirty-log-perf unit test showed nice improvements, some times faster
than before, when the number of dirty pages was below 8K.  For other
cases we saw a bit of regression but still enough fast compared to the
processing of these dirty pages in the userspace.

For real workloads, both VGA and live migration, we have observed pure
improvements: when the guest was reading a file, we originally saw a few
ms of latency, but with the new method the latency was less than 200us.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/x86.c |  116 +++++++++++++++++++--------------------------------
 1 files changed, 43 insertions(+), 73 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3bc1922..0748bab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3041,55 +3041,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 }
 
 /**
- * write_protect_slot - write protect a slot for dirty logging
- * @kvm: the kvm instance
- * @memslot: the slot we protect
- * @dirty_bitmap: the bitmap indicating which pages are dirty
- * @nr_dirty_pages: the number of dirty pages
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm: kvm instance
+ * @log: slot id and address to which we copy the log
  *
- * We have two ways to find all sptes to protect:
- * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
- *    checks ones that have a spte mapping a page in the slot.
- * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
  *
- * Generally speaking, if there are not so many dirty pages compared to the
- * number of shadow pages, we should use the latter.
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Flush TLB's if needed.
+ *   4. Copy the snapshot to the userspace.
  *
- * Note that letting others write into a page marked dirty in the old bitmap
- * by using the remaining tlb entry is not a problem.  That page will become
- * write protected again when we flush the tlb and then be reported dirty to
- * the user space by copying the old bitmap.
+ * Between 2 and 3, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page will be reported dirty at
+ * step 4 using the snapshot taken before and step 3 ensures that successive
+ * writes will be logged for the next call.
  */
-static void write_protect_slot(struct kvm *kvm,
-			       struct kvm_memory_slot *memslot,
-			       unsigned long *dirty_bitmap,
-			       unsigned long nr_dirty_pages)
-{
-	spin_lock(&kvm->mmu_lock);
-
-	/* Not many dirty pages compared to # of shadow pages. */
-	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
-		gfn_t offset;
-
-		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
-			kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, 1);
-
-		kvm_flush_remote_tlbs(kvm);
-	} else
-		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
-
-	spin_unlock(&kvm->mmu_lock);
-}
-
-/*
- * Get (and clear) the dirty memory log for a memory slot.
- */
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
-				      struct kvm_dirty_log *log)
+int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
 	int r;
 	struct kvm_memory_slot *memslot;
-	unsigned long n, nr_dirty_pages;
+	unsigned long n, i;
+	unsigned long *dirty_bitmap;
+	unsigned long *dirty_bitmap_buffer;
+	bool is_dirty = false;
 
 	mutex_lock(&kvm->slots_lock);
 
@@ -3098,49 +3075,42 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 		goto out;
 
 	memslot = id_to_memslot(kvm->memslots, log->slot);
+
+	dirty_bitmap = memslot->dirty_bitmap;
 	r = -ENOENT;
-	if (!memslot->dirty_bitmap)
+	if (!dirty_bitmap)
 		goto out;
 
 	n = kvm_dirty_bitmap_bytes(memslot);
-	nr_dirty_pages = memslot->nr_dirty_pages;
 
-	/* If nothing is dirty, don't bother messing with page tables. */
-	if (nr_dirty_pages) {
-		struct kvm_memslots *slots, *old_slots;
-		unsigned long *dirty_bitmap, *dirty_bitmap_head;
+	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
+	memset(dirty_bitmap_buffer, 0, n);
 
-		dirty_bitmap = memslot->dirty_bitmap;
-		dirty_bitmap_head = memslot->dirty_bitmap_head;
-		if (dirty_bitmap == dirty_bitmap_head)
-			dirty_bitmap_head += n / sizeof(long);
-		memset(dirty_bitmap_head, 0, n);
+	spin_lock(&kvm->mmu_lock);
 
-		r = -ENOMEM;
-		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
-		if (!slots)
-			goto out;
+	for (i = 0; i < n / sizeof(long); i++) {
+		unsigned long mask;
+		gfn_t offset;
 
-		memslot = id_to_memslot(slots, log->slot);
-		memslot->nr_dirty_pages = 0;
-		memslot->dirty_bitmap = dirty_bitmap_head;
-		update_memslots(slots, NULL);
+		if (!dirty_bitmap[i])
+			continue;
 
-		old_slots = kvm->memslots;
-		rcu_assign_pointer(kvm->memslots, slots);
-		synchronize_srcu_expedited(&kvm->srcu);
-		kfree(old_slots);
+		is_dirty = true;
 
-		write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
+		mask = xchg(&dirty_bitmap[i], 0);
+		dirty_bitmap_buffer[i] = mask;
 
-		r = -EFAULT;
-		if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
-			goto out;
-	} else {
-		r = -EFAULT;
-		if (clear_user(log->dirty_bitmap, n))
-			goto out;
+		offset = i * BITS_PER_LONG;
+		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
 	}
+	if (is_dirty)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+
+	r = -EFAULT;
+	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
+		goto out;
 
 	r = 0;
 out:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages
  2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
                   ` (2 preceding siblings ...)
  2012-03-01 10:33 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
@ 2012-03-01 10:34 ` Takuya Yoshikawa
  2012-03-03  5:12 ` [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
  2012-03-20 14:43 ` Avi Kivity
  5 siblings, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-01 10:34 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel

Now that we do neither double buffering nor heuristic selection of the
write protection method these are not needed anymore.

Note: some drivers have their own implementation of set_bit_le() and
making it generic needs a bit of work; so we use test_and_set_bit_le()
and will later replace it with generic set_bit_le().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 include/linux/kvm_host.h |    2 --
 virt/kvm/kvm_main.c      |   14 +++++---------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 355e445..73c7d76 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -177,8 +177,6 @@ struct kvm_memory_slot {
 	unsigned long flags;
 	unsigned long *rmap;
 	unsigned long *dirty_bitmap;
-	unsigned long *dirty_bitmap_head;
-	unsigned long nr_dirty_pages;
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
 	int user_alloc;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4431ad..27a1083 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -522,12 +522,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 		return;
 
 	if (2 * kvm_dirty_bitmap_bytes(memslot) > PAGE_SIZE)
-		vfree(memslot->dirty_bitmap_head);
+		vfree(memslot->dirty_bitmap);
 	else
-		kfree(memslot->dirty_bitmap_head);
+		kfree(memslot->dirty_bitmap);
 
 	memslot->dirty_bitmap = NULL;
-	memslot->dirty_bitmap_head = NULL;
 }
 
 /*
@@ -611,8 +610,7 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 
 /*
  * Allocation size is twice as large as the actual dirty bitmap size.
- * This makes it possible to do double buffering: see x86's
- * kvm_vm_ioctl_get_dirty_log().
+ * See x86's kvm_vm_ioctl_get_dirty_log() why this is needed.
  */
 static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
 {
@@ -627,8 +625,6 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
 	if (!memslot->dirty_bitmap)
 		return -ENOMEM;
 
-	memslot->dirty_bitmap_head = memslot->dirty_bitmap;
-	memslot->nr_dirty_pages = 0;
 #endif /* !CONFIG_S390 */
 	return 0;
 }
@@ -1476,8 +1472,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
-		if (!test_and_set_bit_le(rel_gfn, memslot->dirty_bitmap))
-			memslot->nr_dirty_pages++;
+		/* TODO: introduce set_bit_le() and use it */
+		test_and_set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
@ 2012-03-02  2:56   ` Takuya Yoshikawa
  2012-03-02  5:11     ` Takuya Yoshikawa
  2012-03-12 18:04   ` Avi Kivity
  2012-03-13 23:04   ` Marcelo Tosatti
  2 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-02  2:56 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:

> +	while (mask) {
> +		rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
> +		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
>  
> -	return write_protected;
> +		/* clear the first set bit */
> +		mask &= mask - 1;
> +	}
>  }

while (mask) {
	fsbit = __ffs(mask);
	gfn_offset += fsbit;
	mask >>= fsbit + 1;

	rmapp = &slot->rmap[gfn_offset];
	__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
}

Seems to be better: BSF friendly.

I will update this function and post the result with dirty-log-perf output!

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-02  2:56   ` Takuya Yoshikawa
@ 2012-03-02  5:11     ` Takuya Yoshikawa
  0 siblings, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-02  5:11 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:

> Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:
> 
> > +	while (mask) {
> > +		rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
> > +		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
> >  
> > -	return write_protected;
> > +		/* clear the first set bit */
> > +		mask &= mask - 1;
> > +	}
> >  }
> 
> while (mask) {
> 	fsbit = __ffs(mask);
> 	gfn_offset += fsbit;
> 	mask >>= fsbit + 1;
> 
> 	rmapp = &slot->rmap[gfn_offset];
> 	__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
> }
> 
> Seems to be better: BSF friendly.

I was wrong.

"and < shift" and bsf did not show any difference for this kind of loop
on my box.  So we should keep the original implementation.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] KVM: srcu-less dirty logging -v2
  2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
                   ` (3 preceding siblings ...)
  2012-03-01 10:34 ` [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages Takuya Yoshikawa
@ 2012-03-03  5:12 ` Takuya Yoshikawa
  2012-03-20 14:43 ` Avi Kivity
  5 siblings, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-03  5:12 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:

> v2: changed to protect masked pages
> 
> Live migration gets a bit faster than v1.

I have noticed that this version is much faster than version 1 when
nr-dirty-pages = 16K, 32K, 64K.

So I have updated the description of PATCH 3 a bit: please see
"PATCH 3/4 changelog-v2".

--------------------------------------------------------
                                  improvements
      avg     stdev    ns/page     /kvm    /v1   pages
--------------------------------------------------------
  12741.6    1049.9    12741.6     1054%    11%       1
  12117.4    1094.0     6058.7     1205%    10%       2
  12345.0    1094.3     3086.2      933%     9%       4
  12864.2    1188.0     1608.0      746%    10%       8
  14042.0    1138.8      877.6      717%     9%      16
  15738.6    1611.0      491.8      387%     8%      32
  19466.8    1691.5      304.1      204%     9%      64
  27220.0    2473.8      212.6      122%    24%     128
  42947.4    4228.7      167.7       78%   -12%     256
  71484.4   15812.6      139.6       58%   -14%     512
 110278.4   22409.1      107.6      744%   -19%      1K
 187205.4   39089.8       91.4      402%   -19%      2K
 343577.0   34272.8       83.8      190%   -15%      4K
 493900.4   15911.9       60.2      125%     5%      8K
 760268.2    5929.6       46.4       63%   199%     16K
1238709.6    7123.5       37.8       23%   173%     32K
2359523.6    3121.7       36.0       -9%    87%     64K
4540780.6   10155.6       34.6      -27%    30%    128K
8747274.0   10973.3       33.3      -31%    -3%    256K
--------------------------------------------------------

Even when every unsigned long word needs to be xchg'ed, we can see nice
improvements except for the last three cases.

I hope that everyone can see similar improvements for real workloads.

	Takuya


> 
> === from v1
> 
> This patch series is the result of the integration of my dirty logging
> optimization work, including preparation for the new GET_DIRTY_LOG API,
> and the attempt to get rid of controversial synchronize_srcu_expedited().
> 
> 1 - KVM: MMU: Split the main body of rmap_write_protect() off from others
> 2 - KVM: Avoid checking huge page mappings in get_dirty_log()
> 3 - KVM: Switch to srcu-less get_dirty_log()
> 4 - KVM: Remove unused dirty_bitmap_head and nr_dirty_pages
> 
> Although there are still some remaining tasks, the test result obtained
> looks very promising.
> 
> 
> Remaining tasks:
> 
> - Implement set_bit_le() for mark_page_dirty()
> 
>   Some drivers are using their own implementation of it and a bit of
>   work is needed to make it generic.  I want to do this separately
>   later because it cannot be done within kvm tree.
> 
> - Stop allocating extra dirty bitmap buffer area
> 
>   According to Peter, mmu_notifier has become preemptible.  If we can
>   change mmu_lock from spin_lock to mutex_lock, as Avi said before, this
>   would be staightforward because we can use __put_user() right after
>   xchg() with the mmu_lock held.
> 
> 
> Test results:
> 
> 1. dirty-log-perf unit test (on Sandy Bridge core-i3 32-bit host)
> 
> With some changes added since the previous post, the performance was
> much improved: now even when every page in the slot is dirty, the number
> is reasonably close to the original one.  For others, needless to say,
> we have achieved very nice improvement.
> 
> - kvm.git next
> average(ns)    stdev     ns/page    pages
> 
>  147018.6    77604.9    147018.6        1
>  158080.2    82211.9     79040.1        2
>  127555.6    80619.8     31888.9        4
>  108865.6    78499.3     13608.2        8
>  114707.8    43508.6      7169.2       16
>   76679.0    37659.8      2396.2       32
>   59159.8    20417.1       924.3       64
>   60418.2    19405.7       472.0      128
>   76267.0    21450.5       297.9      256
>  113182.0    22684.9       221.0      512
>  930344.2   153766.5       908.5       1K
>  939098.2   163800.3       458.5       2K
>  996813.4    77921.0       243.3       4K
> 1113232.6   107782.6       135.8       8K
> 1241206.4    82282.5        75.7      16K
> 1529526.4   116388.2        46.6      32K
> 2147538.4   227375.9        32.7      64K
> 3309619.4    79356.8        25.2     128K
> 6016951.8   549873.4        22.9     256K
> 
> - kvm.git next + srcu-less series
> average(ns)    stdev     ns/page    pages    improvement(%)
> 
>   14086.0     3532.3     14086.0        1     944
>   13303.6     3317.7      6651.8        2    1088
>   13455.6     3315.2      3363.9        4     848
>   14125.8     3435.4      1765.7        8     671
>   15322.4     3690.1       957.6       16     649
>   17026.6     4037.2       532.0       32     350
>   21258.6     4852.3       332.1       64     178
>   33845.6    14115.8       264.4      128      79
>   37893.0      681.8       148.0      256     101
>   61707.4     1057.6       120.5      512      83
>   88861.4     2131.0        86.7       1K     947
>  151315.6     6490.5        73.8       2K     521
>  290579.6     8523.0        70.9       4K     243
>  518231.0    20412.6        63.2       8K     115
> 2271171.4    12064.9       138.6      16K     -45
> 3375866.2    14743.3       103.0      32K     -55
> 4408395.6    10720.0        67.2      64K     -51
> 5915336.2    26538.1        45.1     128K     -44
> 8497356.4    16441.0        32.4     256K     -29
> 
> Note that when the number of dirty pages was large, we spent less than
> 100ns for getting one dirty page information: see ns/page column.
> 
> As Avi noted before, this is much faster than the userspace send one
> page to the destination node.
> 
> Furthermore, with the already proposed new GET_DIRTY_LOG API, we will
> be able to restrict the area from which we get the log and will not need
> to care about ms order of latency observed for very large number of dirty
> pages.
> 
> 2. real workloads (on Xeon W3520 64-bit host)
> 
> I traced kvm_vm_ioctl_get_dirty_log() during heavy VGA updates and
> during live migration.
> 
> 2.1. VGA: guest was doing "x11perf -rect1 -rect10 -rect100 -rect500"
> 
> As can be guessed from the result of dirty-log-perf, we observed very
> nice improvement.
> 
> - kvm.git next
> For heavy updates: 100us to 300us.
> Worst: 300us
> 
> - kvm.git next + srcu-less series
> For heavy updates: 3us to 10us.
> Worst: 50us.
> 
> 2.2. live migration: guest was doing "dd if=/path/to/a/file of=/dev/null"
> 
> The improvement was significant again.
> 
> - kvm.git next
> For heavy updates: 1ms to 3ms
> 
> - kvm.git next + srcu-less series
> For heavy updates: 50us to 300us
> 
> Probably we gained a lot from the locality of WWS.
> 
> 
> 	Takuya
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Takuya Yoshikawa <takuya.yoshikawa@gmail.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-01 10:33 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
@ 2012-03-03  5:21   ` Takuya Yoshikawa
  2012-03-06 11:15     ` Marcelo Tosatti
  2012-03-16  5:03   ` [PATCH 3/4] " Xiao Guangrong
  1 sibling, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-03  5:21 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

We have seen some problems of the current implementation of
get_dirty_log() which uses synchronize_srcu_expedited() for updating
dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
order of latency when we use VGA displays.

Furthermore the recent discussion on the following thread
    "srcu: Implement call_srcu()"
    http://lkml.org/lkml/2012/1/31/211
also motivated us to implement get_dirty_log() without SRCU.

This patch achieves this goal without sacrificing the performance of
both VGA and live migration: in practice the new code is much faster
than the old one unless we have too many dirty pages.

Implementation:

The key part of the implementation is the use of xchg() operation for
clearing dirty bits atomically.  Since this allows us to update only
BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
until every dirty bit is cleared again for the next call.

Although some people may worry about the problem of using the atomic
memory instruction many times to the concurrently accessible bitmap,
it is usually accessed with mmu_lock held and we rarely see concurrent
accesses: so what we need to care about is the pure xchg() overheads.

Another point to note is that we do not use for_each_set_bit() to check
which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
simply use __ffs() in a loop.  This is much faster than repeatedly call
find_next_bit().

Performance:

The dirty-log-perf unit test showed nice improvements, some times faster
than before, except for some extreme cases; for such cases the speed of
getting dirty page information is much faster than we process it in the
userspace.

For real workloads, both VGA and live migration, we have observed pure
improvements: when the guest was reading a file during live migration,
we originally saw a few ms of latency, but with the new method the
latency was less than 200us.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/x86.c |  116 +++++++++++++++++++--------------------------------
 1 files changed, 43 insertions(+), 73 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3bc1922..0748bab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3041,55 +3041,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 }
 
 /**
- * write_protect_slot - write protect a slot for dirty logging
- * @kvm: the kvm instance
- * @memslot: the slot we protect
- * @dirty_bitmap: the bitmap indicating which pages are dirty
- * @nr_dirty_pages: the number of dirty pages
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm: kvm instance
+ * @log: slot id and address to which we copy the log
  *
- * We have two ways to find all sptes to protect:
- * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
- *    checks ones that have a spte mapping a page in the slot.
- * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
  *
- * Generally speaking, if there are not so many dirty pages compared to the
- * number of shadow pages, we should use the latter.
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Flush TLB's if needed.
+ *   4. Copy the snapshot to the userspace.
  *
- * Note that letting others write into a page marked dirty in the old bitmap
- * by using the remaining tlb entry is not a problem.  That page will become
- * write protected again when we flush the tlb and then be reported dirty to
- * the user space by copying the old bitmap.
+ * Between 2 and 3, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page will be reported dirty at
+ * step 4 using the snapshot taken before and step 3 ensures that successive
+ * writes will be logged for the next call.
  */
-static void write_protect_slot(struct kvm *kvm,
-			       struct kvm_memory_slot *memslot,
-			       unsigned long *dirty_bitmap,
-			       unsigned long nr_dirty_pages)
-{
-	spin_lock(&kvm->mmu_lock);
-
-	/* Not many dirty pages compared to # of shadow pages. */
-	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
-		gfn_t offset;
-
-		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
-			kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, 1);
-
-		kvm_flush_remote_tlbs(kvm);
-	} else
-		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
-
-	spin_unlock(&kvm->mmu_lock);
-}
-
-/*
- * Get (and clear) the dirty memory log for a memory slot.
- */
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
-				      struct kvm_dirty_log *log)
+int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
 	int r;
 	struct kvm_memory_slot *memslot;
-	unsigned long n, nr_dirty_pages;
+	unsigned long n, i;
+	unsigned long *dirty_bitmap;
+	unsigned long *dirty_bitmap_buffer;
+	bool is_dirty = false;
 
 	mutex_lock(&kvm->slots_lock);
 
@@ -3098,49 +3075,42 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 		goto out;
 
 	memslot = id_to_memslot(kvm->memslots, log->slot);
+
+	dirty_bitmap = memslot->dirty_bitmap;
 	r = -ENOENT;
-	if (!memslot->dirty_bitmap)
+	if (!dirty_bitmap)
 		goto out;
 
 	n = kvm_dirty_bitmap_bytes(memslot);
-	nr_dirty_pages = memslot->nr_dirty_pages;
 
-	/* If nothing is dirty, don't bother messing with page tables. */
-	if (nr_dirty_pages) {
-		struct kvm_memslots *slots, *old_slots;
-		unsigned long *dirty_bitmap, *dirty_bitmap_head;
+	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
+	memset(dirty_bitmap_buffer, 0, n);
 
-		dirty_bitmap = memslot->dirty_bitmap;
-		dirty_bitmap_head = memslot->dirty_bitmap_head;
-		if (dirty_bitmap == dirty_bitmap_head)
-			dirty_bitmap_head += n / sizeof(long);
-		memset(dirty_bitmap_head, 0, n);
+	spin_lock(&kvm->mmu_lock);
 
-		r = -ENOMEM;
-		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
-		if (!slots)
-			goto out;
+	for (i = 0; i < n / sizeof(long); i++) {
+		unsigned long mask;
+		gfn_t offset;
 
-		memslot = id_to_memslot(slots, log->slot);
-		memslot->nr_dirty_pages = 0;
-		memslot->dirty_bitmap = dirty_bitmap_head;
-		update_memslots(slots, NULL);
+		if (!dirty_bitmap[i])
+			continue;
 
-		old_slots = kvm->memslots;
-		rcu_assign_pointer(kvm->memslots, slots);
-		synchronize_srcu_expedited(&kvm->srcu);
-		kfree(old_slots);
+		is_dirty = true;
 
-		write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
+		mask = xchg(&dirty_bitmap[i], 0);
+		dirty_bitmap_buffer[i] = mask;
 
-		r = -EFAULT;
-		if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
-			goto out;
-	} else {
-		r = -EFAULT;
-		if (clear_user(log->dirty_bitmap, n))
-			goto out;
+		offset = i * BITS_PER_LONG;
+		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
 	}
+	if (is_dirty)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+
+	r = -EFAULT;
+	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
+		goto out;
 
 	r = 0;
 out:
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-03  5:21   ` [PATCH 3/4 changelog-v2] " Takuya Yoshikawa
@ 2012-03-06 11:15     ` Marcelo Tosatti
  2012-03-06 14:43       ` Takuya Yoshikawa
  0 siblings, 1 reply; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-06 11:15 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

On Sat, Mar 03, 2012 at 02:21:48PM +0900, Takuya Yoshikawa wrote:
> We have seen some problems of the current implementation of
> get_dirty_log() which uses synchronize_srcu_expedited() for updating
> dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
> order of latency when we use VGA displays.
> 
> Furthermore the recent discussion on the following thread
>     "srcu: Implement call_srcu()"
>     http://lkml.org/lkml/2012/1/31/211
> also motivated us to implement get_dirty_log() without SRCU.
> 
> This patch achieves this goal without sacrificing the performance of
> both VGA and live migration: in practice the new code is much faster
> than the old one unless we have too many dirty pages.
> 
> Implementation:
> 
> The key part of the implementation is the use of xchg() operation for
> clearing dirty bits atomically.  Since this allows us to update only
> BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> until every dirty bit is cleared again for the next call.
> 
> Although some people may worry about the problem of using the atomic
> memory instruction many times to the concurrently accessible bitmap,
> it is usually accessed with mmu_lock held and we rarely see concurrent
> accesses: so what we need to care about is the pure xchg() overheads.
> 
> Another point to note is that we do not use for_each_set_bit() to check
> which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
> simply use __ffs() in a loop.  This is much faster than repeatedly call
> find_next_bit().
> 
> Performance:
> 
> The dirty-log-perf unit test showed nice improvements, some times faster
> than before, except for some extreme cases; for such cases the speed of
> getting dirty page information is much faster than we process it in the
> userspace.
> 
> For real workloads, both VGA and live migration, we have observed pure
> improvements: when the guest was reading a file during live migration,
> we originally saw a few ms of latency, but with the new method the
> latency was less than 200us.
> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
>  arch/x86/kvm/x86.c |  116 +++++++++++++++++++--------------------------------
>  1 files changed, 43 insertions(+), 73 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3bc1922..0748bab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3041,55 +3041,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
>  }
>  
>  /**
> - * write_protect_slot - write protect a slot for dirty logging
> - * @kvm: the kvm instance
> - * @memslot: the slot we protect
> - * @dirty_bitmap: the bitmap indicating which pages are dirty
> - * @nr_dirty_pages: the number of dirty pages
> + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
> + * @kvm: kvm instance
> + * @log: slot id and address to which we copy the log
>   *
> - * We have two ways to find all sptes to protect:
> - * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
> - *    checks ones that have a spte mapping a page in the slot.
> - * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
> + * We need to keep it in mind that VCPU threads can write to the bitmap
> + * concurrently.  So, to avoid losing data, we keep the following order for
> + * each bit:
>   *
> - * Generally speaking, if there are not so many dirty pages compared to the
> - * number of shadow pages, we should use the latter.
> + *   1. Take a snapshot of the bit and clear it if needed.
> + *   2. Write protect the corresponding page.
> + *   3. Flush TLB's if needed.
> + *   4. Copy the snapshot to the userspace.
>   *
> - * Note that letting others write into a page marked dirty in the old bitmap
> - * by using the remaining tlb entry is not a problem.  That page will become
> - * write protected again when we flush the tlb and then be reported dirty to
> - * the user space by copying the old bitmap.
> + * Between 2 and 3, the guest may write to the page using the remaining TLB
> + * entry.  This is not a problem because the page will be reported dirty at
> + * step 4 using the snapshot taken before and step 3 ensures that successive
> + * writes will be logged for the next call.
>   */
> -static void write_protect_slot(struct kvm *kvm,
> -			       struct kvm_memory_slot *memslot,
> -			       unsigned long *dirty_bitmap,
> -			       unsigned long nr_dirty_pages)
> -{
> -	spin_lock(&kvm->mmu_lock);
> -
> -	/* Not many dirty pages compared to # of shadow pages. */
> -	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
> -		gfn_t offset;
> -
> -		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
> -			kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, 1);
> -
> -		kvm_flush_remote_tlbs(kvm);
> -	} else
> -		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
> -
> -	spin_unlock(&kvm->mmu_lock);
> -}
> -
> -/*
> - * Get (and clear) the dirty memory log for a memory slot.
> - */
> -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> -				      struct kvm_dirty_log *log)
> +int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  {
>  	int r;
>  	struct kvm_memory_slot *memslot;
> -	unsigned long n, nr_dirty_pages;
> +	unsigned long n, i;
> +	unsigned long *dirty_bitmap;
> +	unsigned long *dirty_bitmap_buffer;
> +	bool is_dirty = false;
>  
>  	mutex_lock(&kvm->slots_lock);
>  
> @@ -3098,49 +3075,42 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  		goto out;
>  
>  	memslot = id_to_memslot(kvm->memslots, log->slot);
> +
> +	dirty_bitmap = memslot->dirty_bitmap;
>  	r = -ENOENT;
> -	if (!memslot->dirty_bitmap)
> +	if (!dirty_bitmap)
>  		goto out;
>  
>  	n = kvm_dirty_bitmap_bytes(memslot);
> -	nr_dirty_pages = memslot->nr_dirty_pages;
>  
> -	/* If nothing is dirty, don't bother messing with page tables. */
> -	if (nr_dirty_pages) {
> -		struct kvm_memslots *slots, *old_slots;
> -		unsigned long *dirty_bitmap, *dirty_bitmap_head;
> +	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> +	memset(dirty_bitmap_buffer, 0, n);
>  
> -		dirty_bitmap = memslot->dirty_bitmap;
> -		dirty_bitmap_head = memslot->dirty_bitmap_head;
> -		if (dirty_bitmap == dirty_bitmap_head)
> -			dirty_bitmap_head += n / sizeof(long);
> -		memset(dirty_bitmap_head, 0, n);
> +	spin_lock(&kvm->mmu_lock);

It is not clear why mmu_lock is needed. Dropping it across the xchg loop
should be similar to srcu implementation, in that concurrent updates
will be visible only on the next get_dirty call? Well, it is necessary
anyway for write protecting the sptes.

A cond_resched_lock() would alleviate the potentially long held 
times for mmu_lock (can you measure it with large memslots?)

Otherwise looks nice.

>  
> -		r = -ENOMEM;
> -		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
> -		if (!slots)
> -			goto out;
> +	for (i = 0; i < n / sizeof(long); i++) {
> +		unsigned long mask;
> +		gfn_t offset;
>  
> -		memslot = id_to_memslot(slots, log->slot);
> -		memslot->nr_dirty_pages = 0;
> -		memslot->dirty_bitmap = dirty_bitmap_head;
> -		update_memslots(slots, NULL);
> +		if (!dirty_bitmap[i])
> +			continue;
>  
> -		old_slots = kvm->memslots;
> -		rcu_assign_pointer(kvm->memslots, slots);
> -		synchronize_srcu_expedited(&kvm->srcu);
> -		kfree(old_slots);
> +		is_dirty = true;
>  
> -		write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
> +		mask = xchg(&dirty_bitmap[i], 0);
> +		dirty_bitmap_buffer[i] = mask;
>  
> -		r = -EFAULT;
> -		if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
> -			goto out;
> -	} else {
> -		r = -EFAULT;
> -		if (clear_user(log->dirty_bitmap, n))
> -			goto out;
> +		offset = i * BITS_PER_LONG;
> +		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
>  	}
> +	if (is_dirty)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	r = -EFAULT;
> +	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
> +		goto out;
>  
>  	r = 0;
>  out:
> -- 
> 1.7.5.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-06 11:15     ` Marcelo Tosatti
@ 2012-03-06 14:43       ` Takuya Yoshikawa
  2012-03-06 15:01         ` Marcelo Tosatti
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-06 14:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > +	spin_lock(&kvm->mmu_lock);
> 
> It is not clear why mmu_lock is needed. Dropping it across the xchg loop
> should be similar to srcu implementation, in that concurrent updates
> will be visible only on the next get_dirty call? Well, it is necessary
> anyway for write protecting the sptes.

My implementation does write protection inside the xchg loop.
Then, after that loop, flushes TLB.

mmu_lock must protect both of these together.

If we do not mind scanning the bitmap twice, we can decouple the
xchg loop and write protection, but it will be a bit slower, and in
any case we need to hold mmu_lock until TLB is flushed.

As can be seen from the unit-test result the majority of time
is being spent on write protecting sptes, so decoupling xchg loop
alone will not alleviate the problem so much -- my guess.

> A cond_resched_lock() would alleviate the potentially long held 
> times for mmu_lock (can you measure it with large memslots?)

How to move TLB flush out of mmu_lock critical sections was discussed
before, and there seemed to be some proposals.

Anyone is working on that?

After that we can do many things.

One idea is to make the extra bitmap buffer size shrink to one page
or so and do xchg and write protection loop by that limited size.

Because we can drop mmu_lock, it is possible to copy_to_user part of
the dirty bitmap, and then go to the next part.

After everything is protected, we can then do TLB flush after dropping
mmu_lock.

> Otherwise looks nice.

Thanks,
	Takuya


> > -		r = -ENOMEM;
> > -		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
> > -		if (!slots)
> > -			goto out;
> > +	for (i = 0; i < n / sizeof(long); i++) {
> > +		unsigned long mask;
> > +		gfn_t offset;
> >  
> > -		memslot = id_to_memslot(slots, log->slot);
> > -		memslot->nr_dirty_pages = 0;
> > -		memslot->dirty_bitmap = dirty_bitmap_head;
> > -		update_memslots(slots, NULL);
> > +		if (!dirty_bitmap[i])
> > +			continue;
> >  
> > -		old_slots = kvm->memslots;
> > -		rcu_assign_pointer(kvm->memslots, slots);
> > -		synchronize_srcu_expedited(&kvm->srcu);
> > -		kfree(old_slots);
> > +		is_dirty = true;
> >  
> > -		write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
> > +		mask = xchg(&dirty_bitmap[i], 0);
> > +		dirty_bitmap_buffer[i] = mask;
> >  
> > -		r = -EFAULT;
> > -		if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
> > -			goto out;
> > -	} else {
> > -		r = -EFAULT;
> > -		if (clear_user(log->dirty_bitmap, n))
> > -			goto out;
> > +		offset = i * BITS_PER_LONG;
> > +		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
> >  	}
> > +	if (is_dirty)
> > +		kvm_flush_remote_tlbs(kvm);
> > +
> > +	spin_unlock(&kvm->mmu_lock);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-06 14:43       ` Takuya Yoshikawa
@ 2012-03-06 15:01         ` Marcelo Tosatti
  2012-03-06 15:23           ` Takuya Yoshikawa
  0 siblings, 1 reply; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-06 15:01 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

On Tue, Mar 06, 2012 at 11:43:17PM +0900, Takuya Yoshikawa wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > +	spin_lock(&kvm->mmu_lock);
> > 
> > It is not clear why mmu_lock is needed. Dropping it across the xchg loop
> > should be similar to srcu implementation, in that concurrent updates
> > will be visible only on the next get_dirty call? Well, it is necessary
> > anyway for write protecting the sptes.
> 
> My implementation does write protection inside the xchg loop.
> Then, after that loop, flushes TLB.
> 
> mmu_lock must protect both of these together.
> 
> If we do not mind scanning the bitmap twice, we can decouple the
> xchg loop and write protection, but it will be a bit slower, and in
> any case we need to hold mmu_lock until TLB is flushed.

Why is it necessary to scan twice? Simply continuing to the next set 
of pages, after dropping the lock, should be enough.

The potential problem i am referring to is:

- kvm.git next + srcu-less series
average(ns)    stdev     ns/page    pages    improvement(%)

8497356.4    16441.0        32.4     256K     -29

So 8ms for 1GB. Assuming it increases linearly, it would take 
400ms for get_dirty on a 50GB slot (most of that time spent 
with mmu_lock held). Is this correct?

> As can be seen from the unit-test result the majority of time
> is being spent on write protecting sptes, so decoupling xchg loop
> alone will not alleviate the problem so much -- my guess.
> 
> > A cond_resched_lock() would alleviate the potentially long held 
> > times for mmu_lock (can you measure it with large memslots?)
> 
> How to move TLB flush out of mmu_lock critical sections was discussed
> before, and there seemed to be some proposals.
> 
> Anyone is working on that?
> 
> After that we can do many things.
> 
> One idea is to make the extra bitmap buffer size shrink to one page
> or so and do xchg and write protection loop by that limited size.
> 
> Because we can drop mmu_lock, it is possible to copy_to_user part of
> the dirty bitmap, and then go to the next part.
> 
> After everything is protected, we can then do TLB flush after dropping
> mmu_lock.
> 
> > Otherwise looks nice.
> 
> Thanks,
> 	Takuya
> 
> 
> > > -		r = -ENOMEM;
> > > -		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
> > > -		if (!slots)
> > > -			goto out;
> > > +	for (i = 0; i < n / sizeof(long); i++) {
> > > +		unsigned long mask;
> > > +		gfn_t offset;
> > >  
> > > -		memslot = id_to_memslot(slots, log->slot);
> > > -		memslot->nr_dirty_pages = 0;
> > > -		memslot->dirty_bitmap = dirty_bitmap_head;
> > > -		update_memslots(slots, NULL);
> > > +		if (!dirty_bitmap[i])
> > > +			continue;
> > >  
> > > -		old_slots = kvm->memslots;
> > > -		rcu_assign_pointer(kvm->memslots, slots);
> > > -		synchronize_srcu_expedited(&kvm->srcu);
> > > -		kfree(old_slots);
> > > +		is_dirty = true;
> > >  
> > > -		write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
> > > +		mask = xchg(&dirty_bitmap[i], 0);
> > > +		dirty_bitmap_buffer[i] = mask;
> > >  
> > > -		r = -EFAULT;
> > > -		if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
> > > -			goto out;
> > > -	} else {
> > > -		r = -EFAULT;
> > > -		if (clear_user(log->dirty_bitmap, n))
> > > -			goto out;
> > > +		offset = i * BITS_PER_LONG;
> > > +		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
> > >  	}
> > > +	if (is_dirty)
> > > +		kvm_flush_remote_tlbs(kvm);
> > > +
> > > +	spin_unlock(&kvm->mmu_lock);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-06 15:01         ` Marcelo Tosatti
@ 2012-03-06 15:23           ` Takuya Yoshikawa
  2012-03-06 15:28             ` Marcelo Tosatti
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-06 15:23 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > If we do not mind scanning the bitmap twice, we can decouple the
> > xchg loop and write protection, but it will be a bit slower, and in
> > any case we need to hold mmu_lock until TLB is flushed.
> 
> Why is it necessary to scan twice? Simply continuing to the next set 
> of pages, after dropping the lock, should be enough.

We cannot drop the lock.
Do you mean doing TLB flush each time before dropping the lock?

> The potential problem i am referring to is:
> 
> - kvm.git next + srcu-less series
> average(ns)    stdev     ns/page    pages    improvement(%)
> 
> 8497356.4    16441.0        32.4     256K     -29
> 
> So 8ms for 1GB. Assuming it increases linearly, it would take 
> 400ms for get_dirty on a 50GB slot (most of that time spent 
> with mmu_lock held). Is this correct?

Partly yes: my method mainly depends on the number of dirty pages,
not slot size.

But it is not a new problem: traversing all shadow pages for that
also takes linearly increasing time.

If that 1GB dirty memory is in a 50GB slot, my method will alleviate
the latency really a lot compared to the current way.  I do not want
to imagine checking every shadow page in such a huge slot.

Checking pages found in the dirty bitmap only should be better.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-06 15:23           ` Takuya Yoshikawa
@ 2012-03-06 15:28             ` Marcelo Tosatti
  2012-03-07  8:07               ` Takuya Yoshikawa
  2012-03-07  8:18               ` Takuya Yoshikawa
  0 siblings, 2 replies; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-06 15:28 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

On Wed, Mar 07, 2012 at 12:23:17AM +0900, Takuya Yoshikawa wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > If we do not mind scanning the bitmap twice, we can decouple the
> > > xchg loop and write protection, but it will be a bit slower, and in
> > > any case we need to hold mmu_lock until TLB is flushed.
> > 
> > Why is it necessary to scan twice? Simply continuing to the next set 
> > of pages, after dropping the lock, should be enough.
> 
> We cannot drop the lock.
> Do you mean doing TLB flush each time before dropping the lock?

Yes, only if there is contention (see cond_resched_lock).

> > The potential problem i am referring to is:
> > 
> > - kvm.git next + srcu-less series
> > average(ns)    stdev     ns/page    pages    improvement(%)
> > 
> > 8497356.4    16441.0        32.4     256K     -29
> > 
> > So 8ms for 1GB. Assuming it increases linearly, it would take 
> > 400ms for get_dirty on a 50GB slot (most of that time spent 
> > with mmu_lock held). Is this correct?
> 
> Partly yes: my method mainly depends on the number of dirty pages,
> not slot size.
> 
> But it is not a new problem: traversing all shadow pages for that
> also takes linearly increasing time.

It was not necessary to read the bitmap under mmu_lock previously.

> If that 1GB dirty memory is in a 50GB slot, my method will alleviate
> the latency really a lot compared to the current way.  I do not want
> to imagine checking every shadow page in such a huge slot.
> 
> Checking pages found in the dirty bitmap only should be better.
> 
> 	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-06 15:28             ` Marcelo Tosatti
@ 2012-03-07  8:07               ` Takuya Yoshikawa
  2012-03-07 23:25                 ` Marcelo Tosatti
  2012-03-07  8:18               ` Takuya Yoshikawa
  1 sibling, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-07  8:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > Partly yes: my method mainly depends on the number of dirty pages,
> > not slot size.
> > 
> > But it is not a new problem: traversing all shadow pages for that
> > also takes linearly increasing time.
> 
> It was not necessary to read the bitmap under mmu_lock previously.

Let's check actual data!

Below I pasted a simple test result to show that reading bitmap is not
a problem at all compared to traversing shadow pages.

** During doing the same live migration test as:

  For real workloads, both VGA and live migration, we have observed pure
  improvements: when the guest was reading a file during live migration,
  we originally saw a few ms of latency, but with the new method the
  latency was less than 200us.

I measured how long the current method takes to just write protect sptes
with the mmu_lock held - kvm_mmu_slot_remove_write_access() time.


You can see many ms order of protection times from this result: for me this
is more problematic than downtime problem many people like to improve.

In contrast my method only took 200us in the worst case: actually what I
measured for that was the entire kvm_vm_ioctl_get_dirty_log() time which
contained more extra tasks, e.g. copy_to_user().

  FYI: changing the guest memory size from 4GB to 8GB did not show any
  siginificant change to my method, but, as you can guess, traversing
  shadow pages will need more time for increased shadow pages.


If we have 4K shadow pages in the slot, kvm_mmu_slot_remove_write_access()
have to traverse all of them, checking all 512 entries in them.

Compared to that, the bitmap size of 4GB memory slot is 1M bits = 128KB.
Reading this 8 pages is negligible.

My unit-test experiment has also showed that xchg overheads is not so much,
compared to others:

   493900.4   15911.9       60.2      125%     5%      8K
   760268.2    5929.6       46.4       63%   199%     16K
  1238709.6    7123.5       37.8       23%   173%     32K
  2359523.6    3121.7       36.0       -9%    87%     64K
  4540780.6   10155.6       34.6      -27%    30%    128K
  8747274.0   10973.3       33.3      -31%    -3%    256K

Note that these cases need to xchg the entire dirty bitmap because at least
one bit is set for each unsigned-long-word.

The big difference came from the number of sptes to protect alone.

	Takuya

===
funcgraph_entry:      + 25.123 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      + 35.746 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 922.886 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      + 20.153 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      + 20.424 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      + 17.595 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      + 20.240 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 9783.060 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1992.718 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1312.128 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2028.900 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1455.889 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1382.795 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2030.321 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1407.248 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2189.321 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1444.344 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2291.976 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1801.848 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1993.104 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1531.858 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2394.283 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1613.203 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1699.472 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2416.467 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1566.451 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1772.670 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1700.544 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1590.114 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2311.419 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1923.888 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2534.780 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2083.623 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1664.170 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2867.553 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2684.615 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1706.371 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2655.976 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1720.777 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2993.758 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1924.842 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3091.190 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1776.427 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2808.984 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2669.008 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2359.525 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2703.617 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2623.198 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1942.833 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1906.551 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2981.093 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2168.301 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1949.932 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2992.925 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3360.511 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1993.321 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3187.857 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 1989.417 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2001.865 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2047.220 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3107.808 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2039.732 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2057.575 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2417.748 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2076.445 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2308.323 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3216.713 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2148.263 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2269.673 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 2133.566 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3757.388 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3372.302 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3679.316 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3516.200 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 630.067 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 3191.830 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      ! 658.717 us |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      + 66.683 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:      + 31.027 us  |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.274 us   |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.568 us   |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.460 us   |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.358 us   |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.197 us   |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.306 us   |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.259 us   |  kvm_mmu_slot_remove_write_access();
funcgraph_entry:        0.181 us   |  kvm_mmu_slot_remove_write_access();

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-06 15:28             ` Marcelo Tosatti
  2012-03-07  8:07               ` Takuya Yoshikawa
@ 2012-03-07  8:18               ` Takuya Yoshikawa
  2012-03-07 23:20                 ` Marcelo Tosatti
  1 sibling, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-07  8:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > We cannot drop the lock.
> > Do you mean doing TLB flush each time before dropping the lock?
> 
> Yes, only if there is contention (see cond_resched_lock).

But how can we conditionally call kvm_flush_remote_tlbs() from
inside __cond_resched_lock right before spin_unlock()?

We need another API.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-07  8:18               ` Takuya Yoshikawa
@ 2012-03-07 23:20                 ` Marcelo Tosatti
  0 siblings, 0 replies; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-07 23:20 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

On Wed, Mar 07, 2012 at 05:18:26PM +0900, Takuya Yoshikawa wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > We cannot drop the lock.
> > > Do you mean doing TLB flush each time before dropping the lock?
> > 
> > Yes, only if there is contention (see cond_resched_lock).
> 
> But how can we conditionally call kvm_flush_remote_tlbs() from
> inside __cond_resched_lock right before spin_unlock()?
>
> We need another API.

Yep, a callback into cond_resched_lock_cb (or something) 
could do it.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-07  8:07               ` Takuya Yoshikawa
@ 2012-03-07 23:25                 ` Marcelo Tosatti
  2012-03-08  1:35                   ` Takuya Yoshikawa
  0 siblings, 1 reply; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-07 23:25 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

On Wed, Mar 07, 2012 at 05:07:45PM +0900, Takuya Yoshikawa wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > Partly yes: my method mainly depends on the number of dirty pages,
> > > not slot size.
> > > 
> > > But it is not a new problem: traversing all shadow pages for that
> > > also takes linearly increasing time.
> > 
> > It was not necessary to read the bitmap under mmu_lock previously.
> 
> Let's check actual data!
> 
> Below I pasted a simple test result to show that reading bitmap is not
> a problem at all compared to traversing shadow pages.
> 
> ** During doing the same live migration test as:
> 
>   For real workloads, both VGA and live migration, we have observed pure
>   improvements: when the guest was reading a file during live migration,
>   we originally saw a few ms of latency, but with the new method the
>   latency was less than 200us.
> 
> I measured how long the current method takes to just write protect sptes
> with the mmu_lock held - kvm_mmu_slot_remove_write_access() time.
> 
> 
> You can see many ms order of protection times from this result: for me this
> is more problematic than downtime problem many people like to improve.
> 
> In contrast my method only took 200us in the worst case: actually what I
> measured for that was the entire kvm_vm_ioctl_get_dirty_log() time which
> contained more extra tasks, e.g. copy_to_user().
> 
>   FYI: changing the guest memory size from 4GB to 8GB did not show any
>   siginificant change to my method, but, as you can guess, traversing
>   shadow pages will need more time for increased shadow pages.
> 
> 
> If we have 4K shadow pages in the slot, kvm_mmu_slot_remove_write_access()
> have to traverse all of them, checking all 512 entries in them.
> 
> Compared to that, the bitmap size of 4GB memory slot is 1M bits = 128KB.
> Reading this 8 pages is negligible.

Right, thanks for checking it.

> My unit-test experiment has also showed that xchg overheads is not so much,
> compared to others:
> 
>    493900.4   15911.9       60.2      125%     5%      8K
>    760268.2    5929.6       46.4       63%   199%     16K
>   1238709.6    7123.5       37.8       23%   173%     32K
>   2359523.6    3121.7       36.0       -9%    87%     64K
>   4540780.6   10155.6       34.6      -27%    30%    128K
>   8747274.0   10973.3       33.3      -31%    -3%    256K
> 
> Note that these cases need to xchg the entire dirty bitmap because at least
> one bit is set for each unsigned-long-word.
> 
> The big difference came from the number of sptes to protect alone.
> 
> 	Takuya

What is worrying are large memory cases: think of the 50GB slot case.
100ms hold time is pretty bad (and reacquiring the lock is relatively
simple).

> 
> ===
> funcgraph_entry:      + 25.123 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      + 35.746 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 922.886 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      + 20.153 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      + 20.424 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      + 17.595 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      + 20.240 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 9783.060 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1992.718 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1312.128 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2028.900 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1455.889 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1382.795 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2030.321 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1407.248 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2189.321 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1444.344 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2291.976 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1801.848 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1993.104 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1531.858 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2394.283 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1613.203 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1699.472 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2416.467 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1566.451 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1772.670 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1700.544 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1590.114 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2311.419 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1923.888 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2534.780 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2083.623 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1664.170 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2867.553 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2684.615 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1706.371 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2655.976 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1720.777 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2993.758 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1924.842 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3091.190 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1776.427 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2808.984 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2669.008 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2359.525 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2703.617 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2623.198 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1942.833 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1906.551 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2981.093 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2168.301 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1949.932 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2992.925 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3360.511 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1993.321 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3187.857 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 1989.417 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2001.865 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2047.220 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3107.808 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2039.732 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2057.575 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2417.748 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2076.445 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2308.323 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3216.713 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2148.263 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2269.673 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 2133.566 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3757.388 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3372.302 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3679.316 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3516.200 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 630.067 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 3191.830 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      ! 658.717 us |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      + 66.683 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:      + 31.027 us  |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.274 us   |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.568 us   |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.460 us   |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.358 us   |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.197 us   |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.306 us   |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.259 us   |  kvm_mmu_slot_remove_write_access();
> funcgraph_entry:        0.181 us   |  kvm_mmu_slot_remove_write_access();

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-07 23:25                 ` Marcelo Tosatti
@ 2012-03-08  1:35                   ` Takuya Yoshikawa
  2012-03-09  0:08                     ` Marcelo Tosatti
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-08  1:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> What is worrying are large memory cases: think of the 50GB slot case.
> 100ms hold time is pretty bad (and reacquiring the lock is relatively
> simple).
> 

OK, I agree basically.

But let me explain one thing before deciding what I should do next.

With my method, even when we use a 50GB slot, the hold time will be under
10ms -- not 100ms -- if the memory actually updated from the last time is
1GB (256K dirty pages).

> >   8747274.0   10973.3       33.3      -31%    -3%    256K
  Note that this unit-test was done on my tiny core-i3 32-bit host.
  On servers which can install more than 50GB memory, this will become
  much faster: actually my live migration tests done on Xeon saw much
  better numbers.  Unit-test has been tuned for the worst case.

I admit that if the dirty memory size is more than 10GB, we may see over
100ms hold time you are worrying about.

  For that I was proposing introducing a new GET_DIRTY_LOG API which can
  restrict the number of dirty pages we get the log - but this is a long
  term goal.


So, I am OK to try to introduce cond_resched_lock_cb() as you suggested.
But, as I explained above, my current implementation does not introduce
any real regression concerning to mmu_lock hold time:

  Before we could see 10ms hold time in real workloads:
  > funcgraph_entry:      ! 9783.060 us |  kvm_mmu_slot_remove_write_access();

  I have never seen ms hold time with my method in the same workloads.

So, it would be helpful if you can apply the patch series and I can work
on top of that: although I cannot use servers with 100GB memory now,
migrating a guest with 16GB memory or so may be possible later: I need
to reserve servers for that.

I also want to know "mmu_lock -- TLB flush"-decoupling plan.  We will not
need to introduce cond_resched_lock_cb() in sched.h if that is possible.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-08  1:35                   ` Takuya Yoshikawa
@ 2012-03-09  0:08                     ` Marcelo Tosatti
  2012-03-12 12:05                       ` Avi Kivity
  0 siblings, 1 reply; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-09  0:08 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, avi, kvm, linux-kernel

On Thu, Mar 08, 2012 at 10:35:45AM +0900, Takuya Yoshikawa wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > What is worrying are large memory cases: think of the 50GB slot case.
> > 100ms hold time is pretty bad (and reacquiring the lock is relatively
> > simple).
> > 
> 
> OK, I agree basically.
> 
> But let me explain one thing before deciding what I should do next.
> 
> With my method, even when we use a 50GB slot, the hold time will be under
> 10ms -- not 100ms -- if the memory actually updated from the last time is
> 1GB (256K dirty pages).
> 
> > >   8747274.0   10973.3       33.3      -31%    -3%    256K
>   Note that this unit-test was done on my tiny core-i3 32-bit host.
>   On servers which can install more than 50GB memory, this will become
>   much faster: actually my live migration tests done on Xeon saw much
>   better numbers.  Unit-test has been tuned for the worst case.
> 
> I admit that if the dirty memory size is more than 10GB, we may see over
> 100ms hold time you are worrying about.
> 
>   For that I was proposing introducing a new GET_DIRTY_LOG API which can
>   restrict the number of dirty pages we get the log - but this is a long
>   term goal.
> 
> 
> So, I am OK to try to introduce cond_resched_lock_cb() as you suggested.
> But, as I explained above, my current implementation does not introduce
> any real regression concerning to mmu_lock hold time:
> 
>   Before we could see 10ms hold time in real workloads:
>   > funcgraph_entry:      ! 9783.060 us |  kvm_mmu_slot_remove_write_access();
> 
>   I have never seen ms hold time with my method in the same workloads.
> 
> So, it would be helpful if you can apply the patch series and I can work
> on top of that: although I cannot use servers with 100GB memory now,
> migrating a guest with 16GB memory or so may be possible later: I need
> to reserve servers for that.

Makes sense.

It looks good to me, Avi can you review & ack please?

> I also want to know "mmu_lock -- TLB flush"-decoupling plan.  We will not
> need to introduce cond_resched_lock_cb() in sched.h if that is possible.
> 
> Thanks,
> 	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others
  2012-03-01 10:31 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
@ 2012-03-12  7:39   ` Takuya Yoshikawa
  2012-03-12  7:52     ` Takuya Yoshikawa
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-12  7:39 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:

> -int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
> -			       struct kvm_memory_slot *slot)
> +static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level)
>  {
> -	unsigned long *rmapp;
> -	u64 *spte;
> -	int i, write_protected = 0;
> +	u64 *spte = NULL;
> +	int write_protected = 0;
>  
> -	rmapp = __gfn_to_rmap(gfn, PT_PAGE_TABLE_LEVEL, slot);
> -	spte = rmap_next(rmapp, NULL);
> -	while (spte) {
> +	while ((spte = rmap_next(rmapp, spte))) {
>  		BUG_ON(!(*spte & PT_PRESENT_MASK));
>  		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
> -		if (is_writable_pte(*spte)) {
> +
> +		if (!is_writable_pte(*spte))
> +			continue;
> +
> +		if (level == PT_PAGE_TABLE_LEVEL) {
>  			mmu_spte_update(spte, *spte & ~PT_WRITABLE_MASK);
> -			write_protected = 1;
> +		} else {
> +			BUG_ON(!is_large_pte(*spte));
> +			drop_spte(kvm, spte);
> +			--kvm->stat.lpages;
> +			spte = NULL;
>  		}
> -		spte = rmap_next(rmapp, spte);
> +
> +		write_protected = 1;
>  	}

Something may change here: when level > PT_PAGE_TABLE_LEVEL, this loop
does not handle lower level mappings after dropping large-ptes.

This may be incorrect.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others
  2012-03-12  7:39   ` Takuya Yoshikawa
@ 2012-03-12  7:52     ` Takuya Yoshikawa
  0 siblings, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-12  7:52 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:

> Something may change here: when level > PT_PAGE_TABLE_LEVEL, this loop
> does not handle lower level mappings after dropping large-ptes.
> 
> This may be incorrect.

On second thoughts, this seems to be no problem.
I was just confused.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4 changelog-v2] KVM: Switch to srcu-less get_dirty_log()
  2012-03-09  0:08                     ` Marcelo Tosatti
@ 2012-03-12 12:05                       ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2012-03-12 12:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Takuya Yoshikawa, Takuya Yoshikawa, kvm, linux-kernel

On 03/09/2012 02:08 AM, Marcelo Tosatti wrote:
> > So, it would be helpful if you can apply the patch series and I can work
> > on top of that: although I cannot use servers with 100GB memory now,
> > migrating a guest with 16GB memory or so may be possible later: I need
> > to reserve servers for that.
>
> Makes sense.

To me too.

> It looks good to me, Avi can you review & ack please?

Patchset is fine from my point of view, I think it's a great improvement.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
  2012-03-02  2:56   ` Takuya Yoshikawa
@ 2012-03-12 18:04   ` Avi Kivity
  2012-03-13  9:20     ` Takuya Yoshikawa
  2012-03-13 23:04   ` Marcelo Tosatti
  2 siblings, 1 reply; 40+ messages in thread
From: Avi Kivity @ 2012-03-12 18:04 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel

On 03/01/2012 12:32 PM, Takuya Yoshikawa wrote:
> Dropped such mappings when we enabled dirty logging and we will never
> create new ones until we stop the logging.
>
> For this we introduce a new function which can be used to write protect
> a range of PT level pages: although we do not need to care about a range
> of pages at this point, the following patch will need this feature to
> optimize the write protection of many pages.
>
>

It occurs to me that we should write-protect huge page tables, since it
makes write protection much faster (we make up for this later at write
fault time, but that might not occur, and even if it does we reduce
guest jitter).  In fact I once proposed a more involved variant, called
O(1) write protection, in which we write-protect the topmost page table
only and only un-write-protect the paths that fault.

That can be done later however and shouldn't affect this patchset.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-12 18:04   ` Avi Kivity
@ 2012-03-13  9:20     ` Takuya Yoshikawa
  2012-03-13 10:12       ` Avi Kivity
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-13  9:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, kvm, linux-kernel

Avi Kivity <avi@redhat.com> wrote:

> It occurs to me that we should write-protect huge page tables, since it
> makes write protection much faster (we make up for this later at write
> fault time, but that might not occur, and even if it does we reduce
> guest jitter).  In fact I once proposed a more involved variant, called

Do you mean protecting huge page tables when we start logging and dropping
them, one by one, at the time of write fault?

If so, we do not need to change get_dirty_log() implementation.
Changing kvm_mmu_remove_write_access() and fault handler seems to be enough.

> O(1) write protection, in which we write-protect the topmost page table
> only and only un-write-protect the paths that fault.

> That can be done later however and shouldn't affect this patchset.

I have some additional patches to optimize rmap handling which seems to
improve get_dirty_log() about 10% in the worst case.

After that, I want to take some time to prepare for further developments
because my current test tools are not enough now.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-13  9:20     ` Takuya Yoshikawa
@ 2012-03-13 10:12       ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2012-03-13 10:12 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel

On 03/13/2012 11:20 AM, Takuya Yoshikawa wrote:
> Avi Kivity <avi@redhat.com> wrote:
>
> > It occurs to me that we should write-protect huge page tables, since it
> > makes write protection much faster (we make up for this later at write
> > fault time, but that might not occur, and even if it does we reduce
> > guest jitter).  In fact I once proposed a more involved variant, called
>
> Do you mean protecting huge page tables when we start logging and dropping
> them, one by one, at the time of write fault?

Yes.  Even more generally, protecting PML4Es, then at fault time
un-protecting the faulting PML4E and write-protecting all underlying
PDPEs except the one for the faulting address, and similarly for PDEs
and PTEs.

> If so, we do not need to change get_dirty_log() implementation.
> Changing kvm_mmu_remove_write_access() and fault handler seems to be enough.
>
> > O(1) write protection, in which we write-protect the topmost page table
> > only and only un-write-protect the paths that fault.
>
> > That can be done later however and shouldn't affect this patchset.
>
> I have some additional patches to optimize rmap handling which seems to
> improve get_dirty_log() about 10% in the worst case.

Great, looking forward.

> After that, I want to take some time to prepare for further developments
> because my current test tools are not enough now.
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
  2012-03-02  2:56   ` Takuya Yoshikawa
  2012-03-12 18:04   ` Avi Kivity
@ 2012-03-13 23:04   ` Marcelo Tosatti
  2012-03-14  1:04     ` Takuya Yoshikawa
  2012-03-14  5:34     ` Takuya Yoshikawa
  2 siblings, 2 replies; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-13 23:04 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, kvm, linux-kernel

On Thu, Mar 01, 2012 at 07:32:16PM +0900, Takuya Yoshikawa wrote:
> Dropped such mappings when we enabled dirty logging and we will never
> create new ones until we stop the logging.
> 
> For this we introduce a new function which can be used to write protect
> a range of PT level pages: although we do not need to care about a range
> of pages at this point, the following patch will need this feature to
> optimize the write protection of many pages.
> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
>  arch/x86/include/asm/kvm_host.h |    5 ++-
>  arch/x86/kvm/mmu.c              |   40 +++++++++++++++++++++++++++++---------
>  arch/x86/kvm/x86.c              |    8 ++----
>  3 files changed, 36 insertions(+), 17 deletions(-)

This is a race with hugetlbfs which is not an issue ATM (it is 
hidden by the removal of huge sptes in get_dirty).


guest fault                                     enable dirty logging

tdp_page_fault (all _page_fault functions)      kvm_set_memory_region


level = mapping_level(vcpu, gfn)
(finds level == 2 or 3)


                                                rcu_assign_pointer(slot
								   with 
                                                                   ->dirty_bitmap)
                                                synchronize_srcu_expedited()
schedule()
                                                kvm_arch_commit_memory_region()
                                                spin_lock(mmu_lock)
                                                kvm_mmu_slot_remove_write_access()
						removes large sptes
                                                spin_unlock(mmu_lock)
spin_lock(mmu_lock)
create large spte accordingly
to level above
spin_unlock(mmu_lock)

Not removing large sptes in get_dirty means this racy sptes could
live from the start of migration to the end of it.

It can be fixed with a preceding patch that checks whether
slot->dirty_bitmap value changes between mapping_level and after
mmu_lock acquision, similarly to mmu_seq. Also please add a 
WARN_ON in mmu_set_spte if(slot->dirty_bitmap && level > 1).
And document it clearly.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-13 23:04   ` Marcelo Tosatti
@ 2012-03-14  1:04     ` Takuya Yoshikawa
  2012-03-14  5:34     ` Takuya Yoshikawa
  1 sibling, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-14  1:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> 
> This is a race with hugetlbfs which is not an issue ATM (it is 
> hidden by the removal of huge sptes in get_dirty).

Thank you!
I did not notice this possibility at all.

...

> It can be fixed with a preceding patch that checks whether
> slot->dirty_bitmap value changes between mapping_level and after
> mmu_lock acquision, similarly to mmu_seq. Also please add a 
> WARN_ON in mmu_set_spte if(slot->dirty_bitmap && level > 1).
> And document it clearly.

We may also be able to change the "if (slot->dirty_bitmap)" check to
use another flag so that we can delay the start of logging until
mmu_lock acquisition in kvm_set_memory_region().

Looking dirty_bitmap directly should be limited to when we are in
get_dirty functions.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-13 23:04   ` Marcelo Tosatti
  2012-03-14  1:04     ` Takuya Yoshikawa
@ 2012-03-14  5:34     ` Takuya Yoshikawa
  2012-03-14 10:58       ` Marcelo Tosatti
  1 sibling, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-14  5:34 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> guest fault                                     enable dirty logging
> 
> tdp_page_fault (all _page_fault functions)      kvm_set_memory_region
> 
> 
> level = mapping_level(vcpu, gfn)
> (finds level == 2 or 3)
> 
> 
>                                                 rcu_assign_pointer(slot
> 								   with 
>                                                                    ->dirty_bitmap)
>                                                 synchronize_srcu_expedited()

Isn't here still in the SRCU read-side critical section?

> schedule()
>                                                 kvm_arch_commit_memory_region()
>                                                 spin_lock(mmu_lock)
>                                                 kvm_mmu_slot_remove_write_access()
> 						removes large sptes
>                                                 spin_unlock(mmu_lock)
> spin_lock(mmu_lock)
> create large spte accordingly
> to level above
> spin_unlock(mmu_lock)
> 

If so, we cannot start kvm_arch_commit_memory_region() until the completion
of the SRCU critical section and this race will not happen.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-14  5:34     ` Takuya Yoshikawa
@ 2012-03-14 10:58       ` Marcelo Tosatti
  0 siblings, 0 replies; 40+ messages in thread
From: Marcelo Tosatti @ 2012-03-14 10:58 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, kvm, linux-kernel

On Wed, Mar 14, 2012 at 02:34:32PM +0900, Takuya Yoshikawa wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > guest fault                                     enable dirty logging
> > 
> > tdp_page_fault (all _page_fault functions)      kvm_set_memory_region
> > 
> > 
> > level = mapping_level(vcpu, gfn)
> > (finds level == 2 or 3)
> > 
> > 
> >                                                 rcu_assign_pointer(slot
> > 								   with 
> >                                                                    ->dirty_bitmap)
> >                                                 synchronize_srcu_expedited()
> 
> Isn't here still in the SRCU read-side critical section?
> 
> > schedule()
> >                                                 kvm_arch_commit_memory_region()
> >                                                 spin_lock(mmu_lock)
> >                                                 kvm_mmu_slot_remove_write_access()
> > 						removes large sptes
> >                                                 spin_unlock(mmu_lock)
> > spin_lock(mmu_lock)
> > create large spte accordingly
> > to level above
> > spin_unlock(mmu_lock)
> > 
> 
> If so, we cannot start kvm_arch_commit_memory_region() until the completion
> of the SRCU critical section and this race will not happen.
> 
> 	Takuya

Correct.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-01 10:33 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
  2012-03-03  5:21   ` [PATCH 3/4 changelog-v2] " Takuya Yoshikawa
@ 2012-03-16  5:03   ` Xiao Guangrong
  2012-03-16  6:55     ` Takuya Yoshikawa
  1 sibling, 1 reply; 40+ messages in thread
From: Xiao Guangrong @ 2012-03-16  5:03 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

On 03/01/2012 06:33 PM, Takuya Yoshikawa wrote:


> +	spin_lock(&kvm->mmu_lock);
> 
> -		r = -ENOMEM;
> -		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
> -		if (!slots)
> -			goto out;
> +	for (i = 0; i < n / sizeof(long); i++) {
> +		unsigned long mask;
> +		gfn_t offset;
> 
> -		memslot = id_to_memslot(slots, log->slot);
> -		memslot->nr_dirty_pages = 0;
> -		memslot->dirty_bitmap = dirty_bitmap_head;
> -		update_memslots(slots, NULL);
> +		if (!dirty_bitmap[i])
> +			continue;
> 
> -		old_slots = kvm->memslots;
> -		rcu_assign_pointer(kvm->memslots, slots);
> -		synchronize_srcu_expedited(&kvm->srcu);
> -		kfree(old_slots);


For my quickly review, mmu_lock can not protect everything, if the guest page
is written out of the shadow page/ept table, dirty page will be lost.

There is a example:

             CPU A                                   CPU  B
guest page is written by write-emulation

                                                  hold mmu-lock and see dirty-bitmap
                                                  is not be changed, then migration is
                                                  completed.

call mark_page_dirty() to set dirty_bit map


Right?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-16  5:03   ` [PATCH 3/4] " Xiao Guangrong
@ 2012-03-16  6:55     ` Takuya Yoshikawa
  2012-03-16  7:30       ` Xiao Guangrong
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-16  6:55 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, linux-kernel

On Fri, 16 Mar 2012 13:03:48 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> For my quickly review, mmu_lock can not protect everything, if the guest page

Yes and ...

> is written out of the shadow page/ept table, dirty page will be lost.

No.

> 
> There is a example:
> 
>              CPU A                                   CPU  B
> guest page is written by write-emulation
> 
>                                                   hold mmu-lock and see dirty-bitmap
>                                                   is not be changed, then migration is
>                                                   completed.

We do not allow this break.

> 
> call mark_page_dirty() to set dirty_bit map
> 
> 
> Right?


As you pointed out, we cannot assume mutual exclusion by mmu_lock.
That is why we are using atomic bitmap operations: xchg and set_bit.

In this sense we are at least guaranteed to get the dirty page
information in dirty_bitmap - the current one or next one.

So what we should care about is to not miss the information written in
the next bitmap at the time we actually migrate the guest.

Actually the userspace stops the guest at the final stage and then send the
remaining pages found in the bitmap.  So the above break between write and
mark_page_dirty() cannot happen IIUC.


Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-16  6:55     ` Takuya Yoshikawa
@ 2012-03-16  7:30       ` Xiao Guangrong
  2012-03-16  7:55         ` Takuya Yoshikawa
  0 siblings, 1 reply; 40+ messages in thread
From: Xiao Guangrong @ 2012-03-16  7:30 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

On 03/16/2012 02:55 PM, Takuya Yoshikawa wrote:

> On Fri, 16 Mar 2012 13:03:48 +0800
> Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:
> 
>> For my quickly review, mmu_lock can not protect everything, if the guest page
> 
> Yes and ...
> 
>> is written out of the shadow page/ept table, dirty page will be lost.
> 
> No.
> 
>>
>> There is a example:
>>
>>              CPU A                                   CPU  B
>> guest page is written by write-emulation
>>
>>                                                   hold mmu-lock and see dirty-bitmap
>>                                                   is not be changed, then migration is
>>                                                   completed.
> 
> We do not allow this break.
> 


Hmm? what can avoid this? Could you please point it out?


>>
>> call mark_page_dirty() to set dirty_bit map
>>
>>
>> Right?
> 
> 
> As you pointed out, we cannot assume mutual exclusion by mmu_lock.
> That is why we are using atomic bitmap operations: xchg and set_bit.
> 
> In this sense we are at least guaranteed to get the dirty page
> information in dirty_bitmap - the current one or next one.
> 


The problem is the guest page is written before dirty-bitmap is set,
we may log the dirty page in this window like above case...

> So what we should care about is to not miss the information written in
> the next bitmap at the time we actually migrate the guest.
> 


Actually, the way log dirty page in MMU page-table is tricky:

set dirty-bitmap

allow spte to be writeable

page can be written

That means we always set dirty-bitmap _before_ page become dirty that is
the reason why your bitmap-way can work.

> Actually the userspace stops the guest at the final stage and then send the
> remaining pages found in the bitmap.  So the above break between write and
> mark_page_dirty() cannot happen IIUC.
> 


Maybe i'd better firstly understand why "We do not allow this break" :)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-16  7:30       ` Xiao Guangrong
@ 2012-03-16  7:55         ` Takuya Yoshikawa
  2012-03-16  8:28           ` Xiao Guangrong
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-16  7:55 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, linux-kernel

On Fri, 16 Mar 2012 15:30:45 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> >> There is a example:
> >>
> >>              CPU A                                   CPU  B
> >> guest page is written by write-emulation
> >>
> >>                                                   hold mmu-lock and see dirty-bitmap
> >>                                                   is not be changed, then migration is
> >>                                                   completed.
> > 
> > We do not allow this break.
> > 
> 
> 
> Hmm? what can avoid this? Could you please point it out?

Stopping the guest before actualy migrating the guest means VCPU threads
must be back in the userspace at the moment, no?

So when the final GET_DIRTY_LOG is being executed, thread A cannot be
in KVM.

> The problem is the guest page is written before dirty-bitmap is set,
> we may log the dirty page in this window like above case...

Exactly, but the next GET_DIRTY_LOG call can take that because, as I
wrote above, at this time the GET_DIRTY_LOG must not be the final one.

Makes sense?

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-16  7:55         ` Takuya Yoshikawa
@ 2012-03-16  8:28           ` Xiao Guangrong
  2012-03-16  9:44             ` Takuya Yoshikawa
  0 siblings, 1 reply; 40+ messages in thread
From: Xiao Guangrong @ 2012-03-16  8:28 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

On 03/16/2012 03:55 PM, Takuya Yoshikawa wrote:

> On Fri, 16 Mar 2012 15:30:45 +0800
> Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:
> 
>>>> There is a example:
>>>>
>>>>              CPU A                                   CPU  B
>>>> guest page is written by write-emulation
>>>>
>>>>                                                   hold mmu-lock and see dirty-bitmap
>>>>                                                   is not be changed, then migration is
>>>>                                                   completed.
>>>
>>> We do not allow this break.
>>>
>>
>>
>> Hmm? what can avoid this? Could you please point it out?
> 
> Stopping the guest before actualy migrating the guest means VCPU threads
> must be back in the userspace at the moment, no?
> 
> So when the final GET_DIRTY_LOG is being executed, thread A cannot be
> in KVM.
> 
>> The problem is the guest page is written before dirty-bitmap is set,
>> we may log the dirty page in this window like above case...
> 
> Exactly, but the next GET_DIRTY_LOG call can take that because, as I
> wrote above, at this time the GET_DIRTY_LOG must not be the final one.
> 


Thanks for your explanation, maybe you are right, i do not know migration
much.

What i worried about is, you have changed the behaviour of GET_DIRTY_LOG,
in the current one, it can get all the dirty pages when it is called; after
your change, GET_DIRTY_LOG can get a empty dirty bitmap but dirty page exists.

Migration may work correctly depends on the final GET_DIRTY_LOG, in that time,
guest is stopped. But i am not sure whether other components using GET_DIRTY_LOG
are happy, e.g. frame-buffer.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-16  8:28           ` Xiao Guangrong
@ 2012-03-16  9:44             ` Takuya Yoshikawa
  2012-03-19  9:34               ` Xiao Guangrong
  0 siblings, 1 reply; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-16  9:44 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, linux-kernel

On Fri, 16 Mar 2012 16:28:56 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> Thanks for your explanation, maybe you are right, i do not know migration
> much.
> 
> What i worried about is, you have changed the behaviour of GET_DIRTY_LOG,
> in the current one, it can get all the dirty pages when it is called; after
> your change, GET_DIRTY_LOG can get a empty dirty bitmap but dirty page exists.

The current code also see the same situation because nothing prevents the
guest from writing to pages before GET_DIRTY_LOG returns and the userspace
checks the bitmap.  Everything is running.

> Migration may work correctly depends on the final GET_DIRTY_LOG, in that time,
> guest is stopped. But i am not sure whether other components using GET_DIRTY_LOG
> are happy, e.g. frame-buffer.

Ah, you are probably worrying about what I discussed with Avi before.

In the end we cannot get a complete snapshot without stopping the guest
like migration does.  So that cannot be guaranteed.

The only thing it can promise is to make it possible to get the log after
mark_page_dirty().  Even when the bit is not marked at Nth GET_DIRTY_LOG
time, we should be able to get it at (N+1)th call - maybe N+2.

For VGA, the display continues to update endlessly and each page will be
updated at some time: of course there may be a bit of time lag.

BTW marking framebuffer pages is through MMU and mmu_lock protected.


Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-16  9:44             ` Takuya Yoshikawa
@ 2012-03-19  9:34               ` Xiao Guangrong
  2012-03-19 10:15                 ` Takuya Yoshikawa
  0 siblings, 1 reply; 40+ messages in thread
From: Xiao Guangrong @ 2012-03-19  9:34 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

On 03/16/2012 05:44 PM, Takuya Yoshikawa wrote:

> On Fri, 16 Mar 2012 16:28:56 +0800
> Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:
> 
>> Thanks for your explanation, maybe you are right, i do not know migration
>> much.
>>
>> What i worried about is, you have changed the behaviour of GET_DIRTY_LOG,
>> in the current one, it can get all the dirty pages when it is called; after
>> your change, GET_DIRTY_LOG can get a empty dirty bitmap but dirty page exists.
> 
> The current code also see the same situation because nothing prevents the
> guest from writing to pages before GET_DIRTY_LOG returns and the userspace
> checks the bitmap.  Everything is running.
> 


The current code is under the protection of s-rcu:
IIRC, it always holds s-rcu when write guest page and set dirty bit,
that mean the dirty page is logged either in the old dirty_bitmap or in the
current memslot->dirty_bitmap. Yes?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-03-19  9:34               ` Xiao Guangrong
@ 2012-03-19 10:15                 ` Takuya Yoshikawa
  0 siblings, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-03-19 10:15 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, linux-kernel

On Mon, 19 Mar 2012 17:34:49 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> The current code is under the protection of s-rcu:
> IIRC, it always holds s-rcu when write guest page and set dirty bit,
> that mean the dirty page is logged either in the old dirty_bitmap or in the
> current memslot->dirty_bitmap. Yes?


Yes.

I just wanted to explain that getting clear dirty bitmap by GET_DIRTY_LOG
does not mean there is no dirty page: it just means that there was nothing
logged when we updated the bitmap in get_dirty_log().

We cannot know if anything happend between the bitmap update and result
check in the userspace.

So even when we get a clear bitmap, we need to stop the VCPU threads and
then do GET_DIRTY_LOG once more for live migration.


The important thing is that every bit set by mark_page_dirty() can be
found at some time in the future, including the final GET_DIRTY_LOG.

	Takuya

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/4] KVM: srcu-less dirty logging -v2
  2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
                   ` (4 preceding siblings ...)
  2012-03-03  5:12 ` [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
@ 2012-03-20 14:43 ` Avi Kivity
  5 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2012-03-20 14:43 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel

On 03/01/2012 12:30 PM, Takuya Yoshikawa wrote:
> This patch series is the result of the integration of my dirty logging
> optimization work, including preparation for the new GET_DIRTY_LOG API,
> and the attempt to get rid of controversial synchronize_srcu_expedited().
>
> 1 - KVM: MMU: Split the main body of rmap_write_protect() off from others
> 2 - KVM: Avoid checking huge page mappings in get_dirty_log()
> 3 - KVM: Switch to srcu-less get_dirty_log()
> 4 - KVM: Remove unused dirty_bitmap_head and nr_dirty_pages
>
> Although there are still some remaining tasks, the test result obtained
> looks very promising.
>

Thanks, applied.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
@ 2012-02-23 11:34 ` Takuya Yoshikawa
  0 siblings, 0 replies; 40+ messages in thread
From: Takuya Yoshikawa @ 2012-02-23 11:34 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel, peterz, paulmck

Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/include/asm/kvm_host.h |    5 +++--
 arch/x86/kvm/mmu.c              |   38 ++++++++++++++++++++++++++++----------
 arch/x86/kvm/x86.c              |    8 +++-----
 3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 74c9edf..bd0f78e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -712,8 +712,9 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot);
+void kvm_mmu_write_protect_pt_range(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    gfn_t start_offset, gfn_t end_offset);
 void kvm_mmu_zap_all(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 67857bd..c453ddd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1037,27 +1037,45 @@ static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level
 	return write_protected;
 }
 
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot)
+/**
+ * kvm_mmu_write_protect_pt_range - write protect a range of PT level pages
+ * @kvm: kvm instance
+ * @slot: slot to protect
+ * @start_offset: offset of the first page to protect
+ * @end_offset: offset of the last page to protect
+ *
+ * Used when we do not need to care about huge page mappings: e.g. during dirty
+ * logging we do not have any such mappings.
+ */
+void kvm_mmu_write_protect_pt_range(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    gfn_t start_offset, gfn_t end_offset)
 {
+	gfn_t i;
 	unsigned long *rmapp;
-	int i, write_protected = 0;
 
-	for (i = PT_PAGE_TABLE_LEVEL;
-	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
-		rmapp = __gfn_to_rmap(gfn, i, slot);
-		write_protected |= __rmap_write_protect(kvm, rmapp, i);
+	for (i = start_offset; i <= end_offset; i++) {
+		rmapp = &slot->rmap[i];
+		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
 	}
-
-	return write_protected;
 }
 
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
+	unsigned long *rmapp;
+	int i;
+	int write_protected = 0;
 
 	slot = gfn_to_memslot(kvm, gfn);
-	return kvm_mmu_rmap_write_protect(kvm, gfn, slot);
+
+	for (i = PT_PAGE_TABLE_LEVEL;
+	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+		rmapp = __gfn_to_rmap(gfn, i, slot);
+		write_protected |= __rmap_write_protect(kvm, rmapp, i);
+	}
+
+	return write_protected;
 }
 
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9d99e5..3b3d1eb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3069,13 +3069,11 @@ static void write_protect_slot(struct kvm *kvm,
 
 	/* Not many dirty pages compared to # of shadow pages. */
 	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
-		unsigned long gfn_offset;
+		gfn_t offset;
 
-		for_each_set_bit(gfn_offset, dirty_bitmap, memslot->npages) {
-			unsigned long gfn = memslot->base_gfn + gfn_offset;
+		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
+			kvm_mmu_write_protect_pt_range(kvm, memslot, offset, offset);
 
-			kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
-		}
 		kvm_flush_remote_tlbs(kvm);
 	} else
 		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2012-03-20 14:43 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
2012-03-01 10:31 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
2012-03-12  7:39   ` Takuya Yoshikawa
2012-03-12  7:52     ` Takuya Yoshikawa
2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
2012-03-02  2:56   ` Takuya Yoshikawa
2012-03-02  5:11     ` Takuya Yoshikawa
2012-03-12 18:04   ` Avi Kivity
2012-03-13  9:20     ` Takuya Yoshikawa
2012-03-13 10:12       ` Avi Kivity
2012-03-13 23:04   ` Marcelo Tosatti
2012-03-14  1:04     ` Takuya Yoshikawa
2012-03-14  5:34     ` Takuya Yoshikawa
2012-03-14 10:58       ` Marcelo Tosatti
2012-03-01 10:33 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
2012-03-03  5:21   ` [PATCH 3/4 changelog-v2] " Takuya Yoshikawa
2012-03-06 11:15     ` Marcelo Tosatti
2012-03-06 14:43       ` Takuya Yoshikawa
2012-03-06 15:01         ` Marcelo Tosatti
2012-03-06 15:23           ` Takuya Yoshikawa
2012-03-06 15:28             ` Marcelo Tosatti
2012-03-07  8:07               ` Takuya Yoshikawa
2012-03-07 23:25                 ` Marcelo Tosatti
2012-03-08  1:35                   ` Takuya Yoshikawa
2012-03-09  0:08                     ` Marcelo Tosatti
2012-03-12 12:05                       ` Avi Kivity
2012-03-07  8:18               ` Takuya Yoshikawa
2012-03-07 23:20                 ` Marcelo Tosatti
2012-03-16  5:03   ` [PATCH 3/4] " Xiao Guangrong
2012-03-16  6:55     ` Takuya Yoshikawa
2012-03-16  7:30       ` Xiao Guangrong
2012-03-16  7:55         ` Takuya Yoshikawa
2012-03-16  8:28           ` Xiao Guangrong
2012-03-16  9:44             ` Takuya Yoshikawa
2012-03-19  9:34               ` Xiao Guangrong
2012-03-19 10:15                 ` Takuya Yoshikawa
2012-03-01 10:34 ` [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages Takuya Yoshikawa
2012-03-03  5:12 ` [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
2012-03-20 14:43 ` Avi Kivity
  -- strict thread matches above, loose matches on Subject: below --
2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
2012-02-23 11:34 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).