linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: srcu-less dirty logging
@ 2012-02-23 11:33 Takuya Yoshikawa
  2012-02-23 11:33 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-23 11:33 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel, peterz, paulmck

This patch series is the result of the integration of my dirty logging
optimization work, including preparation for the new GET_DIRTY_LOG API,
and the attempt to get rid of controversial synchronize_srcu_expedited().

1 - KVM: MMU: Split the main body of rmap_write_protect() off from others
2 - KVM: Avoid checking huge page mappings in get_dirty_log()
3 - KVM: Switch to srcu-less get_dirty_log()
4 - KVM: Remove unused dirty_bitmap_head and nr_dirty_pages

Although there are still some remaining tasks, the test result obtained
looks very promising.


Remaining tasks:

- Implement set_bit_le() for mark_page_dirty()

  Some drivers are using their own implementation of it and a bit of
  work is needed to make it generic.  I want to do this separately
  later because it cannot be done within kvm tree.

- Stop allocating extra dirty bitmap buffer area

  According to Peter, mmu_notifier has become preemptible.  If we can
  change mmu_lock from spin_lock to mutex_lock, as Avi said before, this
  would be staightforward because we can use __put_user() right after
  xchg() with the mmu_lock held.


Test results:

1. dirty-log-perf unit test (on Sandy Bridge core-i3 32-bit host)

With some changes added since the previous post, the performance was
much improved: now even when every page in the slot is dirty, the number
is reasonably close to the original one.  For others, needless to say,
we have achieved very nice improvement.

- kvm.git next
average(ns)    stdev     ns/page    pages

 147018.6    77604.9    147018.6        1
 158080.2    82211.9     79040.1        2
 127555.6    80619.8     31888.9        4
 108865.6    78499.3     13608.2        8
 114707.8    43508.6      7169.2       16
  76679.0    37659.8      2396.2       32
  59159.8    20417.1       924.3       64
  60418.2    19405.7       472.0      128
  76267.0    21450.5       297.9      256
 113182.0    22684.9       221.0      512
 930344.2   153766.5       908.5       1K
 939098.2   163800.3       458.5       2K
 996813.4    77921.0       243.3       4K
1113232.6   107782.6       135.8       8K
1241206.4    82282.5        75.7      16K
1529526.4   116388.2        46.6      32K
2147538.4   227375.9        32.7      64K
3309619.4    79356.8        25.2     128K
6016951.8   549873.4        22.9     256K

- kvm.git next + srcu-less series
average(ns)    stdev     ns/page    pages    improvement(%)

  14086.0     3532.3     14086.0        1     944
  13303.6     3317.7      6651.8        2    1088
  13455.6     3315.2      3363.9        4     848
  14125.8     3435.4      1765.7        8     671
  15322.4     3690.1       957.6       16     649
  17026.6     4037.2       532.0       32     350
  21258.6     4852.3       332.1       64     178
  33845.6    14115.8       264.4      128      79
  37893.0      681.8       148.0      256     101
  61707.4     1057.6       120.5      512      83
  88861.4     2131.0        86.7       1K     947
 151315.6     6490.5        73.8       2K     521
 290579.6     8523.0        70.9       4K     243
 518231.0    20412.6        63.2       8K     115
2271171.4    12064.9       138.6      16K     -45
3375866.2    14743.3       103.0      32K     -55
4408395.6    10720.0        67.2      64K     -51
5915336.2    26538.1        45.1     128K     -44
8497356.4    16441.0        32.4     256K     -29

Note that when the number of dirty pages was large, we spent less than
100ns for getting one dirty page information: see ns/page column.

As Avi noted before, this is much faster than the userspace send one
page to the destination node.

Furthermore, with the already proposed new GET_DIRTY_LOG API, we will
be able to restrict the area from which we get the log and will not need
to care about ms order of latency observed for very large number of dirty
pages.

2. real workloads (on Xeon W3520 64-bit host)

I traced kvm_vm_ioctl_get_dirty_log() during heavy VGA updates and
during live migration.

2.1. VGA: guest was doing "x11perf -rect1 -rect10 -rect100 -rect500"

As can be guessed from the result of dirty-log-perf, we observed very
nice improvement.

- kvm.git next
For heavy updates: 100us to 300us.
Worst: 300us

- kvm.git next + srcu-less series
For heavy updates: 3us to 10us.
Worst: 50us.

2.2. live migration: guest was doing "dd if=/path/to/a/file of=/dev/null"

The improvement was significant again.

- kvm.git next
For heavy updates: 1ms to 3ms

- kvm.git next + srcu-less series
For heavy updates: 50us to 300us

Probably we gained a lot from the locality of WWS.


	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others
  2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
@ 2012-02-23 11:33 ` Takuya Yoshikawa
  2012-02-23 11:34 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-23 11:33 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel, peterz, paulmck

We will use this in the following patch to implement another function
which needs to write protect pages using the rmap information.

Note that there is a small change in debug printing for large pages:
we do not differentiate them from others to avoid duplicating code.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/mmu.c |   53 ++++++++++++++++++++++++++-------------------------
 1 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ff053ca..67857bd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1010,42 +1010,43 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot)
+static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level)
 {
-	unsigned long *rmapp;
-	u64 *spte;
-	int i, write_protected = 0;
+	u64 *spte = NULL;
+	int write_protected = 0;
 
-	rmapp = __gfn_to_rmap(gfn, PT_PAGE_TABLE_LEVEL, slot);
-	spte = rmap_next(rmapp, NULL);
-	while (spte) {
+	while ((spte = rmap_next(rmapp, spte))) {
 		BUG_ON(!(*spte & PT_PRESENT_MASK));
 		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
-		if (is_writable_pte(*spte)) {
+
+		if (!is_writable_pte(*spte))
+			continue;
+
+		if (level == PT_PAGE_TABLE_LEVEL) {
 			mmu_spte_update(spte, *spte & ~PT_WRITABLE_MASK);
-			write_protected = 1;
+		} else {
+			BUG_ON(!is_large_pte(*spte));
+			drop_spte(kvm, spte);
+			--kvm->stat.lpages;
+			spte = NULL;
 		}
-		spte = rmap_next(rmapp, spte);
+
+		write_protected = 1;
 	}
 
-	/* check for huge page mappings */
-	for (i = PT_DIRECTORY_LEVEL;
+	return write_protected;
+}
+
+int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
+			       struct kvm_memory_slot *slot)
+{
+	unsigned long *rmapp;
+	int i, write_protected = 0;
+
+	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
 		rmapp = __gfn_to_rmap(gfn, i, slot);
-		spte = rmap_next(rmapp, NULL);
-		while (spte) {
-			BUG_ON(!(*spte & PT_PRESENT_MASK));
-			BUG_ON(!is_large_pte(*spte));
-			pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", spte, *spte, gfn);
-			if (is_writable_pte(*spte)) {
-				drop_spte(kvm, spte);
-				--kvm->stat.lpages;
-				spte = NULL;
-				write_protected = 1;
-			}
-			spte = rmap_next(rmapp, spte);
-		}
+		write_protected |= __rmap_write_protect(kvm, rmapp, i);
 	}
 
 	return write_protected;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
  2012-02-23 11:33 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
@ 2012-02-23 11:34 ` Takuya Yoshikawa
  2012-02-23 11:35 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-23 11:34 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel, peterz, paulmck

Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/include/asm/kvm_host.h |    5 +++--
 arch/x86/kvm/mmu.c              |   38 ++++++++++++++++++++++++++++----------
 arch/x86/kvm/x86.c              |    8 +++-----
 3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 74c9edf..bd0f78e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -712,8 +712,9 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot);
+void kvm_mmu_write_protect_pt_range(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    gfn_t start_offset, gfn_t end_offset);
 void kvm_mmu_zap_all(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 67857bd..c453ddd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1037,27 +1037,45 @@ static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level
 	return write_protected;
 }
 
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot)
+/**
+ * kvm_mmu_write_protect_pt_range - write protect a range of PT level pages
+ * @kvm: kvm instance
+ * @slot: slot to protect
+ * @start_offset: offset of the first page to protect
+ * @end_offset: offset of the last page to protect
+ *
+ * Used when we do not need to care about huge page mappings: e.g. during dirty
+ * logging we do not have any such mappings.
+ */
+void kvm_mmu_write_protect_pt_range(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    gfn_t start_offset, gfn_t end_offset)
 {
+	gfn_t i;
 	unsigned long *rmapp;
-	int i, write_protected = 0;
 
-	for (i = PT_PAGE_TABLE_LEVEL;
-	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
-		rmapp = __gfn_to_rmap(gfn, i, slot);
-		write_protected |= __rmap_write_protect(kvm, rmapp, i);
+	for (i = start_offset; i <= end_offset; i++) {
+		rmapp = &slot->rmap[i];
+		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
 	}
-
-	return write_protected;
 }
 
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
+	unsigned long *rmapp;
+	int i;
+	int write_protected = 0;
 
 	slot = gfn_to_memslot(kvm, gfn);
-	return kvm_mmu_rmap_write_protect(kvm, gfn, slot);
+
+	for (i = PT_PAGE_TABLE_LEVEL;
+	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+		rmapp = __gfn_to_rmap(gfn, i, slot);
+		write_protected |= __rmap_write_protect(kvm, rmapp, i);
+	}
+
+	return write_protected;
 }
 
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9d99e5..3b3d1eb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3069,13 +3069,11 @@ static void write_protect_slot(struct kvm *kvm,
 
 	/* Not many dirty pages compared to # of shadow pages. */
 	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
-		unsigned long gfn_offset;
+		gfn_t offset;
 
-		for_each_set_bit(gfn_offset, dirty_bitmap, memslot->npages) {
-			unsigned long gfn = memslot->base_gfn + gfn_offset;
+		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
+			kvm_mmu_write_protect_pt_range(kvm, memslot, offset, offset);
 
-			kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
-		}
 		kvm_flush_remote_tlbs(kvm);
 	} else
 		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
  2012-02-23 11:33 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
  2012-02-23 11:34 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
@ 2012-02-23 11:35 ` Takuya Yoshikawa
  2012-02-28 11:46   ` Avi Kivity
  2012-02-23 11:36 ` [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages Takuya Yoshikawa
  2012-02-23 13:25 ` [PATCH 0/4] KVM: srcu-less dirty logging Peter Zijlstra
  4 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-23 11:35 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel, peterz, paulmck

We have seen some problems of the current implementation of
get_dirty_log() which uses synchronize_srcu_expedited() for updating
dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
order of latency when we use VGA displays.

Furthermore the recent discussion on the following thread
    "srcu: Implement call_srcu()"
    http://lkml.org/lkml/2012/1/31/211
also motivated us to implement get_dirty_log() without SRCU.

This patch achieves this goal without sacrificing the performance of
both VGA and live migration: in practice the new code is much faster
than the old one unless we have too many dirty pages.

Implementation:

The key part of the implementation is the use of xchg() operation for
clearing dirty bits atomically.  Since this allows us to update only
BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
until every dirty bit is cleared again for the next call.

Although some people may worry about the problem of using the atomic
memory instruction many times to the concurrently accessible bitmap,
it is usually accessed with mmu_lock held and we rarely see concurrent
accesses: so what we need to care about is the pure xchg() overheads.

Another point to note is that we do not use for_each_set_bit() to check
which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
simply use __ffs() and __fls() and pass the range in between the two
positions found by them to kvm_mmu_write_protect_pt_range().

Even though the passed range may include clean pages, it is much faster
than repeatedly call find_next_bit() due to the locality of dirty pages.

Performance:

The dirty-log-perf unit test showed nice improvement, some times faster
than before, when the number of dirty pages was below 8K.  For other
cases we saw a bit of regression but still enough fast compared to the
processing of these dirty pages in the userspace.

For real workloads, both VGA and live migration, we have observed pure
improvement: when the guest was reading a file, we originally saw a few
ms of latency, but with the new method the latency was 50us to 300us.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/x86.c |  117 +++++++++++++++++++---------------------------------
 1 files changed, 43 insertions(+), 74 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3b3d1eb..be4c52b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3041,55 +3041,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 }
 
 /**
- * write_protect_slot - write protect a slot for dirty logging
- * @kvm: the kvm instance
- * @memslot: the slot we protect
- * @dirty_bitmap: the bitmap indicating which pages are dirty
- * @nr_dirty_pages: the number of dirty pages
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm: kvm instance
+ * @log: slot id and address to which we copy the log
  *
- * We have two ways to find all sptes to protect:
- * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
- *    checks ones that have a spte mapping a page in the slot.
- * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
  *
- * Generally speaking, if there are not so many dirty pages compared to the
- * number of shadow pages, we should use the latter.
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Flush TLB's if needed.
+ *   4. Copy the snapshot to the userspace.
  *
- * Note that letting others write into a page marked dirty in the old bitmap
- * by using the remaining tlb entry is not a problem.  That page will become
- * write protected again when we flush the tlb and then be reported dirty to
- * the user space by copying the old bitmap.
+ * Between 2 and 3, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page will be reported dirty at
+ * step 4 using the snapshot taken before and step 3 ensures that successive
+ * writes will be logged for the next call.
  */
-static void write_protect_slot(struct kvm *kvm,
-			       struct kvm_memory_slot *memslot,
-			       unsigned long *dirty_bitmap,
-			       unsigned long nr_dirty_pages)
-{
-	spin_lock(&kvm->mmu_lock);
-
-	/* Not many dirty pages compared to # of shadow pages. */
-	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
-		gfn_t offset;
-
-		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
-			kvm_mmu_write_protect_pt_range(kvm, memslot, offset, offset);
-
-		kvm_flush_remote_tlbs(kvm);
-	} else
-		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
-
-	spin_unlock(&kvm->mmu_lock);
-}
-
-/*
- * Get (and clear) the dirty memory log for a memory slot.
- */
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
-				      struct kvm_dirty_log *log)
+int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
 	int r;
 	struct kvm_memory_slot *memslot;
-	unsigned long n, nr_dirty_pages;
+	unsigned long n, i;
+	unsigned long *dirty_bitmap;
+	unsigned long *dirty_bitmap_buffer;
+	bool is_dirty = false;
 
 	mutex_lock(&kvm->slots_lock);
 
@@ -3098,49 +3075,41 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 		goto out;
 
 	memslot = id_to_memslot(kvm->memslots, log->slot);
+	dirty_bitmap = memslot->dirty_bitmap;
 	r = -ENOENT;
-	if (!memslot->dirty_bitmap)
+	if (!dirty_bitmap)
 		goto out;
 
 	n = kvm_dirty_bitmap_bytes(memslot);
-	nr_dirty_pages = memslot->nr_dirty_pages;
+	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
+	memset(dirty_bitmap_buffer, 0, n);
 
-	/* If nothing is dirty, don't bother messing with page tables. */
-	if (nr_dirty_pages) {
-		struct kvm_memslots *slots, *old_slots;
-		unsigned long *dirty_bitmap, *dirty_bitmap_head;
+	spin_lock(&kvm->mmu_lock);
 
-		dirty_bitmap = memslot->dirty_bitmap;
-		dirty_bitmap_head = memslot->dirty_bitmap_head;
-		if (dirty_bitmap == dirty_bitmap_head)
-			dirty_bitmap_head += n / sizeof(long);
-		memset(dirty_bitmap_head, 0, n);
+	for (i = 0; i < n / sizeof(long); i++) {
+		unsigned long bits;
+		gfn_t start, end;
 
-		r = -ENOMEM;
-		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
-		if (!slots)
-			goto out;
-
-		memslot = id_to_memslot(slots, log->slot);
-		memslot->nr_dirty_pages = 0;
-		memslot->dirty_bitmap = dirty_bitmap_head;
-		update_memslots(slots, NULL);
+		if (!dirty_bitmap[i])
+			continue;
 
-		old_slots = kvm->memslots;
-		rcu_assign_pointer(kvm->memslots, slots);
-		synchronize_srcu_expedited(&kvm->srcu);
-		kfree(old_slots);
+		is_dirty = true;
+		bits = xchg(&dirty_bitmap[i], 0);
+		dirty_bitmap_buffer[i] = bits;
 
-		write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
+		start = i * BITS_PER_LONG + __ffs(bits);
+		end   = i * BITS_PER_LONG + __fls(bits);
 
-		r = -EFAULT;
-		if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
-			goto out;
-	} else {
-		r = -EFAULT;
-		if (clear_user(log->dirty_bitmap, n))
-			goto out;
+		kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
 	}
+	if (is_dirty)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+
+	r = -EFAULT;
+	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
+		goto out;
 
 	r = 0;
 out:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages
  2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
                   ` (2 preceding siblings ...)
  2012-02-23 11:35 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
@ 2012-02-23 11:36 ` Takuya Yoshikawa
  2012-02-23 13:25 ` [PATCH 0/4] KVM: srcu-less dirty logging Peter Zijlstra
  4 siblings, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-23 11:36 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel, peterz, paulmck

Now that we do neither double buffering nor heuristic selection of the
write protection method these are not needed anymore.

Note: some drivers have their own implementation of set_bit_le() and
making it generic needs a bit of work; so we use test_and_set_bit_le()
and will later replace it with generic set_bit_le().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 include/linux/kvm_host.h |    2 --
 virt/kvm/kvm_main.c      |   14 +++++---------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 355e445..73c7d76 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -177,8 +177,6 @@ struct kvm_memory_slot {
 	unsigned long flags;
 	unsigned long *rmap;
 	unsigned long *dirty_bitmap;
-	unsigned long *dirty_bitmap_head;
-	unsigned long nr_dirty_pages;
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
 	int user_alloc;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4431ad..27a1083 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -522,12 +522,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 		return;
 
 	if (2 * kvm_dirty_bitmap_bytes(memslot) > PAGE_SIZE)
-		vfree(memslot->dirty_bitmap_head);
+		vfree(memslot->dirty_bitmap);
 	else
-		kfree(memslot->dirty_bitmap_head);
+		kfree(memslot->dirty_bitmap);
 
 	memslot->dirty_bitmap = NULL;
-	memslot->dirty_bitmap_head = NULL;
 }
 
 /*
@@ -611,8 +610,7 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 
 /*
  * Allocation size is twice as large as the actual dirty bitmap size.
- * This makes it possible to do double buffering: see x86's
- * kvm_vm_ioctl_get_dirty_log().
+ * See x86's kvm_vm_ioctl_get_dirty_log() why this is needed.
  */
 static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
 {
@@ -627,8 +625,6 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
 	if (!memslot->dirty_bitmap)
 		return -ENOMEM;
 
-	memslot->dirty_bitmap_head = memslot->dirty_bitmap;
-	memslot->nr_dirty_pages = 0;
 #endif /* !CONFIG_S390 */
 	return 0;
 }
@@ -1476,8 +1472,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
-		if (!test_and_set_bit_le(rel_gfn, memslot->dirty_bitmap))
-			memslot->nr_dirty_pages++;
+		/* TODO: introduce set_bit_le() and use it */
+		test_and_set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] KVM: srcu-less dirty logging
  2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
                   ` (3 preceding siblings ...)
  2012-02-23 11:36 ` [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages Takuya Yoshikawa
@ 2012-02-23 13:25 ` Peter Zijlstra
  2012-02-28 10:03   ` Avi Kivity
  4 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2012-02-23 13:25 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel, paulmck

On Thu, 2012-02-23 at 20:33 +0900, Takuya Yoshikawa wrote:
> - Stop allocating extra dirty bitmap buffer area
> 
>   According to Peter, mmu_notifier has become preemptible.  If we can
>   change mmu_lock from spin_lock to mutex_lock, as Avi said before, this
>   would be staightforward because we can use __put_user() right after
>   xchg() with the mmu_lock held 

So the 'only' thing to consider is running the end result with lockdep
enabled since the mmu locks are rather deep in the nesting tree its very
easy to accidentally cause inversions.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] KVM: srcu-less dirty logging
  2012-02-23 13:25 ` [PATCH 0/4] KVM: srcu-less dirty logging Peter Zijlstra
@ 2012-02-28 10:03   ` Avi Kivity
  2012-02-29  4:30     ` Takuya Yoshikawa
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-02-28 10:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Takuya Yoshikawa, mtosatti, kvm, linux-kernel, paulmck

On 02/23/2012 03:25 PM, Peter Zijlstra wrote:
> On Thu, 2012-02-23 at 20:33 +0900, Takuya Yoshikawa wrote:
> > - Stop allocating extra dirty bitmap buffer area
> > 
> >   According to Peter, mmu_notifier has become preemptible.  If we can
> >   change mmu_lock from spin_lock to mutex_lock, as Avi said before, this
> >   would be staightforward because we can use __put_user() right after
> >   xchg() with the mmu_lock held 
>
> So the 'only' thing to consider is running the end result with lockdep
> enabled since the mmu locks are rather deep in the nesting tree its very
> easy to accidentally cause inversions.

There will be an inversion for sure, if __put_user() faults and triggers
an mmu notifier (perhaps directly, perhaps through an allocation that
triggers a swap).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-02-23 11:35 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
@ 2012-02-28 11:46   ` Avi Kivity
  2012-02-29  7:49     ` Takuya Yoshikawa
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-02-28 11:46 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel, peterz, paulmck

On 02/23/2012 01:35 PM, Takuya Yoshikawa wrote:
> We have seen some problems of the current implementation of
> get_dirty_log() which uses synchronize_srcu_expedited() for updating
> dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
> order of latency when we use VGA displays.
>
> Furthermore the recent discussion on the following thread
>     "srcu: Implement call_srcu()"
>     http://lkml.org/lkml/2012/1/31/211
> also motivated us to implement get_dirty_log() without SRCU.
>
> This patch achieves this goal without sacrificing the performance of
> both VGA and live migration: in practice the new code is much faster
> than the old one unless we have too many dirty pages.
>
> Implementation:
>
> The key part of the implementation is the use of xchg() operation for
> clearing dirty bits atomically.  Since this allows us to update only
> BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> until every dirty bit is cleared again for the next call.

What about using cmpxchg16b?  That should reduce locked ops by a factor
of 2 (but note it needs 16 bytes alignment).

>
> Although some people may worry about the problem of using the atomic
> memory instruction many times to the concurrently accessible bitmap,
> it is usually accessed with mmu_lock held and we rarely see concurrent
> accesses: so what we need to care about is the pure xchg() overheads.
>
> Another point to note is that we do not use for_each_set_bit() to check
> which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
> simply use __ffs() and __fls() and pass the range in between the two
> positions found by them to kvm_mmu_write_protect_pt_range().

This seems artificial.

> Even though the passed range may include clean pages, it is much faster
> than repeatedly call find_next_bit() due to the locality of dirty pages.

Perhaps this is due to the implementation of find_next_bit()?  would
using bsf improve things?

> Performance:
>
> The dirty-log-perf unit test showed nice improvement, some times faster
> than before, when the number of dirty pages was below 8K.  For other
> cases we saw a bit of regression but still enough fast compared to the
> processing of these dirty pages in the userspace.
>
> For real workloads, both VGA and live migration, we have observed pure
> improvement: when the guest was reading a file, we originally saw a few
> ms of latency, but with the new method the latency was 50us to 300us.
>
>  
>  /**
> - * write_protect_slot - write protect a slot for dirty logging
> - * @kvm: the kvm instance
> - * @memslot: the slot we protect
> - * @dirty_bitmap: the bitmap indicating which pages are dirty
> - * @nr_dirty_pages: the number of dirty pages
> + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
> + * @kvm: kvm instance
> + * @log: slot id and address to which we copy the log
>   *
> - * We have two ways to find all sptes to protect:
> - * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
> - *    checks ones that have a spte mapping a page in the slot.
> - * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
> + * We need to keep it in mind that VCPU threads can write to the bitmap
> + * concurrently.  So, to avoid losing data, we keep the following order for
> + * each bit:
>   *
> - * Generally speaking, if there are not so many dirty pages compared to the
> - * number of shadow pages, we should use the latter.
> + *   1. Take a snapshot of the bit and clear it if needed.
> + *   2. Write protect the corresponding page.
> + *   3. Flush TLB's if needed.
> + *   4. Copy the snapshot to the userspace.
>   *
> - * Note that letting others write into a page marked dirty in the old bitmap
> - * by using the remaining tlb entry is not a problem.  That page will become
> - * write protected again when we flush the tlb and then be reported dirty to
> - * the user space by copying the old bitmap.
> + * Between 2 and 3, the guest may write to the page using the remaining TLB
> + * entry.  This is not a problem because the page will be reported dirty at
> + * step 4 using the snapshot taken before and step 3 ensures that successive
> + * writes will be logged for the next call.
>   */
> -static void write_protect_slot(struct kvm *kvm,
> -			       struct kvm_memory_slot *memslot,
> -			       unsigned long *dirty_bitmap,
> -			       unsigned long nr_dirty_pages)
> -{
> -	spin_lock(&kvm->mmu_lock);
> -
> -	/* Not many dirty pages compared to # of shadow pages. */
> -	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
> -		gfn_t offset;
> -
> -		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
> -			kvm_mmu_write_protect_pt_range(kvm, memslot, offset, offset);
> -
> -		kvm_flush_remote_tlbs(kvm);
> -	} else
> -		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
> -
> -	spin_unlock(&kvm->mmu_lock);
> -}
> -
> -/*
> - * Get (and clear) the dirty memory log for a memory slot.
> - */
> -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> -				      struct kvm_dirty_log *log)
> +int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  {
>  	int r;
>  	struct kvm_memory_slot *memslot;
> -	unsigned long n, nr_dirty_pages;
> +	unsigned long n, i;
> +	unsigned long *dirty_bitmap;
> +	unsigned long *dirty_bitmap_buffer;
> +	bool is_dirty = false;
>  
>  	mutex_lock(&kvm->slots_lock);
>  
> @@ -3098,49 +3075,41 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  		goto out;
>  
>  	memslot = id_to_memslot(kvm->memslots, log->slot);
> +	dirty_bitmap = memslot->dirty_bitmap;
>  	r = -ENOENT;
> -	if (!memslot->dirty_bitmap)
> +	if (!dirty_bitmap)
>  		goto out;
>  
>  	n = kvm_dirty_bitmap_bytes(memslot);
> -	nr_dirty_pages = memslot->nr_dirty_pages;
> +	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> +	memset(dirty_bitmap_buffer, 0, n);
>  
> -	/* If nothing is dirty, don't bother messing with page tables. */
> -	if (nr_dirty_pages) {
> -		struct kvm_memslots *slots, *old_slots;
> -		unsigned long *dirty_bitmap, *dirty_bitmap_head;
> +	spin_lock(&kvm->mmu_lock);
>  
> -		dirty_bitmap = memslot->dirty_bitmap;
> -		dirty_bitmap_head = memslot->dirty_bitmap_head;
> -		if (dirty_bitmap == dirty_bitmap_head)
> -			dirty_bitmap_head += n / sizeof(long);
> -		memset(dirty_bitmap_head, 0, n);
> +	for (i = 0; i < n / sizeof(long); i++) {
> +		unsigned long bits;
> +		gfn_t start, end;
>  
> -		r = -ENOMEM;
> -		slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL);
> -		if (!slots)
> -			goto out;
> -
> -		memslot = id_to_memslot(slots, log->slot);
> -		memslot->nr_dirty_pages = 0;
> -		memslot->dirty_bitmap = dirty_bitmap_head;
> -		update_memslots(slots, NULL);
> +		if (!dirty_bitmap[i])
> +			continue;
>  
> -		old_slots = kvm->memslots;
> -		rcu_assign_pointer(kvm->memslots, slots);
> -		synchronize_srcu_expedited(&kvm->srcu);
> -		kfree(old_slots);
> +		is_dirty = true;
> +		bits = xchg(&dirty_bitmap[i], 0);
> +		dirty_bitmap_buffer[i] = bits;
>  
> -		write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages);
> +		start = i * BITS_PER_LONG + __ffs(bits);
> +		end   = i * BITS_PER_LONG + __fls(bits);
>  
> -		r = -EFAULT;
> -		if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
> -			goto out;
> -	} else {
> -		r = -EFAULT;
> -		if (clear_user(log->dirty_bitmap, n))
> -			goto out;
> +		kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);

If indeed the problem is find_next_bit(), then we could hanve
kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
a parameter.  This would allow covering just this function with the
spinlock, not the xchg loop.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] KVM: srcu-less dirty logging
  2012-02-28 10:03   ` Avi Kivity
@ 2012-02-29  4:30     ` Takuya Yoshikawa
  0 siblings, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-29  4:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Peter Zijlstra, mtosatti, kvm, linux-kernel, paulmck

Avi Kivity <avi@redhat.com> wrote:

> There will be an inversion for sure, if __put_user() faults and triggers
> an mmu notifier (perhaps directly, perhaps through an allocation that
> triggers a swap).

Ah, I did not notice that possibility.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-02-28 11:46   ` Avi Kivity
@ 2012-02-29  7:49     ` Takuya Yoshikawa
  2012-02-29  9:25       ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-29  7:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, kvm, linux-kernel, peterz, paulmck

Avi Kivity <avi@redhat.com> wrote:

> > The key part of the implementation is the use of xchg() operation for
> > clearing dirty bits atomically.  Since this allows us to update only
> > BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> > until every dirty bit is cleared again for the next call.
> 
> What about using cmpxchg16b?  That should reduce locked ops by a factor
> of 2 (but note it needs 16 bytes alignment).

I tried cmpxchg16b first: the implementation could not be naturally
combined with the for loop over the unsigned long array.

	Extra "if not zero", alignement check and ... it was ugly
	and I guessed it would be slow.

Taking it into account that cmpxchg16b needs more cycles than others,
I think this should be tried carefully with more measurement later.

How about concentrating on xchg now?

The implementation is simple and gives us enough improvement for now.
At least, I want to see whether xchg-based implementation works well
for one release.

	GET_DIRTY_LOG can be easily tuned to one particular case and
	it is really hard to check whether the implementation works well
	for every important case.  I really want feedback from users
	before adding non-obvious optimization.

In addition, we should care about the new API.  It is not decided about
what kind of range can be ordered.  I think restricting the range to be
long size aligned is natural.  Do you have any plan?

> > Another point to note is that we do not use for_each_set_bit() to check
> > which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
> > simply use __ffs() and __fls() and pass the range in between the two
> > positions found by them to kvm_mmu_write_protect_pt_range().
> 
> This seems artificial.

OK, then I want to pass the bits (unsingned long) as a mask.

Non-NPT machines may gain some.

> > Even though the passed range may include clean pages, it is much faster
> > than repeatedly call find_next_bit() due to the locality of dirty pages.
> 
> Perhaps this is due to the implementation of find_next_bit()?  would
> using bsf improve things?

I need to explain what I did in the past.

Before srcu-less work, I had already noticed the slowness of
for_each_set_bit() and replaced it with simple for loop like now: the
improvement was significant.

Yes, find_next_bit() is for generic use and not at all good when there
are many consecutive bits set: it cannot assume anything so needs to check
a lot of cases - we have long size aligned bitmap and "bits" is already
known to be non-zero after the first check of the for loop.

Of course, doing 64 function calls alone should be avoided in our case.
I also do not want to call kvm_mmu_* for each bit.

So, above, I proposed just passing "bits" to kvm_mmu_*: we can check
each bit i in a register before using rmap[i] if needed.

__ffs is really fast compared to other APIs.

One note is that we will lose in cases like bits = 0xffffffff..

2271171.4    12064.9       138.6      16K     -45
3375866.2    14743.3       103.0      32K     -55
4408395.6    10720.0        67.2      64K     -51
5915336.2    26538.1        45.1     128K     -44
8497356.4    16441.0        32.4     256K     -29

So the last one will become worse.  For other 4 patterns I am not sure.

I thought that we should tune to the last case for gaining a lot from
the locality of WWS.  What do you think about this point?

> > -	} else {
> > -		r = -EFAULT;
> > -		if (clear_user(log->dirty_bitmap, n))
> > -			goto out;
> > +		kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
> 
> If indeed the problem is find_next_bit(), then we could hanve
> kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
> a parameter.  This would allow covering just this function with the
> spinlock, not the xchg loop.

We may see partial display updates if we do not hold the mmu_lock during
xchg loop: it is possible that pages near the end of the framebuffer alone
gets updated sometimes - I noticed this problem when I fixed the TLB flush
issue.

Not a big problem but still maybe-noticeable change, so I think we should
do it separately with some comments if needed.

In addition, we do not want to scan the dirty bitmap twice.  Using the
bits value soon after it is read into a register seems to be the fastest.


BTW, I also want to decide the design of the new API at this chance.

	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-02-29  7:49     ` Takuya Yoshikawa
@ 2012-02-29  9:25       ` Avi Kivity
  2012-02-29 10:16         ` Takuya Yoshikawa
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-02-29  9:25 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel, peterz, paulmck

On 02/29/2012 09:49 AM, Takuya Yoshikawa wrote:
> Avi Kivity <avi@redhat.com> wrote:
>
> > > The key part of the implementation is the use of xchg() operation for
> > > clearing dirty bits atomically.  Since this allows us to update only
> > > BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> > > until every dirty bit is cleared again for the next call.
> > 
> > What about using cmpxchg16b?  That should reduce locked ops by a factor
> > of 2 (but note it needs 16 bytes alignment).
>
> I tried cmpxchg16b first: the implementation could not be naturally
> combined with the for loop over the unsigned long array.
>
> 	Extra "if not zero", alignement check and ... it was ugly
> 	and I guessed it would be slow.
>
> Taking it into account that cmpxchg16b needs more cycles than others,
> I think this should be tried carefully with more measurement later.
>
> How about concentrating on xchg now?

Okay.  But note you don't need the alignment check; simply allocate the
array aligned, and a multiple of 16 bytes, in the first place.

>
> The implementation is simple and gives us enough improvement for now.
> At least, I want to see whether xchg-based implementation works well
> for one release.
>
> 	GET_DIRTY_LOG can be easily tuned to one particular case and
> 	it is really hard to check whether the implementation works well
> 	for every important case.  I really want feedback from users
> 	before adding non-obvious optimization.
>
> In addition, we should care about the new API.  It is not decided about
> what kind of range can be ordered.  I think restricting the range to be
> long size aligned is natural.  Do you have any plan?

Not really.  But the current changes are all internal and don't affect
the user API.

>
> > > Another point to note is that we do not use for_each_set_bit() to check
> > > which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
> > > simply use __ffs() and __fls() and pass the range in between the two
> > > positions found by them to kvm_mmu_write_protect_pt_range().
> > 
> > This seems artificial.
>
> OK, then I want to pass the bits (unsingned long) as a mask.
>
> Non-NPT machines may gain some.
>
> > > Even though the passed range may include clean pages, it is much faster
> > > than repeatedly call find_next_bit() due to the locality of dirty pages.
> > 
> > Perhaps this is due to the implementation of find_next_bit()?  would
> > using bsf improve things?
>
> I need to explain what I did in the past.
>
> Before srcu-less work, I had already noticed the slowness of
> for_each_set_bit() and replaced it with simple for loop like now: the
> improvement was significant.
>
> Yes, find_next_bit() is for generic use and not at all good when there
> are many consecutive bits set: it cannot assume anything so needs to check
> a lot of cases - we have long size aligned bitmap and "bits" is already
> known to be non-zero after the first check of the for loop.
>
> Of course, doing 64 function calls alone should be avoided in our case.
> I also do not want to call kvm_mmu_* for each bit.
>
> So, above, I proposed just passing "bits" to kvm_mmu_*: we can check
> each bit i in a register before using rmap[i] if needed.
>
> __ffs is really fast compared to other APIs.

You could do something like this

    for (i = 0; i < bitmap_size_in_longs; ++i) {
        mask = bitmap[i];
        if (!mask)
               continue;
        for (j = __ffs(mask); mask; mask &= mask - 1, j = __ffs(mask))
                handle i * BITS_PER_LONG + j;
    }

This gives you the speed of __ffs() but without working on zero bits.

>
> One note is that we will lose in cases like bits = 0xffffffff..
>
> 2271171.4    12064.9       138.6      16K     -45
> 3375866.2    14743.3       103.0      32K     -55
> 4408395.6    10720.0        67.2      64K     -51
> 5915336.2    26538.1        45.1     128K     -44
> 8497356.4    16441.0        32.4     256K     -29
>
> So the last one will become worse.  For other 4 patterns I am not sure.
>
> I thought that we should tune to the last case for gaining a lot from
> the locality of WWS.  What do you think about this point?
>
> > > -	} else {
> > > -		r = -EFAULT;
> > > -		if (clear_user(log->dirty_bitmap, n))
> > > -			goto out;
> > > +		kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
> > 
> > If indeed the problem is find_next_bit(), then we could hanve
> > kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
> > a parameter.  This would allow covering just this function with the
> > spinlock, not the xchg loop.
>
> We may see partial display updates if we do not hold the mmu_lock during
> xchg loop: it is possible that pages near the end of the framebuffer alone
> gets updated sometimes - I noticed this problem when I fixed the TLB flush
> issue.

I don't understand why this happens.

>
> Not a big problem but still maybe-noticeable change, so I think we should
> do it separately with some comments if needed.

Well if it's noticable on the framebuffer it's also noticable with live
migration.  We could do it later, but we need to really understand it first.

> In addition, we do not want to scan the dirty bitmap twice.  Using the
> bits value soon after it is read into a register seems to be the fastest.

Probably.

> BTW, I also want to decide the design of the new API at this chance.

Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
everything we can out of the current APIs.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-02-29  9:25       ` Avi Kivity
@ 2012-02-29 10:16         ` Takuya Yoshikawa
  2012-02-29 12:27           ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-29 10:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, kvm, linux-kernel, peterz, paulmck

Avi Kivity <avi@redhat.com> wrote:

> Okay.  But note you don't need the alignment check; simply allocate the
> array aligned, and a multiple of 16 bytes, in the first place.

OKay, then we can do something like:

	for each (x = bitmap[i], y = bitmap[i+1])
		if (!x && !y)
			continue
		else if ...
			cmpxchg16b or 8b
		...

I will try later.

> > In addition, we should care about the new API.  It is not decided about
> > what kind of range can be ordered.  I think restricting the range to be
> > long size aligned is natural.  Do you have any plan?
> 
> Not really.  But the current changes are all internal and don't affect
> the user API.

Then I will do my best to improve the performance now!

> > __ffs is really fast compared to other APIs.
> 
> You could do something like this
> 
>     for (i = 0; i < bitmap_size_in_longs; ++i) {
>         mask = bitmap[i];
>         if (!mask)
>                continue;
>         for (j = __ffs(mask); mask; mask &= mask - 1, j = __ffs(mask))
>                 handle i * BITS_PER_LONG + j;
>     }
> 
> This gives you the speed of __ffs() but without working on zero bits.

Yes, I will do this in v2.

> > We may see partial display updates if we do not hold the mmu_lock during
> > xchg loop: it is possible that pages near the end of the framebuffer alone
> > gets updated sometimes - I noticed this problem when I fixed the TLB flush
> > issue.
> 
> I don't understand why this happens.

Because only mmu_lock protects the bitmap for VGA.

xchg i = 1
xchg i = 2
...
xchg i = N

We cannot get a complete snapshot without mmu_lock; if the guest faults on
the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
become newer.

But others will be updated by the next call, so the problem is restricted:
maybe not noticeable.

> > Not a big problem but still maybe-noticeable change, so I think we should
> > do it separately with some comments if needed.
> 
> Well if it's noticable on the framebuffer it's also noticable with live
> migration.  We could do it later, but we need to really understand it first.

About live migration, we do not mind whether the bitmap is a complete snapshot.
In addition, we cannot do anything because the emulator can access the bitmap
without mmu_lock.

What we are doing is calling GET_DIRTY_LOG slot by slot: so already the
result is not a snapshot at all.

In the end, at the last stage, we will stop the guest and get a complete snapshot.

> > In addition, we do not want to scan the dirty bitmap twice.  Using the
> > bits value soon after it is read into a register seems to be the fastest.
> 
> Probably.
> 
> > BTW, I also want to decide the design of the new API at this chance.
> 
> Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
> everything we can out of the current APIs.

I agree with you of course.

At the same time, we cannot say anything without actually implementing
sample userspace programs.

So I want to see how much improvement the proposed API can achieve.

I thought this might be a good GSoC project but ...


Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-02-29 10:16         ` Takuya Yoshikawa
@ 2012-02-29 12:27           ` Avi Kivity
  2012-02-29 14:18             ` Takuya Yoshikawa
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-02-29 12:27 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel, peterz, paulmck

On 02/29/2012 12:16 PM, Takuya Yoshikawa wrote:
> > > We may see partial display updates if we do not hold the mmu_lock during
> > > xchg loop: it is possible that pages near the end of the framebuffer alone
> > > gets updated sometimes - I noticed this problem when I fixed the TLB flush
> > > issue.
> > 
> > I don't understand why this happens.
>
> Because only mmu_lock protects the bitmap for VGA.
>
> xchg i = 1
> xchg i = 2
> ...
> xchg i = N
>
> We cannot get a complete snapshot without mmu_lock; if the guest faults on
> the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
> become newer.

Ah, so there is no data corruption (missed dirty bits), just the display
is updated inconsistently?

I don't think we can get a consistent snapshot anyway, since the guest
can update the framebuffer while userspace is processing it.

>
> But others will be updated by the next call, so the problem is restricted:
> maybe not noticeable.
>
> > > Not a big problem but still maybe-noticeable change, so I think we should
> > > do it separately with some comments if needed.
> > 
> > Well if it's noticable on the framebuffer it's also noticable with live
> > migration.  We could do it later, but we need to really understand it first.
>
> About live migration, we do not mind whether the bitmap is a complete snapshot.
> In addition, we cannot do anything because the emulator can access the bitmap
> without mmu_lock.
>
> What we are doing is calling GET_DIRTY_LOG slot by slot: so already the
> result is not a snapshot at all.
>
> In the end, at the last stage, we will stop the guest and get a complete snapshot.

Understood.  I don't think we can get a consistent vga snapshot without
stopping the guest, and even then, it depends on how the guest updates
the framebuffer.

>
> > > In addition, we do not want to scan the dirty bitmap twice.  Using the
> > > bits value soon after it is read into a register seems to be the fastest.
> > 
> > Probably.
> > 
> > > BTW, I also want to decide the design of the new API at this chance.
> > 
> > Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
> > everything we can out of the current APIs.
>
> I agree with you of course.
>
> At the same time, we cannot say anything without actually implementing
> sample userspace programs.
>
> So I want to see how much improvement the proposed API can achieve.
>
> I thought this might be a good GSoC project but ...

It may be too involved for GSoC, the issues are difficult.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
  2012-02-29 12:27           ` Avi Kivity
@ 2012-02-29 14:18             ` Takuya Yoshikawa
  0 siblings, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-29 14:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Takuya Yoshikawa, mtosatti, kvm, linux-kernel, peterz, paulmck

Avi Kivity <avi@redhat.com> wrote:

> > We cannot get a complete snapshot without mmu_lock; if the guest faults on
> > the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
> > become newer.
> 
> Ah, so there is no data corruption (missed dirty bits), just the display
> is updated inconsistently?

Yes, no data corruption and just a matter of ... feeling.

The basic rule I wrote in the comment in this patch should be enough to
not lose dirty bits.

> I don't think we can get a consistent snapshot anyway, since the guest
> can update the framebuffer while userspace is processing it.

Yes, nothing will be broken.  I was just not sure what the API should promise
to the userspace.

To some extent the inconsistency may be felt more than before.

> Understood.  I don't think we can get a consistent vga snapshot without
> stopping the guest, and even then, it depends on how the guest updates
> the framebuffer.

OKay, I will not care about this from now.

> > So I want to see how much improvement the proposed API can achieve.
> >
> > I thought this might be a good GSoC project but ...
> 
> It may be too involved for GSoC, the issues are difficult.

I am now checking QEMU code closely - rather readable than before!

Though I can make experimental KVM patches for the student, just changing
QEMU's live migration code seems to be difficult - but not sure now.

Anyway, we should try this before deciding the new API: I mean if I cannot
find anyone who want to try this, no need to be a student, I may do instead
at some point.

	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-14  5:34     ` Takuya Yoshikawa
@ 2012-03-14 10:58       ` Marcelo Tosatti
  0 siblings, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2012-03-14 10:58 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, kvm, linux-kernel

On Wed, Mar 14, 2012 at 02:34:32PM +0900, Takuya Yoshikawa wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > guest fault                                     enable dirty logging
> > 
> > tdp_page_fault (all _page_fault functions)      kvm_set_memory_region
> > 
> > 
> > level = mapping_level(vcpu, gfn)
> > (finds level == 2 or 3)
> > 
> > 
> >                                                 rcu_assign_pointer(slot
> > 								   with 
> >                                                                    ->dirty_bitmap)
> >                                                 synchronize_srcu_expedited()
> 
> Isn't here still in the SRCU read-side critical section?
> 
> > schedule()
> >                                                 kvm_arch_commit_memory_region()
> >                                                 spin_lock(mmu_lock)
> >                                                 kvm_mmu_slot_remove_write_access()
> > 						removes large sptes
> >                                                 spin_unlock(mmu_lock)
> > spin_lock(mmu_lock)
> > create large spte accordingly
> > to level above
> > spin_unlock(mmu_lock)
> > 
> 
> If so, we cannot start kvm_arch_commit_memory_region() until the completion
> of the SRCU critical section and this race will not happen.
> 
> 	Takuya

Correct.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-13 23:04   ` Marcelo Tosatti
  2012-03-14  1:04     ` Takuya Yoshikawa
@ 2012-03-14  5:34     ` Takuya Yoshikawa
  2012-03-14 10:58       ` Marcelo Tosatti
  1 sibling, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-03-14  5:34 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> guest fault                                     enable dirty logging
> 
> tdp_page_fault (all _page_fault functions)      kvm_set_memory_region
> 
> 
> level = mapping_level(vcpu, gfn)
> (finds level == 2 or 3)
> 
> 
>                                                 rcu_assign_pointer(slot
> 								   with 
>                                                                    ->dirty_bitmap)
>                                                 synchronize_srcu_expedited()

Isn't here still in the SRCU read-side critical section?

> schedule()
>                                                 kvm_arch_commit_memory_region()
>                                                 spin_lock(mmu_lock)
>                                                 kvm_mmu_slot_remove_write_access()
> 						removes large sptes
>                                                 spin_unlock(mmu_lock)
> spin_lock(mmu_lock)
> create large spte accordingly
> to level above
> spin_unlock(mmu_lock)
> 

If so, we cannot start kvm_arch_commit_memory_region() until the completion
of the SRCU critical section and this race will not happen.

	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-13 23:04   ` Marcelo Tosatti
@ 2012-03-14  1:04     ` Takuya Yoshikawa
  2012-03-14  5:34     ` Takuya Yoshikawa
  1 sibling, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-03-14  1:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: avi, kvm, linux-kernel

Marcelo Tosatti <mtosatti@redhat.com> wrote:

> 
> This is a race with hugetlbfs which is not an issue ATM (it is 
> hidden by the removal of huge sptes in get_dirty).

Thank you!
I did not notice this possibility at all.

...

> It can be fixed with a preceding patch that checks whether
> slot->dirty_bitmap value changes between mapping_level and after
> mmu_lock acquision, similarly to mmu_seq. Also please add a 
> WARN_ON in mmu_set_spte if(slot->dirty_bitmap && level > 1).
> And document it clearly.

We may also be able to change the "if (slot->dirty_bitmap)" check to
use another flag so that we can delay the start of logging until
mmu_lock acquisition in kvm_set_memory_region().

Looking dirty_bitmap directly should be limited to when we are in
get_dirty functions.

	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
  2012-03-02  2:56   ` Takuya Yoshikawa
  2012-03-12 18:04   ` Avi Kivity
@ 2012-03-13 23:04   ` Marcelo Tosatti
  2012-03-14  1:04     ` Takuya Yoshikawa
  2012-03-14  5:34     ` Takuya Yoshikawa
  2 siblings, 2 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2012-03-13 23:04 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, kvm, linux-kernel

On Thu, Mar 01, 2012 at 07:32:16PM +0900, Takuya Yoshikawa wrote:
> Dropped such mappings when we enabled dirty logging and we will never
> create new ones until we stop the logging.
> 
> For this we introduce a new function which can be used to write protect
> a range of PT level pages: although we do not need to care about a range
> of pages at this point, the following patch will need this feature to
> optimize the write protection of many pages.
> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
>  arch/x86/include/asm/kvm_host.h |    5 ++-
>  arch/x86/kvm/mmu.c              |   40 +++++++++++++++++++++++++++++---------
>  arch/x86/kvm/x86.c              |    8 ++----
>  3 files changed, 36 insertions(+), 17 deletions(-)

This is a race with hugetlbfs which is not an issue ATM (it is 
hidden by the removal of huge sptes in get_dirty).


guest fault                                     enable dirty logging

tdp_page_fault (all _page_fault functions)      kvm_set_memory_region


level = mapping_level(vcpu, gfn)
(finds level == 2 or 3)


                                                rcu_assign_pointer(slot
								   with 
                                                                   ->dirty_bitmap)
                                                synchronize_srcu_expedited()
schedule()
                                                kvm_arch_commit_memory_region()
                                                spin_lock(mmu_lock)
                                                kvm_mmu_slot_remove_write_access()
						removes large sptes
                                                spin_unlock(mmu_lock)
spin_lock(mmu_lock)
create large spte accordingly
to level above
spin_unlock(mmu_lock)

Not removing large sptes in get_dirty means this racy sptes could
live from the start of migration to the end of it.

It can be fixed with a preceding patch that checks whether
slot->dirty_bitmap value changes between mapping_level and after
mmu_lock acquision, similarly to mmu_seq. Also please add a 
WARN_ON in mmu_set_spte if(slot->dirty_bitmap && level > 1).
And document it clearly.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-13  9:20     ` Takuya Yoshikawa
@ 2012-03-13 10:12       ` Avi Kivity
  0 siblings, 0 replies; 24+ messages in thread
From: Avi Kivity @ 2012-03-13 10:12 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel

On 03/13/2012 11:20 AM, Takuya Yoshikawa wrote:
> Avi Kivity <avi@redhat.com> wrote:
>
> > It occurs to me that we should write-protect huge page tables, since it
> > makes write protection much faster (we make up for this later at write
> > fault time, but that might not occur, and even if it does we reduce
> > guest jitter).  In fact I once proposed a more involved variant, called
>
> Do you mean protecting huge page tables when we start logging and dropping
> them, one by one, at the time of write fault?

Yes.  Even more generally, protecting PML4Es, then at fault time
un-protecting the faulting PML4E and write-protecting all underlying
PDPEs except the one for the faulting address, and similarly for PDEs
and PTEs.

> If so, we do not need to change get_dirty_log() implementation.
> Changing kvm_mmu_remove_write_access() and fault handler seems to be enough.
>
> > O(1) write protection, in which we write-protect the topmost page table
> > only and only un-write-protect the paths that fault.
>
> > That can be done later however and shouldn't affect this patchset.
>
> I have some additional patches to optimize rmap handling which seems to
> improve get_dirty_log() about 10% in the worst case.

Great, looking forward.

> After that, I want to take some time to prepare for further developments
> because my current test tools are not enough now.
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-12 18:04   ` Avi Kivity
@ 2012-03-13  9:20     ` Takuya Yoshikawa
  2012-03-13 10:12       ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-03-13  9:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, kvm, linux-kernel

Avi Kivity <avi@redhat.com> wrote:

> It occurs to me that we should write-protect huge page tables, since it
> makes write protection much faster (we make up for this later at write
> fault time, but that might not occur, and even if it does we reduce
> guest jitter).  In fact I once proposed a more involved variant, called

Do you mean protecting huge page tables when we start logging and dropping
them, one by one, at the time of write fault?

If so, we do not need to change get_dirty_log() implementation.
Changing kvm_mmu_remove_write_access() and fault handler seems to be enough.

> O(1) write protection, in which we write-protect the topmost page table
> only and only un-write-protect the paths that fault.

> That can be done later however and shouldn't affect this patchset.

I have some additional patches to optimize rmap handling which seems to
improve get_dirty_log() about 10% in the worst case.

After that, I want to take some time to prepare for further developments
because my current test tools are not enough now.

	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
  2012-03-02  2:56   ` Takuya Yoshikawa
@ 2012-03-12 18:04   ` Avi Kivity
  2012-03-13  9:20     ` Takuya Yoshikawa
  2012-03-13 23:04   ` Marcelo Tosatti
  2 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-03-12 18:04 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm, linux-kernel

On 03/01/2012 12:32 PM, Takuya Yoshikawa wrote:
> Dropped such mappings when we enabled dirty logging and we will never
> create new ones until we stop the logging.
>
> For this we introduce a new function which can be used to write protect
> a range of PT level pages: although we do not need to care about a range
> of pages at this point, the following patch will need this feature to
> optimize the write protection of many pages.
>
>

It occurs to me that we should write-protect huge page tables, since it
makes write protection much faster (we make up for this later at write
fault time, but that might not occur, and even if it does we reduce
guest jitter).  In fact I once proposed a more involved variant, called
O(1) write protection, in which we write-protect the topmost page table
only and only un-write-protect the paths that fault.

That can be done later however and shouldn't affect this patchset.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-02  2:56   ` Takuya Yoshikawa
@ 2012-03-02  5:11     ` Takuya Yoshikawa
  0 siblings, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-03-02  5:11 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:

> Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:
> 
> > +	while (mask) {
> > +		rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
> > +		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
> >  
> > -	return write_protected;
> > +		/* clear the first set bit */
> > +		mask &= mask - 1;
> > +	}
> >  }
> 
> while (mask) {
> 	fsbit = __ffs(mask);
> 	gfn_offset += fsbit;
> 	mask >>= fsbit + 1;
> 
> 	rmapp = &slot->rmap[gfn_offset];
> 	__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
> }
> 
> Seems to be better: BSF friendly.

I was wrong.

"and < shift" and bsf did not show any difference for this kind of loop
on my box.  So we should keep the original implementation.

	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
@ 2012-03-02  2:56   ` Takuya Yoshikawa
  2012-03-02  5:11     ` Takuya Yoshikawa
  2012-03-12 18:04   ` Avi Kivity
  2012-03-13 23:04   ` Marcelo Tosatti
  2 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-03-02  2:56 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, linux-kernel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:

> +	while (mask) {
> +		rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
> +		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
>  
> -	return write_protected;
> +		/* clear the first set bit */
> +		mask &= mask - 1;
> +	}
>  }

while (mask) {
	fsbit = __ffs(mask);
	gfn_offset += fsbit;
	mask >>= fsbit + 1;

	rmapp = &slot->rmap[gfn_offset];
	__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
}

Seems to be better: BSF friendly.

I will update this function and post the result with dirty-log-perf output!

	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log()
  2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
@ 2012-03-01 10:32 ` Takuya Yoshikawa
  2012-03-02  2:56   ` Takuya Yoshikawa
                     ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-03-01 10:32 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, linux-kernel

Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/include/asm/kvm_host.h |    5 ++-
 arch/x86/kvm/mmu.c              |   40 +++++++++++++++++++++++++++++---------
 arch/x86/kvm/x86.c              |    8 ++----
 3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 74c9edf..935cbcc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -712,8 +712,9 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot);
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+				     struct kvm_memory_slot *slot,
+				     gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 67857bd..be8a529 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1037,27 +1037,47 @@ static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level
 	return write_protected;
 }
 
-int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn,
-			       struct kvm_memory_slot *slot)
+/**
+ * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+ * @kvm: kvm instance
+ * @slot: slot to protect
+ * @gfn_offset: start of the BITS_PER_LONG pages we care about
+ * @mask: indicates which pages we should protect
+ *
+ * Used when we do not need to care about huge page mappings: e.g. during dirty
+ * logging we do not have any such mappings.
+ */
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+				     struct kvm_memory_slot *slot,
+				     gfn_t gfn_offset, unsigned long mask)
 {
 	unsigned long *rmapp;
-	int i, write_protected = 0;
 
-	for (i = PT_PAGE_TABLE_LEVEL;
-	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
-		rmapp = __gfn_to_rmap(gfn, i, slot);
-		write_protected |= __rmap_write_protect(kvm, rmapp, i);
-	}
+	while (mask) {
+		rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
+		__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
 
-	return write_protected;
+		/* clear the first set bit */
+		mask &= mask - 1;
+	}
 }
 
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
+	unsigned long *rmapp;
+	int i;
+	int write_protected = 0;
 
 	slot = gfn_to_memslot(kvm, gfn);
-	return kvm_mmu_rmap_write_protect(kvm, gfn, slot);
+
+	for (i = PT_PAGE_TABLE_LEVEL;
+	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+		rmapp = __gfn_to_rmap(gfn, i, slot);
+		write_protected |= __rmap_write_protect(kvm, rmapp, i);
+	}
+
+	return write_protected;
 }
 
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9d99e5..3bc1922 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3069,13 +3069,11 @@ static void write_protect_slot(struct kvm *kvm,
 
 	/* Not many dirty pages compared to # of shadow pages. */
 	if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
-		unsigned long gfn_offset;
+		gfn_t offset;
 
-		for_each_set_bit(gfn_offset, dirty_bitmap, memslot->npages) {
-			unsigned long gfn = memslot->base_gfn + gfn_offset;
+		for_each_set_bit(offset, dirty_bitmap, memslot->npages)
+			kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, 1);
 
-			kvm_mmu_rmap_write_protect(kvm, gfn, memslot);
-		}
 		kvm_flush_remote_tlbs(kvm);
 	} else
 		kvm_mmu_slot_remove_write_access(kvm, memslot->id);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-03-14 11:01 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-23 11:33 [PATCH 0/4] KVM: srcu-less dirty logging Takuya Yoshikawa
2012-02-23 11:33 ` [PATCH 1/4] KVM: MMU: Split the main body of rmap_write_protect() off from others Takuya Yoshikawa
2012-02-23 11:34 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
2012-02-23 11:35 ` [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log() Takuya Yoshikawa
2012-02-28 11:46   ` Avi Kivity
2012-02-29  7:49     ` Takuya Yoshikawa
2012-02-29  9:25       ` Avi Kivity
2012-02-29 10:16         ` Takuya Yoshikawa
2012-02-29 12:27           ` Avi Kivity
2012-02-29 14:18             ` Takuya Yoshikawa
2012-02-23 11:36 ` [PATCH 4/4] KVM: Remove unused dirty_bitmap_head and nr_dirty_pages Takuya Yoshikawa
2012-02-23 13:25 ` [PATCH 0/4] KVM: srcu-less dirty logging Peter Zijlstra
2012-02-28 10:03   ` Avi Kivity
2012-02-29  4:30     ` Takuya Yoshikawa
2012-03-01 10:30 [PATCH 0/4] KVM: srcu-less dirty logging -v2 Takuya Yoshikawa
2012-03-01 10:32 ` [PATCH 2/4] KVM: Avoid checking huge page mappings in get_dirty_log() Takuya Yoshikawa
2012-03-02  2:56   ` Takuya Yoshikawa
2012-03-02  5:11     ` Takuya Yoshikawa
2012-03-12 18:04   ` Avi Kivity
2012-03-13  9:20     ` Takuya Yoshikawa
2012-03-13 10:12       ` Avi Kivity
2012-03-13 23:04   ` Marcelo Tosatti
2012-03-14  1:04     ` Takuya Yoshikawa
2012-03-14  5:34     ` Takuya Yoshikawa
2012-03-14 10:58       ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).