kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v2 00/17] KVM: Dirty ring interface
@ 2019-12-21  1:49 Peter Xu
  2019-12-21  1:49 ` [PATCH RESEND v2 01/17] KVM: Remove kvm_read_guest_atomic() Peter Xu
                   ` (17 more replies)
  0 siblings, 18 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
(based on 5.4.0)

This is v2 of the dirty ring series, and also the first non-RFC
version of it.  I didn't put a changelog from v1-rfc because I feel
like it would be easier to go into the patchset comparing to read that
lengthy and probably helpless changelog.  However I do like to do a
summary here on what has majorly changed, and also some conclusions on
the previous v1 discussions.

======================

* Per-vm ring is dropped

For x86 (which is still the major focus for now), we found that kvmgt
is probably the only one that still writes to the guest without a vcpu
context.  It would be a complete pity if we keep the per-vm ring only
for kvmgt (who shouldn't write directly to guest via kvm api after
all...), so remove it.  Work should be ongoing in parallel to refactor
kvmgt to not use kvm apis like kvm_write_guest().

However I don't want to break kvmgt before it's fixed.  So this series
uses an interim way to solve this by fallback no-vcpu-context writes
to vcpu0 if there is.  So we will keep the interface clean (per-vcpu
only), while we don't break the code base.  After kvmgt is fixed, we
can probably even drop this special fallback and kvm->dirty_ring_lock.

* Waitqueue is still kept (for now)

We did plan to drop the waitqueue, however again if with kvmgt we
still have chance to ful-fill a ring (and I feel like it'll definitely
happen if we migrate a kvmgt guest).  This series will only trigger
the waitqueue mechanism if it's the special case (no-vcpu-context) and
actually it naturally avoids another mmu lock deadlock issue I've
encountered, which is good.

For vcpu context writes, now the series is even more strict that we'll
directly fail the KVM_RUN if the dirty ring is soft full, until the
userspace collects the dirty rings first.  That'll guarantee the ring
will never be full.  With that, I dropped KVM_REQ_DIRTY_RING_FULL
together because then it's not needed.

Potentially this could still also be used by ARM when there're code
paths that dump the ARM device information to the guests
(e.g. KVM_DEV_ARM_ITS_SAVE_TABLES).  We'll see.  No matter what, even
if the code is there, x86 (as long as without kvmgt) should never
trigger waitqueue.

Although the waitqueue is kept, I dropped the complete waitqueue test,
simply because now I can never trigger it without kvmgt...

* Why not virtio?

There's already some discussion during v1 patchset on whether it's
good to use virtio for the data path of delivering dirty pages [1].
I'd confess the only thing that we might consider to use is the vring
layout (because virtqueue is tightly bound to devices, while we don't
have a device contet here), however it's a pity that even we only use
the most low-level vring api it'll be at least iov based which is
already an overkill for dirty ring (which is literally an array of
addresses).  So I just kept things easy.

======================

About the patchset:

Patch 1-5:    Mostly cleanups
Patch 6,7:    Prepare for the dirty ring interface
Patch 8-10:   Dirty ring implementation (majorly patch 8)
Patch 11-17:  Test cases update

Please have a look, thanks.

[1] V1 is here: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com

Paolo Bonzini (1):
  KVM: Move running VCPU from ARM to common code

Peter Xu (16):
  KVM: Remove kvm_read_guest_atomic()
  KVM: X86: Change parameter for fast_page_fault tracepoint
  KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  KVM: Cache as_id in kvm_memory_slot
  KVM: Add build-time error check on kvm_run size
  KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  KVM: X86: Implement ring-based dirty memory tracking
  KVM: Make dirty ring exclusive to dirty bitmap log
  KVM: Don't allocate dirty bitmap if dirty ring is enabled
  KVM: selftests: Always clear dirty bitmap after iteration
  KVM: selftests: Sync uapi/linux/kvm.h to tools/
  KVM: selftests: Use a single binary for dirty/clear log test
  KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  KVM: selftests: Add dirty ring buffer test
  KVM: selftests: Let dirty_log_test async for dirty ring test
  KVM: selftests: Add "-c" parameter to dirty log test

 Documentation/virt/kvm/api.txt                |  96 ++++
 arch/arm/include/asm/kvm_host.h               |   2 -
 arch/arm64/include/asm/kvm_host.h             |   2 -
 arch/x86/include/asm/kvm_host.h               |   3 +
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/Makefile                         |   3 +-
 arch/x86/kvm/mmu.c                            |   6 +
 arch/x86/kvm/mmutrace.h                       |   9 +-
 arch/x86/kvm/vmx/vmx.c                        |  25 +-
 arch/x86/kvm/x86.c                            |   9 +
 include/linux/kvm_dirty_ring.h                |  57 +++
 include/linux/kvm_host.h                      |  44 +-
 include/trace/events/kvm.h                    |  78 ++++
 include/uapi/linux/kvm.h                      |  31 ++
 tools/include/uapi/linux/kvm.h                |  36 ++
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   2 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 420 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   4 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  64 +++
 .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
 virt/kvm/arm/arch_timer.c                     |   2 +-
 virt/kvm/arm/arm.c                            |  29 --
 virt/kvm/arm/perf.c                           |   6 +-
 virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
 virt/kvm/dirty_ring.c                         | 201 +++++++++
 virt/kvm/kvm_main.c                           | 269 +++++++++--
 27 files changed, 1274 insertions(+), 145 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
 create mode 100644 virt/kvm/dirty_ring.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 01/17] KVM: Remove kvm_read_guest_atomic()
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2020-01-08 17:45   ` Paolo Bonzini
  2019-12-21  1:49 ` [PATCH RESEND v2 02/17] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Remove kvm_read_guest_atomic() because it's not used anywhere.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h |  2 --
 virt/kvm/kvm_main.c      | 11 -----------
 2 files changed, 13 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d41c521a39da..2ea1ea79befd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -730,8 +730,6 @@ void kvm_get_pfn(kvm_pfn_t pfn);
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
-int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
-			  unsigned long len);
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
 int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 			   void *data, unsigned long len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13efc291b1c7..7ee28af9eb48 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2039,17 +2039,6 @@ static int __kvm_read_guest_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 	return 0;
 }
 
-int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
-			  unsigned long len)
-{
-	gfn_t gfn = gpa >> PAGE_SHIFT;
-	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
-	int offset = offset_in_page(gpa);
-
-	return __kvm_read_guest_atomic(slot, gfn, data, offset, len);
-}
-EXPORT_SYMBOL_GPL(kvm_read_guest_atomic);
-
 int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 			       void *data, unsigned long len)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 02/17] KVM: X86: Change parameter for fast_page_fault tracepoint
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
  2019-12-21  1:49 ` [PATCH RESEND v2 01/17] KVM: Remove kvm_read_guest_atomic() Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2020-01-08 17:46   ` Paolo Bonzini
  2019-12-21  1:49 ` [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

It would be clearer to dump the return value to know easily on whether
did we go through the fast path for handling current page fault.
Remove the old two last parameters because after all the old/new sptes
were dumped in the same line.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmutrace.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 7ca8831c7d1a..09bdc5c91650 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -244,9 +244,6 @@ TRACE_EVENT(
 		  __entry->access)
 );
 
-#define __spte_satisfied(__spte)				\
-	(__entry->retry && is_writable_pte(__entry->__spte))
-
 TRACE_EVENT(
 	fast_page_fault,
 	TP_PROTO(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code,
@@ -274,12 +271,10 @@ TRACE_EVENT(
 	),
 
 	TP_printk("vcpu %d gva %lx error_code %s sptep %p old %#llx"
-		  " new %llx spurious %d fixed %d", __entry->vcpu_id,
+		  " new %llx ret %d", __entry->vcpu_id,
 		  __entry->gva, __print_flags(__entry->error_code, "|",
 		  kvm_mmu_trace_pferr_flags), __entry->sptep,
-		  __entry->old_spte, __entry->new_spte,
-		  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
-	)
+		  __entry->old_spte, __entry->new_spte, __entry->retry)
 );
 
 TRACE_EVENT(
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
  2019-12-21  1:49 ` [PATCH RESEND v2 01/17] KVM: Remove kvm_read_guest_atomic() Peter Xu
  2019-12-21  1:49 ` [PATCH RESEND v2 02/17] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2019-12-21 13:51   ` Paolo Bonzini
  2019-12-21  1:49 ` [PATCH RESEND v2 04/17] KVM: Cache as_id in kvm_memory_slot Peter Xu
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, a new parameter is added to kvm_[write|clear]_guest_page()
to show whether we would like to track dirty bits for the operations.
With that, pass in "false" to this new parameter for any guest memory
write of the ioctls (KVM_SET_TSS_ADDR, KVM_SET_IDENTITY_MAP_ADDR).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c   | 18 ++++++++++--------
 include/linux/kvm_host.h |  5 +++--
 virt/kvm/kvm_main.c      | 25 ++++++++++++++++---------
 3 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 04a8212704c1..1ff5a428f489 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3452,24 +3452,24 @@ static int init_rmode_tss(struct kvm *kvm)
 
 	idx = srcu_read_lock(&kvm->srcu);
 	fn = to_kvm_vmx(kvm)->tss_addr >> PAGE_SHIFT;
-	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
+	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE, false);
 	if (r < 0)
 		goto out;
 	data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
 	r = kvm_write_guest_page(kvm, fn++, &data,
-			TSS_IOPB_BASE_OFFSET, sizeof(u16));
+				 TSS_IOPB_BASE_OFFSET, sizeof(u16), false);
 	if (r < 0)
 		goto out;
-	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE);
+	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE, false);
 	if (r < 0)
 		goto out;
-	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
+	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE, false);
 	if (r < 0)
 		goto out;
 	data = ~0;
 	r = kvm_write_guest_page(kvm, fn, &data,
 				 RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1,
-				 sizeof(u8));
+				 sizeof(u8), false);
 out:
 	srcu_read_unlock(&kvm->srcu, idx);
 	return r;
@@ -3498,7 +3498,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
 		goto out2;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
+	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE, false);
 	if (r < 0)
 		goto out;
 	/* Set up identity-mapping pagetable for EPT in real mode */
@@ -3506,7 +3506,8 @@ static int init_rmode_identity_map(struct kvm *kvm)
 		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
 			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
 		r = kvm_write_guest_page(kvm, identity_map_pfn,
-				&tmp, i * sizeof(tmp), sizeof(tmp));
+					 &tmp, i * sizeof(tmp),
+					 sizeof(tmp), false);
 		if (r < 0)
 			goto out;
 	}
@@ -7265,7 +7266,8 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
 		dst = vmcs12->pml_address + sizeof(u64) * vmcs12->guest_pml_index;
 
 		if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), &gpa,
-					 offset_in_page(dst), sizeof(gpa)))
+					 offset_in_page(dst), sizeof(gpa),
+					 false))
 			return 0;
 
 		vmcs12->guest_pml_index--;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2ea1ea79befd..4e34cf97ca90 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -734,7 +734,7 @@ int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
 int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 			   void *data, unsigned long len);
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
-			 int offset, int len);
+			 int offset, int len, bool track_dirty);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 		    unsigned long len);
 int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
@@ -744,7 +744,8 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 				  unsigned long len);
 int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 			      gpa_t gpa, unsigned long len);
-int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
+int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len,
+			 bool track_dirty);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7ee28af9eb48..b1047173d78e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2051,7 +2051,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
 static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
-			          const void *data, int offset, int len)
+			          const void *data, int offset, int len,
+				  bool track_dirty)
 {
 	int r;
 	unsigned long addr;
@@ -2062,16 +2063,19 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
 	r = __copy_to_user((void __user *)addr + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(memslot, gfn);
+	if (track_dirty)
+		mark_page_dirty_in_slot(memslot, gfn);
 	return 0;
 }
 
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
-			 const void *data, int offset, int len)
+			 const void *data, int offset, int len,
+			 bool track_dirty)
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(slot, gfn, data, offset, len,
+				      track_dirty);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
 
@@ -2080,7 +2084,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(slot, gfn, data, offset,
+				      len, true);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
@@ -2093,7 +2098,7 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
 	int ret;
 
 	while ((seg = next_segment(len, offset)) != 0) {
-		ret = kvm_write_guest_page(kvm, gfn, data, offset, seg);
+		ret = kvm_write_guest_page(kvm, gfn, data, offset, seg, true);
 		if (ret < 0)
 			return ret;
 		offset = 0;
@@ -2232,11 +2237,13 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
 
-int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
+int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len,
+			 bool track_dirty)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
 
-	return kvm_write_guest_page(kvm, gfn, zero_page, offset, len);
+	return kvm_write_guest_page(kvm, gfn, zero_page, offset, len,
+				    track_dirty);
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest_page);
 
@@ -2248,7 +2255,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 	int ret;
 
 	while ((seg = next_segment(len, offset)) != 0) {
-		ret = kvm_clear_guest_page(kvm, gfn, offset, seg);
+		ret = kvm_clear_guest_page(kvm, gfn, offset, seg, true);
 		if (ret < 0)
 			return ret;
 		offset = 0;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 04/17] KVM: Cache as_id in kvm_memory_slot
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (2 preceding siblings ...)
  2019-12-21  1:49 ` [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2020-01-08 17:47   ` Paolo Bonzini
  2019-12-21  1:49 ` [PATCH RESEND v2 05/17] KVM: Add build-time error check on kvm_run size Peter Xu
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Let's cache the address space ID just like the slot ID.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4e34cf97ca90..24854c9e3717 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -348,6 +348,7 @@ struct kvm_memory_slot {
 	unsigned long userspace_addr;
 	u32 flags;
 	short id;
+	u8 as_id;
 };
 
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b1047173d78e..cea4b8dd4ac9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1027,6 +1027,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
 	new = old = *slot;
 
+	BUILD_BUG_ON(U8_MAX < KVM_ADDRESS_SPACE_NUM);
+	new.as_id = as_id;
 	new.id = id;
 	new.base_gfn = base_gfn;
 	new.npages = npages;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 05/17] KVM: Add build-time error check on kvm_run size
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (3 preceding siblings ...)
  2019-12-21  1:49 ` [PATCH RESEND v2 04/17] KVM: Cache as_id in kvm_memory_slot Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2019-12-21  1:49 ` [PATCH RESEND v2 06/17] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

It's already going to reach 2400 Bytes (which is over half of page
size on 4K page archs), so maybe it's good to have this build-time
check in case it overflows when adding new fields.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cea4b8dd4ac9..c80a363831ae 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -338,6 +338,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->pre_pcpu = -1;
 	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
 
+	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
 		r = -ENOMEM;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 06/17] KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (4 preceding siblings ...)
  2019-12-21  1:49 ` [PATCH RESEND v2 05/17] KVM: Add build-time error check on kvm_run size Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2020-01-08 17:47   ` Paolo Bonzini
  2019-12-21  1:49 ` [PATCH RESEND v2 07/17] KVM: Move running VCPU from ARM to common code Peter Xu
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

The context will be needed to implement the kvm dirty ring.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c80a363831ae..17969cf110dd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -144,7 +144,9 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_memory_slot *memslot,
+				    gfn_t gfn);
 
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -2053,8 +2055,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
-static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
-			          const void *data, int offset, int len,
+static int __kvm_write_guest_page(struct kvm *kvm,
+				  struct kvm_memory_slot *memslot, gfn_t gfn,
+				  const void *data, int offset, int len,
 				  bool track_dirty)
 {
 	int r;
@@ -2067,7 +2070,7 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
 	if (r)
 		return -EFAULT;
 	if (track_dirty)
-		mark_page_dirty_in_slot(memslot, gfn);
+		mark_page_dirty_in_slot(kvm, memslot, gfn);
 	return 0;
 }
 
@@ -2077,7 +2080,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len,
+	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len,
 				      track_dirty);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
@@ -2087,7 +2090,7 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset,
+	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset,
 				      len, true);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
@@ -2202,7 +2205,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
+	mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT);
 
 	return 0;
 }
@@ -2269,7 +2272,8 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn)
 {
 	if (memslot && memslot->dirty_bitmap) {
@@ -2284,7 +2288,7 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = gfn_to_memslot(kvm, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty);
 
@@ -2293,7 +2297,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(vcpu->kvm, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 07/17] KVM: Move running VCPU from ARM to common code
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (5 preceding siblings ...)
  2019-12-21  1:49 ` [PATCH RESEND v2 06/17] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2020-01-08 17:47   ` Paolo Bonzini
  2019-12-21  1:49 ` [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

From: Paolo Bonzini <pbonzini@redhat.com>

For ring-based dirty log tracking, it will be more efficient to account
writes during schedule-out or schedule-in to the currently running VCPU.
We would like to do it even if the write doesn't use the current VCPU's
address space, as is the case for cached writes (see commit 4e335d9e7ddb,
"Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).

Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
There is already something similar in KVM/ARM; one important difference
is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
we have to update both the architecture-independent vcpu_{load,put} and
the preempt notifiers.

Another change made in the process is to allow using kvm_get_running_vcpu()
in preemptible code.  This is allowed because preempt notifiers ensure
that the value does not change even after the VCPU thread is migrated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm/include/asm/kvm_host.h   |  2 --
 arch/arm64/include/asm/kvm_host.h |  2 --
 include/linux/kvm_host.h          |  3 +++
 virt/kvm/arm/arch_timer.c         |  2 +-
 virt/kvm/arm/arm.c                | 29 -----------------------------
 virt/kvm/arm/perf.c               |  6 +++---
 virt/kvm/arm/vgic/vgic-mmio.c     | 15 +++------------
 virt/kvm/kvm_main.c               | 25 ++++++++++++++++++++++++-
 8 files changed, 34 insertions(+), 50 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 8a37c8e89777..40eff9cc3744 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -274,8 +274,6 @@ int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f656169db8c3..df8d72f7c20e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -430,8 +430,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 24854c9e3717..b4f7bef38e0d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1323,6 +1323,9 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 }
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 
+struct kvm_vcpu *kvm_get_running_vcpu(void);
+struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
+
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
 bool kvm_arch_has_irq_bypass(void);
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index e2bb5bd60227..085e7fed850c 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -1022,7 +1022,7 @@ static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_timer_get_input_level(int vintid)
 {
-	struct kvm_vcpu *vcpu = kvm_arm_get_running_vcpu();
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 	struct arch_timer_context *timer;
 
 	if (vintid == vcpu_vtimer(vcpu)->irq.irq)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 86c6aa1cb58e..f7dbb94ec525 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -47,9 +47,6 @@ __asm__(".arch_extension	virt");
 DEFINE_PER_CPU(kvm_host_data_t, kvm_host_data);
 static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
 
-/* Per-CPU variable containing the currently running vcpu. */
-static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
-
 /* The VMID used in the VTTBR */
 static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
 static u32 kvm_next_vmid;
@@ -58,31 +55,8 @@ static DEFINE_SPINLOCK(kvm_vmid_lock);
 static bool vgic_present;
 
 static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled);
-
-static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
-{
-	__this_cpu_write(kvm_arm_running_vcpu, vcpu);
-}
-
 DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
-/**
- * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
- * Must be called from non-preemptible context
- */
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
-{
-	return __this_cpu_read(kvm_arm_running_vcpu);
-}
-
-/**
- * kvm_arm_get_running_vcpus - get the per-CPU array of currently running vcpus.
- */
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
-{
-	return &kvm_arm_running_vcpu;
-}
-
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
@@ -374,7 +348,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vcpu->cpu = cpu;
 	vcpu->arch.host_cpu_context = &cpu_data->host_ctxt;
 
-	kvm_arm_set_running_vcpu(vcpu);
 	kvm_vgic_load(vcpu);
 	kvm_timer_vcpu_load(vcpu);
 	kvm_vcpu_load_sysregs(vcpu);
@@ -398,8 +371,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_vcpu_pmu_restore_host(vcpu);
 
 	vcpu->cpu = -1;
-
-	kvm_arm_set_running_vcpu(NULL);
 }
 
 static void vcpu_power_off(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/arm/perf.c b/virt/kvm/arm/perf.c
index 918cdc3839ea..d45b8b9a4415 100644
--- a/virt/kvm/arm/perf.c
+++ b/virt/kvm/arm/perf.c
@@ -13,14 +13,14 @@
 
 static int kvm_is_in_guest(void)
 {
-        return kvm_arm_get_running_vcpu() != NULL;
+        return kvm_get_running_vcpu() != NULL;
 }
 
 static int kvm_is_user_mode(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return !vcpu_mode_priv(vcpu);
@@ -32,7 +32,7 @@ static unsigned long kvm_get_guest_ip(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return *vcpu_pc(vcpu);
diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
index 0d090482720d..d656ebd5f9d4 100644
--- a/virt/kvm/arm/vgic/vgic-mmio.c
+++ b/virt/kvm/arm/vgic/vgic-mmio.c
@@ -190,15 +190,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
  * value later will give us the same value as we update the per-CPU variable
  * in the preempt notifier handlers.
  */
-static struct kvm_vcpu *vgic_get_mmio_requester_vcpu(void)
-{
-	struct kvm_vcpu *vcpu;
-
-	preempt_disable();
-	vcpu = kvm_arm_get_running_vcpu();
-	preempt_enable();
-	return vcpu;
-}
 
 /* Must be called with irq->irq_lock held */
 static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
@@ -221,7 +212,7 @@ void vgic_mmio_write_spending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -274,7 +265,7 @@ void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -335,7 +326,7 @@ static void vgic_mmio_change_active(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
 				    bool active)
 {
 	unsigned long flags;
-	struct kvm_vcpu *requester_vcpu = vgic_get_mmio_requester_vcpu();
+	struct kvm_vcpu *requester_vcpu = kvm_get_running_vcpu();
 
 	raw_spin_lock_irqsave(&irq->irq_lock, flags);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 17969cf110dd..5c606d158854 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -108,6 +108,7 @@ struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
 static __read_mostly struct preempt_ops kvm_preempt_ops;
+static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
 
 struct dentry *kvm_debugfs_dir;
 EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
@@ -199,6 +200,8 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu = get_cpu();
+
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
@@ -210,6 +213,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	__this_cpu_write(kvm_running_vcpu, NULL);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);
@@ -4294,8 +4298,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 	WRITE_ONCE(vcpu->preempted, false);
 	WRITE_ONCE(vcpu->ready, false);
 
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	kvm_arch_sched_in(vcpu, cpu);
-
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
@@ -4309,6 +4313,25 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 		WRITE_ONCE(vcpu->ready, true);
 	}
 	kvm_arch_vcpu_put(vcpu);
+	__this_cpu_write(kvm_running_vcpu, NULL);
+}
+
+/**
+ * kvm_get_running_vcpu - get the vcpu running on the current CPU.
+ * Thanks to preempt notifiers, this can also be called from
+ * preemptible context.
+ */
+struct kvm_vcpu *kvm_get_running_vcpu(void)
+{
+        return __this_cpu_read(kvm_running_vcpu);
+}
+
+/**
+ * kvm_get_running_vcpus - get the per-CPU array of currently running vcpus.
+ */
+struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
+{
+        return &kvm_running_vcpu;
 }
 
 static void check_processor_compat(void *rtn)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (6 preceding siblings ...)
  2019-12-21  1:49 ` [PATCH RESEND v2 07/17] KVM: Move running VCPU from ARM to common code Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2019-12-24  6:16   ` Jason Wang
  2020-01-08 15:52   ` Peter Xu
  2019-12-21  1:49 ` [PATCH RESEND v2 09/17] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov, Lei Cao

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only.  However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.txt  |  89 ++++++++++++++
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/include/uapi/asm/kvm.h |   1 +
 arch/x86/kvm/Makefile           |   3 +-
 arch/x86/kvm/mmu.c              |   6 +
 arch/x86/kvm/vmx/vmx.c          |   7 ++
 arch/x86/kvm/x86.c              |   9 ++
 include/linux/kvm_dirty_ring.h  |  57 +++++++++
 include/linux/kvm_host.h        |  28 +++++
 include/trace/events/kvm.h      |  78 +++++++++++++
 include/uapi/linux/kvm.h        |  31 +++++
 virt/kvm/dirty_ring.c           | 201 ++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             | 172 ++++++++++++++++++++++++++-
 13 files changed, 682 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 create mode 100644 virt/kvm/dirty_ring.c

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 4833904d32a5..c141b285e673 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
 It is thus encouraged to use the vm ioctl to query for capabilities (available
 with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
 
+
 4.5 KVM_GET_VCPU_MMAP_SIZE
 
 Capability: basic
@@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
 memory region.  This ioctl returns the size of that region.  See the
 KVM_RUN documentation for details.
 
+Besides the size of the KVM_RUN communication region, other areas of
+the VCPU file descriptor can be mmap-ed, including:
+
+- if KVM_CAP_COALESCED_MMIO is available, a page at
+  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
+  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
+  KVM_CAP_COALESCED_MMIO is not documented yet.
+
+- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
+  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
+  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+
 
 4.6 KVM_SET_MEMORY_REGION
 
@@ -5302,6 +5315,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
 AArch64, this value will be reported in the ISS field of ESR_ELx.
 
 See KVM_CAP_VCPU_EVENTS for more details.
+
 8.20 KVM_CAP_HYPERV_SEND_IPI
 
 Architectures: x86
@@ -5309,6 +5323,7 @@ Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
 
 Architecture: x86
@@ -5322,3 +5337,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
 flush hypercalls by Hyper-V) so userspace should disable KVM identification
 in CPUID and only exposes Hyper-V identification. In this case, guest
 thinks it's running on Hyper-V and only use Hyper-V hypercalls.
+
+8.22 KVM_CAP_DIRTY_LOG_RING
+
+Architectures: x86
+Parameters: args[0] - size of the dirty log ring
+
+KVM is capable of tracking dirty memory using ring buffers that are
+mmaped into userspace; there is one dirty ring per vcpu.
+
+One dirty ring is defined as below internally:
+
+struct kvm_dirty_ring {
+	u32 dirty_index;
+	u32 reset_index;
+	u32 size;
+	u32 soft_limit;
+	struct kvm_dirty_gfn *dirty_gfns;
+	struct kvm_dirty_ring_indices *indices;
+	int index;
+};
+
+Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
+For each of the dirty entry it's defined as:
+
+struct kvm_dirty_gfn {
+        __u32 pad;
+        __u32 slot; /* as_id | slot_id */
+        __u64 offset;
+};
+
+Most of the ring structure is used by KVM internally, while only the
+indices are exposed to userspace:
+
+struct kvm_dirty_ring_indices {
+	__u32 avail_index; /* set by kernel */
+	__u32 fetch_index; /* set by userspace */
+};
+
+The two indices in the ring buffer are free running counters.
+
+Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
+to enable this capability for the new guest and set the size of the
+rings.  It is only allowed before creating any vCPU, and the size of
+the ring must be a power of two.  The larger the ring buffer, the less
+likely the ring is full and the VM is forced to exit to userspace. The
+optimal size depends on the workload, but it is recommended that it be
+at least 64 KiB (4096 entries).
+
+Just like for dirty page bitmaps, the buffer tracks writes to
+all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
+set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
+with the flag set, userspace can start harvesting dirty pages from the
+ring buffer.
+
+To harvest the dirty pages, userspace accesses the mmaped ring buffer
+to read the dirty GFNs up to avail_index, and sets the fetch_index
+accordingly.  This can be done when the guest is running or paused,
+and dirty pages need not be collected all at once.  After processing
+one or more entries in the ring buffer, userspace calls the VM ioctl
+KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
+fetch_index and to mark those pages clean.  Therefore, the ioctl
+must be called *before* reading the content of the dirty pages.
+
+However, there is a major difference comparing to the
+KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
+userspace it's still possible that the kernel has not yet flushed the
+hardware dirty buffers into the kernel buffer (which was previously
+done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one needs to
+kick the vcpu out for a hardware buffer flush (vmexit) to make sure
+all the existing dirty gfns are flushed to the dirty rings.
+
+If one of the ring buffers is full, the guest will exit to userspace
+with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the KVM_RUN
+ioctl will return to userspace with zero.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4fc61483919a..7e5e2d3f0509 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1159,6 +1159,7 @@ struct kvm_x86_ops {
 					   struct kvm_memory_slot *slot,
 					   gfn_t offset, unsigned long mask);
 	int (*write_log_dirty)(struct kvm_vcpu *vcpu);
+	int (*cpu_dirty_log_size)(void);
 
 	/* pmu operations of sub-arch */
 	const struct kvm_pmu_ops *pmu_ops;
@@ -1641,4 +1642,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define GET_SMSTATE(type, buf, offset)		\
 	(*(type *)((buf) + (offset) - 0x7e00))
 
+int kvm_cpu_dirty_log_size(void);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 503d3f42da16..b59bf356c478 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -12,6 +12,7 @@
 
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define DE_VECTOR 0
 #define DB_VECTOR 1
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31ecf7a76d5a..a66ddb552208 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
 KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
+				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
+				$(KVM)/dirty_ring.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2ce9da58611e..5f7d73730f73 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1818,7 +1818,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
 {
 	if (kvm_x86_ops->write_log_dirty)
 		return kvm_x86_ops->write_log_dirty(vcpu);
+	return 0;
+}
 
+int kvm_cpu_dirty_log_size(void)
+{
+	if (kvm_x86_ops->cpu_dirty_log_size)
+		return kvm_x86_ops->cpu_dirty_log_size();
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1ff5a428f489..c3565319b481 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7686,6 +7686,7 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->slot_disable_log_dirty = NULL;
 		kvm_x86_ops->flush_log_dirty = NULL;
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
+		kvm_x86_ops->cpu_dirty_log_size = NULL;
 	}
 
 	if (!cpu_has_vmx_preemption_timer())
@@ -7750,6 +7751,11 @@ static __exit void hardware_unsetup(void)
 	free_kvm_area();
 }
 
+static int vmx_cpu_dirty_log_size(void)
+{
+	return enable_pml ? PML_ENTITY_NUM : 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -7873,6 +7879,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.flush_log_dirty = vmx_flush_log_dirty,
 	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
 	.write_log_dirty = vmx_write_pml_buffer,
+	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
 
 	.pre_block = vmx_pre_block,
 	.post_block = vmx_post_block,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5d530521f11d..f93262025a61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7965,6 +7965,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	bool req_immediate_exit = false;
 
+	/* Forbid vmenter if vcpu dirty ring is soft-full */
+	if (unlikely(vcpu->kvm->dirty_ring_size &&
+		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
+		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+		trace_kvm_dirty_ring_exit(vcpu);
+		r = 0;
+		goto out;
+	}
+
 	if (kvm_request_pending(vcpu)) {
 		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu)) {
 			if (unlikely(!kvm_x86_ops->get_vmcs12_pages(vcpu))) {
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
new file mode 100644
index 000000000000..06db2312b383
--- /dev/null
+++ b/include/linux/kvm_dirty_ring.h
@@ -0,0 +1,57 @@
+#ifndef KVM_DIRTY_RING_H
+#define KVM_DIRTY_RING_H
+
+/**
+ * kvm_dirty_ring: KVM internal dirty ring structure
+ *
+ * @dirty_index: free running counter that points to the next slot in
+ *               dirty_ring->dirty_gfns, where a new dirty page should go
+ * @reset_index: free running counter that points to the next dirty page
+ *               in dirty_ring->dirty_gfns for which dirty trap needs to
+ *               be reenabled
+ * @size:        size of the compact list, dirty_ring->dirty_gfns
+ * @soft_limit:  when the number of dirty pages in the list reaches this
+ *               limit, vcpu that owns this ring should exit to userspace
+ *               to allow userspace to harvest all the dirty pages
+ * @dirty_gfns:  the array to keep the dirty gfns
+ * @indices:     the pointer to the @kvm_dirty_ring_indices structure
+ *               of this specific ring
+ * @index:       index of this dirty ring
+ */
+struct kvm_dirty_ring {
+	u32 dirty_index;
+	u32 reset_index;
+	u32 size;
+	u32 soft_limit;
+	struct kvm_dirty_gfn *dirty_gfns;
+	struct kvm_dirty_ring_indices *indices;
+	int index;
+};
+
+u32 kvm_dirty_ring_get_rsvd_entries(void);
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
+			 struct kvm_dirty_ring_indices *indices,
+			 int index, u32 size);
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
+void kvm_dirty_ring_put(struct kvm *kvm,
+			struct kvm_dirty_ring *ring);
+
+/*
+ * called with kvm->slots_lock held, returns the number of
+ * processed pages.
+ */
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
+
+/*
+ * returns =0: successfully pushed
+ *         <0: unable to push, need to wait
+ */
+int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
+
+/* for use in vm_operations_struct */
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
+
+#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b4f7bef38e0d..dff214ab72eb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
 #include <linux/kvm_types.h>
 
 #include <asm/kvm_host.h>
+#include <linux/kvm_dirty_ring.h>
 
 #ifndef KVM_MAX_VCPU_ID
 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -321,6 +322,7 @@ struct kvm_vcpu {
 	bool ready;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	struct kvm_dirty_ring dirty_ring;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -502,6 +504,9 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	u32 dirty_ring_size;
+	struct spinlock dirty_ring_lock;
+	wait_queue_head_t dirty_ring_waitqueue;
 };
 
 #define kvm_err(fmt, ...) \
@@ -813,6 +818,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 					gfn_t gfn_offset,
 					unsigned long mask);
 
+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
+
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 				struct kvm_dirty_log *log);
 int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
@@ -1392,4 +1399,25 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 				uintptr_t data, const char *name,
 				struct task_struct **thread_ptr);
 
+/*
+ * This defines how many reserved entries we want to keep before we
+ * kick the vcpu to the userspace to avoid dirty ring full.  This
+ * value can be tuned to higher if e.g. PML is enabled on the host.
+ */
+#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
+
+/* Max number of entries allowed for each kvm dirty ring */
+#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
+
+/*
+ * Arch needs to define these macro after implementing the dirty ring
+ * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures, while
+ * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
+ * feature is off on all archs.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+
 #endif
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 2c735a3e6613..3d850997940c 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -399,6 +399,84 @@ TRACE_EVENT(kvm_halt_poll_ns,
 #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
 	trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
 
+TRACE_EVENT(kvm_dirty_ring_push,
+	TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
+	TP_ARGS(ring, slot, offset),
+
+	TP_STRUCT__entry(
+		__field(int, index)
+		__field(u32, dirty_index)
+		__field(u32, reset_index)
+		__field(u32, slot)
+		__field(u64, offset)
+	),
+
+	TP_fast_assign(
+		__entry->index          = ring->index;
+		__entry->dirty_index    = ring->dirty_index;
+		__entry->reset_index    = ring->reset_index;
+		__entry->slot           = slot;
+		__entry->offset         = offset;
+	),
+
+	TP_printk("ring %d: dirty 0x%x reset 0x%x "
+		  "slot %u offset 0x%llx (used %u)",
+		  __entry->index, __entry->dirty_index,
+		  __entry->reset_index,  __entry->slot, __entry->offset,
+		  __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_reset,
+	TP_PROTO(struct kvm_dirty_ring *ring),
+	TP_ARGS(ring),
+
+	TP_STRUCT__entry(
+		__field(int, index)
+		__field(u32, dirty_index)
+		__field(u32, reset_index)
+	),
+
+	TP_fast_assign(
+		__entry->index          = ring->index;
+		__entry->dirty_index    = ring->dirty_index;
+		__entry->reset_index    = ring->reset_index;
+	),
+
+	TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
+		  __entry->index, __entry->dirty_index, __entry->reset_index,
+		  __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_waitqueue,
+	TP_PROTO(bool enter),
+	TP_ARGS(enter),
+
+	TP_STRUCT__entry(
+	    __field(bool, enter)
+	),
+
+	TP_fast_assign(
+	    __entry->enter = enter;
+	),
+
+	TP_printk("%s", __entry->enter ? "wait" : "awake")
+);
+
+TRACE_EVENT(kvm_dirty_ring_exit,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
+
+	TP_STRUCT__entry(
+	    __field(int, vcpu_id)
+	),
+
+	TP_fast_assign(
+	    __entry->vcpu_id = vcpu->vcpu_id;
+	),
+
+	TP_printk("vcpu %d", __entry->vcpu_id)
+);
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52641d8ca9e8..5ea98e35a129 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -235,6 +235,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_S390_STSI        25
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
+#define KVM_EXIT_DIRTY_RING_FULL  28
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -246,6 +247,11 @@ struct kvm_hyperv_exit {
 /* Encounter unexpected vm-exit reason */
 #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
 
+struct kvm_dirty_ring_indices {
+	__u32 avail_index; /* set by kernel */
+	__u32 fetch_index; /* set by userspace */
+};
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -415,6 +421,8 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	struct kvm_dirty_ring_indices vcpu_ring_indices;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1000,6 +1008,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PMU_EVENT_FILTER 173
 #define KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 174
 #define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 175
+#define KVM_CAP_DIRTY_LOG_RING 176
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1461,6 +1470,9 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1611,4 +1623,23 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+/*
+ * The following are the requirements for supporting dirty log ring
+ * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
+ *
+ * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
+ *    of kvm_write_* so that the global dirty ring is not filled up
+ *    too quickly.
+ * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
+ *    enabling dirty logging.
+ * 3. There should not be a separate step to synchronize hardware
+ *    dirty bitmap with KVM's.
+ */
+
+struct kvm_dirty_gfn {
+	__u32 pad;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
new file mode 100644
index 000000000000..c614822493ff
--- /dev/null
+++ b/virt/kvm/dirty_ring.c
@@ -0,0 +1,201 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * KVM dirty ring implementation
+ *
+ * Copyright 2019 Red Hat, Inc.
+ */
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/vmalloc.h>
+#include <linux/kvm_dirty_ring.h>
+#include <trace/events/kvm.h>
+
+int __weak kvm_cpu_dirty_log_size(void)
+{
+	return 0;
+}
+
+u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
+}
+
+static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
+{
+	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
+}
+
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
+{
+	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
+}
+
+bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
+{
+	return kvm_dirty_ring_used(ring) >= ring->size;
+}
+
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+        /*
+	 * TODO: Currently use vcpu0 as default ring.  Note that this
+	 * should not happen only if called by kvmgt_rw_gpa for x86.
+	 * After the kvmgt code refactoring we should remove this,
+	 * together with the kvm->dirty_ring_lock.
+	 */
+	if (!vcpu) {
+		pr_warn_once("Detected page dirty without vcpu context. "
+			     "Probably because kvm-gt is used. "
+			     "May expect unbalanced loads on vcpu0.");
+		vcpu = kvm->vcpus[0];
+	}
+
+	WARN_ON_ONCE(vcpu->kvm != kvm);
+
+	if (vcpu == kvm->vcpus[0])
+		spin_lock(&kvm->dirty_ring_lock);
+
+	return &vcpu->dirty_ring;
+}
+
+void kvm_dirty_ring_put(struct kvm *kvm,
+			struct kvm_dirty_ring *ring)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+	if (!vcpu)
+		vcpu = kvm->vcpus[0];
+
+	WARN_ON_ONCE(vcpu->kvm != kvm);
+	WARN_ON_ONCE(&vcpu->dirty_ring != ring);
+
+	if (vcpu == kvm->vcpus[0])
+		spin_unlock(&kvm->dirty_ring_lock);
+}
+
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
+			 struct kvm_dirty_ring_indices *indices,
+			 int index, u32 size)
+{
+	ring->dirty_gfns = vmalloc(size);
+	if (!ring->dirty_gfns)
+		return -ENOMEM;
+	memset(ring->dirty_gfns, 0, size);
+
+	ring->size = size / sizeof(struct kvm_dirty_gfn);
+	ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
+	ring->dirty_index = 0;
+	ring->reset_index = 0;
+	ring->index = index;
+	ring->indices = indices;
+
+	return 0;
+}
+
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+{
+	u32 cur_slot, next_slot;
+	u64 cur_offset, next_offset;
+	unsigned long mask;
+	u32 fetch;
+	int count = 0;
+	struct kvm_dirty_gfn *entry;
+	struct kvm_dirty_ring_indices *indices = ring->indices;
+	bool first_round = true;
+
+	fetch = READ_ONCE(indices->fetch_index);
+
+	/*
+	 * Note that fetch_index is written by the userspace, which
+	 * should not be trusted.  If this happens, then it's probably
+	 * that the userspace has written a wrong fetch_index.
+	 */
+	if (fetch - ring->reset_index > ring->size)
+		return -EINVAL;
+
+	if (fetch == ring->reset_index)
+		return 0;
+
+	/* This is only needed to make compilers happy */
+	cur_slot = cur_offset = mask = 0;
+	while (ring->reset_index != fetch) {
+		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+		next_slot = READ_ONCE(entry->slot);
+		next_offset = READ_ONCE(entry->offset);
+		ring->reset_index++;
+		count++;
+		/*
+		 * Try to coalesce the reset operations when the guest is
+		 * scanning pages in the same slot.
+		 */
+		if (!first_round && next_slot == cur_slot) {
+			s64 delta = next_offset - cur_offset;
+
+			if (delta >= 0 && delta < BITS_PER_LONG) {
+				mask |= 1ull << delta;
+				continue;
+			}
+
+			/* Backwards visit, careful about overflows!  */
+			if (delta > -BITS_PER_LONG && delta < 0 &&
+			    (mask << -delta >> -delta) == mask) {
+				cur_offset = next_offset;
+				mask = (mask << -delta) | 1;
+				continue;
+			}
+		}
+		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+		cur_slot = next_slot;
+		cur_offset = next_offset;
+		mask = 1;
+		first_round = false;
+	}
+	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+	trace_kvm_dirty_ring_reset(ring);
+
+	return count;
+}
+
+int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
+{
+	struct kvm_dirty_gfn *entry;
+	struct kvm_dirty_ring_indices *indices = ring->indices;
+
+	/*
+	 * Note: here we will start waiting even soft full, because we
+	 * can't risk making it completely full, since vcpu0 could use
+	 * it right after us and if vcpu0 context gets full it could
+	 * deadlock if wait with mmu_lock held.
+	 */
+	if (kvm_get_running_vcpu() == NULL &&
+	    kvm_dirty_ring_soft_full(ring))
+		return -EBUSY;
+
+	/* It will never gets completely full when with a vcpu context */
+	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
+
+	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
+	entry->slot = slot;
+	entry->offset = offset;
+	smp_wmb();
+	ring->dirty_index++;
+	WRITE_ONCE(indices->avail_index, ring->dirty_index);
+
+	trace_kvm_dirty_ring_push(ring, slot, offset);
+
+	return 0;
+}
+
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
+{
+	return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE);
+}
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+	vfree(ring->dirty_gfns);
+	ring->dirty_gfns = NULL;
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5c606d158854..4050631d05f3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -64,6 +64,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
 
+#include <linux/kvm_dirty_ring.h>
+
 /* Worst case buffer size needed for holding an integer. */
 #define ITOA_MAX_LEN 12
 
@@ -148,6 +150,9 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 static void mark_page_dirty_in_slot(struct kvm *kvm,
 				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn);
+static void mark_page_dirty_in_ring(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    gfn_t gfn);
 
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -357,11 +362,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->preempted = false;
 	vcpu->ready = false;
 
+	if (kvm->dirty_ring_size) {
+		r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
+					 &vcpu->run->vcpu_ring_indices,
+					 id, kvm->dirty_ring_size);
+		if (r)
+			goto fail_free_run;
+	}
+
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
-		goto fail_free_run;
+		goto fail_free_ring;
 	return 0;
 
+fail_free_ring:
+	if (kvm->dirty_ring_size)
+		kvm_dirty_ring_free(&vcpu->dirty_ring);
 fail_free_run:
 	free_page((unsigned long)vcpu->run);
 fail:
@@ -379,6 +395,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 	put_pid(rcu_dereference_protected(vcpu->pid, 1));
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
+	if (vcpu->kvm->dirty_ring_size)
+		kvm_dirty_ring_free(&vcpu->dirty_ring);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
 
@@ -693,6 +711,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		return ERR_PTR(-ENOMEM);
 
 	spin_lock_init(&kvm->mmu_lock);
+	spin_lock_init(&kvm->dirty_ring_lock);
 	mmgrab(current->mm);
 	kvm->mm = current->mm;
 	kvm_eventfd_init(kvm);
@@ -700,6 +719,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
 	INIT_LIST_HEAD(&kvm->devices);
+	init_waitqueue_head(&kvm->dirty_ring_waitqueue);
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
@@ -2283,7 +2303,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
-		set_bit_le(rel_gfn, memslot->dirty_bitmap);
+		if (kvm->dirty_ring_size)
+			mark_page_dirty_in_ring(kvm, memslot, gfn);
+		else
+			set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
 
@@ -2630,6 +2653,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
+static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
+{
+	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
+		return false;
+
+	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
+	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
+	     kvm->dirty_ring_size / PAGE_SIZE);
+}
+
 static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 {
 	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
@@ -2645,6 +2678,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
 		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
 #endif
+	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
+		page = kvm_dirty_ring_get_page(
+		    &vcpu->dirty_ring,
+		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
 	else
 		return kvm_arch_vcpu_fault(vcpu, vmf);
 	get_page(page);
@@ -3239,12 +3276,138 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 	case KVM_CAP_NR_MEMSLOTS:
 		return KVM_USER_MEM_SLOTS;
+	case KVM_CAP_DIRTY_LOG_RING:
+#ifdef CONFIG_X86
+		return KVM_DIRTY_RING_MAX_ENTRIES;
+#else
+		return 0;
+#endif
 	default:
 		break;
 	}
 	return kvm_vm_ioctl_check_extension(kvm, arg);
 }
 
+static void mark_page_dirty_in_ring(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    gfn_t gfn)
+{
+	struct kvm_dirty_ring *ring;
+	u64 offset;
+	int ret;
+
+	if (!kvm->dirty_ring_size)
+		return;
+
+	offset = gfn - slot->base_gfn;
+
+	ring = kvm_dirty_ring_get(kvm);
+
+retry:
+	ret = kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
+				  offset);
+	if (ret < 0) {
+		/* We must be without a vcpu context. */
+		WARN_ON_ONCE(kvm_get_running_vcpu());
+
+		trace_kvm_dirty_ring_waitqueue(1);
+		/*
+		 * Ring is full, put us onto per-vm waitqueue and wait
+		 * for another KVM_RESET_DIRTY_RINGS to retry
+		 */
+		wait_event_killable(kvm->dirty_ring_waitqueue,
+				    !kvm_dirty_ring_soft_full(ring));
+
+		trace_kvm_dirty_ring_waitqueue(0);
+
+		/* If we're killed, no worry on lossing dirty bits */
+		if (fatal_signal_pending(current))
+			return;
+
+		goto retry;
+	}
+
+	kvm_dirty_ring_put(kvm, ring);
+}
+
+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+{
+	struct kvm_memory_slot *memslot;
+	int as_id, id;
+
+	as_id = slot >> 16;
+	id = (u16)slot;
+	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+		return;
+
+	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
+	if (offset >= memslot->npages)
+		return;
+
+	spin_lock(&kvm->mmu_lock);
+	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
+	spin_unlock(&kvm->mmu_lock);
+}
+
+static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
+{
+	int r;
+
+	/* the size should be power of 2 */
+	if (!size || (size & (size - 1)))
+		return -EINVAL;
+
+	/* Should be bigger to keep the reserved entries, or a page */
+	if (size < kvm_dirty_ring_get_rsvd_entries() *
+	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
+		return -EINVAL;
+
+	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
+	    sizeof(struct kvm_dirty_gfn))
+		return -E2BIG;
+
+	/* We only allow it to set once */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+
+	if (kvm->created_vcpus) {
+		/* We don't allow to change this value after vcpu created */
+		r = -EINVAL;
+	} else {
+		kvm->dirty_ring_size = size;
+		r = 0;
+	}
+
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	int cleared = 0;
+
+	if (!kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
+
+	mutex_unlock(&kvm->slots_lock);
+
+	if (cleared)
+		kvm_flush_remote_tlbs(kvm);
+
+	wake_up_all(&kvm->dirty_ring_waitqueue);
+
+	return cleared;
+}
+
 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 						  struct kvm_enable_cap *cap)
 {
@@ -3262,6 +3425,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 		kvm->manual_dirty_log_protect = cap->args[0];
 		return 0;
 #endif
+	case KVM_CAP_DIRTY_LOG_RING:
+		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -3449,6 +3614,9 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+	case KVM_RESET_DIRTY_RINGS:
+		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
+		break;
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 09/17] KVM: Make dirty ring exclusive to dirty bitmap log
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (7 preceding siblings ...)
  2019-12-21  1:49 ` [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
@ 2019-12-21  1:49 ` Peter Xu
  2019-12-21  1:58 ` [PATCH RESEND v2 10/17] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

There's no good reason to use both the dirty bitmap logging and the
new dirty ring buffer to track dirty bits.  We should be able to even
support both of them at the same time, but it could complicate things
which could actually help little.  Let's simply make it the rule
before we enable dirty ring on any arch, that we don't allow these two
interfaces to be used together.

The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
enablement.  That's where we'll switch from the default dirty logging
way to the dirty ring way.  As long as kvm->dirty_ring_size is setup
correctly, we'll once and for all switch to the dirty ring buffer mode
for the current virtual machine.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.txt |  7 +++++++
 virt/kvm/kvm_main.c            | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index c141b285e673..b507b966f9f1 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -5411,3 +5411,10 @@ all the existing dirty gfns are flushed to the dirty rings.
 If one of the ring buffers is full, the guest will exit to userspace
 with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the KVM_RUN
 ioctl will return to userspace with zero.
+
+NOTE: the KVM_CAP_DIRTY_LOG_RING capability and the new ioctl
+KVM_RESET_DIRTY_RINGS are exclusive to the existing KVM_GET_DIRTY_LOG
+interface.  After enabling KVM_CAP_DIRTY_LOG_RING with an acceptable
+dirty ring size, the virtual machine will switch to the dirty ring
+tracking mode, and KVM_GET_DIRTY_LOG, KVM_CLEAR_DIRTY_LOG ioctls will
+stop working.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4050631d05f3..b69d34425f8d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1204,6 +1204,10 @@ int kvm_get_dirty_log(struct kvm *kvm,
 	unsigned long n;
 	unsigned long any = 0;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -1261,6 +1265,10 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap;
 	unsigned long *dirty_bitmap_buffer;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -1332,6 +1340,10 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap;
 	unsigned long *dirty_bitmap_buffer;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 10/17] KVM: Don't allocate dirty bitmap if dirty ring is enabled
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (8 preceding siblings ...)
  2019-12-21  1:49 ` [PATCH RESEND v2 09/17] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
@ 2019-12-21  1:58 ` Peter Xu
  2019-12-21  2:04 ` [PATCH RESEND v2 11/17] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  1:58 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Peter Xu, Dr David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin ,
	Jason Wang, Vitaly Kuznetsov

Because kvm dirty rings and kvm dirty log is used in an exclusive way,
Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
At the meantime, since the dirty_bitmap will be conditionally created
now, we can't use it as a sign of "whether this memory slot enabled
dirty tracking".  Change users like that to check against the kvm
memory slot flags.

Note that there still can be chances where the kvm memory slot got its
dirty_bitmap allocated, _if_ the memory slots are created before
enabling of the dirty rings and at the same time with the dirty
tracking capability enabled, they'll still with the dirty_bitmap.
However it should not hurt much (e.g., the bitmaps will always be
freed if they are there), and the real users normally won't trigger
this because dirty bit tracking flag should in most cases only be
applied to kvm slots only before migration starts, that should be far
latter than kvm initializes (VM starts).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h | 5 +++++
 virt/kvm/kvm_main.c      | 5 +++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dff214ab72eb..725650aac05d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -353,6 +353,11 @@ struct kvm_memory_slot {
 	u8 as_id;
 };
 
+static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
+}
+
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
 {
 	return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b69d34425f8d..bb11fec1bf08 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1113,7 +1113,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	}
 
 	/* Allocate page dirty bitmap if needed */
-	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
+	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap &&
+	    !kvm->dirty_ring_size) {
 		if (kvm_create_dirty_bitmap(&new) < 0)
 			goto out_free;
 	}
@@ -2312,7 +2313,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn)
 {
-	if (memslot && memslot->dirty_bitmap) {
+	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 
 		if (kvm->dirty_ring_size)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 11/17] KVM: selftests: Always clear dirty bitmap after iteration
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (9 preceding siblings ...)
  2019-12-21  1:58 ` [PATCH RESEND v2 10/17] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
@ 2019-12-21  2:04 ` Peter Xu
  2019-12-21  2:04 ` [PATCH RESEND v2 12/17] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  2:04 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

We don't clear the dirty bitmap before because KVM_GET_DIRTY_LOG will
clear it for us before copying the dirty log onto it.  However we'd
still better to clear it explicitly instead of assuming the kernel
will always do it for us.

More importantly, in the upcoming dirty ring tests we'll start to
fetch dirty pages from a ring buffer, so no one is going to clear the
dirty bitmap for us.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 5614222a6628..3c0ffd34b3b0 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -197,7 +197,7 @@ static void vm_dirty_log_verify(unsigned long *bmap)
 				    page);
 		}
 
-		if (test_bit_le(page, bmap)) {
+		if (test_and_clear_bit_le(page, bmap)) {
 			host_dirty_count++;
 			/*
 			 * If the bit is set, the value written onto
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 12/17] KVM: selftests: Sync uapi/linux/kvm.h to tools/
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (10 preceding siblings ...)
  2019-12-21  2:04 ` [PATCH RESEND v2 11/17] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
@ 2019-12-21  2:04 ` Peter Xu
  2019-12-21  2:04 ` [PATCH RESEND v2 13/17] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  2:04 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

This will be needed to extend the kvm selftest program.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/include/uapi/linux/kvm.h | 36 ++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 52641d8ca9e8..17df0de21cce 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -235,6 +235,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_S390_STSI        25
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
+#define KVM_EXIT_DIRTY_RING_FULL  28
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -246,6 +247,11 @@ struct kvm_hyperv_exit {
 /* Encounter unexpected vm-exit reason */
 #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
 
+struct kvm_dirty_ring_indices {
+	__u32 avail_index; /* set by kernel */
+	__u32 fetch_index; /* set by userspace */
+};
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -415,6 +421,13 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	struct kvm_dirty_ring_indices vcpu_ring_indices;
+};
+
+/* Returned by mmap(kvm->fd, offset=0) */
+struct kvm_vm_run {
+	struct kvm_dirty_ring_indices vm_ring_indices;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1000,6 +1013,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PMU_EVENT_FILTER 173
 #define KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 174
 #define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 175
+#define KVM_CAP_DIRTY_LOG_RING 176
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1461,6 +1475,9 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1611,4 +1628,23 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+/*
+ * The following are the requirements for supporting dirty log ring
+ * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
+ *
+ * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
+ *    of kvm_write_* so that the global dirty ring is not filled up
+ *    too quickly.
+ * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
+ *    enabling dirty logging.
+ * 3. There should not be a separate step to synchronize hardware
+ *    dirty bitmap with KVM's.
+ */
+
+struct kvm_dirty_gfn {
+	__u32 pad;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 13/17] KVM: selftests: Use a single binary for dirty/clear log test
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (11 preceding siblings ...)
  2019-12-21  2:04 ` [PATCH RESEND v2 12/17] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
@ 2019-12-21  2:04 ` Peter Xu
  2019-12-21  2:04 ` [PATCH RESEND v2 14/17] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  2:04 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Remove the clear_dirty_log test, instead merge it into the existing
dirty_log_test.  It should be cleaner to use this single binary to do
both tests, also it's a preparation for the upcoming dirty ring test.

The default test will still be the dirty_log test.  To run the clear
dirty log test, we need to specify "-M clear-log".

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   2 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 131 +++++++++++++++---
 3 files changed, 110 insertions(+), 25 deletions(-)
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index c5ec868fa1e5..ad91b7129a93 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -25,11 +25,9 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
-TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 
-TEST_GEN_PROGS_aarch64 += clear_dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 
diff --git a/tools/testing/selftests/kvm/clear_dirty_log_test.c b/tools/testing/selftests/kvm/clear_dirty_log_test.c
deleted file mode 100644
index 749336937d37..000000000000
--- a/tools/testing/selftests/kvm/clear_dirty_log_test.c
+++ /dev/null
@@ -1,2 +0,0 @@
-#define USE_CLEAR_DIRTY_LOG
-#include "dirty_log_test.c"
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3c0ffd34b3b0..a8ae8c0042a8 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -128,6 +128,66 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+enum log_mode_t {
+	/* Only use KVM_GET_DIRTY_LOG for logging */
+	LOG_MODE_DIRTY_LOG = 0,
+
+	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
+	LOG_MODE_CLERA_LOG = 1,
+
+	LOG_MODE_NUM,
+};
+
+/* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
+static enum log_mode_t host_log_mode;
+
+static void clear_log_create_vm_done(struct kvm_vm *vm)
+{
+	struct kvm_enable_cap cap = {};
+
+	if (!kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2)) {
+		fprintf(stderr, "KVM_CLEAR_DIRTY_LOG not available, skipping tests\n");
+		exit(KSFT_SKIP);
+	}
+
+	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
+	cap.args[0] = 1;
+	vm_enable_cap(vm, &cap);
+}
+
+static void dirty_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+}
+
+static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
+}
+
+struct log_mode {
+	const char *name;
+	/* Hook when the vm creation is done (before vcpu creation) */
+	void (*create_vm_done)(struct kvm_vm *vm);
+	/* Hook to collect the dirty pages into the bitmap provided */
+	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
+				     void *bitmap, uint32_t num_pages);
+} log_modes[LOG_MODE_NUM] = {
+	{
+		.name = "dirty-log",
+		.create_vm_done = NULL,
+		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+	},
+	{
+		.name = "clear-log",
+		.create_vm_done = clear_log_create_vm_done,
+		.collect_dirty_pages = clear_log_collect_dirty_pages,
+	},
+};
+
 /*
  * We use this bitmap to track some pages that should have its dirty
  * bit set in the _next_ iteration.  For example, if we detected the
@@ -137,6 +197,33 @@ static uint64_t host_track_next_count;
  */
 static unsigned long *host_bmap_track;
 
+static void log_modes_dump(void)
+{
+	int i;
+
+	for (i = 0; i < LOG_MODE_NUM; i++)
+		printf("%s, ", log_modes[i].name);
+	puts("\b\b  \b\b");
+}
+
+static void log_mode_create_vm_done(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->create_vm_done)
+		mode->create_vm_done(vm);
+}
+
+static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					 void *bitmap, uint32_t num_pages)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	TEST_ASSERT(mode->collect_dirty_pages != NULL,
+		    "collect_dirty_pages() is required for any log mode!");
+	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -257,6 +344,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 #ifdef __x86_64__
 	vm_create_irqchip(vm);
 #endif
+	log_mode_create_vm_done(vm);
 	vm_vcpu_add_default(vm, vcpuid, guest_code);
 	return vm;
 }
@@ -316,14 +404,6 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	bmap = bitmap_alloc(host_num_pages);
 	host_bmap_track = bitmap_alloc(host_num_pages);
 
-#ifdef USE_CLEAR_DIRTY_LOG
-	struct kvm_enable_cap cap = {};
-
-	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
-	cap.args[0] = 1;
-	vm_enable_cap(vm, &cap);
-#endif
-
 	/* Add an extra memory slot for testing dirty logging */
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
 				    guest_test_phys_mem,
@@ -364,11 +444,8 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	while (iteration < iterations) {
 		/* Give the vcpu thread some time to dirty some pages */
 		usleep(interval * 1000);
-		kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap);
-#ifdef USE_CLEAR_DIRTY_LOG
-		kvm_vm_clear_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap, 0,
-				       host_num_pages);
-#endif
+		log_mode_collect_dirty_pages(vm, TEST_MEM_SLOT_INDEX,
+					     bmap, host_num_pages);
 		vm_dirty_log_verify(bmap);
 		iteration++;
 		sync_global_to_guest(vm, iteration);
@@ -413,6 +490,9 @@ static void help(char *name)
 	       TEST_HOST_LOOP_INTERVAL);
 	printf(" -p: specify guest physical test memory offset\n"
 	       "     Warning: a low offset can conflict with the loaded test code.\n");
+	printf(" -M: specify the host logging mode "
+	       "(default: log-dirty).  Supported modes: \n\t");
+	log_modes_dump();
 	printf(" -m: specify the guest mode ID to test "
 	       "(default: test all supported modes)\n"
 	       "     This option may be used multiple times.\n"
@@ -437,13 +517,6 @@ int main(int argc, char *argv[])
 	unsigned int host_ipa_limit;
 #endif
 
-#ifdef USE_CLEAR_DIRTY_LOG
-	if (!kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2)) {
-		fprintf(stderr, "KVM_CLEAR_DIRTY_LOG not available, skipping tests\n");
-		exit(KSFT_SKIP);
-	}
-#endif
-
 #ifdef __x86_64__
 	vm_guest_mode_params_init(VM_MODE_PXXV48_4K, true, true);
 #endif
@@ -463,7 +536,7 @@ int main(int argc, char *argv[])
 	vm_guest_mode_params_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:")) != -1) {
+	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
 		switch (opt) {
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
@@ -485,6 +558,22 @@ int main(int argc, char *argv[])
 				    "Guest mode ID %d too big", mode);
 			vm_guest_mode_params[mode].enabled = true;
 			break;
+		case 'M':
+			for (i = 0; i < LOG_MODE_NUM; i++) {
+				if (!strcmp(optarg, log_modes[i].name)) {
+					DEBUG("Setting log mode to: '%s'\n",
+					      optarg);
+					host_log_mode = i;
+					break;
+				}
+			}
+			if (i == LOG_MODE_NUM) {
+				printf("Log mode '%s' is invalid.  "
+				       "Please choose from: ", optarg);
+				log_modes_dump();
+				exit(-1);
+			}
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 14/17] KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (12 preceding siblings ...)
  2019-12-21  2:04 ` [PATCH RESEND v2 13/17] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
@ 2019-12-21  2:04 ` Peter Xu
  2019-12-21  2:04 ` [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test Peter Xu
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  2:04 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Provide a hook for the checks after vcpu_run() completes.  Preparation
for the dirty ring test because we'll need to take care of another
exit reason.

Since at it, drop the pages_count because after all we have a better
summary right now with statistics, and clean it up a bit.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 39 ++++++++++++--------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index a8ae8c0042a8..3542311f56ff 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -168,6 +168,15 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
+static void default_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
+		    "Invalid guest sync status: exit_reason=%s\n",
+		    exit_reason_str(run->exit_reason));
+}
+
 struct log_mode {
 	const char *name;
 	/* Hook when the vm creation is done (before vcpu creation) */
@@ -175,16 +184,20 @@ struct log_mode {
 	/* Hook to collect the dirty pages into the bitmap provided */
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
+	/* Hook to call when after each vcpu run */
+	void (*after_vcpu_run)(struct kvm_vm *vm);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
 		.create_vm_done = NULL,
 		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 	{
 		.name = "clear-log",
 		.create_vm_done = clear_log_create_vm_done,
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 };
 
@@ -224,6 +237,14 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
+static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->after_vcpu_run)
+		mode->after_vcpu_run(vm);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -237,31 +258,17 @@ static void *vcpu_worker(void *data)
 	int ret;
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
-	uint64_t pages_count = 0;
-	struct kvm_run *run;
-
-	run = vcpu_state(vm, VCPU_ID);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
-	generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 
 	while (!READ_ONCE(host_quit)) {
+		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 		/* Let the guest dirty the random pages */
 		ret = _vcpu_run(vm, VCPU_ID);
 		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
-			pages_count += TEST_PAGES_PER_LOOP;
-			generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
-		} else {
-			TEST_ASSERT(false,
-				    "Invalid guest sync status: "
-				    "exit_reason=%s\n",
-				    exit_reason_str(run->exit_reason));
-		}
+		log_mode_after_vcpu_run(vm);
 	}
 
-	DEBUG("Dirtied %"PRIu64" pages\n", pages_count);
-
 	return NULL;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (13 preceding siblings ...)
  2019-12-21  2:04 ` [PATCH RESEND v2 14/17] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
@ 2019-12-21  2:04 ` Peter Xu
  2019-12-24  6:18   ` Jason Wang
  2019-12-24  6:50   ` Jason Wang
  2019-12-21  2:04 ` [PATCH RESEND v2 16/17] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  2:04 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Add the initial dirty ring buffer test.

The current test implements the userspace dirty ring collection, by
only reaping the dirty ring when the ring is full.

So it's still running asynchronously like this:

            vcpu                             main thread

  1. vcpu dirties pages
  2. vcpu gets dirty ring full
     (userspace exit)

                                       3. main thread waits until full
                                          (so hardware buffers flushed)
                                       4. main thread collects
                                       5. main thread continues vcpu

  6. vcpu continues, goes back to 1

We can't directly collects dirty bits during vcpu execution because
otherwise we can't guarantee the hardware dirty bits were flushed when
we collect and we're very strict on the dirty bits so otherwise we can
fail the future verify procedure.  A follow up patch will make this
test to support async just like the existing dirty log test, by adding
a vcpu kick mechanism.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 174 +++++++++++++++++-
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  56 ++++++
 .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
 4 files changed, 234 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3542311f56ff..af9b1a16c7d1 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -12,8 +12,10 @@
 #include <unistd.h>
 #include <time.h>
 #include <pthread.h>
+#include <semaphore.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
+#include <asm/barrier.h>
 
 #include "test_util.h"
 #include "kvm_util.h"
@@ -57,6 +59,8 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
+#define TEST_DIRTY_RING_COUNT		1024
+
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
  * sync_global_to/from_guest() are used when accessing from
@@ -128,6 +132,10 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+/* Whether dirty ring reset is requested, or finished */
+static sem_t dirty_ring_vcpu_stop;
+static sem_t dirty_ring_vcpu_cont;
+
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
 	LOG_MODE_DIRTY_LOG = 0,
@@ -135,6 +143,9 @@ enum log_mode_t {
 	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
 	LOG_MODE_CLERA_LOG = 1,
 
+	/* Use dirty ring for logging */
+	LOG_MODE_DIRTY_RING = 2,
+
 	LOG_MODE_NUM,
 };
 
@@ -177,6 +188,118 @@ static void default_after_vcpu_run(struct kvm_vm *vm)
 		    exit_reason_str(run->exit_reason));
 }
 
+static void dirty_ring_create_vm_done(struct kvm_vm *vm)
+{
+	/*
+	 * Switch to dirty ring mode after VM creation but before any
+	 * of the vcpu creation.
+	 */
+	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+			     sizeof(struct kvm_dirty_gfn));
+}
+
+static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
+				       struct kvm_dirty_ring_indices *indices,
+				       int slot, void *bitmap,
+				       uint32_t num_pages, int index)
+{
+	struct kvm_dirty_gfn *cur;
+	uint32_t avail, fetch, count = 0;
+
+	/*
+	 * We should keep it somewhere, but to be simple we read
+	 * fetch_index too.
+	 */
+	fetch = READ_ONCE(indices->fetch_index);
+	avail = READ_ONCE(indices->avail_index);
+
+	/* Make sure we read valid entries always */
+	rmb();
+
+	DEBUG("ring %d: fetch: 0x%x, avail: 0x%x\n", index, fetch, avail);
+
+	while (fetch != avail) {
+		cur = &dirty_gfns[fetch % TEST_DIRTY_RING_COUNT];
+		TEST_ASSERT(cur->pad == 0, "Padding is non-zero: 0x%x", cur->pad);
+		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
+			    "%u != %u", cur->slot, slot);
+		TEST_ASSERT(cur->offset < num_pages, "Offset overflow: "
+			    "0x%llx >= 0x%llx", cur->offset, num_pages);
+		DEBUG("fetch 0x%x offset 0x%llx\n", fetch, cur->offset);
+		test_and_set_bit(cur->offset, bitmap);
+		fetch++;
+		count++;
+	}
+	WRITE_ONCE(indices->fetch_index, fetch);
+
+	return count;
+}
+
+static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					   void *bitmap, uint32_t num_pages)
+{
+	/* We only have one vcpu */
+	struct kvm_run *state = vcpu_state(vm, VCPU_ID);
+	uint32_t count = 0, cleared;
+
+	/*
+	 * Before fetching the dirty pages, we need a vmexit of the
+	 * worker vcpu to make sure the hardware dirty buffers were
+	 * flushed.  This is not needed for dirty-log/clear-log tests
+	 * because get dirty log will natually do so.
+	 *
+	 * For now we do it in the simple way - we simply wait until
+	 * the vcpu uses up the soft dirty ring, then it'll always
+	 * do a vmexit to make sure that PML buffers will be flushed.
+	 * In real hypervisors, we probably need a vcpu kick or to
+	 * stop the vcpus (before the final sync) to make sure we'll
+	 * get all the existing dirty PFNs even cached in hardware.
+	 */
+	sem_wait(&dirty_ring_vcpu_stop);
+
+	/* Only have one vcpu */
+	count = dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
+				       &state->vcpu_ring_indices,
+				       slot, bitmap, num_pages, VCPU_ID);
+
+	cleared = kvm_vm_reset_dirty_ring(vm);
+
+	/* Cleared pages should be the same as collected */
+	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
+		    "with collected (%u)", cleared, count);
+
+	DEBUG("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
+
+	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
+}
+
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	/* A ucall-sync or ring-full event is allowed */
+	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
+		/* We should allow this to continue */
+		;
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+		sem_post(&dirty_ring_vcpu_stop);
+		DEBUG("vcpu stops because dirty ring full...\n");
+		sem_wait(&dirty_ring_vcpu_cont);
+		DEBUG("vcpu continues now.\n");
+	} else {
+		TEST_ASSERT(false, "Invalid guest sync status: "
+			    "exit_reason=%s\n",
+			    exit_reason_str(run->exit_reason));
+	}
+}
+
+static void dirty_ring_before_vcpu_join(void)
+{
+	/* Kick another round of vcpu just to make sure it will quit */
+	sem_post(&dirty_ring_vcpu_cont);
+}
+
 struct log_mode {
 	const char *name;
 	/* Hook when the vm creation is done (before vcpu creation) */
@@ -186,6 +309,7 @@ struct log_mode {
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
 	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
@@ -199,6 +323,13 @@ struct log_mode {
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
 		.after_vcpu_run = default_after_vcpu_run,
 	},
+	{
+		.name = "dirty-ring",
+		.create_vm_done = dirty_ring_create_vm_done,
+		.collect_dirty_pages = dirty_ring_collect_dirty_pages,
+		.before_vcpu_join = dirty_ring_before_vcpu_join,
+		.after_vcpu_run = dirty_ring_after_vcpu_run,
+	},
 };
 
 /*
@@ -245,6 +376,14 @@ static void log_mode_after_vcpu_run(struct kvm_vm *vm)
 		mode->after_vcpu_run(vm);
 }
 
+static void log_mode_before_vcpu_join(void)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->before_vcpu_join)
+		mode->before_vcpu_join();
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -292,14 +431,41 @@ static void vm_dirty_log_verify(unsigned long *bmap)
 		}
 
 		if (test_and_clear_bit_le(page, bmap)) {
+			bool matched;
+
 			host_dirty_count++;
+
 			/*
 			 * If the bit is set, the value written onto
 			 * the corresponding page should be either the
 			 * previous iteration number or the current one.
+			 *
+			 * (*value_ptr == iteration - 2) case is
+			 * special only for dirty ring test where the
+			 * page is the last page before a kvm dirty
+			 * ring full userspace exit of the 2nd
+			 * iteration, if without this we'll probably
+			 * fail on the 4th iteration.  Anyway, let's
+			 * just loose the test case a little bit for
+			 * all for simplicity.
 			 */
-			TEST_ASSERT(*value_ptr == iteration ||
-				    *value_ptr == iteration - 1,
+			matched = (*value_ptr == iteration ||
+				   *value_ptr == iteration - 1 ||
+				   *value_ptr == iteration - 2);
+
+			/*
+			 * This is the common path for dirty ring
+			 * where this page is exactly the last page
+			 * touched before KVM_EXIT_DIRTY_RING_FULL.
+			 * If it happens, we should expect it to be
+			 * there for the next round.
+			 */
+			if (host_log_mode == LOG_MODE_DIRTY_RING && !matched) {
+				set_bit_le(page, host_bmap_track);
+				continue;
+			}
+
+			TEST_ASSERT(matched,
 				    "Set page %"PRIu64" value %"PRIu64
 				    " incorrect (iteration=%"PRIu64")",
 				    page, *value_ptr, iteration);
@@ -460,6 +626,7 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
+	log_mode_before_vcpu_join();
 	pthread_join(vcpu_thread, NULL);
 
 	DEBUG("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
@@ -524,6 +691,9 @@ int main(int argc, char *argv[])
 	unsigned int host_ipa_limit;
 #endif
 
+	sem_init(&dirty_ring_vcpu_stop, 0, 0);
+	sem_init(&dirty_ring_vcpu_cont, 0, 0);
+
 #ifdef __x86_64__
 	vm_guest_mode_params_init(VM_MODE_PXXV48_4K, true, true);
 #endif
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 29cccaf96baf..4b78a8d3e773 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -67,6 +67,7 @@ enum vm_mem_backing_src_type {
 
 int kvm_check_cap(long cap);
 int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 
 struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
 struct kvm_vm *_vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
@@ -76,6 +77,7 @@ void kvm_vm_release(struct kvm_vm *vmp);
 void kvm_vm_get_dirty_log(struct kvm_vm *vm, int slot, void *log);
 void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 			    uint64_t first_page, uint32_t num_pages);
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm);
 
 int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
 		       size_t len);
@@ -137,6 +139,7 @@ void vcpu_nested_state_get(struct kvm_vm *vm, uint32_t vcpuid,
 int vcpu_nested_state_set(struct kvm_vm *vm, uint32_t vcpuid,
 			  struct kvm_nested_state *state, bool ignore_error);
 #endif
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid);
 
 const char *exit_reason_str(unsigned int exit_reason);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 41cf45416060..a119717bc84c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -85,6 +85,26 @@ int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap)
 	return ret;
 }
 
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size)
+{
+	struct kvm_enable_cap cap = {};
+	int ret;
+
+	ret = kvm_check_cap(KVM_CAP_DIRTY_LOG_RING);
+
+	TEST_ASSERT(ret >= 0, "KVM_CAP_DIRTY_LOG_RING");
+
+	if (ret == 0) {
+		fprintf(stderr, "KVM does not support dirty ring, skipping tests\n");
+		exit(KSFT_SKIP);
+	}
+
+	cap.cap = KVM_CAP_DIRTY_LOG_RING;
+	cap.args[0] = ring_size;
+	vm_enable_cap(vm, &cap);
+	vm->dirty_ring_size = ring_size;
+}
+
 static void vm_open(struct kvm_vm *vm, int perm)
 {
 	vm->kvm_fd = open(KVM_DEV_PATH, perm);
@@ -297,6 +317,11 @@ void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 		    strerror(-ret));
 }
 
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm)
+{
+	return ioctl(vm->fd, KVM_RESET_DIRTY_RINGS);
+}
+
 /*
  * Userspace Memory Region Find
  *
@@ -408,6 +433,13 @@ static void vm_vcpu_rm(struct kvm_vm *vm, uint32_t vcpuid)
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
 	int ret;
 
+	if (vcpu->dirty_gfns) {
+		ret = munmap(vcpu->dirty_gfns, vm->dirty_ring_size);
+		TEST_ASSERT(ret == 0, "munmap of VCPU dirty ring failed, "
+			    "rc: %i errno: %i", ret, errno);
+		vcpu->dirty_gfns = NULL;
+	}
+
 	ret = munmap(vcpu->state, sizeof(*vcpu->state));
 	TEST_ASSERT(ret == 0, "munmap of VCPU fd failed, rc: %i "
 		"errno: %i", ret, errno);
@@ -1409,6 +1441,29 @@ int _vcpu_ioctl(struct kvm_vm *vm, uint32_t vcpuid,
 	return ret;
 }
 
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu;
+	uint32_t size = vm->dirty_ring_size;
+
+	TEST_ASSERT(size > 0, "Should enable dirty ring first");
+
+	vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu, "Cannot find vcpu %u", vcpuid);
+
+	if (!vcpu->dirty_gfns) {
+		vcpu->dirty_gfns_count = size / sizeof(struct kvm_dirty_gfn);
+		vcpu->dirty_gfns = mmap(NULL, size, PROT_READ | PROT_WRITE,
+					MAP_SHARED, vcpu->fd, vm->page_size *
+					KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(vcpu->dirty_gfns != MAP_FAILED,
+			    "Dirty ring map failed");
+	}
+
+	return vcpu->dirty_gfns;
+}
+
 /*
  * VM Ioctl
  *
@@ -1503,6 +1558,7 @@ static struct exit_reason {
 	{KVM_EXIT_INTERNAL_ERROR, "INTERNAL_ERROR"},
 	{KVM_EXIT_OSI, "OSI"},
 	{KVM_EXIT_PAPR_HCALL, "PAPR_HCALL"},
+	{KVM_EXIT_DIRTY_RING_FULL, "DIRTY_RING_FULL"},
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
 #endif
diff --git a/tools/testing/selftests/kvm/lib/kvm_util_internal.h b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
index ac50c42750cf..87edcc6746a2 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util_internal.h
+++ b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
@@ -39,6 +39,8 @@ struct vcpu {
 	uint32_t id;
 	int fd;
 	struct kvm_run *state;
+	struct kvm_dirty_gfn *dirty_gfns;
+	uint32_t dirty_gfns_count;
 };
 
 struct kvm_vm {
@@ -61,6 +63,7 @@ struct kvm_vm {
 	vm_paddr_t pgd;
 	vm_vaddr_t gdt;
 	vm_vaddr_t tss;
+	uint32_t dirty_ring_size;
 };
 
 struct vcpu *vcpu_find(struct kvm_vm *vm, uint32_t vcpuid);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 16/17] KVM: selftests: Let dirty_log_test async for dirty ring test
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (14 preceding siblings ...)
  2019-12-21  2:04 ` [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test Peter Xu
@ 2019-12-21  2:04 ` Peter Xu
  2019-12-21  2:04 ` [PATCH RESEND v2 17/17] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
  2019-12-24  6:34 ` [PATCH RESEND v2 00/17] KVM: Dirty ring interface Jason Wang
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  2:04 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

Previously the dirty ring test was working in synchronous way, because
only with a vmexit (with that it was the ring full event) we'll know
the hardware dirty bits will be flushed to the dirty ring.

With this patch we first introduced the vcpu kick mechanism by using
SIGUSR1, meanwhile we can have a guarantee of vmexit and also the
flushing of hardware dirty bits.  With all these, we can keep the vcpu
dirty work asynchronous of the whole collection procedure now.  Still,
we need to be very careful that we can only do it async if the vcpu is
not reaching soft limit (no KVM_EXIT_DIRTY_RING_FULL).  Otherwise we
must collect the dirty bits before continuing the vcpu.

Further increase the dirty ring size to current maximum to make sure
we torture more on the no-ring-full case, which should be the major
scenario when the hypervisors like QEMU would like to use this feature.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 123 +++++++++++++-----
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   8 ++
 3 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index af9b1a16c7d1..4403c6770276 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -13,6 +13,9 @@
 #include <time.h>
 #include <pthread.h>
 #include <semaphore.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 #include <asm/barrier.h>
@@ -59,7 +62,9 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
-#define TEST_DIRTY_RING_COUNT		1024
+#define TEST_DIRTY_RING_COUNT		65536
+
+#define SIG_IPI SIGUSR1
 
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
@@ -135,6 +140,12 @@ static uint64_t host_track_next_count;
 /* Whether dirty ring reset is requested, or finished */
 static sem_t dirty_ring_vcpu_stop;
 static sem_t dirty_ring_vcpu_cont;
+/*
+ * This is updated by the vcpu thread to tell the host whether it's a
+ * ring-full event.  It should only be read until a sem_wait() of
+ * dirty_ring_vcpu_stop and before vcpu continues to run.
+ */
+static bool dirty_ring_vcpu_ring_full;
 
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
@@ -151,6 +162,33 @@ enum log_mode_t {
 
 /* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
 static enum log_mode_t host_log_mode;
+pthread_t vcpu_thread;
+
+/* Only way to pass this to the signal handler */
+struct kvm_vm *current_vm;
+
+static void vcpu_sig_handler(int sig)
+{
+	TEST_ASSERT(sig == SIG_IPI, "unknown signal: %d", sig);
+}
+
+static void vcpu_kick(void)
+{
+	pthread_kill(vcpu_thread, SIG_IPI);
+}
+
+/*
+ * In our test we do signal tricks, let's use a better version of
+ * sem_wait to avoid signal interrupts
+ */
+static void sem_wait_until(sem_t *sem)
+{
+	int ret;
+
+	do
+		ret = sem_wait(sem);
+	while (ret == -1 && errno == EINTR);
+}
 
 static void clear_log_create_vm_done(struct kvm_vm *vm)
 {
@@ -179,10 +217,13 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
-static void default_after_vcpu_run(struct kvm_vm *vm)
+static void default_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
+	TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
+		    "vcpu run failed: errno=%d", err);
+
 	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
 		    "Invalid guest sync status: exit_reason=%s\n",
 		    exit_reason_str(run->exit_reason));
@@ -235,27 +276,37 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 	return count;
 }
 
+static void dirty_ring_wait_vcpu(void)
+{
+	/* This makes sure that hardware PML cache flushed */
+	vcpu_kick();
+	sem_wait_until(&dirty_ring_vcpu_stop);
+}
+
+static void dirty_ring_continue_vcpu(void)
+{
+	DEBUG("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
+}
+
 static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 					   void *bitmap, uint32_t num_pages)
 {
 	/* We only have one vcpu */
 	struct kvm_run *state = vcpu_state(vm, VCPU_ID);
 	uint32_t count = 0, cleared;
+	bool continued_vcpu = false;
 
-	/*
-	 * Before fetching the dirty pages, we need a vmexit of the
-	 * worker vcpu to make sure the hardware dirty buffers were
-	 * flushed.  This is not needed for dirty-log/clear-log tests
-	 * because get dirty log will natually do so.
-	 *
-	 * For now we do it in the simple way - we simply wait until
-	 * the vcpu uses up the soft dirty ring, then it'll always
-	 * do a vmexit to make sure that PML buffers will be flushed.
-	 * In real hypervisors, we probably need a vcpu kick or to
-	 * stop the vcpus (before the final sync) to make sure we'll
-	 * get all the existing dirty PFNs even cached in hardware.
-	 */
-	sem_wait(&dirty_ring_vcpu_stop);
+	dirty_ring_wait_vcpu();
+
+	if (!dirty_ring_vcpu_ring_full) {
+		/*
+		 * This is not a ring-full event, it's safe to allow
+		 * vcpu to continue
+		 */
+		dirty_ring_continue_vcpu();
+		continued_vcpu = true;
+	}
 
 	/* Only have one vcpu */
 	count = dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
@@ -268,13 +319,16 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
 		    "with collected (%u)", cleared, count);
 
-	DEBUG("Notifying vcpu to continue\n");
-	sem_post(&dirty_ring_vcpu_cont);
+	if (!continued_vcpu) {
+		TEST_ASSERT(dirty_ring_vcpu_ring_full,
+			    "Didn't continue vcpu even without ring full");
+		dirty_ring_continue_vcpu();
+	}
 
 	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
 }
 
-static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
@@ -282,10 +336,16 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
 	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
 		/* We should allow this to continue */
 		;
-	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
+		   (ret == -1 && err == EINTR)) {
+		/* Update the flag first before pause */
+		WRITE_ONCE(dirty_ring_vcpu_ring_full,
+			   run->exit_reason == KVM_EXIT_DIRTY_RING_FULL);
 		sem_post(&dirty_ring_vcpu_stop);
-		DEBUG("vcpu stops because dirty ring full...\n");
-		sem_wait(&dirty_ring_vcpu_cont);
+		DEBUG("vcpu stops because %s...\n",
+		      dirty_ring_vcpu_ring_full ?
+		      "dirty ring is full" : "vcpu is kicked out");
+		sem_wait_until(&dirty_ring_vcpu_cont);
 		DEBUG("vcpu continues now.\n");
 	} else {
 		TEST_ASSERT(false, "Invalid guest sync status: "
@@ -308,7 +368,7 @@ struct log_mode {
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
-	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*after_vcpu_run)(struct kvm_vm *vm, int ret, int err);
 	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
@@ -368,12 +428,12 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
-static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+static void log_mode_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct log_mode *mode = &log_modes[host_log_mode];
 
 	if (mode->after_vcpu_run)
-		mode->after_vcpu_run(vm);
+		mode->after_vcpu_run(vm, ret, err);
 }
 
 static void log_mode_before_vcpu_join(void)
@@ -397,15 +457,21 @@ static void *vcpu_worker(void *data)
 	int ret;
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
+	struct sigaction sigact;
+
+	current_vm = vm;
+	memset(&sigact, 0, sizeof(sigact));
+	sigact.sa_handler = vcpu_sig_handler;
+	sigaction(SIG_IPI, &sigact, NULL);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
 
 	while (!READ_ONCE(host_quit)) {
+		/* Clear any existing kick signals */
 		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 		/* Let the guest dirty the random pages */
-		ret = _vcpu_run(vm, VCPU_ID);
-		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		log_mode_after_vcpu_run(vm);
+		ret = __vcpu_run(vm, VCPU_ID);
+		log_mode_after_vcpu_run(vm, ret, errno);
 	}
 
 	return NULL;
@@ -528,7 +594,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 		     unsigned long interval, uint64_t phys_offset)
 {
-	pthread_t vcpu_thread;
 	struct kvm_vm *vm;
 	unsigned long *bmap;
 
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 4b78a8d3e773..e64fbfe6bbd5 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -115,6 +115,7 @@ vm_paddr_t addr_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva);
 struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
+int __vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_set_mp_state(struct kvm_vm *vm, uint32_t vcpuid,
 		       struct kvm_mp_state *mp_state);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index a119717bc84c..32c1aca55652 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1187,6 +1187,14 @@ int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
 	return rc;
 }
 
+int __vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu != NULL, "vcpu not found, vcpuid: %u", vcpuid);
+	return ioctl(vcpu->fd, KVM_RUN, NULL);
+}
+
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid)
 {
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RESEND v2 17/17] KVM: selftests: Add "-c" parameter to dirty log test
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (15 preceding siblings ...)
  2019-12-21  2:04 ` [PATCH RESEND v2 16/17] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
@ 2019-12-21  2:04 ` Peter Xu
  2019-12-24  6:34 ` [PATCH RESEND v2 00/17] KVM: Dirty ring interface Jason Wang
  17 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-21  2:04 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin, peterx,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

It's only used to override the existing dirty ring size/count.  If
with a bigger ring count, we test async of dirty ring.  If with a
smaller ring count, we test ring full code path.  Async is default.

It has no use for non-dirty-ring tests.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 4403c6770276..fde3fa751818 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -163,6 +163,7 @@ enum log_mode_t {
 /* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
 static enum log_mode_t host_log_mode;
 pthread_t vcpu_thread;
+static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
 
 /* Only way to pass this to the signal handler */
 struct kvm_vm *current_vm;
@@ -235,7 +236,7 @@ static void dirty_ring_create_vm_done(struct kvm_vm *vm)
 	 * Switch to dirty ring mode after VM creation but before any
 	 * of the vcpu creation.
 	 */
-	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+	vm_enable_dirty_ring(vm, test_dirty_ring_count *
 			     sizeof(struct kvm_dirty_gfn));
 }
 
@@ -260,7 +261,7 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 	DEBUG("ring %d: fetch: 0x%x, avail: 0x%x\n", index, fetch, avail);
 
 	while (fetch != avail) {
-		cur = &dirty_gfns[fetch % TEST_DIRTY_RING_COUNT];
+		cur = &dirty_gfns[fetch % test_dirty_ring_count];
 		TEST_ASSERT(cur->pad == 0, "Padding is non-zero: 0x%x", cur->pad);
 		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
 			    "%u != %u", cur->slot, slot);
@@ -723,6 +724,9 @@ static void help(char *name)
 	printf("usage: %s [-h] [-i iterations] [-I interval] "
 	       "[-p offset] [-m mode]\n", name);
 	puts("");
+	printf(" -c: specify dirty ring size, in number of entries\n");
+	printf("     (only useful for dirty-ring test; default: %"PRIu32")\n",
+	       TEST_DIRTY_RING_COUNT);
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
 	printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
@@ -778,8 +782,11 @@ int main(int argc, char *argv[])
 	vm_guest_mode_params_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
 		switch (opt) {
+		case 'c':
+			test_dirty_ring_count = strtol(optarg, NULL, 10);
+			break;
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
 			break;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2019-12-21  1:49 ` [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
@ 2019-12-21 13:51   ` Paolo Bonzini
  2019-12-23 17:27     ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2019-12-21 13:51 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Michael S . Tsirkin, Jason Wang,
	Vitaly Kuznetsov

On 21/12/19 02:49, Peter Xu wrote:
> Originally, we have three code paths that can dirty a page without
> vcpu context for X86:
> 
>   - init_rmode_identity_map
>   - init_rmode_tss
>   - kvmgt_rw_gpa
> 
> init_rmode_identity_map and init_rmode_tss will be setup on
> destination VM no matter what (and the guest cannot even see them), so
> it does not make sense to track them at all.
> 
> To do this, a new parameter is added to kvm_[write|clear]_guest_page()
> to show whether we would like to track dirty bits for the operations.
> With that, pass in "false" to this new parameter for any guest memory
> write of the ioctls (KVM_SET_TSS_ADDR, KVM_SET_IDENTITY_MAP_ADDR).

We can also return the hva from x86_set_memory_region and
__x86_set_memory_region.

Paolo

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c   | 18 ++++++++++--------
>  include/linux/kvm_host.h |  5 +++--
>  virt/kvm/kvm_main.c      | 25 ++++++++++++++++---------
>  3 files changed, 29 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 04a8212704c1..1ff5a428f489 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3452,24 +3452,24 @@ static int init_rmode_tss(struct kvm *kvm)
>  
>  	idx = srcu_read_lock(&kvm->srcu);
>  	fn = to_kvm_vmx(kvm)->tss_addr >> PAGE_SHIFT;
> -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> +	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE, false);
>  	if (r < 0)
>  		goto out;
>  	data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
>  	r = kvm_write_guest_page(kvm, fn++, &data,
> -			TSS_IOPB_BASE_OFFSET, sizeof(u16));
> +				 TSS_IOPB_BASE_OFFSET, sizeof(u16), false);
>  	if (r < 0)
>  		goto out;
> -	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE);
> +	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE, false);
>  	if (r < 0)
>  		goto out;
> -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> +	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE, false);
>  	if (r < 0)
>  		goto out;
>  	data = ~0;
>  	r = kvm_write_guest_page(kvm, fn, &data,
>  				 RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1,
> -				 sizeof(u8));
> +				 sizeof(u8), false);
>  out:
>  	srcu_read_unlock(&kvm->srcu, idx);
>  	return r;
> @@ -3498,7 +3498,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
>  		goto out2;
>  
>  	idx = srcu_read_lock(&kvm->srcu);
> -	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
> +	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE, false);
>  	if (r < 0)
>  		goto out;
>  	/* Set up identity-mapping pagetable for EPT in real mode */
> @@ -3506,7 +3506,8 @@ static int init_rmode_identity_map(struct kvm *kvm)
>  		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
>  			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
>  		r = kvm_write_guest_page(kvm, identity_map_pfn,
> -				&tmp, i * sizeof(tmp), sizeof(tmp));
> +					 &tmp, i * sizeof(tmp),
> +					 sizeof(tmp), false);
>  		if (r < 0)
>  			goto out;
>  	}
> @@ -7265,7 +7266,8 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
>  		dst = vmcs12->pml_address + sizeof(u64) * vmcs12->guest_pml_index;
>  
>  		if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), &gpa,
> -					 offset_in_page(dst), sizeof(gpa)))
> +					 offset_in_page(dst), sizeof(gpa),
> +					 false))
>  			return 0;
>  
>  		vmcs12->guest_pml_index--;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2ea1ea79befd..4e34cf97ca90 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -734,7 +734,7 @@ int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
>  int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  			   void *data, unsigned long len);
>  int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
> -			 int offset, int len);
> +			 int offset, int len, bool track_dirty);
>  int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
>  		    unsigned long len);
>  int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
> @@ -744,7 +744,8 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  				  unsigned long len);
>  int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  			      gpa_t gpa, unsigned long len);
> -int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
> +int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len,
> +			 bool track_dirty);
>  int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
>  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ee28af9eb48..b1047173d78e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2051,7 +2051,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
>  static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
> -			          const void *data, int offset, int len)
> +			          const void *data, int offset, int len,
> +				  bool track_dirty)
>  {
>  	int r;
>  	unsigned long addr;
> @@ -2062,16 +2063,19 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
>  	r = __copy_to_user((void __user *)addr + offset, data, len);
>  	if (r)
>  		return -EFAULT;
> -	mark_page_dirty_in_slot(memslot, gfn);
> +	if (track_dirty)
> +		mark_page_dirty_in_slot(memslot, gfn);
>  	return 0;
>  }
>  
>  int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
> -			 const void *data, int offset, int len)
> +			 const void *data, int offset, int len,
> +			 bool track_dirty)
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +				      track_dirty);
>  }
>  EXPORT_SYMBOL_GPL(kvm_write_guest_page);
>  
> @@ -2080,7 +2084,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(slot, gfn, data, offset,
> +				      len, true);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
>  
> @@ -2093,7 +2098,7 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
>  	int ret;
>  
>  	while ((seg = next_segment(len, offset)) != 0) {
> -		ret = kvm_write_guest_page(kvm, gfn, data, offset, seg);
> +		ret = kvm_write_guest_page(kvm, gfn, data, offset, seg, true);
>  		if (ret < 0)
>  			return ret;
>  		offset = 0;
> @@ -2232,11 +2237,13 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
>  
> -int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
> +int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len,
> +			 bool track_dirty)
>  {
>  	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
>  
> -	return kvm_write_guest_page(kvm, gfn, zero_page, offset, len);
> +	return kvm_write_guest_page(kvm, gfn, zero_page, offset, len,
> +				    track_dirty);
>  }
>  EXPORT_SYMBOL_GPL(kvm_clear_guest_page);
>  
> @@ -2248,7 +2255,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
>  	int ret;
>  
>  	while ((seg = next_segment(len, offset)) != 0) {
> -		ret = kvm_clear_guest_page(kvm, gfn, offset, seg);
> +		ret = kvm_clear_guest_page(kvm, gfn, offset, seg, true);
>  		if (ret < 0)
>  			return ret;
>  		offset = 0;
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2019-12-21 13:51   ` Paolo Bonzini
@ 2019-12-23 17:27     ` Peter Xu
  2019-12-23 17:59       ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-23 17:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

On Sat, Dec 21, 2019 at 02:51:52PM +0100, Paolo Bonzini wrote:
> On 21/12/19 02:49, Peter Xu wrote:
> > Originally, we have three code paths that can dirty a page without
> > vcpu context for X86:
> > 
> >   - init_rmode_identity_map
> >   - init_rmode_tss
> >   - kvmgt_rw_gpa
> > 
> > init_rmode_identity_map and init_rmode_tss will be setup on
> > destination VM no matter what (and the guest cannot even see them), so
> > it does not make sense to track them at all.
> > 
> > To do this, a new parameter is added to kvm_[write|clear]_guest_page()
> > to show whether we would like to track dirty bits for the operations.
> > With that, pass in "false" to this new parameter for any guest memory
> > write of the ioctls (KVM_SET_TSS_ADDR, KVM_SET_IDENTITY_MAP_ADDR).
> 
> We can also return the hva from x86_set_memory_region and
> __x86_set_memory_region.

Yes.  Though it is a bit tricky in that then we'll also need to make
sure to take slots_lock or srcu to protect that hva (say, we must drop
that hva reference before we release the locks, otherwise the hva
could gone under us, iiuc).  So if we want to do that we'd better
comment on that hva value very explicitly, just in case some future
callers of __x86_set_memory_region could cache it somewhere.

(Side topic: I feel like the srcu_read_lock() pair in
 init_rmode_identity_map() is redundant..)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2019-12-23 17:27     ` Peter Xu
@ 2019-12-23 17:59       ` Paolo Bonzini
  2019-12-23 20:10         ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2019-12-23 17:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

On 23/12/19 18:27, Peter Xu wrote:
> Yes.  Though it is a bit tricky in that then we'll also need to make
> sure to take slots_lock or srcu to protect that hva (say, we must drop
> that hva reference before we release the locks, otherwise the hva
> could gone under us, iiuc).

Yes, kvm->slots_lock is taken by x86_set_memory_region.  We need to move
that to the callers, of which several are already taking the lock (all
except vmx_set_tss_addr and kvm_arch_destroy_vm).

Paolo

> So if we want to do that we'd better
> comment on that hva value very explicitly, just in case some future
> callers of __x86_set_memory_region could cache it somewhere.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2019-12-23 17:59       ` Paolo Bonzini
@ 2019-12-23 20:10         ` Peter Xu
  2020-01-08 17:46           ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-23 20:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

On Mon, Dec 23, 2019 at 06:59:01PM +0100, Paolo Bonzini wrote:
> On 23/12/19 18:27, Peter Xu wrote:
> > Yes.  Though it is a bit tricky in that then we'll also need to make
> > sure to take slots_lock or srcu to protect that hva (say, we must drop
> > that hva reference before we release the locks, otherwise the hva
> > could gone under us, iiuc).
> 
> Yes, kvm->slots_lock is taken by x86_set_memory_region.  We need to move
> that to the callers, of which several are already taking the lock (all
> except vmx_set_tss_addr and kvm_arch_destroy_vm).

OK, will do.  I'll directly replace the x86_set_memory_region() calls
in kvm_arch_destroy_vm() to be __x86_set_memory_region() since IIUC
the slots_lock is helpless when destroying the vm... then drop the
x86_set_memory_region() helper in the next version.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2019-12-21  1:49 ` [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
@ 2019-12-24  6:16   ` Jason Wang
  2019-12-24 15:08     ` Peter Xu
  2020-01-08 15:52   ` Peter Xu
  1 sibling, 1 reply; 45+ messages in thread
From: Jason Wang @ 2019-12-24  6:16 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Vitaly Kuznetsov, Lei Cao


On 2019/12/21 上午9:49, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
>
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.  However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
>
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
>
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
>
> This patch enables dirty ring for X86 only.  However it should be
> easily extended to other archs as well.
>
> [1] https://patchwork.kernel.org/patch/10471409/
>
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   Documentation/virt/kvm/api.txt  |  89 ++++++++++++++
>   arch/x86/include/asm/kvm_host.h |   3 +
>   arch/x86/include/uapi/asm/kvm.h |   1 +
>   arch/x86/kvm/Makefile           |   3 +-
>   arch/x86/kvm/mmu.c              |   6 +
>   arch/x86/kvm/vmx/vmx.c          |   7 ++
>   arch/x86/kvm/x86.c              |   9 ++
>   include/linux/kvm_dirty_ring.h  |  57 +++++++++
>   include/linux/kvm_host.h        |  28 +++++
>   include/trace/events/kvm.h      |  78 +++++++++++++
>   include/uapi/linux/kvm.h        |  31 +++++
>   virt/kvm/dirty_ring.c           | 201 ++++++++++++++++++++++++++++++++
>   virt/kvm/kvm_main.c             | 172 ++++++++++++++++++++++++++-
>   13 files changed, 682 insertions(+), 3 deletions(-)
>   create mode 100644 include/linux/kvm_dirty_ring.h
>   create mode 100644 virt/kvm/dirty_ring.c
>
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 4833904d32a5..c141b285e673 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
>   It is thus encouraged to use the vm ioctl to query for capabilities (available
>   with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>   
> +
>   4.5 KVM_GET_VCPU_MMAP_SIZE
>   
>   Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
>   memory region.  This ioctl returns the size of that region.  See the
>   KVM_RUN documentation for details.
>   
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> +  KVM_CAP_COALESCED_MMIO is not documented yet.
> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>   
>   4.6 KVM_SET_MEMORY_REGION
>   
> @@ -5302,6 +5315,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
>   AArch64, this value will be reported in the ISS field of ESR_ELx.
>   
>   See KVM_CAP_VCPU_EVENTS for more details.
> +
>   8.20 KVM_CAP_HYPERV_SEND_IPI
>   
>   Architectures: x86
> @@ -5309,6 +5323,7 @@ Architectures: x86
>   This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>   hypercalls:
>   HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
>   8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>   
>   Architecture: x86
> @@ -5322,3 +5337,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
>   flush hypercalls by Hyper-V) so userspace should disable KVM identification
>   in CPUID and only exposes Hyper-V identification. In this case, guest
>   thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu.
> +
> +One dirty ring is defined as below internally:
> +
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +	struct kvm_dirty_ring_indices *indices;
> +	int index;
> +};
> +
> +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> +For each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {
> +        __u32 pad;
> +        __u32 slot; /* as_id | slot_id */
> +        __u64 offset;
> +};
> +
> +Most of the ring structure is used by KVM internally, while only the
> +indices are exposed to userspace:
> +
> +struct kvm_dirty_ring_indices {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
> +The two indices in the ring buffer are free running counters.
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings.  It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.  The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly.  This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once.  After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean.  Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.
> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer (which was previously
> +done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one needs to
> +kick the vcpu out for a hardware buffer flush (vmexit) to make sure
> +all the existing dirty gfns are flushed to the dirty rings.
> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the KVM_RUN
> +ioctl will return to userspace with zero.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4fc61483919a..7e5e2d3f0509 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1159,6 +1159,7 @@ struct kvm_x86_ops {
>   					   struct kvm_memory_slot *slot,
>   					   gfn_t offset, unsigned long mask);
>   	int (*write_log_dirty)(struct kvm_vcpu *vcpu);
> +	int (*cpu_dirty_log_size)(void);
>   
>   	/* pmu operations of sub-arch */
>   	const struct kvm_pmu_ops *pmu_ops;
> @@ -1641,4 +1642,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>   #define GET_SMSTATE(type, buf, offset)		\
>   	(*(type *)((buf) + (offset) - 0x7e00))
>   
> +int kvm_cpu_dirty_log_size(void);
> +
>   #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 503d3f42da16..b59bf356c478 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -12,6 +12,7 @@
>   
>   #define KVM_PIO_PAGE_OFFSET 1
>   #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
>   
>   #define DE_VECTOR 0
>   #define DB_VECTOR 1
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 31ecf7a76d5a..a66ddb552208 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
>   KVM := ../../../virt/kvm
>   
>   kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> +				$(KVM)/dirty_ring.o
>   kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>   
>   kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2ce9da58611e..5f7d73730f73 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1818,7 +1818,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
>   {
>   	if (kvm_x86_ops->write_log_dirty)
>   		return kvm_x86_ops->write_log_dirty(vcpu);
> +	return 0;
> +}
>   
> +int kvm_cpu_dirty_log_size(void)
> +{
> +	if (kvm_x86_ops->cpu_dirty_log_size)
> +		return kvm_x86_ops->cpu_dirty_log_size();
>   	return 0;
>   }
>   
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 1ff5a428f489..c3565319b481 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7686,6 +7686,7 @@ static __init int hardware_setup(void)
>   		kvm_x86_ops->slot_disable_log_dirty = NULL;
>   		kvm_x86_ops->flush_log_dirty = NULL;
>   		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
> +		kvm_x86_ops->cpu_dirty_log_size = NULL;
>   	}
>   
>   	if (!cpu_has_vmx_preemption_timer())
> @@ -7750,6 +7751,11 @@ static __exit void hardware_unsetup(void)
>   	free_kvm_area();
>   }
>   
> +static int vmx_cpu_dirty_log_size(void)
> +{
> +	return enable_pml ? PML_ENTITY_NUM : 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
> @@ -7873,6 +7879,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>   	.flush_log_dirty = vmx_flush_log_dirty,
>   	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
>   	.write_log_dirty = vmx_write_pml_buffer,
> +	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
>   
>   	.pre_block = vmx_pre_block,
>   	.post_block = vmx_post_block,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5d530521f11d..f93262025a61 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7965,6 +7965,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   
>   	bool req_immediate_exit = false;
>   
> +	/* Forbid vmenter if vcpu dirty ring is soft-full */
> +	if (unlikely(vcpu->kvm->dirty_ring_size &&
> +		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> +		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +		trace_kvm_dirty_ring_exit(vcpu);
> +		r = 0;
> +		goto out;
> +	}
> +
>   	if (kvm_request_pending(vcpu)) {
>   		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu)) {
>   			if (unlikely(!kvm_x86_ops->get_vmcs12_pages(vcpu))) {
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..06db2312b383
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,57 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/**
> + * kvm_dirty_ring: KVM internal dirty ring structure
> + *
> + * @dirty_index: free running counter that points to the next slot in
> + *               dirty_ring->dirty_gfns, where a new dirty page should go
> + * @reset_index: free running counter that points to the next dirty page
> + *               in dirty_ring->dirty_gfns for which dirty trap needs to
> + *               be reenabled
> + * @size:        size of the compact list, dirty_ring->dirty_gfns
> + * @soft_limit:  when the number of dirty pages in the list reaches this
> + *               limit, vcpu that owns this ring should exit to userspace
> + *               to allow userspace to harvest all the dirty pages
> + * @dirty_gfns:  the array to keep the dirty gfns
> + * @indices:     the pointer to the @kvm_dirty_ring_indices structure
> + *               of this specific ring
> + * @index:       index of this dirty ring
> + */
> +struct kvm_dirty_ring {
> +	u32 dirty_index;


Does this always equal to indices->avail_index?


> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +	struct kvm_dirty_ring_indices *indices;


Any reason to keep dirty gfns and indices in different places? I guess 
it is because you want to map dirty_gfns as readonly page but I couldn't 
find such codes...


> +	int index;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indices *indices,
> +			 int index, u32 size);
> +struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
> +void kvm_dirty_ring_put(struct kvm *kvm,
> +			struct kvm_dirty_ring *ring);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * returns =0: successfully pushed
> + *         <0: unable to push, need to wait
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b4f7bef38e0d..dff214ab72eb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>   #include <linux/kvm_types.h>
>   
>   #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>   
>   #ifndef KVM_MAX_VCPU_ID
>   #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -321,6 +322,7 @@ struct kvm_vcpu {
>   	bool ready;
>   	struct kvm_vcpu_arch arch;
>   	struct dentry *debugfs_dentry;
> +	struct kvm_dirty_ring dirty_ring;
>   };
>   
>   static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -502,6 +504,9 @@ struct kvm {
>   	struct srcu_struct srcu;
>   	struct srcu_struct irq_srcu;
>   	pid_t userspace_pid;
> +	u32 dirty_ring_size;
> +	struct spinlock dirty_ring_lock;
> +	wait_queue_head_t dirty_ring_waitqueue;
>   };
>   
>   #define kvm_err(fmt, ...) \
> @@ -813,6 +818,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>   					gfn_t gfn_offset,
>   					unsigned long mask);
>   
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
>   int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>   				struct kvm_dirty_log *log);
>   int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1392,4 +1399,25 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>   				uintptr_t data, const char *name,
>   				struct task_struct **thread_ptr);
>   
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full.  This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures, while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +
>   #endif
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 2c735a3e6613..3d850997940c 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -399,6 +399,84 @@ TRACE_EVENT(kvm_halt_poll_ns,
>   #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
>   	trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
>   
> +TRACE_EVENT(kvm_dirty_ring_push,
> +	TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
> +	TP_ARGS(ring, slot, offset),
> +
> +	TP_STRUCT__entry(
> +		__field(int, index)
> +		__field(u32, dirty_index)
> +		__field(u32, reset_index)
> +		__field(u32, slot)
> +		__field(u64, offset)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index          = ring->index;
> +		__entry->dirty_index    = ring->dirty_index;
> +		__entry->reset_index    = ring->reset_index;
> +		__entry->slot           = slot;
> +		__entry->offset         = offset;
> +	),
> +
> +	TP_printk("ring %d: dirty 0x%x reset 0x%x "
> +		  "slot %u offset 0x%llx (used %u)",
> +		  __entry->index, __entry->dirty_index,
> +		  __entry->reset_index,  __entry->slot, __entry->offset,
> +		  __entry->dirty_index - __entry->reset_index)
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_reset,
> +	TP_PROTO(struct kvm_dirty_ring *ring),
> +	TP_ARGS(ring),
> +
> +	TP_STRUCT__entry(
> +		__field(int, index)
> +		__field(u32, dirty_index)
> +		__field(u32, reset_index)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index          = ring->index;
> +		__entry->dirty_index    = ring->dirty_index;
> +		__entry->reset_index    = ring->reset_index;
> +	),
> +
> +	TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
> +		  __entry->index, __entry->dirty_index, __entry->reset_index,
> +		  __entry->dirty_index - __entry->reset_index)
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_waitqueue,
> +	TP_PROTO(bool enter),
> +	TP_ARGS(enter),
> +
> +	TP_STRUCT__entry(
> +	    __field(bool, enter)
> +	),
> +
> +	TP_fast_assign(
> +	    __entry->enter = enter;
> +	),
> +
> +	TP_printk("%s", __entry->enter ? "wait" : "awake")
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_exit,
> +	TP_PROTO(struct kvm_vcpu *vcpu),
> +	TP_ARGS(vcpu),
> +
> +	TP_STRUCT__entry(
> +	    __field(int, vcpu_id)
> +	),
> +
> +	TP_fast_assign(
> +	    __entry->vcpu_id = vcpu->vcpu_id;
> +	),
> +
> +	TP_printk("vcpu %d", __entry->vcpu_id)
> +);
> +
>   #endif /* _TRACE_KVM_MAIN_H */
>   
>   /* This part must be outside protection */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 52641d8ca9e8..5ea98e35a129 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -235,6 +235,7 @@ struct kvm_hyperv_exit {
>   #define KVM_EXIT_S390_STSI        25
>   #define KVM_EXIT_IOAPIC_EOI       26
>   #define KVM_EXIT_HYPERV           27
> +#define KVM_EXIT_DIRTY_RING_FULL  28
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -246,6 +247,11 @@ struct kvm_hyperv_exit {
>   /* Encounter unexpected vm-exit reason */
>   #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
>   
> +struct kvm_dirty_ring_indices {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */


Is this better to make those two cacheline aligned?


> +};
> +
>   /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>   struct kvm_run {
>   	/* in */
> @@ -415,6 +421,8 @@ struct kvm_run {
>   		struct kvm_sync_regs regs;
>   		char padding[SYNC_REGS_SIZE_BYTES];
>   	} s;
> +
> +	struct kvm_dirty_ring_indices vcpu_ring_indices;
>   };
>   
>   /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1000,6 +1008,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_PMU_EVENT_FILTER 173
>   #define KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 174
>   #define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 175
> +#define KVM_CAP_DIRTY_LOG_RING 176
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -1461,6 +1470,9 @@ struct kvm_enc_region {
>   /* Available with KVM_CAP_ARM_SVE */
>   #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>   
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> +
>   /* Secure Encrypted Virtualization command */
>   enum sev_cmd_id {
>   	/* Guest initialization commands */
> @@ -1611,4 +1623,23 @@ struct kvm_hyperv_eventfd {
>   #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
>   #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
>   
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + *    of kvm_write_* so that the global dirty ring is not filled up
> + *    too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + *    enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + *    dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> +	__u32 pad;
> +	__u32 slot;
> +	__u64 offset;
> +};
> +
>   #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..c614822493ff
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,201 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * KVM dirty ring implementation
> + *
> + * Copyright 2019 Red Hat, Inc.
> + */
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +#include <trace/events/kvm.h>
> +
> +int __weak kvm_cpu_dirty_log_size(void)
> +{
> +	return 0;
> +}
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> +	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
> +}
> +
> +bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> +
> +        /*
> +	 * TODO: Currently use vcpu0 as default ring.  Note that this
> +	 * should not happen only if called by kvmgt_rw_gpa for x86.
> +	 * After the kvmgt code refactoring we should remove this,
> +	 * together with the kvm->dirty_ring_lock.
> +	 */
> +	if (!vcpu) {
> +		pr_warn_once("Detected page dirty without vcpu context. "
> +			     "Probably because kvm-gt is used. "
> +			     "May expect unbalanced loads on vcpu0.");
> +		vcpu = kvm->vcpus[0];
> +	}
> +
> +	WARN_ON_ONCE(vcpu->kvm != kvm);
> +
> +	if (vcpu == kvm->vcpus[0])
> +		spin_lock(&kvm->dirty_ring_lock);
> +
> +	return &vcpu->dirty_ring;
> +}
> +
> +void kvm_dirty_ring_put(struct kvm *kvm,
> +			struct kvm_dirty_ring *ring)
> +{
> +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> +
> +	if (!vcpu)
> +		vcpu = kvm->vcpus[0];
> +
> +	WARN_ON_ONCE(vcpu->kvm != kvm);
> +	WARN_ON_ONCE(&vcpu->dirty_ring != ring);
> +
> +	if (vcpu == kvm->vcpus[0])
> +		spin_unlock(&kvm->dirty_ring_lock);
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indices *indices,
> +			 int index, u32 size)
> +{
> +	ring->dirty_gfns = vmalloc(size);
> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	ring->index = index;
> +	ring->indices = indices;
> +
> +	return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 cur_slot, next_slot;
> +	u64 cur_offset, next_offset;
> +	unsigned long mask;
> +	u32 fetch;
> +	int count = 0;
> +	struct kvm_dirty_gfn *entry;
> +	struct kvm_dirty_ring_indices *indices = ring->indices;
> +	bool first_round = true;
> +
> +	fetch = READ_ONCE(indices->fetch_index);
> +
> +	/*
> +	 * Note that fetch_index is written by the userspace, which
> +	 * should not be trusted.  If this happens, then it's probably
> +	 * that the userspace has written a wrong fetch_index.
> +	 */
> +	if (fetch - ring->reset_index > ring->size)
> +		return -EINVAL;
> +
> +	if (fetch == ring->reset_index)
> +		return 0;
> +
> +	/* This is only needed to make compilers happy */
> +	cur_slot = cur_offset = mask = 0;
> +	while (ring->reset_index != fetch) {
> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +		next_slot = READ_ONCE(entry->slot);
> +		next_offset = READ_ONCE(entry->offset);
> +		ring->reset_index++;
> +		count++;
> +		/*
> +		 * Try to coalesce the reset operations when the guest is
> +		 * scanning pages in the same slot.
> +		 */
> +		if (!first_round && next_slot == cur_slot) {


initialize cur_slot to -1 then we can drop first_round here?


> +			s64 delta = next_offset - cur_offset;
> +
> +			if (delta >= 0 && delta < BITS_PER_LONG) {
> +				mask |= 1ull << delta;
> +				continue;
> +			}
> +
> +			/* Backwards visit, careful about overflows!  */
> +			if (delta > -BITS_PER_LONG && delta < 0 &&
> +			    (mask << -delta >> -delta) == mask) {
> +				cur_offset = next_offset;
> +				mask = (mask << -delta) | 1;
> +				continue;
> +			}
> +		}
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		cur_slot = next_slot;
> +		cur_offset = next_offset;
> +		mask = 1;
> +		first_round = false;
> +	}
> +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +	trace_kvm_dirty_ring_reset(ring);
> +
> +	return count;
> +}
> +
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> +{
> +	struct kvm_dirty_gfn *entry;
> +	struct kvm_dirty_ring_indices *indices = ring->indices;
> +
> +	/*
> +	 * Note: here we will start waiting even soft full, because we
> +	 * can't risk making it completely full, since vcpu0 could use
> +	 * it right after us and if vcpu0 context gets full it could
> +	 * deadlock if wait with mmu_lock held.
> +	 */
> +	if (kvm_get_running_vcpu() == NULL &&
> +	    kvm_dirty_ring_soft_full(ring))
> +		return -EBUSY;
> +
> +	/* It will never gets completely full when with a vcpu context */
> +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	smp_wmb();


Better to add comment to explain this barrier. E.g pairing.


> +	ring->dirty_index++;
> +	WRITE_ONCE(indices->avail_index, ring->dirty_index);


Is WRITE_ONCE() a must here?


> +
> +	trace_kvm_dirty_ring_push(ring, slot, offset);
> +
> +	return 0;
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> +{
> +	return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	vfree(ring->dirty_gfns);
> +	ring->dirty_gfns = NULL;
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5c606d158854..4050631d05f3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/kvm.h>
>   
> +#include <linux/kvm_dirty_ring.h>
> +
>   /* Worst case buffer size needed for holding an integer. */
>   #define ITOA_MAX_LEN 12
>   
> @@ -148,6 +150,9 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
>   static void mark_page_dirty_in_slot(struct kvm *kvm,
>   				    struct kvm_memory_slot *memslot,
>   				    gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn);
>   
>   __visible bool kvm_rebooting;
>   EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -357,11 +362,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>   	vcpu->preempted = false;
>   	vcpu->ready = false;
>   
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
> +					 &vcpu->run->vcpu_ring_indices,
> +					 id, kvm->dirty_ring_size);
> +		if (r)
> +			goto fail_free_run;
> +	}
> +
>   	r = kvm_arch_vcpu_init(vcpu);
>   	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>   	return 0;
>   
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>   fail_free_run:
>   	free_page((unsigned long)vcpu->run);
>   fail:
> @@ -379,6 +395,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>   	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>   	kvm_arch_vcpu_uninit(vcpu);
>   	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>   }
>   EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>   
> @@ -693,6 +711,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   		return ERR_PTR(-ENOMEM);
>   
>   	spin_lock_init(&kvm->mmu_lock);
> +	spin_lock_init(&kvm->dirty_ring_lock);
>   	mmgrab(current->mm);
>   	kvm->mm = current->mm;
>   	kvm_eventfd_init(kvm);
> @@ -700,6 +719,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   	mutex_init(&kvm->irq_lock);
>   	mutex_init(&kvm->slots_lock);
>   	INIT_LIST_HEAD(&kvm->devices);
> +	init_waitqueue_head(&kvm->dirty_ring_waitqueue);
>   
>   	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>   
> @@ -2283,7 +2303,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>   	if (memslot && memslot->dirty_bitmap) {
>   		unsigned long rel_gfn = gfn - memslot->base_gfn;
>   
> -		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> +		if (kvm->dirty_ring_size)
> +			mark_page_dirty_in_ring(kvm, memslot, gfn);
> +		else
> +			set_bit_le(rel_gfn, memslot->dirty_bitmap);
>   	}
>   }
>   
> @@ -2630,6 +2653,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>   }
>   EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>   
> +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> +{
> +	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
> +		return false;
> +
> +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> +	     kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
>   static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>   {
>   	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2645,6 +2678,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>   	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
>   		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
>   #endif
> +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &vcpu->dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>   	else
>   		return kvm_arch_vcpu_fault(vcpu, vmf);
>   	get_page(page);
> @@ -3239,12 +3276,138 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   #endif
>   	case KVM_CAP_NR_MEMSLOTS:
>   		return KVM_USER_MEM_SLOTS;
> +	case KVM_CAP_DIRTY_LOG_RING:
> +#ifdef CONFIG_X86
> +		return KVM_DIRTY_RING_MAX_ENTRIES;
> +#else
> +		return 0;
> +#endif
>   	default:
>   		break;
>   	}
>   	return kvm_vm_ioctl_check_extension(kvm, arg);
>   }
>   
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn)
> +{
> +	struct kvm_dirty_ring *ring;
> +	u64 offset;
> +	int ret;
> +
> +	if (!kvm->dirty_ring_size)
> +		return;
> +
> +	offset = gfn - slot->base_gfn;
> +
> +	ring = kvm_dirty_ring_get(kvm);
> +
> +retry:
> +	ret = kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
> +				  offset);
> +	if (ret < 0) {
> +		/* We must be without a vcpu context. */
> +		WARN_ON_ONCE(kvm_get_running_vcpu());
> +
> +		trace_kvm_dirty_ring_waitqueue(1);
> +		/*
> +		 * Ring is full, put us onto per-vm waitqueue and wait
> +		 * for another KVM_RESET_DIRTY_RINGS to retry
> +		 */
> +		wait_event_killable(kvm->dirty_ring_waitqueue,
> +				    !kvm_dirty_ring_soft_full(ring));
> +
> +		trace_kvm_dirty_ring_waitqueue(0);
> +
> +		/* If we're killed, no worry on lossing dirty bits */
> +		if (fatal_signal_pending(current))
> +			return;
> +
> +		goto retry;
> +	}
> +
> +	kvm_dirty_ring_put(kvm, ring);
> +}
> +
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> +	struct kvm_memory_slot *memslot;
> +	int as_id, id;
> +
> +	as_id = slot >> 16;
> +	id = (u16)slot;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +		return;
> +
> +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> +	if (offset >= memslot->npages)
> +		return;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> +	int r;
> +
> +	/* the size should be power of 2 */
> +	if (!size || (size & (size - 1)))
> +		return -EINVAL;
> +
> +	/* Should be bigger to keep the reserved entries, or a page */
> +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> +	    sizeof(struct kvm_dirty_gfn))
> +		return -E2BIG;
> +
> +	/* We only allow it to set once */
> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->created_vcpus) {
> +		/* We don't allow to change this value after vcpu created */
> +		r = -EINVAL;
> +	} else {
> +		kvm->dirty_ring_size = size;
> +		r = 0;
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	int cleared = 0;
> +
> +	if (!kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (cleared)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	wake_up_all(&kvm->dirty_ring_waitqueue);
> +
> +	return cleared;
> +}
> +
>   int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   						  struct kvm_enable_cap *cap)
>   {
> @@ -3262,6 +3425,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>   		kvm->manual_dirty_log_protect = cap->args[0];
>   		return 0;
>   #endif
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>   	default:
>   		return kvm_vm_ioctl_enable_cap(kvm, cap);
>   	}
> @@ -3449,6 +3614,9 @@ static long kvm_vm_ioctl(struct file *filp,
>   	case KVM_CHECK_EXTENSION:
>   		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>   		break;
> +	case KVM_RESET_DIRTY_RINGS:
> +		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> +		break;
>   	default:
>   		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>   	}


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test
  2019-12-21  2:04 ` [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test Peter Xu
@ 2019-12-24  6:18   ` Jason Wang
  2019-12-24 15:22     ` Peter Xu
  2019-12-24  6:50   ` Jason Wang
  1 sibling, 1 reply; 45+ messages in thread
From: Jason Wang @ 2019-12-24  6:18 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Vitaly Kuznetsov


On 2019/12/21 上午10:04, Peter Xu wrote:
> Add the initial dirty ring buffer test.
>
> The current test implements the userspace dirty ring collection, by
> only reaping the dirty ring when the ring is full.
>
> So it's still running asynchronously like this:
>
>              vcpu                             main thread
>
>    1. vcpu dirties pages
>    2. vcpu gets dirty ring full
>       (userspace exit)
>
>                                         3. main thread waits until full
>                                            (so hardware buffers flushed)
>                                         4. main thread collects
>                                         5. main thread continues vcpu
>
>    6. vcpu continues, goes back to 1
>
> We can't directly collects dirty bits during vcpu execution because
> otherwise we can't guarantee the hardware dirty bits were flushed when
> we collect and we're very strict on the dirty bits so otherwise we can
> fail the future verify procedure.  A follow up patch will make this
> test to support async just like the existing dirty log test, by adding
> a vcpu kick mechanism.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   tools/testing/selftests/kvm/dirty_log_test.c  | 174 +++++++++++++++++-
>   .../testing/selftests/kvm/include/kvm_util.h  |   3 +
>   tools/testing/selftests/kvm/lib/kvm_util.c    |  56 ++++++
>   .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
>   4 files changed, 234 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
> index 3542311f56ff..af9b1a16c7d1 100644
> --- a/tools/testing/selftests/kvm/dirty_log_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_test.c
> @@ -12,8 +12,10 @@
>   #include <unistd.h>
>   #include <time.h>
>   #include <pthread.h>
> +#include <semaphore.h>
>   #include <linux/bitmap.h>
>   #include <linux/bitops.h>
> +#include <asm/barrier.h>
>   
>   #include "test_util.h"
>   #include "kvm_util.h"
> @@ -57,6 +59,8 @@
>   # define test_and_clear_bit_le	test_and_clear_bit
>   #endif
>   
> +#define TEST_DIRTY_RING_COUNT		1024
> +
>   /*
>    * Guest/Host shared variables. Ensure addr_gva2hva() and/or
>    * sync_global_to/from_guest() are used when accessing from
> @@ -128,6 +132,10 @@ static uint64_t host_dirty_count;
>   static uint64_t host_clear_count;
>   static uint64_t host_track_next_count;
>   
> +/* Whether dirty ring reset is requested, or finished */
> +static sem_t dirty_ring_vcpu_stop;
> +static sem_t dirty_ring_vcpu_cont;
> +
>   enum log_mode_t {
>   	/* Only use KVM_GET_DIRTY_LOG for logging */
>   	LOG_MODE_DIRTY_LOG = 0,
> @@ -135,6 +143,9 @@ enum log_mode_t {
>   	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
>   	LOG_MODE_CLERA_LOG = 1,
>   
> +	/* Use dirty ring for logging */
> +	LOG_MODE_DIRTY_RING = 2,
> +
>   	LOG_MODE_NUM,
>   };
>   
> @@ -177,6 +188,118 @@ static void default_after_vcpu_run(struct kvm_vm *vm)
>   		    exit_reason_str(run->exit_reason));
>   }
>   
> +static void dirty_ring_create_vm_done(struct kvm_vm *vm)
> +{
> +	/*
> +	 * Switch to dirty ring mode after VM creation but before any
> +	 * of the vcpu creation.
> +	 */
> +	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
> +			     sizeof(struct kvm_dirty_gfn));
> +}
> +
> +static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
> +				       struct kvm_dirty_ring_indices *indices,
> +				       int slot, void *bitmap,
> +				       uint32_t num_pages, int index)
> +{
> +	struct kvm_dirty_gfn *cur;
> +	uint32_t avail, fetch, count = 0;
> +
> +	/*
> +	 * We should keep it somewhere, but to be simple we read
> +	 * fetch_index too.
> +	 */
> +	fetch = READ_ONCE(indices->fetch_index);
> +	avail = READ_ONCE(indices->avail_index);
> +
> +	/* Make sure we read valid entries always */
> +	rmb();
> +
> +	DEBUG("ring %d: fetch: 0x%x, avail: 0x%x\n", index, fetch, avail);
> +
> +	while (fetch != avail) {
> +		cur = &dirty_gfns[fetch % TEST_DIRTY_RING_COUNT];
> +		TEST_ASSERT(cur->pad == 0, "Padding is non-zero: 0x%x", cur->pad);
> +		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
> +			    "%u != %u", cur->slot, slot);
> +		TEST_ASSERT(cur->offset < num_pages, "Offset overflow: "
> +			    "0x%llx >= 0x%llx", cur->offset, num_pages);
> +		DEBUG("fetch 0x%x offset 0x%llx\n", fetch, cur->offset);
> +		test_and_set_bit(cur->offset, bitmap);
> +		fetch++;


Any reason to use test_and_set_bit()? I guess set_bit() should be 
sufficient.


> +		count++;
> +	}
> +	WRITE_ONCE(indices->fetch_index, fetch);


Is WRITE_ONCE a must here?


> +
> +	return count;
> +}
> +
> +static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
> +					   void *bitmap, uint32_t num_pages)
> +{
> +	/* We only have one vcpu */
> +	struct kvm_run *state = vcpu_state(vm, VCPU_ID);
> +	uint32_t count = 0, cleared;
> +
> +	/*
> +	 * Before fetching the dirty pages, we need a vmexit of the
> +	 * worker vcpu to make sure the hardware dirty buffers were
> +	 * flushed.  This is not needed for dirty-log/clear-log tests
> +	 * because get dirty log will natually do so.
> +	 *
> +	 * For now we do it in the simple way - we simply wait until
> +	 * the vcpu uses up the soft dirty ring, then it'll always
> +	 * do a vmexit to make sure that PML buffers will be flushed.
> +	 * In real hypervisors, we probably need a vcpu kick or to
> +	 * stop the vcpus (before the final sync) to make sure we'll
> +	 * get all the existing dirty PFNs even cached in hardware.
> +	 */
> +	sem_wait(&dirty_ring_vcpu_stop);
> +
> +	/* Only have one vcpu */
> +	count = dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
> +				       &state->vcpu_ring_indices,
> +				       slot, bitmap, num_pages, VCPU_ID);
> +
> +	cleared = kvm_vm_reset_dirty_ring(vm);
> +
> +	/* Cleared pages should be the same as collected */
> +	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
> +		    "with collected (%u)", cleared, count);
> +
> +	DEBUG("Notifying vcpu to continue\n");
> +	sem_post(&dirty_ring_vcpu_cont);
> +
> +	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
> +}
> +
> +static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
> +{
> +	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> +
> +	/* A ucall-sync or ring-full event is allowed */
> +	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
> +		/* We should allow this to continue */
> +		;
> +	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
> +		sem_post(&dirty_ring_vcpu_stop);
> +		DEBUG("vcpu stops because dirty ring full...\n");
> +		sem_wait(&dirty_ring_vcpu_cont);
> +		DEBUG("vcpu continues now.\n");
> +	} else {
> +		TEST_ASSERT(false, "Invalid guest sync status: "
> +			    "exit_reason=%s\n",
> +			    exit_reason_str(run->exit_reason));
> +	}
> +}
> +
> +static void dirty_ring_before_vcpu_join(void)
> +{
> +	/* Kick another round of vcpu just to make sure it will quit */
> +	sem_post(&dirty_ring_vcpu_cont);
> +}
> +
>   struct log_mode {
>   	const char *name;
>   	/* Hook when the vm creation is done (before vcpu creation) */
> @@ -186,6 +309,7 @@ struct log_mode {
>   				     void *bitmap, uint32_t num_pages);
>   	/* Hook to call when after each vcpu run */
>   	void (*after_vcpu_run)(struct kvm_vm *vm);
> +	void (*before_vcpu_join) (void);
>   } log_modes[LOG_MODE_NUM] = {
>   	{
>   		.name = "dirty-log",
> @@ -199,6 +323,13 @@ struct log_mode {
>   		.collect_dirty_pages = clear_log_collect_dirty_pages,
>   		.after_vcpu_run = default_after_vcpu_run,
>   	},
> +	{
> +		.name = "dirty-ring",
> +		.create_vm_done = dirty_ring_create_vm_done,
> +		.collect_dirty_pages = dirty_ring_collect_dirty_pages,
> +		.before_vcpu_join = dirty_ring_before_vcpu_join,
> +		.after_vcpu_run = dirty_ring_after_vcpu_run,
> +	},
>   };
>   
>   /*
> @@ -245,6 +376,14 @@ static void log_mode_after_vcpu_run(struct kvm_vm *vm)
>   		mode->after_vcpu_run(vm);
>   }
>   
> +static void log_mode_before_vcpu_join(void)
> +{
> +	struct log_mode *mode = &log_modes[host_log_mode];
> +
> +	if (mode->before_vcpu_join)
> +		mode->before_vcpu_join();
> +}
> +
>   static void generate_random_array(uint64_t *guest_array, uint64_t size)
>   {
>   	uint64_t i;
> @@ -292,14 +431,41 @@ static void vm_dirty_log_verify(unsigned long *bmap)
>   		}
>   
>   		if (test_and_clear_bit_le(page, bmap)) {
> +			bool matched;
> +
>   			host_dirty_count++;
> +
>   			/*
>   			 * If the bit is set, the value written onto
>   			 * the corresponding page should be either the
>   			 * previous iteration number or the current one.
> +			 *
> +			 * (*value_ptr == iteration - 2) case is
> +			 * special only for dirty ring test where the
> +			 * page is the last page before a kvm dirty
> +			 * ring full userspace exit of the 2nd
> +			 * iteration, if without this we'll probably
> +			 * fail on the 4th iteration.  Anyway, let's
> +			 * just loose the test case a little bit for
> +			 * all for simplicity.
>   			 */
> -			TEST_ASSERT(*value_ptr == iteration ||
> -				    *value_ptr == iteration - 1,
> +			matched = (*value_ptr == iteration ||
> +				   *value_ptr == iteration - 1 ||
> +				   *value_ptr == iteration - 2);
> +
> +			/*
> +			 * This is the common path for dirty ring
> +			 * where this page is exactly the last page
> +			 * touched before KVM_EXIT_DIRTY_RING_FULL.
> +			 * If it happens, we should expect it to be
> +			 * there for the next round.
> +			 */
> +			if (host_log_mode == LOG_MODE_DIRTY_RING && !matched) {
> +				set_bit_le(page, host_bmap_track);
> +				continue;
> +			}
> +
> +			TEST_ASSERT(matched,
>   				    "Set page %"PRIu64" value %"PRIu64
>   				    " incorrect (iteration=%"PRIu64")",
>   				    page, *value_ptr, iteration);
> @@ -460,6 +626,7 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
>   
>   	/* Tell the vcpu thread to quit */
>   	host_quit = true;
> +	log_mode_before_vcpu_join();
>   	pthread_join(vcpu_thread, NULL);
>   
>   	DEBUG("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
> @@ -524,6 +691,9 @@ int main(int argc, char *argv[])
>   	unsigned int host_ipa_limit;
>   #endif
>   
> +	sem_init(&dirty_ring_vcpu_stop, 0, 0);
> +	sem_init(&dirty_ring_vcpu_cont, 0, 0);
> +
>   #ifdef __x86_64__
>   	vm_guest_mode_params_init(VM_MODE_PXXV48_4K, true, true);
>   #endif
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index 29cccaf96baf..4b78a8d3e773 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -67,6 +67,7 @@ enum vm_mem_backing_src_type {
>   
>   int kvm_check_cap(long cap);
>   int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
> +void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
>   
>   struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
>   struct kvm_vm *_vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
> @@ -76,6 +77,7 @@ void kvm_vm_release(struct kvm_vm *vmp);
>   void kvm_vm_get_dirty_log(struct kvm_vm *vm, int slot, void *log);
>   void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
>   			    uint64_t first_page, uint32_t num_pages);
> +uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm);
>   
>   int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
>   		       size_t len);
> @@ -137,6 +139,7 @@ void vcpu_nested_state_get(struct kvm_vm *vm, uint32_t vcpuid,
>   int vcpu_nested_state_set(struct kvm_vm *vm, uint32_t vcpuid,
>   			  struct kvm_nested_state *state, bool ignore_error);
>   #endif
> +void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid);
>   
>   const char *exit_reason_str(unsigned int exit_reason);
>   
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 41cf45416060..a119717bc84c 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -85,6 +85,26 @@ int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap)
>   	return ret;
>   }
>   
> +void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size)
> +{
> +	struct kvm_enable_cap cap = {};
> +	int ret;
> +
> +	ret = kvm_check_cap(KVM_CAP_DIRTY_LOG_RING);
> +
> +	TEST_ASSERT(ret >= 0, "KVM_CAP_DIRTY_LOG_RING");
> +
> +	if (ret == 0) {
> +		fprintf(stderr, "KVM does not support dirty ring, skipping tests\n");
> +		exit(KSFT_SKIP);
> +	}
> +
> +	cap.cap = KVM_CAP_DIRTY_LOG_RING;
> +	cap.args[0] = ring_size;
> +	vm_enable_cap(vm, &cap);
> +	vm->dirty_ring_size = ring_size;
> +}
> +
>   static void vm_open(struct kvm_vm *vm, int perm)
>   {
>   	vm->kvm_fd = open(KVM_DEV_PATH, perm);
> @@ -297,6 +317,11 @@ void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
>   		    strerror(-ret));
>   }
>   
> +uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm)
> +{
> +	return ioctl(vm->fd, KVM_RESET_DIRTY_RINGS);
> +}
> +
>   /*
>    * Userspace Memory Region Find
>    *
> @@ -408,6 +433,13 @@ static void vm_vcpu_rm(struct kvm_vm *vm, uint32_t vcpuid)
>   	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
>   	int ret;
>   
> +	if (vcpu->dirty_gfns) {
> +		ret = munmap(vcpu->dirty_gfns, vm->dirty_ring_size);
> +		TEST_ASSERT(ret == 0, "munmap of VCPU dirty ring failed, "
> +			    "rc: %i errno: %i", ret, errno);
> +		vcpu->dirty_gfns = NULL;
> +	}
> +
>   	ret = munmap(vcpu->state, sizeof(*vcpu->state));
>   	TEST_ASSERT(ret == 0, "munmap of VCPU fd failed, rc: %i "
>   		"errno: %i", ret, errno);
> @@ -1409,6 +1441,29 @@ int _vcpu_ioctl(struct kvm_vm *vm, uint32_t vcpuid,
>   	return ret;
>   }
>   
> +void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid)
> +{
> +	struct vcpu *vcpu;
> +	uint32_t size = vm->dirty_ring_size;
> +
> +	TEST_ASSERT(size > 0, "Should enable dirty ring first");
> +
> +	vcpu = vcpu_find(vm, vcpuid);
> +
> +	TEST_ASSERT(vcpu, "Cannot find vcpu %u", vcpuid);
> +
> +	if (!vcpu->dirty_gfns) {
> +		vcpu->dirty_gfns_count = size / sizeof(struct kvm_dirty_gfn);
> +		vcpu->dirty_gfns = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +					MAP_SHARED, vcpu->fd, vm->page_size *
> +					KVM_DIRTY_LOG_PAGE_OFFSET);


It looks to me that we don't write to dirty_gfn.

So PROT_READ should be sufficient.

Thanks


> +		TEST_ASSERT(vcpu->dirty_gfns != MAP_FAILED,
> +			    "Dirty ring map failed");
> +	}
> +
> +	return vcpu->dirty_gfns;
> +}
> +
>   /*
>    * VM Ioctl
>    *
> @@ -1503,6 +1558,7 @@ static struct exit_reason {
>   	{KVM_EXIT_INTERNAL_ERROR, "INTERNAL_ERROR"},
>   	{KVM_EXIT_OSI, "OSI"},
>   	{KVM_EXIT_PAPR_HCALL, "PAPR_HCALL"},
> +	{KVM_EXIT_DIRTY_RING_FULL, "DIRTY_RING_FULL"},
>   #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
>   	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
>   #endif
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util_internal.h b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
> index ac50c42750cf..87edcc6746a2 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util_internal.h
> +++ b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
> @@ -39,6 +39,8 @@ struct vcpu {
>   	uint32_t id;
>   	int fd;
>   	struct kvm_run *state;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +	uint32_t dirty_gfns_count;
>   };
>   
>   struct kvm_vm {
> @@ -61,6 +63,7 @@ struct kvm_vm {
>   	vm_paddr_t pgd;
>   	vm_vaddr_t gdt;
>   	vm_vaddr_t tss;
> +	uint32_t dirty_ring_size;
>   };
>   
>   struct vcpu *vcpu_find(struct kvm_vm *vm, uint32_t vcpuid);


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 00/17] KVM: Dirty ring interface
  2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
                   ` (16 preceding siblings ...)
  2019-12-21  2:04 ` [PATCH RESEND v2 17/17] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
@ 2019-12-24  6:34 ` Jason Wang
  17 siblings, 0 replies; 45+ messages in thread
From: Jason Wang @ 2019-12-24  6:34 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Vitaly Kuznetsov


On 2019/12/21 上午9:49, Peter Xu wrote:
> * Why not virtio?
>
> There's already some discussion during v1 patchset on whether it's
> good to use virtio for the data path of delivering dirty pages [1].
> I'd confess the only thing that we might consider to use is the vring
> layout (because virtqueue is tightly bound to devices, while we don't
> have a device contet here), however it's a pity that even we only use
> the most low-level vring api it'll be at least iov based which is
> already an overkill for dirty ring (which is literally an array of
> addresses).  So I just kept things easy.


If iov is the only reason, we can simple extend vringh helper to access 
the descriptor directly.

For split ring, it has some redundant stuffs.

- dirty ring has simple assumption used_idx = last_avail_idx (which is 
fetch_index), so no need for having two rings
- descriptor is self contained (dirty_gfns), no need to another 
indirection (but we can reuse vring descriptors for sure)

For packed ring, it looks not, but I'm not sure it's worthwhile to try.

Thanks


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test
  2019-12-21  2:04 ` [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test Peter Xu
  2019-12-24  6:18   ` Jason Wang
@ 2019-12-24  6:50   ` Jason Wang
  2019-12-24 15:24     ` Peter Xu
  1 sibling, 1 reply; 45+ messages in thread
From: Jason Wang @ 2019-12-24  6:50 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Vitaly Kuznetsov


On 2019/12/21 上午10:04, Peter Xu wrote:
> Add the initial dirty ring buffer test.
>
> The current test implements the userspace dirty ring collection, by
> only reaping the dirty ring when the ring is full.
>
> So it's still running asynchronously like this:


I guess you meant "synchronously" here.

Thanks


>
>              vcpu                             main thread
>
>    1. vcpu dirties pages
>    2. vcpu gets dirty ring full
>       (userspace exit)
>
>                                         3. main thread waits until full
>                                            (so hardware buffers flushed)
>                                         4. main thread collects
>                                         5. main thread continues vcpu
>
>    6. vcpu continues, goes back to 1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2019-12-24  6:16   ` Jason Wang
@ 2019-12-24 15:08     ` Peter Xu
  2019-12-25  3:23       ` Jason Wang
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2019-12-24 15:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Paolo Bonzini,
	Michael S . Tsirkin, Vitaly Kuznetsov, Lei Cao

On Tue, Dec 24, 2019 at 02:16:04PM +0800, Jason Wang wrote:
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> 
> 
> Does this always equal to indices->avail_index?

Yes, but here we keep dirty_index as the internal one, so we never
need to worry about illegal userspace writes to avail_index (then we
never read it from kernel).

> 
> 
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +	struct kvm_dirty_ring_indices *indices;
> 
> 
> Any reason to keep dirty gfns and indices in different places? I guess it is
> because you want to map dirty_gfns as readonly page but I couldn't find such
> codes...

That's a good point!  We should actually map the dirty gfns as read
only.  I've added the check, something like this:

static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
{
	struct kvm_vcpu *vcpu = file->private_data;
	unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;

	/* If to map any writable page within dirty ring, fail it */
	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
	    vma->vm_flags & VM_WRITE)
		return -EINVAL;

	vma->vm_ops = &kvm_vcpu_vm_ops;
	return 0;
}

I also changed the test code to cover this case.

[...]

> > +struct kvm_dirty_ring_indices {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> 
> 
> Is this better to make those two cacheline aligned?

Yes, Paolo should have mentioned that but I must have missed it!  I
hope I didn't miss anything else.

[...]

> > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 cur_slot, next_slot;
> > +	u64 cur_offset, next_offset;
> > +	unsigned long mask;
> > +	u32 fetch;
> > +	int count = 0;
> > +	struct kvm_dirty_gfn *entry;
> > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > +	bool first_round = true;
> > +
> > +	fetch = READ_ONCE(indices->fetch_index);
> > +
> > +	/*
> > +	 * Note that fetch_index is written by the userspace, which
> > +	 * should not be trusted.  If this happens, then it's probably
> > +	 * that the userspace has written a wrong fetch_index.
> > +	 */
> > +	if (fetch - ring->reset_index > ring->size)
> > +		return -EINVAL;
> > +
> > +	if (fetch == ring->reset_index)
> > +		return 0;
> > +
> > +	/* This is only needed to make compilers happy */
> > +	cur_slot = cur_offset = mask = 0;
> > +	while (ring->reset_index != fetch) {
> > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +		next_slot = READ_ONCE(entry->slot);
> > +		next_offset = READ_ONCE(entry->offset);
> > +		ring->reset_index++;
> > +		count++;
> > +		/*
> > +		 * Try to coalesce the reset operations when the guest is
> > +		 * scanning pages in the same slot.
> > +		 */
> > +		if (!first_round && next_slot == cur_slot) {
> 
> 
> initialize cur_slot to -1 then we can drop first_round here?

cur_slot is unsigned.  We can force cur_slot to be s64 but maybe we
can also simply keep the first_round to be clear from its name.

[...]

> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > +{
> > +	struct kvm_dirty_gfn *entry;
> > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > +
> > +	/*
> > +	 * Note: here we will start waiting even soft full, because we
> > +	 * can't risk making it completely full, since vcpu0 could use
> > +	 * it right after us and if vcpu0 context gets full it could
> > +	 * deadlock if wait with mmu_lock held.
> > +	 */
> > +	if (kvm_get_running_vcpu() == NULL &&
> > +	    kvm_dirty_ring_soft_full(ring))
> > +		return -EBUSY;
> > +
> > +	/* It will never gets completely full when with a vcpu context */
> > +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> > +
> > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > +	entry->slot = slot;
> > +	entry->offset = offset;
> > +	smp_wmb();
> 
> 
> Better to add comment to explain this barrier. E.g pairing.

Will do.

> 
> 
> > +	ring->dirty_index++;
> > +	WRITE_ONCE(indices->avail_index, ring->dirty_index);
> 
> 
> Is WRITE_ONCE() a must here?

I think not, but seems to be clearer that we're publishing something
explicilty to userspace.  Since asked, I'm actually curious on whether
immediate memory writes like this could start to affect perf from any
of your previous perf works?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test
  2019-12-24  6:18   ` Jason Wang
@ 2019-12-24 15:22     ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-24 15:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, linux-kernel, Dr David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Vitaly Kuznetsov

On Tue, Dec 24, 2019 at 02:18:37PM +0800, Jason Wang wrote:

[...]

> > +	while (fetch != avail) {
> > +		cur = &dirty_gfns[fetch % TEST_DIRTY_RING_COUNT];
> > +		TEST_ASSERT(cur->pad == 0, "Padding is non-zero: 0x%x", cur->pad);
> > +		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
> > +			    "%u != %u", cur->slot, slot);
> > +		TEST_ASSERT(cur->offset < num_pages, "Offset overflow: "
> > +			    "0x%llx >= 0x%llx", cur->offset, num_pages);
> > +		DEBUG("fetch 0x%x offset 0x%llx\n", fetch, cur->offset);
> > +		test_and_set_bit(cur->offset, bitmap);
> > +		fetch++;
> 
> 
> Any reason to use test_and_set_bit()? I guess set_bit() should be
> sufficient.

Yes.

> 
> 
> > +		count++;
> > +	}
> > +	WRITE_ONCE(indices->fetch_index, fetch);
> 
> 
> Is WRITE_ONCE a must here?

No.

[...]

> > +void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid)
> > +{
> > +	struct vcpu *vcpu;
> > +	uint32_t size = vm->dirty_ring_size;
> > +
> > +	TEST_ASSERT(size > 0, "Should enable dirty ring first");
> > +
> > +	vcpu = vcpu_find(vm, vcpuid);
> > +
> > +	TEST_ASSERT(vcpu, "Cannot find vcpu %u", vcpuid);
> > +
> > +	if (!vcpu->dirty_gfns) {
> > +		vcpu->dirty_gfns_count = size / sizeof(struct kvm_dirty_gfn);
> > +		vcpu->dirty_gfns = mmap(NULL, size, PROT_READ | PROT_WRITE,
> > +					MAP_SHARED, vcpu->fd, vm->page_size *
> > +					KVM_DIRTY_LOG_PAGE_OFFSET);
> 
> 
> It looks to me that we don't write to dirty_gfn.
> 
> So PROT_READ should be sufficient.

Yes.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test
  2019-12-24  6:50   ` Jason Wang
@ 2019-12-24 15:24     ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2019-12-24 15:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, linux-kernel, Dr David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Vitaly Kuznetsov

On Tue, Dec 24, 2019 at 02:50:48PM +0800, Jason Wang wrote:
> 
> On 2019/12/21 上午10:04, Peter Xu wrote:
> > Add the initial dirty ring buffer test.
> > 
> > The current test implements the userspace dirty ring collection, by
> > only reaping the dirty ring when the ring is full.
> > 
> > So it's still running asynchronously like this:
> 
> 
> I guess you meant "synchronously" here.

Yes, definitely. :)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2019-12-24 15:08     ` Peter Xu
@ 2019-12-25  3:23       ` Jason Wang
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Wang @ 2019-12-25  3:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Paolo Bonzini,
	Michael S . Tsirkin, Vitaly Kuznetsov, Lei Cao


On 2019/12/24 下午11:08, Peter Xu wrote:
> On Tue, Dec 24, 2019 at 02:16:04PM +0800, Jason Wang wrote:
>>> +struct kvm_dirty_ring {
>>> +	u32 dirty_index;
>>
>> Does this always equal to indices->avail_index?
> Yes, but here we keep dirty_index as the internal one, so we never
> need to worry about illegal userspace writes to avail_index (then we
> never read it from kernel).


I get you. But I'm not sure it's wroth to bother. We meet similar issue 
in virtio, the used_idx is not expected to write by userspace. We simply 
add checks.

But anyway, I'm fine if you want to keep it (maybe with a comment to 
explain).


>
>>
>>> +	u32 reset_index;
>>> +	u32 size;
>>> +	u32 soft_limit;
>>> +	struct kvm_dirty_gfn *dirty_gfns;
>>> +	struct kvm_dirty_ring_indices *indices;
>>
>> Any reason to keep dirty gfns and indices in different places? I guess it is
>> because you want to map dirty_gfns as readonly page but I couldn't find such
>> codes...
> That's a good point!  We should actually map the dirty gfns as read
> only.  I've added the check, something like this:
>
> static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
> {
> 	struct kvm_vcpu *vcpu = file->private_data;
> 	unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>
> 	/* If to map any writable page within dirty ring, fail it */
> 	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> 	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> 	    vma->vm_flags & VM_WRITE)
> 		return -EINVAL;
>
> 	vma->vm_ops = &kvm_vcpu_vm_ops;
> 	return 0;
> }
>
> I also changed the test code to cover this case.
>
> [...]


Looks good.


>
>>> +struct kvm_dirty_ring_indices {
>>> +	__u32 avail_index; /* set by kernel */
>>> +	__u32 fetch_index; /* set by userspace */
>>
>> Is this better to make those two cacheline aligned?
> Yes, Paolo should have mentioned that but I must have missed it!  I
> hope I didn't miss anything else.
>
> [...]
>
>>> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>>> +{
>>> +	u32 cur_slot, next_slot;
>>> +	u64 cur_offset, next_offset;
>>> +	unsigned long mask;
>>> +	u32 fetch;
>>> +	int count = 0;
>>> +	struct kvm_dirty_gfn *entry;
>>> +	struct kvm_dirty_ring_indices *indices = ring->indices;
>>> +	bool first_round = true;
>>> +
>>> +	fetch = READ_ONCE(indices->fetch_index);
>>> +
>>> +	/*
>>> +	 * Note that fetch_index is written by the userspace, which
>>> +	 * should not be trusted.  If this happens, then it's probably
>>> +	 * that the userspace has written a wrong fetch_index.
>>> +	 */
>>> +	if (fetch - ring->reset_index > ring->size)
>>> +		return -EINVAL;
>>> +
>>> +	if (fetch == ring->reset_index)
>>> +		return 0;
>>> +
>>> +	/* This is only needed to make compilers happy */
>>> +	cur_slot = cur_offset = mask = 0;
>>> +	while (ring->reset_index != fetch) {
>>> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>>> +		next_slot = READ_ONCE(entry->slot);
>>> +		next_offset = READ_ONCE(entry->offset);
>>> +		ring->reset_index++;
>>> +		count++;
>>> +		/*
>>> +		 * Try to coalesce the reset operations when the guest is
>>> +		 * scanning pages in the same slot.
>>> +		 */
>>> +		if (!first_round && next_slot == cur_slot) {
>>
>> initialize cur_slot to -1 then we can drop first_round here?
> cur_slot is unsigned.  We can force cur_slot to be s64 but maybe we
> can also simply keep the first_round to be clear from its name.
>
> [...]


Sure.


>
>>> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
>>> +{
>>> +	struct kvm_dirty_gfn *entry;
>>> +	struct kvm_dirty_ring_indices *indices = ring->indices;
>>> +
>>> +	/*
>>> +	 * Note: here we will start waiting even soft full, because we
>>> +	 * can't risk making it completely full, since vcpu0 could use
>>> +	 * it right after us and if vcpu0 context gets full it could
>>> +	 * deadlock if wait with mmu_lock held.
>>> +	 */
>>> +	if (kvm_get_running_vcpu() == NULL &&
>>> +	    kvm_dirty_ring_soft_full(ring))
>>> +		return -EBUSY;
>>> +
>>> +	/* It will never gets completely full when with a vcpu context */
>>> +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
>>> +
>>> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>> +	entry->slot = slot;
>>> +	entry->offset = offset;
>>> +	smp_wmb();
>>
>> Better to add comment to explain this barrier. E.g pairing.
> Will do.
>
>>
>>> +	ring->dirty_index++;
>>> +	WRITE_ONCE(indices->avail_index, ring->dirty_index);
>>
>> Is WRITE_ONCE() a must here?
> I think not, but seems to be clearer that we're publishing something
> explicilty to userspace.  Since asked, I'm actually curious on whether
> immediate memory writes like this could start to affect perf from any
> of your previous perf works?


I never measure the impact for a specific WRITE_ONCE(). But we don't do 
this in virtio/vhost. Maybe the maintainers can give more comments on this.

Thanks


>
> Thanks,
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2019-12-21  1:49 ` [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
  2019-12-24  6:16   ` Jason Wang
@ 2020-01-08 15:52   ` Peter Xu
  2020-01-08 17:41     ` Paolo Bonzini
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-01-08 15:52 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Paolo Bonzini, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov, Lei Cao

On Fri, Dec 20, 2019 at 08:49:29PM -0500, Peter Xu wrote:
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> +{
> +	struct kvm_dirty_gfn *entry;
> +	struct kvm_dirty_ring_indices *indices = ring->indices;
> +
> +	/*
> +	 * Note: here we will start waiting even soft full, because we
> +	 * can't risk making it completely full, since vcpu0 could use
> +	 * it right after us and if vcpu0 context gets full it could
> +	 * deadlock if wait with mmu_lock held.
> +	 */
> +	if (kvm_get_running_vcpu() == NULL &&
> +	    kvm_dirty_ring_soft_full(ring))
> +		return -EBUSY;

I plan to repost next week, but before that I'd like to know whether
there's any further (negative) feedback from design-wise, especially
here, which is still a bit tricky to makeup the kvmgt issue.

Now we still have the waitqueue but it'll only be used for
no-vcpu-context dirtyings, so:

- For no-vcpu-context: thread could wait in the waitqueue if it makes
  vcpu0's ring soft-full (note, previously it was hard-full, so here
  we make it easier to wait so we make sure )

- For with-vcpu-context: we should never wait, guaranteed by the fact
  that KVM_RUN will return now if soft-full for that vcpu ring, and
  above waitqueue will make sure even vcpu0's waitqueue won't be
  filled up by kvmgt

Again this is still a workaround for kvmgt and I think it should not
be needed after the refactoring.  It's just a way to not depend on
that work so this should work even with current kvmgt.

> +
> +	/* It will never gets completely full when with a vcpu context */
> +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indices->avail_index, ring->dirty_index);
> +
> +	trace_kvm_dirty_ring_push(ring, slot, offset);
> +
> +	return 0;
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-08 15:52   ` Peter Xu
@ 2020-01-08 17:41     ` Paolo Bonzini
  2020-01-08 19:06       ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 17:41 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Michael S . Tsirkin, Jason Wang,
	Vitaly Kuznetsov, Lei Cao

On 08/01/20 16:52, Peter Xu wrote:
> here, which is still a bit tricky to makeup the kvmgt issue.
> 
> Now we still have the waitqueue but it'll only be used for
> no-vcpu-context dirtyings, so:
> 
> - For no-vcpu-context: thread could wait in the waitqueue if it makes
>   vcpu0's ring soft-full (note, previously it was hard-full, so here
>   we make it easier to wait so we make sure )
> 
> - For with-vcpu-context: we should never wait, guaranteed by the fact
>   that KVM_RUN will return now if soft-full for that vcpu ring, and
>   above waitqueue will make sure even vcpu0's waitqueue won't be
>   filled up by kvmgt
> 
> Again this is still a workaround for kvmgt and I think it should not
> be needed after the refactoring.  It's just a way to not depend on
> that work so this should work even with current kvmgt.

The kvmgt patches were posted, you could just include them in your next
series and clean everything up.  You can get them at
https://patchwork.kernel.org/cover/11316219/.

Paolo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 01/17] KVM: Remove kvm_read_guest_atomic()
  2019-12-21  1:49 ` [PATCH RESEND v2 01/17] KVM: Remove kvm_read_guest_atomic() Peter Xu
@ 2020-01-08 17:45   ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 17:45 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Michael S . Tsirkin, Jason Wang,
	Vitaly Kuznetsov

On 21/12/19 02:49, Peter Xu wrote:
> Remove kvm_read_guest_atomic() because it's not used anywhere.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/kvm_host.h |  2 --
>  virt/kvm/kvm_main.c      | 11 -----------
>  2 files changed, 13 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d41c521a39da..2ea1ea79befd 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -730,8 +730,6 @@ void kvm_get_pfn(kvm_pfn_t pfn);
>  
>  int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  			int len);
> -int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
> -			  unsigned long len);
>  int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
>  int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  			   void *data, unsigned long len);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13efc291b1c7..7ee28af9eb48 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2039,17 +2039,6 @@ static int __kvm_read_guest_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
>  	return 0;
>  }
>  
> -int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
> -			  unsigned long len)
> -{
> -	gfn_t gfn = gpa >> PAGE_SHIFT;
> -	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> -	int offset = offset_in_page(gpa);
> -
> -	return __kvm_read_guest_atomic(slot, gfn, data, offset, len);
> -}
> -EXPORT_SYMBOL_GPL(kvm_read_guest_atomic);
> -
>  int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>  			       void *data, unsigned long len)
>  {
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 02/17] KVM: X86: Change parameter for fast_page_fault tracepoint
  2019-12-21  1:49 ` [PATCH RESEND v2 02/17] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
@ 2020-01-08 17:46   ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 17:46 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Michael S . Tsirkin, Jason Wang,
	Vitaly Kuznetsov

On 21/12/19 02:49, Peter Xu wrote:
> It would be clearer to dump the return value to know easily on whether
> did we go through the fast path for handling current page fault.
> Remove the old two last parameters because after all the old/new sptes
> were dumped in the same line.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/mmutrace.h | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
> index 7ca8831c7d1a..09bdc5c91650 100644
> --- a/arch/x86/kvm/mmutrace.h
> +++ b/arch/x86/kvm/mmutrace.h
> @@ -244,9 +244,6 @@ TRACE_EVENT(
>  		  __entry->access)
>  );
>  
> -#define __spte_satisfied(__spte)				\
> -	(__entry->retry && is_writable_pte(__entry->__spte))
> -
>  TRACE_EVENT(
>  	fast_page_fault,
>  	TP_PROTO(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code,
> @@ -274,12 +271,10 @@ TRACE_EVENT(
>  	),
>  
>  	TP_printk("vcpu %d gva %lx error_code %s sptep %p old %#llx"
> -		  " new %llx spurious %d fixed %d", __entry->vcpu_id,
> +		  " new %llx ret %d", __entry->vcpu_id,
>  		  __entry->gva, __print_flags(__entry->error_code, "|",
>  		  kvm_mmu_trace_pferr_flags), __entry->sptep,
> -		  __entry->old_spte, __entry->new_spte,
> -		  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
> -	)
> +		  __entry->old_spte, __entry->new_spte, __entry->retry)
>  );
>  
>  TRACE_EVENT(
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2019-12-23 20:10         ` Peter Xu
@ 2020-01-08 17:46           ` Paolo Bonzini
  2020-01-08 19:15             ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 17:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

On 23/12/19 21:10, Peter Xu wrote:
>> Yes, kvm->slots_lock is taken by x86_set_memory_region.  We need to move
>> that to the callers, of which several are already taking the lock (all
>> except vmx_set_tss_addr and kvm_arch_destroy_vm).
> OK, will do.  I'll directly replace the x86_set_memory_region() calls
> in kvm_arch_destroy_vm() to be __x86_set_memory_region() since IIUC
> the slots_lock is helpless when destroying the vm... then drop the
> x86_set_memory_region() helper in the next version.  Thanks,

Be careful because it may cause issues with lockdep.  Better just take
the lock.

Paolo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 04/17] KVM: Cache as_id in kvm_memory_slot
  2019-12-21  1:49 ` [PATCH RESEND v2 04/17] KVM: Cache as_id in kvm_memory_slot Peter Xu
@ 2020-01-08 17:47   ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 17:47 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Michael S . Tsirkin, Jason Wang,
	Vitaly Kuznetsov

On 21/12/19 02:49, Peter Xu wrote:
> Let's cache the address space ID just like the slot ID.

Please add a not that it will be useful in order to fill in the dirty
page ring buffer.

Paolo

> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/kvm_host.h | 1 +
>  virt/kvm/kvm_main.c      | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4e34cf97ca90..24854c9e3717 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -348,6 +348,7 @@ struct kvm_memory_slot {
>  	unsigned long userspace_addr;
>  	u32 flags;
>  	short id;
> +	u8 as_id;
>  };
>  
>  static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b1047173d78e..cea4b8dd4ac9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1027,6 +1027,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  
>  	new = old = *slot;
>  
> +	BUILD_BUG_ON(U8_MAX < KVM_ADDRESS_SPACE_NUM);
> +	new.as_id = as_id;
>  	new.id = id;
>  	new.base_gfn = base_gfn;
>  	new.npages = npages;
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 06/17] KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  2019-12-21  1:49 ` [PATCH RESEND v2 06/17] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
@ 2020-01-08 17:47   ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 17:47 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Michael S . Tsirkin, Jason Wang,
	Vitaly Kuznetsov

On 21/12/19 02:49, Peter Xu wrote:
> The context will be needed to implement the kvm dirty ring.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 24 ++++++++++++++----------
>  1 file changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c80a363831ae..17969cf110dd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -144,7 +144,9 @@ static void hardware_disable_all(void);
>  
>  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
>  
> -static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
> +static void mark_page_dirty_in_slot(struct kvm *kvm,
> +				    struct kvm_memory_slot *memslot,
> +				    gfn_t gfn);
>  
>  __visible bool kvm_rebooting;
>  EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -2053,8 +2055,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
> -static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
> -			          const void *data, int offset, int len,
> +static int __kvm_write_guest_page(struct kvm *kvm,
> +				  struct kvm_memory_slot *memslot, gfn_t gfn,
> +				  const void *data, int offset, int len,
>  				  bool track_dirty)
>  {
>  	int r;
> @@ -2067,7 +2070,7 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
>  	if (r)
>  		return -EFAULT;
>  	if (track_dirty)
> -		mark_page_dirty_in_slot(memslot, gfn);
> +		mark_page_dirty_in_slot(kvm, memslot, gfn);
>  	return 0;
>  }
>  
> @@ -2077,7 +2080,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset, len,
> +	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len,
>  				      track_dirty);
>  }
>  EXPORT_SYMBOL_GPL(kvm_write_guest_page);
> @@ -2087,7 +2090,7 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_write_guest_page(slot, gfn, data, offset,
> +	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset,
>  				      len, true);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
> @@ -2202,7 +2205,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
>  	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
>  	if (r)
>  		return -EFAULT;
> -	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
> +	mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT);
>  
>  	return 0;
>  }
> @@ -2269,7 +2272,8 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
>  }
>  EXPORT_SYMBOL_GPL(kvm_clear_guest);
>  
> -static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
> +static void mark_page_dirty_in_slot(struct kvm *kvm,
> +				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn)
>  {
>  	if (memslot && memslot->dirty_bitmap) {
> @@ -2284,7 +2288,7 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
>  	struct kvm_memory_slot *memslot;
>  
>  	memslot = gfn_to_memslot(kvm, gfn);
> -	mark_page_dirty_in_slot(memslot, gfn);
> +	mark_page_dirty_in_slot(kvm, memslot, gfn);
>  }
>  EXPORT_SYMBOL_GPL(mark_page_dirty);
>  
> @@ -2293,7 +2297,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
>  	struct kvm_memory_slot *memslot;
>  
>  	memslot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> -	mark_page_dirty_in_slot(memslot, gfn);
> +	mark_page_dirty_in_slot(vcpu->kvm, memslot, gfn);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>  
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 07/17] KVM: Move running VCPU from ARM to common code
  2019-12-21  1:49 ` [PATCH RESEND v2 07/17] KVM: Move running VCPU from ARM to common code Peter Xu
@ 2020-01-08 17:47   ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 17:47 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Dr . David Alan Gilbert, Christophe de Dinechin,
	Sean Christopherson, Michael S . Tsirkin, Jason Wang,
	Vitaly Kuznetsov

On 21/12/19 02:49, Peter Xu wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> For ring-based dirty log tracking, it will be more efficient to account
> writes during schedule-out or schedule-in to the currently running VCPU.
> We would like to do it even if the write doesn't use the current VCPU's
> address space, as is the case for cached writes (see commit 4e335d9e7ddb,
> "Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).
> 
> Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
> There is already something similar in KVM/ARM; one important difference
> is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
> we have to update both the architecture-independent vcpu_{load,put} and
> the preempt notifiers.
> 
> Another change made in the process is to allow using kvm_get_running_vcpu()
> in preemptible code.  This is allowed because preempt notifiers ensure
> that the value does not change even after the VCPU thread is migrated.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/arm/include/asm/kvm_host.h   |  2 --
>  arch/arm64/include/asm/kvm_host.h |  2 --
>  include/linux/kvm_host.h          |  3 +++
>  virt/kvm/arm/arch_timer.c         |  2 +-
>  virt/kvm/arm/arm.c                | 29 -----------------------------
>  virt/kvm/arm/perf.c               |  6 +++---
>  virt/kvm/arm/vgic/vgic-mmio.c     | 15 +++------------
>  virt/kvm/kvm_main.c               | 25 ++++++++++++++++++++++++-
>  8 files changed, 34 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 8a37c8e89777..40eff9cc3744 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -274,8 +274,6 @@ int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
>  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
>  
> -struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
> -struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
>  void kvm_arm_halt_guest(struct kvm *kvm);
>  void kvm_arm_resume_guest(struct kvm *kvm);
>  
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index f656169db8c3..df8d72f7c20e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -430,8 +430,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
>  
> -struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
> -struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
>  void kvm_arm_halt_guest(struct kvm *kvm);
>  void kvm_arm_resume_guest(struct kvm *kvm);
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 24854c9e3717..b4f7bef38e0d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1323,6 +1323,9 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
>  }
>  #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
>  
> +struct kvm_vcpu *kvm_get_running_vcpu(void);
> +struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
> +
>  #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
>  bool kvm_arch_has_irq_bypass(void);
>  int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index e2bb5bd60227..085e7fed850c 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -1022,7 +1022,7 @@ static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu)
>  
>  bool kvm_arch_timer_get_input_level(int vintid)
>  {
> -	struct kvm_vcpu *vcpu = kvm_arm_get_running_vcpu();
> +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>  	struct arch_timer_context *timer;
>  
>  	if (vintid == vcpu_vtimer(vcpu)->irq.irq)
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 86c6aa1cb58e..f7dbb94ec525 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -47,9 +47,6 @@ __asm__(".arch_extension	virt");
>  DEFINE_PER_CPU(kvm_host_data_t, kvm_host_data);
>  static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
>  
> -/* Per-CPU variable containing the currently running vcpu. */
> -static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
> -
>  /* The VMID used in the VTTBR */
>  static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
>  static u32 kvm_next_vmid;
> @@ -58,31 +55,8 @@ static DEFINE_SPINLOCK(kvm_vmid_lock);
>  static bool vgic_present;
>  
>  static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled);
> -
> -static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
> -{
> -	__this_cpu_write(kvm_arm_running_vcpu, vcpu);
> -}
> -
>  DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>  
> -/**
> - * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
> - * Must be called from non-preemptible context
> - */
> -struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
> -{
> -	return __this_cpu_read(kvm_arm_running_vcpu);
> -}
> -
> -/**
> - * kvm_arm_get_running_vcpus - get the per-CPU array of currently running vcpus.
> - */
> -struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
> -{
> -	return &kvm_arm_running_vcpu;
> -}
> -
>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>  {
>  	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> @@ -374,7 +348,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	vcpu->cpu = cpu;
>  	vcpu->arch.host_cpu_context = &cpu_data->host_ctxt;
>  
> -	kvm_arm_set_running_vcpu(vcpu);
>  	kvm_vgic_load(vcpu);
>  	kvm_timer_vcpu_load(vcpu);
>  	kvm_vcpu_load_sysregs(vcpu);
> @@ -398,8 +371,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  	kvm_vcpu_pmu_restore_host(vcpu);
>  
>  	vcpu->cpu = -1;
> -
> -	kvm_arm_set_running_vcpu(NULL);
>  }
>  
>  static void vcpu_power_off(struct kvm_vcpu *vcpu)
> diff --git a/virt/kvm/arm/perf.c b/virt/kvm/arm/perf.c
> index 918cdc3839ea..d45b8b9a4415 100644
> --- a/virt/kvm/arm/perf.c
> +++ b/virt/kvm/arm/perf.c
> @@ -13,14 +13,14 @@
>  
>  static int kvm_is_in_guest(void)
>  {
> -        return kvm_arm_get_running_vcpu() != NULL;
> +        return kvm_get_running_vcpu() != NULL;
>  }
>  
>  static int kvm_is_user_mode(void)
>  {
>  	struct kvm_vcpu *vcpu;
>  
> -	vcpu = kvm_arm_get_running_vcpu();
> +	vcpu = kvm_get_running_vcpu();
>  
>  	if (vcpu)
>  		return !vcpu_mode_priv(vcpu);
> @@ -32,7 +32,7 @@ static unsigned long kvm_get_guest_ip(void)
>  {
>  	struct kvm_vcpu *vcpu;
>  
> -	vcpu = kvm_arm_get_running_vcpu();
> +	vcpu = kvm_get_running_vcpu();
>  
>  	if (vcpu)
>  		return *vcpu_pc(vcpu);
> diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
> index 0d090482720d..d656ebd5f9d4 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio.c
> @@ -190,15 +190,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
>   * value later will give us the same value as we update the per-CPU variable
>   * in the preempt notifier handlers.
>   */
> -static struct kvm_vcpu *vgic_get_mmio_requester_vcpu(void)
> -{
> -	struct kvm_vcpu *vcpu;
> -
> -	preempt_disable();
> -	vcpu = kvm_arm_get_running_vcpu();
> -	preempt_enable();
> -	return vcpu;
> -}
>  
>  /* Must be called with irq->irq_lock held */
>  static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
> @@ -221,7 +212,7 @@ void vgic_mmio_write_spending(struct kvm_vcpu *vcpu,
>  			      gpa_t addr, unsigned int len,
>  			      unsigned long val)
>  {
> -	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
> +	bool is_uaccess = !kvm_get_running_vcpu();
>  	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
>  	int i;
>  	unsigned long flags;
> @@ -274,7 +265,7 @@ void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu,
>  			      gpa_t addr, unsigned int len,
>  			      unsigned long val)
>  {
> -	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
> +	bool is_uaccess = !kvm_get_running_vcpu();
>  	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
>  	int i;
>  	unsigned long flags;
> @@ -335,7 +326,7 @@ static void vgic_mmio_change_active(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
>  				    bool active)
>  {
>  	unsigned long flags;
> -	struct kvm_vcpu *requester_vcpu = vgic_get_mmio_requester_vcpu();
> +	struct kvm_vcpu *requester_vcpu = kvm_get_running_vcpu();
>  
>  	raw_spin_lock_irqsave(&irq->irq_lock, flags);
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 17969cf110dd..5c606d158854 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -108,6 +108,7 @@ struct kmem_cache *kvm_vcpu_cache;
>  EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
>  
>  static __read_mostly struct preempt_ops kvm_preempt_ops;
> +static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
>  
>  struct dentry *kvm_debugfs_dir;
>  EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
> @@ -199,6 +200,8 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  void vcpu_load(struct kvm_vcpu *vcpu)
>  {
>  	int cpu = get_cpu();
> +
> +	__this_cpu_write(kvm_running_vcpu, vcpu);
>  	preempt_notifier_register(&vcpu->preempt_notifier);
>  	kvm_arch_vcpu_load(vcpu, cpu);
>  	put_cpu();
> @@ -210,6 +213,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
>  	preempt_disable();
>  	kvm_arch_vcpu_put(vcpu);
>  	preempt_notifier_unregister(&vcpu->preempt_notifier);
> +	__this_cpu_write(kvm_running_vcpu, NULL);
>  	preempt_enable();
>  }
>  EXPORT_SYMBOL_GPL(vcpu_put);
> @@ -4294,8 +4298,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
>  	WRITE_ONCE(vcpu->preempted, false);
>  	WRITE_ONCE(vcpu->ready, false);
>  
> +	__this_cpu_write(kvm_running_vcpu, vcpu);
>  	kvm_arch_sched_in(vcpu, cpu);
> -
>  	kvm_arch_vcpu_load(vcpu, cpu);
>  }
>  
> @@ -4309,6 +4313,25 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>  		WRITE_ONCE(vcpu->ready, true);
>  	}
>  	kvm_arch_vcpu_put(vcpu);
> +	__this_cpu_write(kvm_running_vcpu, NULL);
> +}
> +
> +/**
> + * kvm_get_running_vcpu - get the vcpu running on the current CPU.
> + * Thanks to preempt notifiers, this can also be called from
> + * preemptible context.
> + */
> +struct kvm_vcpu *kvm_get_running_vcpu(void)
> +{
> +        return __this_cpu_read(kvm_running_vcpu);
> +}
> +
> +/**
> + * kvm_get_running_vcpus - get the per-CPU array of currently running vcpus.
> + */
> +struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
> +{
> +        return &kvm_running_vcpu;
>  }
>  
>  static void check_processor_compat(void *rtn)
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-08 17:41     ` Paolo Bonzini
@ 2020-01-08 19:06       ` Peter Xu
  2020-01-08 19:44         ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-01-08 19:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov, Lei Cao

On Wed, Jan 08, 2020 at 06:41:06PM +0100, Paolo Bonzini wrote:
> On 08/01/20 16:52, Peter Xu wrote:
> > here, which is still a bit tricky to makeup the kvmgt issue.
> > 
> > Now we still have the waitqueue but it'll only be used for
> > no-vcpu-context dirtyings, so:
> > 
> > - For no-vcpu-context: thread could wait in the waitqueue if it makes
> >   vcpu0's ring soft-full (note, previously it was hard-full, so here
> >   we make it easier to wait so we make sure )
> > 
> > - For with-vcpu-context: we should never wait, guaranteed by the fact
> >   that KVM_RUN will return now if soft-full for that vcpu ring, and
> >   above waitqueue will make sure even vcpu0's waitqueue won't be
> >   filled up by kvmgt
> > 
> > Again this is still a workaround for kvmgt and I think it should not
> > be needed after the refactoring.  It's just a way to not depend on
> > that work so this should work even with current kvmgt.
> 
> The kvmgt patches were posted, you could just include them in your next
> series and clean everything up.  You can get them at
> https://patchwork.kernel.org/cover/11316219/.

Good to know!

Maybe I'll simply drop all the redundants in the dirty ring series
assuming it's there?  Since these patchsets should not overlap with
each other (so looks more like an ordering constraints for merging).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-08 17:46           ` Paolo Bonzini
@ 2020-01-08 19:15             ` Peter Xu
  2020-01-08 19:44               ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-01-08 19:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

On Wed, Jan 08, 2020 at 06:46:30PM +0100, Paolo Bonzini wrote:
> On 23/12/19 21:10, Peter Xu wrote:
> >> Yes, kvm->slots_lock is taken by x86_set_memory_region.  We need to move
> >> that to the callers, of which several are already taking the lock (all
> >> except vmx_set_tss_addr and kvm_arch_destroy_vm).
> > OK, will do.  I'll directly replace the x86_set_memory_region() calls
> > in kvm_arch_destroy_vm() to be __x86_set_memory_region() since IIUC
> > the slots_lock is helpless when destroying the vm... then drop the
> > x86_set_memory_region() helper in the next version.  Thanks,
> 
> Be careful because it may cause issues with lockdep.  Better just take
> the lock.

But you seemed to have fixed that already? :)

3898da947bba ("KVM: avoid using rcu_dereference_protected", 2017-08-02)

And this path is after kvm_destroy_vm() so kvm->users_count should be 0.
Or I feel like we need to have more places to take the lock..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-08 19:15             ` Peter Xu
@ 2020-01-08 19:44               ` Paolo Bonzini
  2020-01-08 21:02                 ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 19:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

On 08/01/20 20:15, Peter Xu wrote:
> But you seemed to have fixed that already? :)

Perhaps. :)

> 3898da947bba ("KVM: avoid using rcu_dereference_protected", 2017-08-02)
> 
> And this path is after kvm_destroy_vm() so kvm->users_count should be 0.
> Or I feel like we need to have more places to take the lock..

Yeah, it should be okay assuming you test with lockdep.

Paolo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-08 19:06       ` Peter Xu
@ 2020-01-08 19:44         ` Paolo Bonzini
  2020-01-08 19:59           ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 19:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov, Lei Cao

On 08/01/20 20:06, Peter Xu wrote:
>> The kvmgt patches were posted, you could just include them in your next
>> series and clean everything up.  You can get them at
>> https://patchwork.kernel.org/cover/11316219/.
> Good to know!
> 
> Maybe I'll simply drop all the redundants in the dirty ring series
> assuming it's there?  Since these patchsets should not overlap with
> each other (so looks more like an ordering constraints for merging).

Just include the patches, we'll make sure to get an ACK from Alex.

Paolo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-08 19:44         ` Paolo Bonzini
@ 2020-01-08 19:59           ` Peter Xu
  2020-01-08 20:06             ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-01-08 19:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov, Lei Cao

On Wed, Jan 08, 2020 at 08:44:32PM +0100, Paolo Bonzini wrote:
> On 08/01/20 20:06, Peter Xu wrote:
> >> The kvmgt patches were posted, you could just include them in your next
> >> series and clean everything up.  You can get them at
> >> https://patchwork.kernel.org/cover/11316219/.
> > Good to know!
> > 
> > Maybe I'll simply drop all the redundants in the dirty ring series
> > assuming it's there?  Since these patchsets should not overlap with
> > each other (so looks more like an ordering constraints for merging).
> 
> Just include the patches, we'll make sure to get an ACK from Alex.

Sure.  I can even wait for some more days until that consolidates
(just in case we need to change back and forth for this series).

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-08 19:59           ` Peter Xu
@ 2020-01-08 20:06             ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2020-01-08 20:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov, Lei Cao

On 08/01/20 20:59, Peter Xu wrote:
> On Wed, Jan 08, 2020 at 08:44:32PM +0100, Paolo Bonzini wrote:
>> On 08/01/20 20:06, Peter Xu wrote:
>>>> The kvmgt patches were posted, you could just include them in your next
>>>> series and clean everything up.  You can get them at
>>>> https://patchwork.kernel.org/cover/11316219/.
>>> Good to know!
>>>
>>> Maybe I'll simply drop all the redundants in the dirty ring series
>>> assuming it's there?  Since these patchsets should not overlap with
>>> each other (so looks more like an ordering constraints for merging).
>>
>> Just include the patches, we'll make sure to get an ACK from Alex.
> 
> Sure.  I can even wait for some more days until that consolidates
> (just in case we need to change back and forth for this series).

Don't worry, I'll keep track of that.

Paolo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-08 19:44               ` Paolo Bonzini
@ 2020-01-08 21:02                 ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2020-01-08 21:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Dr . David Alan Gilbert,
	Christophe de Dinechin, Sean Christopherson, Michael S . Tsirkin,
	Jason Wang, Vitaly Kuznetsov

On Wed, Jan 08, 2020 at 08:44:09PM +0100, Paolo Bonzini wrote:
> Yeah, it should be okay assuming you test with lockdep.

I didn't turn it on for this work, but I'll make sure to be with it
starting from now.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-01-08 21:02 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-21  1:49 [PATCH RESEND v2 00/17] KVM: Dirty ring interface Peter Xu
2019-12-21  1:49 ` [PATCH RESEND v2 01/17] KVM: Remove kvm_read_guest_atomic() Peter Xu
2020-01-08 17:45   ` Paolo Bonzini
2019-12-21  1:49 ` [PATCH RESEND v2 02/17] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-01-08 17:46   ` Paolo Bonzini
2019-12-21  1:49 ` [PATCH RESEND v2 03/17] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2019-12-21 13:51   ` Paolo Bonzini
2019-12-23 17:27     ` Peter Xu
2019-12-23 17:59       ` Paolo Bonzini
2019-12-23 20:10         ` Peter Xu
2020-01-08 17:46           ` Paolo Bonzini
2020-01-08 19:15             ` Peter Xu
2020-01-08 19:44               ` Paolo Bonzini
2020-01-08 21:02                 ` Peter Xu
2019-12-21  1:49 ` [PATCH RESEND v2 04/17] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-01-08 17:47   ` Paolo Bonzini
2019-12-21  1:49 ` [PATCH RESEND v2 05/17] KVM: Add build-time error check on kvm_run size Peter Xu
2019-12-21  1:49 ` [PATCH RESEND v2 06/17] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-01-08 17:47   ` Paolo Bonzini
2019-12-21  1:49 ` [PATCH RESEND v2 07/17] KVM: Move running VCPU from ARM to common code Peter Xu
2020-01-08 17:47   ` Paolo Bonzini
2019-12-21  1:49 ` [PATCH RESEND v2 08/17] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2019-12-24  6:16   ` Jason Wang
2019-12-24 15:08     ` Peter Xu
2019-12-25  3:23       ` Jason Wang
2020-01-08 15:52   ` Peter Xu
2020-01-08 17:41     ` Paolo Bonzini
2020-01-08 19:06       ` Peter Xu
2020-01-08 19:44         ` Paolo Bonzini
2020-01-08 19:59           ` Peter Xu
2020-01-08 20:06             ` Paolo Bonzini
2019-12-21  1:49 ` [PATCH RESEND v2 09/17] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2019-12-21  1:58 ` [PATCH RESEND v2 10/17] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2019-12-21  2:04 ` [PATCH RESEND v2 11/17] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2019-12-21  2:04 ` [PATCH RESEND v2 12/17] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2019-12-21  2:04 ` [PATCH RESEND v2 13/17] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2019-12-21  2:04 ` [PATCH RESEND v2 14/17] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2019-12-21  2:04 ` [PATCH RESEND v2 15/17] KVM: selftests: Add dirty ring buffer test Peter Xu
2019-12-24  6:18   ` Jason Wang
2019-12-24 15:22     ` Peter Xu
2019-12-24  6:50   ` Jason Wang
2019-12-24 15:24     ` Peter Xu
2019-12-21  2:04 ` [PATCH RESEND v2 16/17] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2019-12-21  2:04 ` [PATCH RESEND v2 17/17] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2019-12-24  6:34 ` [PATCH RESEND v2 00/17] KVM: Dirty ring interface Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).