linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/14] KVM: Dirty ring interface
@ 2020-03-31 18:59 Peter Xu
  2020-03-31 18:59 ` [PATCH v8 01/14] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
                   ` (14 more replies)
  0 siblings, 15 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

KVM branch:
  https://github.com/xzpeter/linux/tree/kvm-dirty-ring

QEMU branch for testing:
  https://github.com/xzpeter/qemu/tree/kvm-dirty-ring

v8:
- rebase to kvm/next
- fix test bisection issues [Drew]
- reword comment for __x86_set_memory_region [Sean]
- document fixup on "mutual exclusive", etc. [Sean]

For previous versions, please refer to:

V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
V3: https://lore.kernel.org/kvm/20200109145729.32898-1-peterx@redhat.com
V4: https://lore.kernel.org/kvm/20200205025105.367213-1-peterx@redhat.com
V5: https://lore.kernel.org/kvm/20200304174947.69595-1-peterx@redhat.com
V6: https://lore.kernel.org/kvm/20200309214424.330363-1-peterx@redhat.com
V7: https://lore.kernel.org/kvm/20200318163720.93929-1-peterx@redhat.com

Overview
============

This is a continued work from Lei Cao <lei.cao@stratus.com> and Paolo
Bonzini on the KVM dirty ring interface.

The new dirty ring interface is another way to collect dirty pages for
the virtual machines. It is different from the existing dirty logging
interface in a few ways, majorly:

  - Data format: The dirty data was in a ring format rather than a
    bitmap format, so dirty bits to sync for dirty logging does not
    depend on the size of guest memory any more, but speed of
    dirtying.  Also, the dirty ring is per-vcpu, while the dirty
    bitmap is per-vm.

  - Data copy: The sync of dirty pages does not need data copy any more,
    but instead the ring is shared between the userspace and kernel by
    page sharings (mmap() on vcpu fd)

  - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
    KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses the new
    KVM_RESET_DIRTY_RINGS ioctl when we want to reset the collected
    dirty pages to protected mode again (works like
    KVM_CLEAR_DIRTY_LOG, but ring based).  To collecting dirty bits,
    we only need to read the ring data, no ioctl is needed.

Ring Layout
===========

KVM dirty ring is per-vcpu.  Each ring is an array of kvm_dirty_gfn
defined as:

struct kvm_dirty_gfn {
        __u32 flags;
        __u32 slot; /* as_id | slot_id */
        __u64 offset;
};

Each GFN is a state machine itself.  The state is embeded in the flags
field, as defined in the uapi header:

/*
 * KVM dirty GFN flags, defined as:
 *
 * |---------------+---------------+--------------|
 * | bit 1 (reset) | bit 0 (dirty) | Status       |
 * |---------------+---------------+--------------|
 * |             0 |             0 | Invalid GFN  |
 * |             0 |             1 | Dirty GFN    |
 * |             1 |             X | GFN to reset |
 * |---------------+---------------+--------------|
 *
 * Lifecycle of a dirty GFN goes like:
 *
 *      dirtied         collected        reset
 * 00 -----------> 01 -------------> 1X -------+
 *  ^                                          |
 *  |                                          |
 *  +------------------------------------------+
 *
 * The userspace program is only responsible for the 01->1X state
 * conversion (to collect dirty bits).  Also, it must not skip any
 * dirty bits so that dirty bits are always collected in sequence.
 */

Testing
=======

This series provided both the implementation of the KVM dirty ring and
the test case.  Also I've implemented the QEMU counterpart that can
run with the new KVM, link can be found at the top of the cover
letter.  However that's still a very initial version which is prone to
change and future optimizations.

I did some measurement with the new method with 24G guest running some
dirty workload, I don't see any speedup so far, even in some heavy
dirty load it'll be slower (e.g., when 800MB/s random dirty rate, kvm
dirty ring takes average of ~73s to complete migration while dirty
logging only needs average of ~55s).  However that's understandable
because 24G guest means only 1M dirty bitmap, that's still a suitable
case for dirty logging.  Meanwhile heavier workload means worst case
for dirty ring.

More tests are welcomed if there's bigger host/guest, especially on
COLO-like workload.

Please review, thanks.

Peter Xu (14):
  KVM: X86: Change parameter for fast_page_fault tracepoint
  KVM: Cache as_id in kvm_memory_slot
  KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  KVM: X86: Implement ring-based dirty memory tracking
  KVM: Make dirty ring exclusive to dirty bitmap log
  KVM: Don't allocate dirty bitmap if dirty ring is enabled
  KVM: selftests: Always clear dirty bitmap after iteration
  KVM: selftests: Sync uapi/linux/kvm.h to tools/
  KVM: selftests: Use a single binary for dirty/clear log test
  KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  KVM: selftests: Add dirty ring buffer test
  KVM: selftests: Let dirty_log_test async for dirty ring test
  KVM: selftests: Add "-c" parameter to dirty log test

 Documentation/virt/kvm/api.rst                | 123 +++++
 arch/x86/include/asm/kvm_host.h               |   6 +-
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/Makefile                         |   3 +-
 arch/x86/kvm/mmu/mmu.c                        |  10 +-
 arch/x86/kvm/mmutrace.h                       |   9 +-
 arch/x86/kvm/svm.c                            |   9 +-
 arch/x86/kvm/vmx/vmx.c                        |  89 +--
 arch/x86/kvm/x86.c                            |  48 +-
 include/linux/kvm_dirty_ring.h                | 103 ++++
 include/linux/kvm_host.h                      |  19 +
 include/trace/events/kvm.h                    |  78 +++
 include/uapi/linux/kvm.h                      |  53 ++
 tools/include/uapi/linux/kvm.h                | 100 +++-
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   6 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 505 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   4 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  68 +++
 .../selftests/kvm/lib/kvm_util_internal.h     |   4 +
 virt/kvm/dirty_ring.c                         | 195 +++++++
 virt/kvm/kvm_main.c                           | 162 +++++-
 22 files changed, 1459 insertions(+), 138 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
 create mode 100644 virt/kvm/dirty_ring.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v8 01/14] KVM: X86: Change parameter for fast_page_fault tracepoint
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 02/14] KVM: Cache as_id in kvm_memory_slot Peter Xu
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

It would be clearer to dump the return value to know easily on whether
did we go through the fast path for handling current page fault.
Remove the old two last parameters because after all the old/new sptes
were dumped in the same line.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmutrace.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index ffcd96fc02d0..ef523e760743 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -244,9 +244,6 @@ TRACE_EVENT(
 		  __entry->access)
 );
 
-#define __spte_satisfied(__spte)				\
-	(__entry->retry && is_writable_pte(__entry->__spte))
-
 TRACE_EVENT(
 	fast_page_fault,
 	TP_PROTO(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 error_code,
@@ -274,12 +271,10 @@ TRACE_EVENT(
 	),
 
 	TP_printk("vcpu %d gva %llx error_code %s sptep %p old %#llx"
-		  " new %llx spurious %d fixed %d", __entry->vcpu_id,
+		  " new %llx ret %d", __entry->vcpu_id,
 		  __entry->cr2_or_gpa, __print_flags(__entry->error_code, "|",
 		  kvm_mmu_trace_pferr_flags), __entry->sptep,
-		  __entry->old_spte, __entry->new_spte,
-		  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
-	)
+		  __entry->old_spte, __entry->new_spte, __entry->retry)
 );
 
 TRACE_EVENT(
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 02/14] KVM: Cache as_id in kvm_memory_slot
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
  2020-03-31 18:59 ` [PATCH v8 01/14] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

Cache the address space ID just like the slot ID.  It will be used in
order to fill in the dirty ring entries.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f6a1905da9bf..515570116b60 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -346,6 +346,7 @@ struct kvm_memory_slot {
 	unsigned long userspace_addr;
 	u32 flags;
 	short id;
+	u16 as_id;
 };
 
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f744bc603c53..c04612726e85 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1243,6 +1243,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	if (!mem->memory_size)
 		return kvm_delete_memslot(kvm, mem, &old, as_id);
 
+	new.as_id = as_id;
 	new.id = id;
 	new.base_gfn = mem->guest_phys_addr >> PAGE_SHIFT;
 	new.npages = mem->memory_size >> PAGE_SHIFT;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
  2020-03-31 18:59 ` [PATCH v8 01/14] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
  2020-03-31 18:59 ` [PATCH v8 02/14] KVM: Cache as_id in kvm_memory_slot Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-04-23 20:39   ` Sean Christopherson
  2020-03-31 18:59 ` [PATCH v8 04/14] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller.  Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.

Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +-
 arch/x86/kvm/svm.c              |  9 ++--
 arch/x86/kvm/vmx/vmx.c          | 82 ++++++++++++++++-----------------
 arch/x86/kvm/x86.c              | 39 +++++++++++++---
 4 files changed, 81 insertions(+), 52 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9a183e9d4cb1..a8c68f626fb5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1645,7 +1645,8 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 
 int kvm_is_in_guest(void);
 
-int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
+void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
+				     u32 size);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 05cb45bc0e08..140bff1946b1 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1785,7 +1785,8 @@ static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu,
  */
 static int avic_update_access_page(struct kvm *kvm, bool activate)
 {
-	int ret = 0;
+	void __user *ret;
+	int r = 0;
 
 	mutex_lock(&kvm->slots_lock);
 	/*
@@ -1801,13 +1802,15 @@ static int avic_update_access_page(struct kvm *kvm, bool activate)
 				      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
 				      APIC_DEFAULT_PHYS_BASE,
 				      activate ? PAGE_SIZE : 0);
-	if (ret)
+	if (IS_ERR(ret)) {
+		r = PTR_ERR(ret);
 		goto out;
+	}
 
 	kvm->arch.apic_access_page_done = activate;
 out:
 	mutex_unlock(&kvm->slots_lock);
-	return ret;
+	return r;
 }
 
 static int avic_init_backing_page(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a7dd67859bd4..529b04ca0ac8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3432,34 +3432,26 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-static int init_rmode_tss(struct kvm *kvm)
+static int init_rmode_tss(struct kvm *kvm, void __user *ua)
 {
-	gfn_t fn;
-	u16 data = 0;
-	int idx, r;
+	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+	u16 data;
+	int i, r;
+
+	for (i = 0; i < 3; i++) {
+		r = __copy_to_user(ua + PAGE_SIZE * i, zero_page, PAGE_SIZE);
+		if (r)
+			return -EFAULT;
+	}
 
-	idx = srcu_read_lock(&kvm->srcu);
-	fn = to_kvm_vmx(kvm)->tss_addr >> PAGE_SHIFT;
-	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
 	data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
-	r = kvm_write_guest_page(kvm, fn++, &data,
-			TSS_IOPB_BASE_OFFSET, sizeof(u16));
-	if (r < 0)
-		goto out;
-	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
-	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
+	r = __copy_to_user(ua + TSS_IOPB_BASE_OFFSET, &data, sizeof(u16));
+	if (r)
+		return -EFAULT;
+
 	data = ~0;
-	r = kvm_write_guest_page(kvm, fn, &data,
-				 RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1,
-				 sizeof(u8));
-out:
-	srcu_read_unlock(&kvm->srcu, idx);
+	r = __copy_to_user(ua + RMODE_TSS_SIZE - 1, &data, sizeof(u8));
+
 	return r;
 }
 
@@ -3468,6 +3460,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
 	int i, r = 0;
 	kvm_pfn_t identity_map_pfn;
+	void __user *uaddr;
 	u32 tmp;
 
 	/* Protect kvm_vmx->ept_identity_pagetable_done. */
@@ -3480,22 +3473,24 @@ static int init_rmode_identity_map(struct kvm *kvm)
 		kvm_vmx->ept_identity_map_addr = VMX_EPT_IDENTITY_PAGETABLE_ADDR;
 	identity_map_pfn = kvm_vmx->ept_identity_map_addr >> PAGE_SHIFT;
 
-	r = __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
-				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE);
-	if (r < 0)
+	uaddr = __x86_set_memory_region(kvm,
+					IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
+					kvm_vmx->ept_identity_map_addr,
+					PAGE_SIZE);
+	if (IS_ERR(uaddr)) {
+		r = PTR_ERR(uaddr);
 		goto out;
+	}
 
-	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
 	/* Set up identity-mapping pagetable for EPT in real mode */
 	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
 		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
 			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
-		r = kvm_write_guest_page(kvm, identity_map_pfn,
-				&tmp, i * sizeof(tmp), sizeof(tmp));
-		if (r < 0)
+		r = __copy_to_user(uaddr + i * sizeof(tmp), &tmp, sizeof(tmp));
+		if (r) {
+			r = -EFAULT;
 			goto out;
+		}
 	}
 	kvm_vmx->ept_identity_pagetable_done = true;
 
@@ -3522,19 +3517,22 @@ static void seg_setup(int seg)
 static int alloc_apic_access_page(struct kvm *kvm)
 {
 	struct page *page;
-	int r = 0;
+	void __user *r;
+	int ret = 0;
 
 	mutex_lock(&kvm->slots_lock);
 	if (kvm->arch.apic_access_page_done)
 		goto out;
 	r = __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
 				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE);
-	if (r)
+	if (IS_ERR(r)) {
+		ret = PTR_ERR(r);
 		goto out;
+	}
 
 	page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (is_error_page(page)) {
-		r = -EFAULT;
+		ret = -EFAULT;
 		goto out;
 	}
 
@@ -3546,7 +3544,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
 	kvm->arch.apic_access_page_done = true;
 out:
 	mutex_unlock(&kvm->slots_lock);
-	return r;
+	return ret;
 }
 
 int allocate_vpid(void)
@@ -4473,7 +4471,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 {
-	int ret;
+	void __user *ret;
 
 	if (enable_unrestricted_guest)
 		return 0;
@@ -4483,10 +4481,12 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 				      PAGE_SIZE * 3);
 	mutex_unlock(&kvm->slots_lock);
 
-	if (ret)
-		return ret;
+	if (IS_ERR(ret))
+		return PTR_ERR(ret);
+
 	to_kvm_vmx(kvm)->tss_addr = addr;
-	return init_rmode_tss(kvm);
+
+	return init_rmode_tss(kvm, ret);
 }
 
 static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b6d9ac9533c..faa702c4d37b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9791,7 +9791,32 @@ void kvm_arch_sync_events(struct kvm *kvm)
 	kvm_free_pit(kvm);
 }
 
-int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
+#define  ERR_PTR_USR(e)  ((void __user *)ERR_PTR(e))
+
+/**
+ * __x86_set_memory_region: Setup KVM internal memory slot
+ *
+ * @kvm: the kvm pointer to the VM.
+ * @id: the slot ID to setup.
+ * @gpa: the GPA to install the slot (unused when @size == 0).
+ * @size: the size of the slot. Set to zero to uninstall a slot.
+ *
+ * This function helps to setup a KVM internal memory slot.  Specify
+ * @size > 0 to install a new slot, while @size == 0 to uninstall a
+ * slot.  The return code can be one of the following:
+ *
+ *   HVA:           on success (uninstall will return a bogus HVA)
+ *   -errno:        on error
+ *
+ * The caller should always use IS_ERR() to check the return value
+ * before use.  Note, the KVM internal memory slots are guaranteed to
+ * remain valid and unchanged until the VM is destroyed, i.e., the
+ * GPA->HVA translation will not change.  However, the HVA is a user
+ * address, i.e. its accessibility is not guaranteed, and must be
+ * accessed via __copy_{to,from}_user().
+ */
+void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
+				      u32 size)
 {
 	int i, r;
 	unsigned long hva, uninitialized_var(old_npages);
@@ -9800,12 +9825,12 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
 
 	/* Called with kvm->slots_lock held.  */
 	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
-		return -EINVAL;
+		return ERR_PTR_USR(-EINVAL);
 
 	slot = id_to_memslot(slots, id);
 	if (size) {
 		if (slot && slot->npages)
-			return -EEXIST;
+			return ERR_PTR_USR(-EEXIST);
 
 		/*
 		 * MAP_SHARED to prevent internal slot pages from being moved
@@ -9814,10 +9839,10 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
 		hva = vm_mmap(NULL, 0, size, PROT_READ | PROT_WRITE,
 			      MAP_SHARED | MAP_ANONYMOUS, 0);
 		if (IS_ERR((void *)hva))
-			return PTR_ERR((void *)hva);
+			return (void __user *)hva;
 	} else {
 		if (!slot || !slot->npages)
-			return 0;
+			return ERR_PTR_USR(0);
 
 		/*
 		 * Stuff a non-canonical value to catch use-after-delete.  This
@@ -9838,13 +9863,13 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
 		m.memory_size = size;
 		r = __kvm_set_memory_region(kvm, &m);
 		if (r < 0)
-			return r;
+			return ERR_PTR_USR(r);
 	}
 
 	if (!size)
 		vm_munmap(hva, old_npages * PAGE_SIZE);
 
-	return 0;
+	return (void __user *)hva;
 }
 EXPORT_SYMBOL_GPL(__x86_set_memory_region);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 04/14] KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (2 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 05/14] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

The context will be needed to implement the kvm dirty ring.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c04612726e85..1f869dda8110 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -144,7 +144,9 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_memory_slot *memslot,
+				    gfn_t gfn);
 
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -2120,7 +2122,8 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_map);
 
-static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
+static void __kvm_unmap_gfn(struct kvm *kvm,
+			struct kvm_memory_slot *memslot,
 			struct kvm_host_map *map,
 			struct gfn_to_pfn_cache *cache,
 			bool dirty, bool atomic)
@@ -2145,7 +2148,7 @@ static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
 #endif
 
 	if (dirty)
-		mark_page_dirty_in_slot(memslot, map->gfn);
+		mark_page_dirty_in_slot(kvm, memslot, map->gfn);
 
 	if (cache)
 		cache->dirty |= dirty;
@@ -2159,7 +2162,7 @@ static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
 int kvm_unmap_gfn(struct kvm_vcpu *vcpu, struct kvm_host_map *map, 
 		  struct gfn_to_pfn_cache *cache, bool dirty, bool atomic)
 {
-	__kvm_unmap_gfn(gfn_to_memslot(vcpu->kvm, map->gfn), map,
+	__kvm_unmap_gfn(vcpu->kvm, gfn_to_memslot(vcpu->kvm, map->gfn), map,
 			cache, dirty, atomic);
 	return 0;
 }
@@ -2167,8 +2170,8 @@ EXPORT_SYMBOL_GPL(kvm_unmap_gfn);
 
 void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty)
 {
-	__kvm_unmap_gfn(kvm_vcpu_gfn_to_memslot(vcpu, map->gfn), map, NULL,
-			dirty, false);
+	__kvm_unmap_gfn(vcpu->kvm, kvm_vcpu_gfn_to_memslot(vcpu, map->gfn),
+			map, NULL, dirty, false);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
 
@@ -2342,7 +2345,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
-static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
+static int __kvm_write_guest_page(struct kvm *kvm,
+				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
 {
 	int r;
@@ -2354,7 +2358,7 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
 	r = __copy_to_user((void __user *)addr + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
 	return 0;
 }
 
@@ -2363,7 +2367,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
 
@@ -2372,7 +2376,7 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
@@ -2491,7 +2495,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
+	mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT);
 
 	return 0;
 }
@@ -2558,7 +2562,8 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn)
 {
 	if (memslot && memslot->dirty_bitmap) {
@@ -2573,7 +2578,7 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = gfn_to_memslot(kvm, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty);
 
@@ -2582,7 +2587,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(vcpu->kvm, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 05/14] KVM: X86: Implement ring-based dirty memory tracking
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (3 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 04/14] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 06/14] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx, Lei Cao

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only.  However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.rst  | 116 +++++++++++++++++++
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/include/uapi/asm/kvm.h |   1 +
 arch/x86/kvm/Makefile           |   3 +-
 arch/x86/kvm/mmu/mmu.c          |   6 +
 arch/x86/kvm/vmx/vmx.c          |   7 ++
 arch/x86/kvm/x86.c              |   9 ++
 include/linux/kvm_dirty_ring.h  | 103 +++++++++++++++++
 include/linux/kvm_host.h        |  13 +++
 include/trace/events/kvm.h      |  78 +++++++++++++
 include/uapi/linux/kvm.h        |  53 +++++++++
 virt/kvm/dirty_ring.c           | 195 ++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             | 112 +++++++++++++++++-
 13 files changed, 697 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 create mode 100644 virt/kvm/dirty_ring.c

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..aa54a34077b7 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -249,6 +249,7 @@ Based on their initialization different VMs may have different capabilities.
 It is thus encouraged to use the vm ioctl to query for capabilities (available
 with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
 
+
 4.5 KVM_GET_VCPU_MMAP_SIZE
 --------------------------
 
@@ -262,6 +263,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
 memory region.  This ioctl returns the size of that region.  See the
 KVM_RUN documentation for details.
 
+Besides the size of the KVM_RUN communication region, other areas of
+the VCPU file descriptor can be mmap-ed, including:
+
+- if KVM_CAP_COALESCED_MMIO is available, a page at
+  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
+  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
+  KVM_CAP_COALESCED_MMIO is not documented yet.
+
+- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
+  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
+  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+
 
 4.6 KVM_SET_MEMORY_REGION
 -------------------------
@@ -6109,3 +6122,106 @@ KVM can therefore start protected VMs.
 This capability governs the KVM_S390_PV_COMMAND ioctl and the
 KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
 guests when the state change is invalid.
+
+8.24 KVM_CAP_DIRTY_LOG_RING
+
+Architectures: x86
+Parameters: args[0] - size of the dirty log ring
+
+KVM is capable of tracking dirty memory using ring buffers that are
+mmaped into userspace; there is one dirty ring per vcpu.
+
+One dirty ring is defined as below internally:
+
+struct kvm_dirty_ring {
+	u32 dirty_index;
+	u32 reset_index;
+	u32 size;
+	u32 soft_limit;
+	struct kvm_dirty_gfn *dirty_gfns;
+	int index;
+};
+
+Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
+For each of the dirty entry it's defined as:
+
+struct kvm_dirty_gfn {
+        __u32 flags;
+        __u32 slot; /* as_id | slot_id */
+        __u64 offset;
+};
+
+Each GFN is a state machine itself.  The state is embeded in the flags
+field, as defined in the uapi header:
+
+/*
+ * KVM dirty GFN flags, defined as:
+ *
+ * |---------------+---------------+--------------|
+ * | bit 1 (reset) | bit 0 (dirty) | Status       |
+ * |---------------+---------------+--------------|
+ * |             0 |             0 | Invalid GFN  |
+ * |             0 |             1 | Dirty GFN    |
+ * |             1 |             X | GFN to reset |
+ * |---------------+---------------+--------------|
+ *
+ * Lifecycle of a dirty GFN goes like:
+ *
+ *      dirtied         collected        reset
+ * 00 -----------> 01 -------------> 1X -------+
+ *  ^                                          |
+ *  |                                          |
+ *  +------------------------------------------+
+ *
+ * The userspace program is only responsible for the 01->1X state
+ * conversion (to collect dirty bits).  Also, it must not skip any
+ * dirty bits so that dirty bits are always collected in sequence.
+ */
+#define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
+#define KVM_DIRTY_GFN_F_RESET           BIT(1)
+#define KVM_DIRTY_GFN_F_MASK            0x3
+
+Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
+to enable this capability for the new guest and set the size of the
+rings.  It is only allowed before creating any vCPU, and the size of
+the ring must be a power of two.  The larger the ring buffer, the less
+likely the ring is full and the VM is forced to exit to userspace. The
+optimal size depends on the workload, but it is recommended that it be
+at least 64 KiB (4096 entries).
+
+Just like for dirty page bitmaps, the buffer tracks writes to
+all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
+set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
+with the flag set, userspace can start harvesting dirty pages from the
+ring buffer.
+
+To harvest the dirty pages, userspace accesses the mmaped ring buffer
+to read the dirty GFNs starting from zero.  If the flags has the DIRTY
+bit set (at this stage the RESET bit must be cleared), then it means
+this GFN is a dirty GFN.  The userspace should collect this GFN and
+mark the flags from state 01b to 1Xb (bit 0 will be ignored by KVM,
+but bit 1 must be set to show that this GFN is collected and waiting
+for a reset), and move on to the next GFN.  The userspace should
+continue to do this until when the flags of a GFN has the DIRTY bit
+cleared, it means we've collected all the dirty GFNs we have for now.
+It's not a must that the userspace collects the all dirty GFNs in
+once.  However it must collect the dirty GFNs in sequence, i.e., the
+userspace program cannot skip one dirty GFN to collect the one next to
+it.
+
+After processing one or more entries in the ring buffer, userspace
+calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
+it, so that the kernel will reprotect those collected GFNs.
+Therefore, the ioctl must be called *before* reading the content of
+the dirty pages.
+
+The dirty ring interface has a major difference comparing to the
+KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
+userspace it's still possible that the kernel has not yet flushed the
+hardware dirty buffers into the kernel buffer (the flushing was
+previously done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
+needs to kick the vcpu out for a hardware buffer flush (vmexit) to
+make sure all the existing dirty gfns are flushed to the dirty rings.
+
+The dirty ring can gets full.  When it happens, the KVM_RUN of the
+vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a8c68f626fb5..970770bfafbd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1201,6 +1201,7 @@ struct kvm_x86_ops {
 					   struct kvm_memory_slot *slot,
 					   gfn_t offset, unsigned long mask);
 	int (*write_log_dirty)(struct kvm_vcpu *vcpu);
+	int (*cpu_dirty_log_size)(void);
 
 	/* pmu operations of sub-arch */
 	const struct kvm_pmu_ops *pmu_ops;
@@ -1693,4 +1694,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define GET_SMSTATE(type, buf, offset)		\
 	(*(type *)((buf) + (offset) - 0x7e00))
 
+int kvm_cpu_dirty_log_size(void);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 3f3f780c8c65..99b15ce39e75 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -12,6 +12,7 @@
 
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define DE_VECTOR 0
 #define DB_VECTOR 1
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index e553f0fdd87d..8e12a6fe1524 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -6,7 +6,8 @@ ccflags-$(CONFIG_KVM_WERROR) += -Werror
 KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
+				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
+				$(KVM)/dirty_ring.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 560e85ebdf22..e770a5dd0c30 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1749,7 +1749,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
 {
 	if (kvm_x86_ops->write_log_dirty)
 		return kvm_x86_ops->write_log_dirty(vcpu);
+	return 0;
+}
 
+int kvm_cpu_dirty_log_size(void)
+{
+	if (kvm_x86_ops->cpu_dirty_log_size)
+		return kvm_x86_ops->cpu_dirty_log_size();
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 529b04ca0ac8..3c6bc41cd6a6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7761,6 +7761,7 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->slot_disable_log_dirty = NULL;
 		kvm_x86_ops->flush_log_dirty = NULL;
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
+		kvm_x86_ops->cpu_dirty_log_size = NULL;
 	}
 
 	if (!cpu_has_vmx_preemption_timer())
@@ -7835,6 +7836,11 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
+static int vmx_cpu_dirty_log_size(void)
+{
+	return enable_pml ? PML_ENTITY_NUM : 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -7945,6 +7951,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.flush_log_dirty = vmx_flush_log_dirty,
 	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
 	.write_log_dirty = vmx_write_pml_buffer,
+	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
 
 	.pre_block = vmx_pre_block,
 	.post_block = vmx_post_block,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index faa702c4d37b..3a12f931a045 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8176,6 +8176,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	bool req_immediate_exit = false;
 
+	/* Forbid vmenter if vcpu dirty ring is soft-full */
+	if (unlikely(vcpu->kvm->dirty_ring_size &&
+		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
+		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+		trace_kvm_dirty_ring_exit(vcpu);
+		r = 0;
+		goto out;
+	}
+
 	if (kvm_request_pending(vcpu)) {
 		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu)) {
 			if (unlikely(!kvm_x86_ops->get_vmcs12_pages(vcpu))) {
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
new file mode 100644
index 000000000000..120e5e90fa1d
--- /dev/null
+++ b/include/linux/kvm_dirty_ring.h
@@ -0,0 +1,103 @@
+#ifndef KVM_DIRTY_RING_H
+#define KVM_DIRTY_RING_H
+
+#include <linux/kvm.h>
+
+/**
+ * kvm_dirty_ring: KVM internal dirty ring structure
+ *
+ * @dirty_index: free running counter that points to the next slot in
+ *               dirty_ring->dirty_gfns, where a new dirty page should go
+ * @reset_index: free running counter that points to the next dirty page
+ *               in dirty_ring->dirty_gfns for which dirty trap needs to
+ *               be reenabled
+ * @size:        size of the compact list, dirty_ring->dirty_gfns
+ * @soft_limit:  when the number of dirty pages in the list reaches this
+ *               limit, vcpu that owns this ring should exit to userspace
+ *               to allow userspace to harvest all the dirty pages
+ * @dirty_gfns:  the array to keep the dirty gfns
+ * @index:       index of this dirty ring
+ */
+struct kvm_dirty_ring {
+	u32 dirty_index;
+	u32 reset_index;
+	u32 size;
+	u32 soft_limit;
+	struct kvm_dirty_gfn *dirty_gfns;
+	int index;
+};
+
+#if (KVM_DIRTY_LOG_PAGE_OFFSET == 0)
+/*
+ * If KVM_DIRTY_LOG_PAGE_OFFSET not defined, kvm_dirty_ring.o should
+ * not be included as well, so define these nop functions for the arch.
+ */
+static inline u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+	return 0;
+}
+
+static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
+				       int index, u32 size)
+{
+	return 0;
+}
+
+static inline struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
+{
+	return NULL;
+}
+
+static inline int kvm_dirty_ring_reset(struct kvm *kvm,
+				       struct kvm_dirty_ring *ring)
+{
+	return 0;
+}
+
+static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+				       u32 slot, u64 offset)
+{
+}
+
+static inline struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring,
+						   u32 offset)
+{
+	return NULL;
+}
+
+static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+}
+
+static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
+{
+	return true;
+}
+
+#else /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
+
+u32 kvm_dirty_ring_get_rsvd_entries(void);
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
+
+/*
+ * called with kvm->slots_lock held, returns the number of
+ * processed pages.
+ */
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
+
+/*
+ * returns =0: successfully pushed
+ *         <0: unable to push, need to wait
+ */
+void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
+
+/* for use in vm_operations_struct */
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
+
+#endif /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
+
+#endif	/* KVM_DIRTY_RING_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 515570116b60..291a9a9a1239 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
 #include <linux/kvm_types.h>
 
 #include <asm/kvm_host.h>
+#include <linux/kvm_dirty_ring.h>
 
 #ifndef KVM_MAX_VCPU_ID
 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -319,6 +320,7 @@ struct kvm_vcpu {
 	bool ready;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	struct kvm_dirty_ring dirty_ring;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -504,6 +506,7 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	u32 dirty_ring_size;
 };
 
 #define kvm_err(fmt, ...) \
@@ -1422,4 +1425,14 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 				uintptr_t data, const char *name,
 				struct task_struct **thread_ptr);
 
+/*
+ * This defines how many reserved entries we want to keep before we
+ * kick the vcpu to the userspace to avoid dirty ring full.  This
+ * value can be tuned to higher if e.g. PML is enabled on the host.
+ */
+#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
+
+/* Max number of entries allowed for each kvm dirty ring */
+#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
+
 #endif
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 2c735a3e6613..3d850997940c 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -399,6 +399,84 @@ TRACE_EVENT(kvm_halt_poll_ns,
 #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
 	trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
 
+TRACE_EVENT(kvm_dirty_ring_push,
+	TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
+	TP_ARGS(ring, slot, offset),
+
+	TP_STRUCT__entry(
+		__field(int, index)
+		__field(u32, dirty_index)
+		__field(u32, reset_index)
+		__field(u32, slot)
+		__field(u64, offset)
+	),
+
+	TP_fast_assign(
+		__entry->index          = ring->index;
+		__entry->dirty_index    = ring->dirty_index;
+		__entry->reset_index    = ring->reset_index;
+		__entry->slot           = slot;
+		__entry->offset         = offset;
+	),
+
+	TP_printk("ring %d: dirty 0x%x reset 0x%x "
+		  "slot %u offset 0x%llx (used %u)",
+		  __entry->index, __entry->dirty_index,
+		  __entry->reset_index,  __entry->slot, __entry->offset,
+		  __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_reset,
+	TP_PROTO(struct kvm_dirty_ring *ring),
+	TP_ARGS(ring),
+
+	TP_STRUCT__entry(
+		__field(int, index)
+		__field(u32, dirty_index)
+		__field(u32, reset_index)
+	),
+
+	TP_fast_assign(
+		__entry->index          = ring->index;
+		__entry->dirty_index    = ring->dirty_index;
+		__entry->reset_index    = ring->reset_index;
+	),
+
+	TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
+		  __entry->index, __entry->dirty_index, __entry->reset_index,
+		  __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_waitqueue,
+	TP_PROTO(bool enter),
+	TP_ARGS(enter),
+
+	TP_STRUCT__entry(
+	    __field(bool, enter)
+	),
+
+	TP_fast_assign(
+	    __entry->enter = enter;
+	),
+
+	TP_printk("%s", __entry->enter ? "wait" : "awake")
+);
+
+TRACE_EVENT(kvm_dirty_ring_exit,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
+
+	TP_STRUCT__entry(
+	    __field(int, vcpu_id)
+	),
+
+	TP_fast_assign(
+	    __entry->vcpu_id = vcpu->vcpu_id;
+	),
+
+	TP_printk("vcpu %d", __entry->vcpu_id)
+);
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 428c7dde6b4b..74f150c69ee6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -1017,6 +1018,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_VCPU_RESETS 179
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_DIRTY_LOG_RING 182
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1518,6 +1520,9 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_S390_PROTECTED */
 #define KVM_S390_PV_COMMAND		_IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc6)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1671,4 +1676,52 @@ struct kvm_hyperv_eventfd {
 #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE    (1 << 0)
 #define KVM_DIRTY_LOG_INITIALLY_SET            (1 << 1)
 
+/*
+ * Arch needs to define the macro after implementing the dirty ring
+ * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+
+/*
+ * KVM dirty GFN flags, defined as:
+ *
+ * |---------------+---------------+--------------|
+ * | bit 1 (reset) | bit 0 (dirty) | Status       |
+ * |---------------+---------------+--------------|
+ * |             0 |             0 | Invalid GFN  |
+ * |             0 |             1 | Dirty GFN    |
+ * |             1 |             X | GFN to reset |
+ * |---------------+---------------+--------------|
+ *
+ * Lifecycle of a dirty GFN goes like:
+ *
+ *      dirtied         collected        reset
+ * 00 -----------> 01 -------------> 1X -------+
+ *  ^                                          |
+ *  |                                          |
+ *  +------------------------------------------+
+ *
+ * The userspace program is only responsible for the 01->1X state
+ * conversion (to collect dirty bits).  Also, it must not skip any
+ * dirty bits so that dirty bits are always collected in sequence.
+ */
+#define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
+#define KVM_DIRTY_GFN_F_RESET           BIT(1)
+#define KVM_DIRTY_GFN_F_MASK            0x3
+
+/*
+ * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of
+ * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn.  The
+ * size of the gfn buffer is decided by the first argument when
+ * enabling KVM_CAP_DIRTY_LOG_RING.
+ */
+struct kvm_dirty_gfn {
+	__u32 flags;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
new file mode 100644
index 000000000000..12d09802b8a7
--- /dev/null
+++ b/virt/kvm/dirty_ring.c
@@ -0,0 +1,195 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * KVM dirty ring implementation
+ *
+ * Copyright 2019 Red Hat, Inc.
+ */
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/vmalloc.h>
+#include <linux/kvm_dirty_ring.h>
+#include <trace/events/kvm.h>
+
+int __weak kvm_cpu_dirty_log_size(void)
+{
+	return 0;
+}
+
+u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
+}
+
+static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
+{
+	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
+}
+
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
+{
+	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
+}
+
+bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
+{
+	return kvm_dirty_ring_used(ring) >= ring->size;
+}
+
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+	WARN_ON_ONCE(vcpu->kvm != kvm);
+
+	return &vcpu->dirty_ring;
+}
+
+static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+{
+	struct kvm_memory_slot *memslot;
+	int as_id, id;
+
+	as_id = slot >> 16;
+	id = (u16)slot;
+	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+		return;
+
+	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
+	if (offset >= memslot->npages)
+		return;
+
+	spin_lock(&kvm->mmu_lock);
+	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
+	spin_unlock(&kvm->mmu_lock);
+}
+
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
+{
+	ring->dirty_gfns = vmalloc(size);
+	if (!ring->dirty_gfns)
+		return -ENOMEM;
+	memset(ring->dirty_gfns, 0, size);
+
+	ring->size = size / sizeof(struct kvm_dirty_gfn);
+	ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
+	ring->dirty_index = 0;
+	ring->reset_index = 0;
+	ring->index = index;
+
+	return 0;
+}
+
+static inline void kvm_dirty_gfn_set_invalid(struct kvm_dirty_gfn *gfn)
+{
+	gfn->flags = 0;
+}
+
+static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
+{
+	gfn->flags = KVM_DIRTY_GFN_F_DIRTY;
+}
+
+static inline bool kvm_dirty_gfn_invalid(struct kvm_dirty_gfn *gfn)
+{
+	return gfn->flags == 0;
+}
+
+static inline bool kvm_dirty_gfn_collected(struct kvm_dirty_gfn *gfn)
+{
+	return gfn->flags & KVM_DIRTY_GFN_F_RESET;
+}
+
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+{
+	u32 cur_slot, next_slot;
+	u64 cur_offset, next_offset;
+	unsigned long mask;
+	int count = 0;
+	struct kvm_dirty_gfn *entry;
+	bool first_round = true;
+
+	/* This is only needed to make compilers happy */
+	cur_slot = cur_offset = mask = 0;
+
+	while (true) {
+		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+
+		if (!kvm_dirty_gfn_collected(entry))
+			break;
+
+		next_slot = READ_ONCE(entry->slot);
+		next_offset = READ_ONCE(entry->offset);
+
+		/* Update the flags to reflect that this GFN is reset */
+		kvm_dirty_gfn_set_invalid(entry);
+
+		ring->reset_index++;
+		count++;
+		/*
+		 * Try to coalesce the reset operations when the guest is
+		 * scanning pages in the same slot.
+		 */
+		if (!first_round && next_slot == cur_slot) {
+			s64 delta = next_offset - cur_offset;
+
+			if (delta >= 0 && delta < BITS_PER_LONG) {
+				mask |= 1ull << delta;
+				continue;
+			}
+
+			/* Backwards visit, careful about overflows!  */
+			if (delta > -BITS_PER_LONG && delta < 0 &&
+			    (mask << -delta >> -delta) == mask) {
+				cur_offset = next_offset;
+				mask = (mask << -delta) | 1;
+				continue;
+			}
+		}
+		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+		cur_slot = next_slot;
+		cur_offset = next_offset;
+		mask = 1;
+		first_round = false;
+	}
+
+	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+	trace_kvm_dirty_ring_reset(ring);
+
+	return count;
+}
+
+void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
+{
+	struct kvm_dirty_gfn *entry;
+
+	/* It should never get full */
+	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
+
+	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
+
+	/* It should always be an invalid entry to fill in */
+	WARN_ON_ONCE(!kvm_dirty_gfn_invalid(entry));
+
+	entry->slot = slot;
+	entry->offset = offset;
+	/*
+	 * Make sure the data is filled in before we publish this to
+	 * the userspace program.  There's no paired kernel-side reader.
+	 */
+	smp_wmb();
+	kvm_dirty_gfn_set_dirtied(entry);
+	ring->dirty_index++;
+	trace_kvm_dirty_ring_push(ring, slot, offset);
+}
+
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
+{
+	return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE);
+}
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+	vfree(ring->dirty_gfns);
+	ring->dirty_gfns = NULL;
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1f869dda8110..eacdedf8d122 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -64,6 +64,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
 
+#include <linux/kvm_dirty_ring.h>
+
 /* Worst case buffer size needed for holding an integer. */
 #define ITOA_MAX_LEN 12
 
@@ -358,6 +360,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 
 void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+	kvm_dirty_ring_free(&vcpu->dirty_ring);
 	kvm_arch_vcpu_destroy(vcpu);
 
 	/*
@@ -2568,8 +2571,13 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 {
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
+		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-		set_bit_le(rel_gfn, memslot->dirty_bitmap);
+		if (kvm->dirty_ring_size)
+			kvm_dirty_ring_push(kvm_dirty_ring_get(kvm),
+					    slot, rel_gfn);
+		else
+			set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
 
@@ -2916,6 +2924,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
+static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff)
+{
+	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
+		return false;
+
+	return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
+	    (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
+	     kvm->dirty_ring_size / PAGE_SIZE);
+}
+
 static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 {
 	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
@@ -2931,6 +2949,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
 		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
 #endif
+	else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff))
+		page = kvm_dirty_ring_get_page(
+		    &vcpu->dirty_ring,
+		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
 	else
 		return kvm_arch_vcpu_fault(vcpu, vmf);
 	get_page(page);
@@ -2944,6 +2966,14 @@ static const struct vm_operations_struct kvm_vcpu_vm_ops = {
 
 static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct kvm_vcpu *vcpu = file->private_data;
+	unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
+	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
+	    ((vma->vm_flags & VM_EXEC) || !(vma->vm_flags & VM_SHARED)))
+		return -EINVAL;
+
 	vma->vm_ops = &kvm_vcpu_vm_ops;
 	return 0;
 }
@@ -3037,6 +3067,13 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	if (r)
 		goto vcpu_free_run_page;
 
+	if (kvm->dirty_ring_size) {
+		r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
+					 id, kvm->dirty_ring_size);
+		if (r)
+			goto arch_vcpu_destroy;
+	}
+
 	kvm_create_vcpu_debugfs(vcpu);
 
 	mutex_lock(&kvm->lock);
@@ -3072,6 +3109,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 unlock_vcpu_destroy:
 	mutex_unlock(&kvm->lock);
 	debugfs_remove_recursive(vcpu->debugfs_dentry);
+	kvm_dirty_ring_free(&vcpu->dirty_ring);
+arch_vcpu_destroy:
 	kvm_arch_vcpu_destroy(vcpu);
 vcpu_free_run_page:
 	free_page((unsigned long)vcpu->run);
@@ -3543,12 +3582,78 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 	case KVM_CAP_NR_MEMSLOTS:
 		return KVM_USER_MEM_SLOTS;
+	case KVM_CAP_DIRTY_LOG_RING:
+#ifdef CONFIG_X86
+		return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
+#else
+		return 0;
+#endif
 	default:
 		break;
 	}
 	return kvm_vm_ioctl_check_extension(kvm, arg);
 }
 
+static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
+{
+	int r;
+
+	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
+		return -EINVAL;
+
+	/* the size should be power of 2 */
+	if (!size || (size & (size - 1)))
+		return -EINVAL;
+
+	/* Should be bigger to keep the reserved entries, or a page */
+	if (size < kvm_dirty_ring_get_rsvd_entries() *
+	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
+		return -EINVAL;
+
+	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
+	    sizeof(struct kvm_dirty_gfn))
+		return -E2BIG;
+
+	/* We only allow it to set once */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+
+	if (kvm->created_vcpus) {
+		/* We don't allow to change this value after vcpu created */
+		r = -EINVAL;
+	} else {
+		kvm->dirty_ring_size = size;
+		r = 0;
+	}
+
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	int cleared = 0;
+
+	if (!kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
+
+	mutex_unlock(&kvm->slots_lock);
+
+	if (cleared)
+		kvm_flush_remote_tlbs(kvm);
+
+	return cleared;
+}
+
 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 						  struct kvm_enable_cap *cap)
 {
@@ -3572,6 +3677,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 		return 0;
 	}
 #endif
+	case KVM_CAP_DIRTY_LOG_RING:
+		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -3759,6 +3866,9 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+	case KVM_RESET_DIRTY_RINGS:
+		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
+		break;
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 06/14] KVM: Make dirty ring exclusive to dirty bitmap log
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (4 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 05/14] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 07/14] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

There's no good reason to use both the dirty bitmap logging and the
new dirty ring buffer to track dirty bits.  We should be able to even
support both of them at the same time, but it could complicate things
which could actually help little.  Let's simply make it the rule
before we enable dirty ring on any arch, that we don't allow these two
interfaces to be used together.

The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
enablement.  That's where we'll switch from the default dirty logging
way to the dirty ring way.  As long as kvm->dirty_ring_size is setup
correctly, we'll once and for all switch to the dirty ring buffer mode
for the current virtual machine.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.rst |  7 +++++++
 virt/kvm/kvm_main.c            | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aa54a34077b7..d56f86ba05a0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6225,3 +6225,10 @@ make sure all the existing dirty gfns are flushed to the dirty rings.
 
 The dirty ring can gets full.  When it happens, the KVM_RUN of the
 vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
+
+NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
+ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctl
+KVM_GET_DIRTY_LOG.  After enabling KVM_CAP_DIRTY_LOG_RING with an
+acceptable dirty ring size, the virtual machine will switch to the
+dirty ring tracking mode.  Further ioctls to either KVM_GET_DIRTY_LOG
+or KVM_CLEAR_DIRTY_LOG will fail.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eacdedf8d122..1a4cc20c5a3c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1355,6 +1355,10 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
 	unsigned long n;
 	unsigned long any = 0;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	*memslot = NULL;
 	*is_dirty = 0;
 
@@ -1416,6 +1420,10 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -1524,6 +1532,10 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 07/14] KVM: Don't allocate dirty bitmap if dirty ring is enabled
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (5 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 06/14] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 08/14] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

Because kvm dirty rings and kvm dirty log is used in an exclusive way,
Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
At the meantime, since the dirty_bitmap will be conditionally created
now, we can't use it as a sign of "whether this memory slot enabled
dirty tracking".  Change users like that to check against the kvm
memory slot flags.

Note that there still can be chances where the kvm memory slot got its
dirty_bitmap allocated, _if_ the memory slots are created before
enabling of the dirty rings and at the same time with the dirty
tracking capability enabled, they'll still with the dirty_bitmap.
However it should not hurt much (e.g., the bitmaps will always be
freed if they are there), and the real users normally won't trigger
this because dirty bit tracking flag should in most cases only be
applied to kvm slots only before migration starts, that should be far
latter than kvm initializes (VM starts).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c   | 4 ++--
 include/linux/kvm_host.h | 5 +++++
 virt/kvm/kvm_main.c      | 4 ++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e770a5dd0c30..970025875abc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1276,8 +1276,8 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return NULL;
-	if (no_dirty_log && slot->dirty_bitmap)
-		return NULL;
+	if (no_dirty_log && kvm_slot_dirty_track_enabled(slot))
+		return false;
 
 	return slot;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 291a9a9a1239..2061cafaf56d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -351,6 +351,11 @@ struct kvm_memory_slot {
 	u16 as_id;
 };
 
+static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
+}
+
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
 {
 	return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1a4cc20c5a3c..ae4930e404d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1294,7 +1294,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	/* Allocate/free page dirty bitmap as needed */
 	if (!(new.flags & KVM_MEM_LOG_DIRTY_PAGES))
 		new.dirty_bitmap = NULL;
-	else if (!new.dirty_bitmap) {
+	else if (!new.dirty_bitmap && !kvm->dirty_ring_size) {
 		r = kvm_alloc_dirty_bitmap(&new);
 		if (r)
 			return r;
@@ -2581,7 +2581,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn)
 {
-	if (memslot && memslot->dirty_bitmap) {
+	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 08/14] KVM: selftests: Always clear dirty bitmap after iteration
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (6 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 07/14] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-04-01  7:04   ` Andrew Jones
  2020-03-31 18:59 ` [PATCH v8 09/14] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

We don't clear the dirty bitmap before because KVM_GET_DIRTY_LOG will
clear it for us before copying the dirty log onto it.  However we'd
still better to clear it explicitly instead of assuming the kernel
will always do it for us.

More importantly, in the upcoming dirty ring tests we'll start to
fetch dirty pages from a ring buffer, so no one is going to clear the
dirty bitmap for us.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 752ec158ac59..6a8275a22861 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -195,7 +195,7 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 				    page);
 		}
 
-		if (test_bit_le(page, bmap)) {
+		if (test_and_clear_bit_le(page, bmap)) {
 			host_dirty_count++;
 			/*
 			 * If the bit is set, the value written onto
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 09/14] KVM: selftests: Sync uapi/linux/kvm.h to tools/
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (7 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 08/14] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 10/14] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

This will be needed to extend the kvm selftest program.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/include/uapi/linux/kvm.h | 100 ++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 4b95f9a31a2f..74f150c69ee6 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -474,12 +475,17 @@ struct kvm_s390_mem_op {
 	__u32 size;		/* amount of bytes */
 	__u32 op;		/* type of operation */
 	__u64 buf;		/* buffer in userspace */
-	__u8 ar;		/* the access register number */
-	__u8 reserved[31];	/* should be set to 0 */
+	union {
+		__u8 ar;	/* the access register number */
+		__u32 sida_offset; /* offset into the sida */
+		__u8 reserved[32]; /* should be set to 0 */
+	};
 };
 /* types for kvm_s390_mem_op->op */
 #define KVM_S390_MEMOP_LOGICAL_READ	0
 #define KVM_S390_MEMOP_LOGICAL_WRITE	1
+#define KVM_S390_MEMOP_SIDA_READ	2
+#define KVM_S390_MEMOP_SIDA_WRITE	3
 /* flags for kvm_s390_mem_op->flags */
 #define KVM_S390_MEMOP_F_CHECK_ONLY		(1ULL << 0)
 #define KVM_S390_MEMOP_F_INJECT_EXCEPTION	(1ULL << 1)
@@ -1010,6 +1016,9 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_NISV_TO_USER 177
 #define KVM_CAP_ARM_INJECT_EXT_DABT 178
 #define KVM_CAP_S390_VCPU_RESETS 179
+#define KVM_CAP_S390_PROTECTED 180
+#define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_DIRTY_LOG_RING 182
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1478,6 +1487,42 @@ struct kvm_enc_region {
 #define KVM_S390_NORMAL_RESET	_IO(KVMIO,   0xc3)
 #define KVM_S390_CLEAR_RESET	_IO(KVMIO,   0xc4)
 
+struct kvm_s390_pv_sec_parm {
+	__u64 origin;
+	__u64 length;
+};
+
+struct kvm_s390_pv_unp {
+	__u64 addr;
+	__u64 size;
+	__u64 tweak;
+};
+
+enum pv_cmd_id {
+	KVM_PV_ENABLE,
+	KVM_PV_DISABLE,
+	KVM_PV_SET_SEC_PARMS,
+	KVM_PV_UNPACK,
+	KVM_PV_VERIFY,
+	KVM_PV_PREP_RESET,
+	KVM_PV_UNSHARE_ALL,
+};
+
+struct kvm_pv_cmd {
+	__u32 cmd;	/* Command to be executed */
+	__u16 rc;	/* Ultravisor return code */
+	__u16 rrc;	/* Ultravisor return reason code */
+	__u64 data;	/* Data or address */
+	__u32 flags;    /* flags for future extensions. Must be 0 for now */
+	__u32 reserved[3];
+};
+
+/* Available with KVM_CAP_S390_PROTECTED */
+#define KVM_S390_PV_COMMAND		_IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
+
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc6)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1628,4 +1673,55 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+#define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE    (1 << 0)
+#define KVM_DIRTY_LOG_INITIALLY_SET            (1 << 1)
+
+/*
+ * Arch needs to define the macro after implementing the dirty ring
+ * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+
+/*
+ * KVM dirty GFN flags, defined as:
+ *
+ * |---------------+---------------+--------------|
+ * | bit 1 (reset) | bit 0 (dirty) | Status       |
+ * |---------------+---------------+--------------|
+ * |             0 |             0 | Invalid GFN  |
+ * |             0 |             1 | Dirty GFN    |
+ * |             1 |             X | GFN to reset |
+ * |---------------+---------------+--------------|
+ *
+ * Lifecycle of a dirty GFN goes like:
+ *
+ *      dirtied         collected        reset
+ * 00 -----------> 01 -------------> 1X -------+
+ *  ^                                          |
+ *  |                                          |
+ *  +------------------------------------------+
+ *
+ * The userspace program is only responsible for the 01->1X state
+ * conversion (to collect dirty bits).  Also, it must not skip any
+ * dirty bits so that dirty bits are always collected in sequence.
+ */
+#define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
+#define KVM_DIRTY_GFN_F_RESET           BIT(1)
+#define KVM_DIRTY_GFN_F_MASK            0x3
+
+/*
+ * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of
+ * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn.  The
+ * size of the gfn buffer is decided by the first argument when
+ * enabling KVM_CAP_DIRTY_LOG_RING.
+ */
+struct kvm_dirty_gfn {
+	__u32 flags;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 10/14] KVM: selftests: Use a single binary for dirty/clear log test
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (8 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 09/14] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx, Andrew Jones

Remove the clear_dirty_log test, instead merge it into the existing
dirty_log_test.  It should be cleaner to use this single binary to do
both tests, also it's a preparation for the upcoming dirty ring test.

The default behavior will run all the modes in sequence.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   6 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 187 +++++++++++++++---
 3 files changed, 156 insertions(+), 39 deletions(-)
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 712a2ddd2a27..fee0393f10da 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -28,13 +28,11 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
-TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_x86_64 += steal_time
 
-TEST_GEN_PROGS_aarch64 += clear_dirty_log_test
 TEST_GEN_PROGS_aarch64 += demand_paging_test
 TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/clear_dirty_log_test.c b/tools/testing/selftests/kvm/clear_dirty_log_test.c
deleted file mode 100644
index 11672ec6f74e..000000000000
--- a/tools/testing/selftests/kvm/clear_dirty_log_test.c
+++ /dev/null
@@ -1,6 +0,0 @@
-#define USE_CLEAR_DIRTY_LOG
-#define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0)
-#define KVM_DIRTY_LOG_INITIALLY_SET         (1 << 1)
-#define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
-		KVM_DIRTY_LOG_INITIALLY_SET)
-#include "dirty_log_test.c"
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 6a8275a22861..139ccb550618 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -128,6 +128,78 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+enum log_mode_t {
+	/* Only use KVM_GET_DIRTY_LOG for logging */
+	LOG_MODE_DIRTY_LOG = 0,
+
+	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
+	LOG_MODE_CLEAR_LOG = 1,
+
+	LOG_MODE_NUM,
+
+	/* Run all supported modes */
+	LOG_MODE_ALL = LOG_MODE_NUM,
+};
+
+/* Mode of logging to test.  Default is to run all supported modes */
+static enum log_mode_t host_log_mode_option = LOG_MODE_ALL;
+/* Logging mode for current run */
+static enum log_mode_t host_log_mode;
+
+static bool clear_log_supported(void)
+{
+	return kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
+}
+
+static void clear_log_create_vm_done(struct kvm_vm *vm)
+{
+	struct kvm_enable_cap cap = {};
+	u64 manual_caps;
+
+	manual_caps = kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
+	TEST_ASSERT(manual_caps, "MANUAL_CAPS is zero!");
+	manual_caps &= (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE |
+			KVM_DIRTY_LOG_INITIALLY_SET);
+	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
+	cap.args[0] = manual_caps;
+	vm_enable_cap(vm, &cap);
+}
+
+static void dirty_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+}
+
+static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
+}
+
+struct log_mode {
+	const char *name;
+	/* Return true if this mode is supported, otherwise false */
+	bool (*supported)(void);
+	/* Hook when the vm creation is done (before vcpu creation) */
+	void (*create_vm_done)(struct kvm_vm *vm);
+	/* Hook to collect the dirty pages into the bitmap provided */
+	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
+				     void *bitmap, uint32_t num_pages);
+} log_modes[LOG_MODE_NUM] = {
+	{
+		.name = "dirty-log",
+		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+	},
+	{
+		.name = "clear-log",
+		.supported = clear_log_supported,
+		.create_vm_done = clear_log_create_vm_done,
+		.collect_dirty_pages = clear_log_collect_dirty_pages,
+	},
+};
+
 /*
  * We use this bitmap to track some pages that should have its dirty
  * bit set in the _next_ iteration.  For example, if we detected the
@@ -137,6 +209,44 @@ static uint64_t host_track_next_count;
  */
 static unsigned long *host_bmap_track;
 
+static void log_modes_dump(void)
+{
+	int i;
+
+	printf("all");
+	for (i = 0; i < LOG_MODE_NUM; i++)
+		printf(", %s", log_modes[i].name);
+	printf("\n");
+}
+
+static bool log_mode_supported(void)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->supported)
+		return mode->supported();
+
+	return true;
+}
+
+static void log_mode_create_vm_done(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->create_vm_done)
+		mode->create_vm_done(vm);
+}
+
+static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					 void *bitmap, uint32_t num_pages)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	TEST_ASSERT(mode->collect_dirty_pages != NULL,
+		    "collect_dirty_pages() is required for any log mode!");
+	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -257,6 +367,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 #ifdef __x86_64__
 	vm_create_irqchip(vm);
 #endif
+	log_mode_create_vm_done(vm);
 	vm_vcpu_add_default(vm, vcpuid, guest_code);
 	return vm;
 }
@@ -264,10 +375,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 #define DIRTY_MEM_BITS 30 /* 1G */
 #define PAGE_SHIFT_4K  12
 
-#ifdef USE_CLEAR_DIRTY_LOG
-static u64 dirty_log_manual_caps;
-#endif
-
 static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 		     unsigned long interval, uint64_t phys_offset)
 {
@@ -275,6 +382,12 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	struct kvm_vm *vm;
 	unsigned long *bmap;
 
+	if (!log_mode_supported()) {
+		print_skip("Log mode '%s' not supported",
+			   log_modes[host_log_mode].name);
+		return;
+	}
+
 	/*
 	 * We reserve page table for 2 times of extra dirty mem which
 	 * will definitely cover the original (1G+) test range.  Here
@@ -317,14 +430,6 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	bmap = bitmap_alloc(host_num_pages);
 	host_bmap_track = bitmap_alloc(host_num_pages);
 
-#ifdef USE_CLEAR_DIRTY_LOG
-	struct kvm_enable_cap cap = {};
-
-	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
-	cap.args[0] = dirty_log_manual_caps;
-	vm_enable_cap(vm, &cap);
-#endif
-
 	/* Add an extra memory slot for testing dirty logging */
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
 				    guest_test_phys_mem,
@@ -362,11 +467,8 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	while (iteration < iterations) {
 		/* Give the vcpu thread some time to dirty some pages */
 		usleep(interval * 1000);
-		kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap);
-#ifdef USE_CLEAR_DIRTY_LOG
-		kvm_vm_clear_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap, 0,
-				       host_num_pages);
-#endif
+		log_mode_collect_dirty_pages(vm, TEST_MEM_SLOT_INDEX,
+					     bmap, host_num_pages);
 		vm_dirty_log_verify(mode, bmap);
 		iteration++;
 		sync_global_to_guest(vm, iteration);
@@ -410,6 +512,9 @@ static void help(char *name)
 	       TEST_HOST_LOOP_INTERVAL);
 	printf(" -p: specify guest physical test memory offset\n"
 	       "     Warning: a low offset can conflict with the loaded test code.\n");
+	printf(" -M: specify the host logging mode "
+	       "(default: run all log modes).  Supported modes: \n\t");
+	log_modes_dump();
 	printf(" -m: specify the guest mode ID to test "
 	       "(default: test all supported modes)\n"
 	       "     This option may be used multiple times.\n"
@@ -429,18 +534,7 @@ int main(int argc, char *argv[])
 	bool mode_selected = false;
 	uint64_t phys_offset = 0;
 	unsigned int mode;
-	int opt, i;
-
-#ifdef USE_CLEAR_DIRTY_LOG
-	dirty_log_manual_caps =
-		kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
-	if (!dirty_log_manual_caps) {
-		print_skip("KVM_CLEAR_DIRTY_LOG not available");
-		exit(KSFT_SKIP);
-	}
-	dirty_log_manual_caps &= (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE |
-				  KVM_DIRTY_LOG_INITIALLY_SET);
-#endif
+	int opt, i, j;
 
 #ifdef __x86_64__
 	guest_mode_init(VM_MODE_PXXV48_4K, true, true);
@@ -464,7 +558,7 @@ int main(int argc, char *argv[])
 	guest_mode_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:")) != -1) {
+	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
 		switch (opt) {
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
@@ -486,6 +580,26 @@ int main(int argc, char *argv[])
 				    "Guest mode ID %d too big", mode);
 			guest_modes[mode].enabled = true;
 			break;
+		case 'M':
+			if (!strcmp(optarg, "all")) {
+				host_log_mode_option = LOG_MODE_ALL;
+				break;
+			}
+			for (i = 0; i < LOG_MODE_NUM; i++) {
+				if (!strcmp(optarg, log_modes[i].name)) {
+					pr_info("Setting log mode to: '%s'\n",
+						optarg);
+					host_log_mode_option = i;
+					break;
+				}
+			}
+			if (i == LOG_MODE_NUM) {
+				printf("Log mode '%s' invalid. Please choose "
+				       "from: ", optarg);
+				log_modes_dump();
+				exit(1);
+			}
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -507,7 +621,18 @@ int main(int argc, char *argv[])
 		TEST_ASSERT(guest_modes[i].supported,
 			    "Guest mode ID %d (%s) not supported.",
 			    i, vm_guest_mode_string(i));
-		run_test(i, iterations, interval, phys_offset);
+		if (host_log_mode_option == LOG_MODE_ALL) {
+			/* Run each log mode */
+			for (j = 0; j < LOG_MODE_NUM; j++) {
+				pr_info("Testing Log Mode '%s'\n",
+					log_modes[j].name);
+				host_log_mode = j;
+				run_test(i, iterations, interval, phys_offset);
+			}
+		} else {
+			host_log_mode = host_log_mode_option;
+			run_test(i, iterations, interval, phys_offset);
+		}
 	}
 
 	return 0;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (9 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 10/14] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-04-01  7:03   ` Andrew Jones
  2020-03-31 18:59 ` [PATCH v8 12/14] KVM: selftests: Add dirty ring buffer test Peter Xu
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

Provide a hook for the checks after vcpu_run() completes.  Preparation
for the dirty ring test because we'll need to take care of another
exit reason.

Since at it, drop the pages_count because after all we have a better
summary right now with statistics, and clean it up a bit.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 36 +++++++++++++-------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 139ccb550618..a2160946bcf5 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -178,6 +178,15 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
+static void default_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
+		    "Invalid guest sync status: exit_reason=%s\n",
+		    exit_reason_str(run->exit_reason));
+}
+
 struct log_mode {
 	const char *name;
 	/* Return true if this mode is supported, otherwise false */
@@ -187,16 +196,20 @@ struct log_mode {
 	/* Hook to collect the dirty pages into the bitmap provided */
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
+	/* Hook to call when after each vcpu run */
+	void (*after_vcpu_run)(struct kvm_vm *vm);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
 		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 	{
 		.name = "clear-log",
 		.supported = clear_log_supported,
 		.create_vm_done = clear_log_create_vm_done,
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 };
 
@@ -247,6 +260,14 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
+static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->after_vcpu_run)
+		mode->after_vcpu_run(vm);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -261,25 +282,16 @@ static void *vcpu_worker(void *data)
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
 	uint64_t pages_count = 0;
-	struct kvm_run *run;
-
-	run = vcpu_state(vm, VCPU_ID);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
-	generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 
 	while (!READ_ONCE(host_quit)) {
+		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
+		pages_count += TEST_PAGES_PER_LOOP;
 		/* Let the guest dirty the random pages */
 		ret = _vcpu_run(vm, VCPU_ID);
 		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
-			pages_count += TEST_PAGES_PER_LOOP;
-			generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
-		} else {
-			TEST_FAIL("Invalid guest sync status: "
-				  "exit_reason=%s\n",
-				  exit_reason_str(run->exit_reason));
-		}
+		log_mode_after_vcpu_run(vm);
 	}
 
 	pr_info("Dirtied %"PRIu64" pages\n", pages_count);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 12/14] KVM: selftests: Add dirty ring buffer test
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (10 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-03-31 18:59 ` [PATCH v8 13/14] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

Add the initial dirty ring buffer test.

The current test implements the userspace dirty ring collection, by
only reaping the dirty ring when the ring is full.

So it's still running synchronously like this:

            vcpu                             main thread

  1. vcpu dirties pages
  2. vcpu gets dirty ring full
     (userspace exit)

                                       3. main thread waits until full
                                          (so hardware buffers flushed)
                                       4. main thread collects
                                       5. main thread continues vcpu

  6. vcpu continues, goes back to 1

We can't directly collects dirty bits during vcpu execution because
otherwise we can't guarantee the hardware dirty bits were flushed when
we collect and we're very strict on the dirty bits so otherwise we can
fail the future verify procedure.  A follow up patch will make this
test to support async just like the existing dirty log test, by adding
a vcpu kick mechanism.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 201 +++++++++++++++++-
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  59 +++++
 .../selftests/kvm/lib/kvm_util_internal.h     |   4 +
 4 files changed, 265 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index a2160946bcf5..531431cff4fc 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -12,8 +12,10 @@
 #include <unistd.h>
 #include <time.h>
 #include <pthread.h>
+#include <semaphore.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
+#include <asm/barrier.h>
 
 #include "test_util.h"
 #include "kvm_util.h"
@@ -57,6 +59,8 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
+#define TEST_DIRTY_RING_COUNT		1024
+
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
  * sync_global_to/from_guest() are used when accessing from
@@ -128,6 +132,10 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+/* Whether dirty ring reset is requested, or finished */
+static sem_t dirty_ring_vcpu_stop;
+static sem_t dirty_ring_vcpu_cont;
+
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
 	LOG_MODE_DIRTY_LOG = 0,
@@ -135,6 +143,9 @@ enum log_mode_t {
 	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
 	LOG_MODE_CLEAR_LOG = 1,
 
+	/* Use dirty ring for logging */
+	LOG_MODE_DIRTY_RING = 2,
+
 	LOG_MODE_NUM,
 
 	/* Run all supported modes */
@@ -187,6 +198,120 @@ static void default_after_vcpu_run(struct kvm_vm *vm)
 		    exit_reason_str(run->exit_reason));
 }
 
+static bool dirty_ring_supported(void)
+{
+	return kvm_check_cap(KVM_CAP_DIRTY_LOG_RING);
+}
+
+static void dirty_ring_create_vm_done(struct kvm_vm *vm)
+{
+	/*
+	 * Switch to dirty ring mode after VM creation but before any
+	 * of the vcpu creation.
+	 */
+	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+			     sizeof(struct kvm_dirty_gfn));
+}
+
+static inline bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
+{
+	return gfn->flags == KVM_DIRTY_GFN_F_DIRTY;
+}
+
+static inline void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
+{
+	gfn->flags = KVM_DIRTY_GFN_F_RESET;
+}
+
+static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
+				       int slot, void *bitmap,
+				       uint32_t num_pages, uint32_t *fetch_index)
+{
+	struct kvm_dirty_gfn *cur;
+	uint32_t count = 0;
+
+	while (true) {
+		cur = &dirty_gfns[*fetch_index % TEST_DIRTY_RING_COUNT];
+		if (!dirty_gfn_is_dirtied(cur))
+			break;
+		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
+			    "%u != %u", cur->slot, slot);
+		TEST_ASSERT(cur->offset < num_pages, "Offset overflow: "
+			    "0x%llx >= 0x%x", cur->offset, num_pages);
+		pr_info("fetch 0x%x page %llu\n", *fetch_index, cur->offset);
+		set_bit(cur->offset, bitmap);
+		dirty_gfn_set_collected(cur);
+		(*fetch_index)++;
+		count++;
+	}
+
+	return count;
+}
+
+static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					   void *bitmap, uint32_t num_pages)
+{
+	/* We only have one vcpu */
+	static uint32_t fetch_index = 0;
+	uint32_t count = 0, cleared;
+
+	/*
+	 * Before fetching the dirty pages, we need a vmexit of the
+	 * worker vcpu to make sure the hardware dirty buffers were
+	 * flushed.  This is not needed for dirty-log/clear-log tests
+	 * because get dirty log will natually do so.
+	 *
+	 * For now we do it in the simple way - we simply wait until
+	 * the vcpu uses up the soft dirty ring, then it'll always
+	 * do a vmexit to make sure that PML buffers will be flushed.
+	 * In real hypervisors, we probably need a vcpu kick or to
+	 * stop the vcpus (before the final sync) to make sure we'll
+	 * get all the existing dirty PFNs even cached in hardware.
+	 */
+	sem_wait(&dirty_ring_vcpu_stop);
+
+	/* Only have one vcpu */
+	count = dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
+				       slot, bitmap, num_pages, &fetch_index);
+
+	cleared = kvm_vm_reset_dirty_ring(vm);
+
+	/* Cleared pages should be the same as collected */
+	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
+		    "with collected (%u)", cleared, count);
+
+	pr_info("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
+
+	pr_info("Iteration %ld collected %u pages\n", iteration, count);
+}
+
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	/* A ucall-sync or ring-full event is allowed */
+	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
+		/* We should allow this to continue */
+		;
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+		sem_post(&dirty_ring_vcpu_stop);
+		pr_info("vcpu stops because dirty ring full...\n");
+		sem_wait(&dirty_ring_vcpu_cont);
+		pr_info("vcpu continues now.\n");
+	} else {
+		TEST_ASSERT(false, "Invalid guest sync status: "
+			    "exit_reason=%s\n",
+			    exit_reason_str(run->exit_reason));
+	}
+}
+
+static void dirty_ring_before_vcpu_join(void)
+{
+	/* Kick another round of vcpu just to make sure it will quit */
+	sem_post(&dirty_ring_vcpu_cont);
+}
+
 struct log_mode {
 	const char *name;
 	/* Return true if this mode is supported, otherwise false */
@@ -198,6 +323,7 @@ struct log_mode {
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
 	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
@@ -211,6 +337,14 @@ struct log_mode {
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
 		.after_vcpu_run = default_after_vcpu_run,
 	},
+	{
+		.name = "dirty-ring",
+		.supported = dirty_ring_supported,
+		.create_vm_done = dirty_ring_create_vm_done,
+		.collect_dirty_pages = dirty_ring_collect_dirty_pages,
+		.before_vcpu_join = dirty_ring_before_vcpu_join,
+		.after_vcpu_run = dirty_ring_after_vcpu_run,
+	},
 };
 
 /*
@@ -268,6 +402,14 @@ static void log_mode_after_vcpu_run(struct kvm_vm *vm)
 		mode->after_vcpu_run(vm);
 }
 
+static void log_mode_before_vcpu_join(void)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->before_vcpu_join)
+		mode->before_vcpu_join();
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -318,14 +460,65 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 		}
 
 		if (test_and_clear_bit_le(page, bmap)) {
+			bool matched;
+
 			host_dirty_count++;
+
 			/*
 			 * If the bit is set, the value written onto
 			 * the corresponding page should be either the
 			 * previous iteration number or the current one.
 			 */
-			TEST_ASSERT(*value_ptr == iteration ||
-				    *value_ptr == iteration - 1,
+			matched = (*value_ptr == iteration ||
+				   *value_ptr == iteration - 1);
+
+			if (host_log_mode == LOG_MODE_DIRTY_RING && !matched) {
+				if (*value_ptr == iteration - 2) {
+					/*
+					 * Short answer: this case is special
+					 * only for dirty ring test where the
+					 * page is the last page before a kvm
+					 * dirty ring full in iteration N-2.
+					 *
+					 * Long answer: Assuming ring size R,
+					 * one possible condition is:
+					 *
+					 *      main thr       vcpu thr
+					 *      --------       --------
+					 *    iter=1
+					 *                   write 1 to page 0~(R-1)
+					 *                   full, vmexit
+					 *    collect 0~(R-1)
+					 *    kick vcpu
+					 *                   write 1 to (R-1)~(2R-2)
+					 *                   full, vmexit
+					 *    iter=2
+					 *    collect (R-1)~(2R-2)
+					 *    kick vcpu
+					 *                   write 1 to (2R-2)
+					 *                   (NOTE!!! "1" cached in cpu reg)
+					 *                   write 2 to (2R-1)~(3R-3)
+					 *                   full, vmexit
+					 *    iter=3
+					 *    collect (2R-2)~(3R-3)
+					 *    (here if we read value on page
+					 *     "2R-2" is 1, while iter=3!!!)
+					 */
+					matched = true;
+				} else {
+					/*
+					 * This is also special for dirty ring
+					 * when this page is exactly the last
+					 * page touched before vcpu ring full.
+					 * If it happens, we should expect the
+					 * value to change in the next round.
+					 */
+					set_bit_le(page, host_bmap_track);
+					continue;
+				}
+			}
+
+			TEST_ASSERT(matched,
 				    "Set page %"PRIu64" value %"PRIu64
 				    " incorrect (iteration=%"PRIu64")",
 				    page, *value_ptr, iteration);
@@ -488,6 +681,7 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
+	log_mode_before_vcpu_join();
 	pthread_join(vcpu_thread, NULL);
 
 	pr_info("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
@@ -548,6 +742,9 @@ int main(int argc, char *argv[])
 	unsigned int mode;
 	int opt, i, j;
 
+	sem_init(&dirty_ring_vcpu_stop, 0, 0);
+	sem_init(&dirty_ring_vcpu_cont, 0, 0);
+
 #ifdef __x86_64__
 	guest_mode_init(VM_MODE_PXXV48_4K, true, true);
 #endif
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index a99b875f50d2..554fdb294bef 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -62,6 +62,7 @@ enum vm_mem_backing_src_type {
 
 int kvm_check_cap(long cap);
 int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 
 struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
 struct kvm_vm *_vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
@@ -71,6 +72,7 @@ void kvm_vm_release(struct kvm_vm *vmp);
 void kvm_vm_get_dirty_log(struct kvm_vm *vm, int slot, void *log);
 void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 			    uint64_t first_page, uint32_t num_pages);
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm);
 
 int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
 		       size_t len);
@@ -192,6 +194,7 @@ void vcpu_nested_state_get(struct kvm_vm *vm, uint32_t vcpuid,
 int vcpu_nested_state_set(struct kvm_vm *vm, uint32_t vcpuid,
 			  struct kvm_nested_state *state, bool ignore_error);
 #endif
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid);
 
 const char *exit_reason_str(unsigned int exit_reason);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 8a3523d4434f..e632d1f4a112 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -85,6 +85,16 @@ int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap)
 	return ret;
 }
 
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size)
+{
+	struct kvm_enable_cap cap = { 0 };
+
+	cap.cap = KVM_CAP_DIRTY_LOG_RING;
+	cap.args[0] = ring_size;
+	vm_enable_cap(vm, &cap);
+	vm->dirty_ring_size = ring_size;
+}
+
 static void vm_open(struct kvm_vm *vm, int perm)
 {
 	vm->kvm_fd = open(KVM_DEV_PATH, perm);
@@ -295,6 +305,11 @@ void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 		    __func__, strerror(-ret));
 }
 
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm)
+{
+	return ioctl(vm->fd, KVM_RESET_DIRTY_RINGS);
+}
+
 /*
  * Userspace Memory Region Find
  *
@@ -406,6 +421,13 @@ static void vm_vcpu_rm(struct kvm_vm *vm, uint32_t vcpuid)
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
 	int ret;
 
+	if (vcpu->dirty_gfns) {
+		ret = munmap(vcpu->dirty_gfns, vm->dirty_ring_size);
+		TEST_ASSERT(ret == 0, "munmap of VCPU dirty ring failed, "
+			    "rc: %i errno: %i", ret, errno);
+		vcpu->dirty_gfns = NULL;
+	}
+
 	ret = munmap(vcpu->state, sizeof(*vcpu->state));
 	TEST_ASSERT(ret == 0, "munmap of VCPU fd failed, rc: %i "
 		"errno: %i", ret, errno);
@@ -1475,6 +1497,42 @@ int _vcpu_ioctl(struct kvm_vm *vm, uint32_t vcpuid,
 	return ret;
 }
 
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu;
+	uint32_t size = vm->dirty_ring_size;
+
+	TEST_ASSERT(size > 0, "Should enable dirty ring first");
+
+	vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu, "Cannot find vcpu %u", vcpuid);
+
+	if (!vcpu->dirty_gfns) {
+		void *addr;
+
+		addr = mmap(NULL, size, PROT_READ,
+			    MAP_PRIVATE, vcpu->fd,
+			    vm->page_size * KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(addr == MAP_FAILED, "Dirty ring mapped private");
+
+		addr = mmap(NULL, size, PROT_READ | PROT_EXEC,
+			    MAP_PRIVATE, vcpu->fd,
+			    vm->page_size * KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(addr == MAP_FAILED, "Dirty ring mapped exec");
+
+		addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			    MAP_SHARED, vcpu->fd,
+			    vm->page_size * KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(addr != MAP_FAILED, "Dirty ring map failed");
+
+		vcpu->dirty_gfns = addr;
+		vcpu->dirty_gfns_count = size / sizeof(struct kvm_dirty_gfn);
+	}
+
+	return vcpu->dirty_gfns;
+}
+
 /*
  * VM Ioctl
  *
@@ -1569,6 +1627,7 @@ static struct exit_reason {
 	{KVM_EXIT_INTERNAL_ERROR, "INTERNAL_ERROR"},
 	{KVM_EXIT_OSI, "OSI"},
 	{KVM_EXIT_PAPR_HCALL, "PAPR_HCALL"},
+	{KVM_EXIT_DIRTY_RING_FULL, "DIRTY_RING_FULL"},
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
 #endif
diff --git a/tools/testing/selftests/kvm/lib/kvm_util_internal.h b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
index ca56a0133127..22c84d9c8b03 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util_internal.h
+++ b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
@@ -28,6 +28,9 @@ struct vcpu {
 	uint32_t id;
 	int fd;
 	struct kvm_run *state;
+	struct kvm_dirty_gfn *dirty_gfns;
+	uint32_t fetch_index;
+	uint32_t dirty_gfns_count;
 };
 
 struct kvm_vm {
@@ -50,6 +53,7 @@ struct kvm_vm {
 	vm_paddr_t pgd;
 	vm_vaddr_t gdt;
 	vm_vaddr_t tss;
+	uint32_t dirty_ring_size;
 };
 
 struct vcpu *vcpu_find(struct kvm_vm *vm, uint32_t vcpuid);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 13/14] KVM: selftests: Let dirty_log_test async for dirty ring test
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (11 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 12/14] KVM: selftests: Add dirty ring buffer test Peter Xu
@ 2020-03-31 18:59 ` Peter Xu
  2020-04-01  7:48   ` Andrew Jones
  2020-03-31 19:00 ` [PATCH v8 14/14] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
  2020-04-22 18:51 ` [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
  14 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-03-31 18:59 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx

Previously the dirty ring test was working in synchronous way, because
only with a vmexit (with that it was the ring full event) we'll know
the hardware dirty bits will be flushed to the dirty ring.

With this patch we first introduced the vcpu kick mechanism by using
SIGUSR1, meanwhile we can have a guarantee of vmexit and also the
flushing of hardware dirty bits.  With all these, we can keep the vcpu
dirty work asynchronous of the whole collection procedure now.  Still,
we need to be very careful that we can only do it async if the vcpu is
not reaching soft limit (no KVM_EXIT_DIRTY_RING_FULL).  Otherwise we
must collect the dirty bits before continuing the vcpu.

Further increase the dirty ring size to current maximum to make sure
we torture more on the no-ring-full case, which should be the major
scenario when the hypervisors like QEMU would like to use this feature.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 126 +++++++++++++-----
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   9 ++
 3 files changed, 106 insertions(+), 30 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 531431cff4fc..4b404dfdc2f9 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -13,6 +13,9 @@
 #include <time.h>
 #include <pthread.h>
 #include <semaphore.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 #include <asm/barrier.h>
@@ -59,7 +62,9 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
-#define TEST_DIRTY_RING_COUNT		1024
+#define TEST_DIRTY_RING_COUNT		65536
+
+#define SIG_IPI SIGUSR1
 
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
@@ -135,6 +140,12 @@ static uint64_t host_track_next_count;
 /* Whether dirty ring reset is requested, or finished */
 static sem_t dirty_ring_vcpu_stop;
 static sem_t dirty_ring_vcpu_cont;
+/*
+ * This is updated by the vcpu thread to tell the host whether it's a
+ * ring-full event.  It should only be read until a sem_wait() of
+ * dirty_ring_vcpu_stop and before vcpu continues to run.
+ */
+static bool dirty_ring_vcpu_ring_full;
 
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
@@ -156,6 +167,33 @@ enum log_mode_t {
 static enum log_mode_t host_log_mode_option = LOG_MODE_ALL;
 /* Logging mode for current run */
 static enum log_mode_t host_log_mode;
+static pthread_t vcpu_thread;
+
+/* Only way to pass this to the signal handler */
+static struct kvm_vm *current_vm;
+
+static void vcpu_sig_handler(int sig)
+{
+	TEST_ASSERT(sig == SIG_IPI, "unknown signal: %d", sig);
+}
+
+static void vcpu_kick(void)
+{
+	pthread_kill(vcpu_thread, SIG_IPI);
+}
+
+/*
+ * In our test we do signal tricks, let's use a better version of
+ * sem_wait to avoid signal interrupts
+ */
+static void sem_wait_until(sem_t *sem)
+{
+	int ret;
+
+	do
+		ret = sem_wait(sem);
+	while (ret == -1 && errno == EINTR);
+}
 
 static bool clear_log_supported(void)
 {
@@ -189,10 +227,13 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
-static void default_after_vcpu_run(struct kvm_vm *vm)
+static void default_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
+	TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
+		    "vcpu run failed: errno=%d", err);
+
 	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
 		    "Invalid guest sync status: exit_reason=%s\n",
 		    exit_reason_str(run->exit_reason));
@@ -248,27 +289,37 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 	return count;
 }
 
+static void dirty_ring_wait_vcpu(void)
+{
+	/* This makes sure that hardware PML cache flushed */
+	vcpu_kick();
+	sem_wait_until(&dirty_ring_vcpu_stop);
+}
+
+static void dirty_ring_continue_vcpu(void)
+{
+	pr_info("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
+}
+
 static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 					   void *bitmap, uint32_t num_pages)
 {
 	/* We only have one vcpu */
 	static uint32_t fetch_index = 0;
 	uint32_t count = 0, cleared;
+	bool continued_vcpu = false;
 
-	/*
-	 * Before fetching the dirty pages, we need a vmexit of the
-	 * worker vcpu to make sure the hardware dirty buffers were
-	 * flushed.  This is not needed for dirty-log/clear-log tests
-	 * because get dirty log will natually do so.
-	 *
-	 * For now we do it in the simple way - we simply wait until
-	 * the vcpu uses up the soft dirty ring, then it'll always
-	 * do a vmexit to make sure that PML buffers will be flushed.
-	 * In real hypervisors, we probably need a vcpu kick or to
-	 * stop the vcpus (before the final sync) to make sure we'll
-	 * get all the existing dirty PFNs even cached in hardware.
-	 */
-	sem_wait(&dirty_ring_vcpu_stop);
+	dirty_ring_wait_vcpu();
+
+	if (!dirty_ring_vcpu_ring_full) {
+		/*
+		 * This is not a ring-full event, it's safe to allow
+		 * vcpu to continue
+		 */
+		dirty_ring_continue_vcpu();
+		continued_vcpu = true;
+	}
 
 	/* Only have one vcpu */
 	count = dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
@@ -280,13 +331,16 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
 		    "with collected (%u)", cleared, count);
 
-	pr_info("Notifying vcpu to continue\n");
-	sem_post(&dirty_ring_vcpu_cont);
+	if (!continued_vcpu) {
+		TEST_ASSERT(dirty_ring_vcpu_ring_full,
+			    "Didn't continue vcpu even without ring full");
+		dirty_ring_continue_vcpu();
+	}
 
 	pr_info("Iteration %ld collected %u pages\n", iteration, count);
 }
 
-static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
@@ -294,10 +348,16 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
 	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
 		/* We should allow this to continue */
 		;
-	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
+		   (ret == -1 && err == EINTR)) {
+		/* Update the flag first before pause */
+		WRITE_ONCE(dirty_ring_vcpu_ring_full,
+			   run->exit_reason == KVM_EXIT_DIRTY_RING_FULL);
 		sem_post(&dirty_ring_vcpu_stop);
-		pr_info("vcpu stops because dirty ring full...\n");
-		sem_wait(&dirty_ring_vcpu_cont);
+		pr_info("vcpu stops because %s...\n",
+			dirty_ring_vcpu_ring_full ?
+			"dirty ring is full" : "vcpu is kicked out");
+		sem_wait_until(&dirty_ring_vcpu_cont);
 		pr_info("vcpu continues now.\n");
 	} else {
 		TEST_ASSERT(false, "Invalid guest sync status: "
@@ -322,7 +382,7 @@ struct log_mode {
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
-	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*after_vcpu_run)(struct kvm_vm *vm, int ret, int err);
 	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
@@ -394,12 +454,12 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
-static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+static void log_mode_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct log_mode *mode = &log_modes[host_log_mode];
 
 	if (mode->after_vcpu_run)
-		mode->after_vcpu_run(vm);
+		mode->after_vcpu_run(vm, ret, err);
 }
 
 static void log_mode_before_vcpu_join(void)
@@ -420,20 +480,27 @@ static void generate_random_array(uint64_t *guest_array, uint64_t size)
 
 static void *vcpu_worker(void *data)
 {
-	int ret;
+	int ret, vcpu_fd;
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
 	uint64_t pages_count = 0;
+	struct sigaction sigact;
+
+	current_vm = vm;
+	vcpu_fd = vcpu_get_fd(vm, VCPU_ID);
+	memset(&sigact, 0, sizeof(sigact));
+	sigact.sa_handler = vcpu_sig_handler;
+	sigaction(SIG_IPI, &sigact, NULL);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
 
 	while (!READ_ONCE(host_quit)) {
+		/* Clear any existing kick signals */
 		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 		pages_count += TEST_PAGES_PER_LOOP;
 		/* Let the guest dirty the random pages */
-		ret = _vcpu_run(vm, VCPU_ID);
-		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		log_mode_after_vcpu_run(vm);
+		ret = ioctl(vcpu_fd, KVM_RUN, NULL);
+		log_mode_after_vcpu_run(vm, ret, errno);
 	}
 
 	pr_info("Dirtied %"PRIu64" pages\n", pages_count);
@@ -583,7 +650,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 		     unsigned long interval, uint64_t phys_offset)
 {
-	pthread_t vcpu_thread;
 	struct kvm_vm *vm;
 	unsigned long *bmap;
 
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 554fdb294bef..62254375ec50 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -144,6 +144,7 @@ vm_paddr_t addr_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva);
 struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
+int vcpu_get_fd(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_set_mp_state(struct kvm_vm *vm, uint32_t vcpuid,
 		       struct kvm_mp_state *mp_state);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index e632d1f4a112..0e79bde7a2a8 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1207,6 +1207,15 @@ int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
 	return rc;
 }
 
+int vcpu_get_fd(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu != NULL, "vcpu not found, vcpuid: %u", vcpuid);
+
+	return vcpu->fd;
+}
+
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid)
 {
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v8 14/14] KVM: selftests: Add "-c" parameter to dirty log test
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (12 preceding siblings ...)
  2020-03-31 18:59 ` [PATCH v8 13/14] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
@ 2020-03-31 19:00 ` Peter Xu
  2020-04-22 18:51 ` [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
  14 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-03-31 19:00 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, peterx, Andrew Jones

It's only used to override the existing dirty ring size/count.  If
with a bigger ring count, we test async of dirty ring.  If with a
smaller ring count, we test ring full code path.  Async is default.

It has no use for non-dirty-ring tests.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 4b404dfdc2f9..80c42c87265e 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -168,6 +168,7 @@ static enum log_mode_t host_log_mode_option = LOG_MODE_ALL;
 /* Logging mode for current run */
 static enum log_mode_t host_log_mode;
 static pthread_t vcpu_thread;
+static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
 
 /* Only way to pass this to the signal handler */
 static struct kvm_vm *current_vm;
@@ -250,7 +251,7 @@ static void dirty_ring_create_vm_done(struct kvm_vm *vm)
 	 * Switch to dirty ring mode after VM creation but before any
 	 * of the vcpu creation.
 	 */
-	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+	vm_enable_dirty_ring(vm, test_dirty_ring_count *
 			     sizeof(struct kvm_dirty_gfn));
 }
 
@@ -272,7 +273,7 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 	uint32_t count = 0;
 
 	while (true) {
-		cur = &dirty_gfns[*fetch_index % TEST_DIRTY_RING_COUNT];
+		cur = &dirty_gfns[*fetch_index % test_dirty_ring_count];
 		if (!dirty_gfn_is_dirtied(cur))
 			break;
 		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
@@ -778,6 +779,9 @@ static void help(char *name)
 	printf("usage: %s [-h] [-i iterations] [-I interval] "
 	       "[-p offset] [-m mode]\n", name);
 	puts("");
+	printf(" -c: specify dirty ring size, in number of entries\n");
+	printf("     (only useful for dirty-ring test; default: %"PRIu32")\n",
+	       TEST_DIRTY_RING_COUNT);
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
 	printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
@@ -833,8 +837,11 @@ int main(int argc, char *argv[])
 	guest_mode_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
 		switch (opt) {
+		case 'c':
+			test_dirty_ring_count = strtol(optarg, NULL, 10);
+			break;
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
 			break;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  2020-03-31 18:59 ` [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
@ 2020-04-01  7:03   ` Andrew Jones
  2020-04-01 23:24     ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Jones @ 2020-04-01  7:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Sean Christopherson, Christophe de Dinechin, Yan Zhao,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

On Tue, Mar 31, 2020 at 02:59:57PM -0400, Peter Xu wrote:
> Provide a hook for the checks after vcpu_run() completes.  Preparation
> for the dirty ring test because we'll need to take care of another
> exit reason.
> 
> Since at it, drop the pages_count because after all we have a better
> summary right now with statistics, and clean it up a bit.

I don't see what you mean by "drop the pages_count", because it's still
there. But otherwise

Reviewed-by: Andrew Jones <drjones@redhat.com>

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  tools/testing/selftests/kvm/dirty_log_test.c | 36 +++++++++++++-------
>  1 file changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
> index 139ccb550618..a2160946bcf5 100644
> --- a/tools/testing/selftests/kvm/dirty_log_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_test.c
> @@ -178,6 +178,15 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
>  	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
>  }
>  
> +static void default_after_vcpu_run(struct kvm_vm *vm)
> +{
> +	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> +
> +	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> +		    "Invalid guest sync status: exit_reason=%s\n",
> +		    exit_reason_str(run->exit_reason));
> +}
> +
>  struct log_mode {
>  	const char *name;
>  	/* Return true if this mode is supported, otherwise false */
> @@ -187,16 +196,20 @@ struct log_mode {
>  	/* Hook to collect the dirty pages into the bitmap provided */
>  	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
>  				     void *bitmap, uint32_t num_pages);
> +	/* Hook to call when after each vcpu run */
> +	void (*after_vcpu_run)(struct kvm_vm *vm);
>  } log_modes[LOG_MODE_NUM] = {
>  	{
>  		.name = "dirty-log",
>  		.collect_dirty_pages = dirty_log_collect_dirty_pages,
> +		.after_vcpu_run = default_after_vcpu_run,
>  	},
>  	{
>  		.name = "clear-log",
>  		.supported = clear_log_supported,
>  		.create_vm_done = clear_log_create_vm_done,
>  		.collect_dirty_pages = clear_log_collect_dirty_pages,
> +		.after_vcpu_run = default_after_vcpu_run,
>  	},
>  };
>  
> @@ -247,6 +260,14 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
>  	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
>  }
>  
> +static void log_mode_after_vcpu_run(struct kvm_vm *vm)
> +{
> +	struct log_mode *mode = &log_modes[host_log_mode];
> +
> +	if (mode->after_vcpu_run)
> +		mode->after_vcpu_run(vm);
> +}
> +
>  static void generate_random_array(uint64_t *guest_array, uint64_t size)
>  {
>  	uint64_t i;
> @@ -261,25 +282,16 @@ static void *vcpu_worker(void *data)
>  	struct kvm_vm *vm = data;
>  	uint64_t *guest_array;
>  	uint64_t pages_count = 0;
> -	struct kvm_run *run;
> -
> -	run = vcpu_state(vm, VCPU_ID);
>  
>  	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
> -	generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
>  
>  	while (!READ_ONCE(host_quit)) {
> +		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
> +		pages_count += TEST_PAGES_PER_LOOP;
>  		/* Let the guest dirty the random pages */
>  		ret = _vcpu_run(vm, VCPU_ID);
>  		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> -		if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
> -			pages_count += TEST_PAGES_PER_LOOP;
> -			generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
> -		} else {
> -			TEST_FAIL("Invalid guest sync status: "
> -				  "exit_reason=%s\n",
> -				  exit_reason_str(run->exit_reason));
> -		}
> +		log_mode_after_vcpu_run(vm);
>  	}
>  
>  	pr_info("Dirtied %"PRIu64" pages\n", pages_count);
> -- 
> 2.24.1
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 08/14] KVM: selftests: Always clear dirty bitmap after iteration
  2020-03-31 18:59 ` [PATCH v8 08/14] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
@ 2020-04-01  7:04   ` Andrew Jones
  0 siblings, 0 replies; 30+ messages in thread
From: Andrew Jones @ 2020-04-01  7:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Sean Christopherson, Christophe de Dinechin, Yan Zhao,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

On Tue, Mar 31, 2020 at 02:59:54PM -0400, Peter Xu wrote:
> We don't clear the dirty bitmap before because KVM_GET_DIRTY_LOG will
> clear it for us before copying the dirty log onto it.  However we'd
> still better to clear it explicitly instead of assuming the kernel
> will always do it for us.
> 
> More importantly, in the upcoming dirty ring tests we'll start to
> fetch dirty pages from a ring buffer, so no one is going to clear the
> dirty bitmap for us.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
> index 752ec158ac59..6a8275a22861 100644
> --- a/tools/testing/selftests/kvm/dirty_log_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_test.c
> @@ -195,7 +195,7 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
>  				    page);
>  		}
>  
> -		if (test_bit_le(page, bmap)) {
> +		if (test_and_clear_bit_le(page, bmap)) {
>  			host_dirty_count++;
>  			/*
>  			 * If the bit is set, the value written onto
> -- 
> 2.24.1
>

Reviewed-by: Andrew Jones <drjones@redhat.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 13/14] KVM: selftests: Let dirty_log_test async for dirty ring test
  2020-03-31 18:59 ` [PATCH v8 13/14] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
@ 2020-04-01  7:48   ` Andrew Jones
  0 siblings, 0 replies; 30+ messages in thread
From: Andrew Jones @ 2020-04-01  7:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Sean Christopherson, Christophe de Dinechin, Yan Zhao,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

On Tue, Mar 31, 2020 at 02:59:59PM -0400, Peter Xu wrote:
> Previously the dirty ring test was working in synchronous way, because
> only with a vmexit (with that it was the ring full event) we'll know
> the hardware dirty bits will be flushed to the dirty ring.
> 
> With this patch we first introduced the vcpu kick mechanism by using
> SIGUSR1, meanwhile we can have a guarantee of vmexit and also the
> flushing of hardware dirty bits.  With all these, we can keep the vcpu
> dirty work asynchronous of the whole collection procedure now.  Still,
> we need to be very careful that we can only do it async if the vcpu is
> not reaching soft limit (no KVM_EXIT_DIRTY_RING_FULL).  Otherwise we
> must collect the dirty bits before continuing the vcpu.
> 
> Further increase the dirty ring size to current maximum to make sure
> we torture more on the no-ring-full case, which should be the major
> scenario when the hypervisors like QEMU would like to use this feature.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  tools/testing/selftests/kvm/dirty_log_test.c  | 126 +++++++++++++-----
>  .../testing/selftests/kvm/include/kvm_util.h  |   1 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |   9 ++
>  3 files changed, 106 insertions(+), 30 deletions(-)
>

For the vcpu_kick and sem_wait stuff

Reviewed-by: Andrew Jones <drjones@redhat.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  2020-04-01  7:03   ` Andrew Jones
@ 2020-04-01 23:24     ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-04-01 23:24 UTC (permalink / raw)
  To: Andrew Jones
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Sean Christopherson, Christophe de Dinechin, Yan Zhao,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

On Wed, Apr 01, 2020 at 09:03:22AM +0200, Andrew Jones wrote:
> On Tue, Mar 31, 2020 at 02:59:57PM -0400, Peter Xu wrote:
> > Provide a hook for the checks after vcpu_run() completes.  Preparation
> > for the dirty ring test because we'll need to take care of another
> > exit reason.
> > 
> > Since at it, drop the pages_count because after all we have a better
> > summary right now with statistics, and clean it up a bit.
> 
> I don't see what you mean by "drop the pages_count", because it's still
> there. But otherwise
> 
> Reviewed-by: Andrew Jones <drjones@redhat.com>

I think the pages_count was dropped in some versions, at least the 1st
version when I wrote the commit, but it must have went back during one
of the rebases upon dirty_log_test.c...  To make it simple, I'll just
remove this paragraph and pick the r-b (assuming it's valid with that).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 00/14] KVM: Dirty ring interface
  2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
                   ` (13 preceding siblings ...)
  2020-03-31 19:00 ` [PATCH v8 14/14] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
@ 2020-04-22 18:51 ` Peter Xu
  2020-04-23  6:28   ` Tian, Kevin
  14 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-04-22 18:51 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Kevin Tian, Michael S . Tsirkin, Jason Wang, Sean Christopherson,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

Hi,

TL;DR: I'm thinking whether we should record pure GPA/GFN instead of (slot_id,
slot_offset) tuple for dirty pages in kvm dirty ring to unbind kvm_dirty_gfn
with memslots.

(A slightly longer version starts...)

The problem is that binding dirty tracking operations to KVM memslots is a
restriction that needs synchronization to memslot changes, which further needs
synchronization across all the vcpus because they're the consumers of memslots.
E.g., when we remove a memory slot, we need to flush all the dirty bits
correctly before we do the removal of the memslot.  That's actually an known
defect for QEMU/KVM [1] (I bet it could be a defect for many other
hypervisors...) right now with current dirty logging.  Meanwhile, even if we
fix it, that procedure is not scale at all, and error prone to dead locks.

Here memory removal is really an (still corner-cased but relatively) important
scenario to think about for dirty logging comparing to memory additions &
movings.  Because memory addition will always have no initial dirty page, and
we don't really move RAM a lot (or do we ever?!) for a general VM use case.

Then I went a step back to think about why we need these dirty bit information
after all if the memslot is going to be removed?

There're two cases:

  - When the memslot is going to be removed forever, then the dirty information
    is indeed meaningless and can be dropped, and,

  - When the memslot is going to be removed but quickly added back with changed
    size, then we need to keep those dirty bits because it's just a commmon way
    to e.g. punch an MMIO hole in an existing RAM region (here I'd confess I
    feel like using "slot_id" to identify memslot is really unfriendly syscall
    design for things like "hole punchings" in the RAM address space...
    However such "punch hold" operation is really needed even for a common
    guest for either system reboots or device hotplugs, etc.).

The real scenario we want to cover for dirty tracking is the 2nd one.

If we can track dirty using raw GPA, the 2nd scenario is solved itself.
Because we know we'll add those memslots back (though it might be with a
different slot ID), then the GPA value will still make sense, which means we
should be able to avoid any kind of synchronization for things like memory
removals, as long as the userspace is aware of that.

With that, when we fetch the dirty bits, we lookup the memslot dynamically,
drop bits if the memslot does not exist on that address (e.g., permanent
removals), and use whatever memslot is there for that guest physical address.
Though we for sure still need to handle memory move, that the userspace needs
to still take care of dirty bit flushing and sync for a memory move, however
that's merely not happening so nothing to take care about either.

Does this makes sense?  Comments greatly welcomed..

Thanks,

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg08361.html

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v8 00/14] KVM: Dirty ring interface
  2020-04-22 18:51 ` [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
@ 2020-04-23  6:28   ` Tian, Kevin
  2020-04-23 15:22     ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-04-23  6:28 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, kvm
  Cc: Michael S . Tsirkin, Jason Wang, Christopherson, Sean J,
	Christophe de Dinechin, Zhao, Yan Y, Alex Williamson,
	Paolo Bonzini, Vitaly Kuznetsov, Dr . David Alan Gilbert

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 23, 2020 2:52 AM
> 
> Hi,
> 
> TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> (slot_id,
> slot_offset) tuple for dirty pages in kvm dirty ring to unbind kvm_dirty_gfn
> with memslots.
> 
> (A slightly longer version starts...)
> 
> The problem is that binding dirty tracking operations to KVM memslots is a
> restriction that needs synchronization to memslot changes, which further
> needs
> synchronization across all the vcpus because they're the consumers of
> memslots.
> E.g., when we remove a memory slot, we need to flush all the dirty bits
> correctly before we do the removal of the memslot.  That's actually an
> known
> defect for QEMU/KVM [1] (I bet it could be a defect for many other
> hypervisors...) right now with current dirty logging.  Meanwhile, even if we
> fix it, that procedure is not scale at all, and error prone to dead locks.
> 
> Here memory removal is really an (still corner-cased but relatively) important
> scenario to think about for dirty logging comparing to memory additions &
> movings.  Because memory addition will always have no initial dirty page,
> and
> we don't really move RAM a lot (or do we ever?!) for a general VM use case.
> 
> Then I went a step back to think about why we need these dirty bit
> information
> after all if the memslot is going to be removed?
> 
> There're two cases:
> 
>   - When the memslot is going to be removed forever, then the dirty
> information
>     is indeed meaningless and can be dropped, and,
> 
>   - When the memslot is going to be removed but quickly added back with
> changed
>     size, then we need to keep those dirty bits because it's just a commmon
> way
>     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess I
>     feel like using "slot_id" to identify memslot is really unfriendly syscall
>     design for things like "hole punchings" in the RAM address space...
>     However such "punch hold" operation is really needed even for a common
>     guest for either system reboots or device hotplugs, etc.).

why would device hotplug punch a hole in an existing RAM region? 

> 
> The real scenario we want to cover for dirty tracking is the 2nd one.
> 
> If we can track dirty using raw GPA, the 2nd scenario is solved itself.
> Because we know we'll add those memslots back (though it might be with a
> different slot ID), then the GPA value will still make sense, which means we
> should be able to avoid any kind of synchronization for things like memory
> removals, as long as the userspace is aware of that.

A curious question. What about the backing storage of the affected GPA 
is changed after adding back? Is recorded dirty info for previous backing 
storage still making sense for the newer one?

Thanks
Kevin

> 
> With that, when we fetch the dirty bits, we lookup the memslot dynamically,
> drop bits if the memslot does not exist on that address (e.g., permanent
> removals), and use whatever memslot is there for that guest physical
> address.
> Though we for sure still need to handle memory move, that the userspace
> needs
> to still take care of dirty bit flushing and sync for a memory move, however
> that's merely not happening so nothing to take care about either.
> 
> Does this makes sense?  Comments greatly welcomed..
> 
> Thanks,
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg08361.html
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 00/14] KVM: Dirty ring interface
  2020-04-23  6:28   ` Tian, Kevin
@ 2020-04-23 15:22     ` Peter Xu
  2020-04-24  6:01       ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-04-23 15:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-kernel, kvm, Michael S . Tsirkin, Jason Wang,
	Christopherson, Sean J, Christophe de Dinechin, Zhao, Yan Y,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, April 23, 2020 2:52 AM
> > 
> > Hi,
> > 
> > TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> > (slot_id,
> > slot_offset) tuple for dirty pages in kvm dirty ring to unbind kvm_dirty_gfn
> > with memslots.
> > 
> > (A slightly longer version starts...)
> > 
> > The problem is that binding dirty tracking operations to KVM memslots is a
> > restriction that needs synchronization to memslot changes, which further
> > needs
> > synchronization across all the vcpus because they're the consumers of
> > memslots.
> > E.g., when we remove a memory slot, we need to flush all the dirty bits
> > correctly before we do the removal of the memslot.  That's actually an
> > known
> > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > hypervisors...) right now with current dirty logging.  Meanwhile, even if we
> > fix it, that procedure is not scale at all, and error prone to dead locks.
> > 
> > Here memory removal is really an (still corner-cased but relatively) important
> > scenario to think about for dirty logging comparing to memory additions &
> > movings.  Because memory addition will always have no initial dirty page,
> > and
> > we don't really move RAM a lot (or do we ever?!) for a general VM use case.
> > 
> > Then I went a step back to think about why we need these dirty bit
> > information
> > after all if the memslot is going to be removed?
> > 
> > There're two cases:
> > 
> >   - When the memslot is going to be removed forever, then the dirty
> > information
> >     is indeed meaningless and can be dropped, and,
> > 
> >   - When the memslot is going to be removed but quickly added back with
> > changed
> >     size, then we need to keep those dirty bits because it's just a commmon
> > way
> >     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess I
> >     feel like using "slot_id" to identify memslot is really unfriendly syscall
> >     design for things like "hole punchings" in the RAM address space...
> >     However such "punch hold" operation is really needed even for a common
> >     guest for either system reboots or device hotplugs, etc.).
> 
> why would device hotplug punch a hole in an existing RAM region? 

I thought it could happen because I used to trace the KVM ioctls and see the
memslot changes during driver loading.  But later when I tried to hotplug a
device I do see that it won't...  The new MMIO regions are added only into
0xfe000000 for a virtio-net:

  00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
  00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
  00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
  00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
  00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
  00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba

Does it mean that device plugging is guaranteed to not trigger RAM changes?  I
am really curious about what cases we need to consider in which we need to keep
the dirty bits for a memory removal, and if system reset is the only case, then
it could be even easier (because we might be able to avoid the sync in memory
removal but do that once in a sys reset hook)...

> 
> > 
> > The real scenario we want to cover for dirty tracking is the 2nd one.
> > 
> > If we can track dirty using raw GPA, the 2nd scenario is solved itself.
> > Because we know we'll add those memslots back (though it might be with a
> > different slot ID), then the GPA value will still make sense, which means we
> > should be able to avoid any kind of synchronization for things like memory
> > removals, as long as the userspace is aware of that.
> 
> A curious question. What about the backing storage of the affected GPA 
> is changed after adding back? Is recorded dirty info for previous backing 
> storage still making sense for the newer one?

It's the case of a permanent removal, plus another addition iiuc.  Then the
worst case is we get some extra dirty bits set on that new memory region, but
IMHO that's benigh (we'll migrate some extra pages even they could be zero pages).

Thanks,

> 
> Thanks
> Kevin
> 
> > 
> > With that, when we fetch the dirty bits, we lookup the memslot dynamically,
> > drop bits if the memslot does not exist on that address (e.g., permanent
> > removals), and use whatever memslot is there for that guest physical
> > address.
> > Though we for sure still need to handle memory move, that the userspace
> > needs
> > to still take care of dirty bit flushing and sync for a memory move, however
> > that's merely not happening so nothing to take care about either.
> > 
> > Does this makes sense?  Comments greatly welcomed..
> > 
> > Thanks,
> > 
> > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg08361.html
> > 
> > --
> > Peter Xu
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-03-31 18:59 ` [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
@ 2020-04-23 20:39   ` Sean Christopherson
  2020-04-24 15:21     ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2020-04-23 20:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On Tue, Mar 31, 2020 at 02:59:49PM -0400, Peter Xu wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b6d9ac9533c..faa702c4d37b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9791,7 +9791,32 @@ void kvm_arch_sync_events(struct kvm *kvm)
>  	kvm_free_pit(kvm);
>  }
>  
> -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> +#define  ERR_PTR_USR(e)  ((void __user *)ERR_PTR(e))

Heh, my first thought when reading the below code was "cool, I didn't know
there was ERR_PTR_USR!".  This probably should be in include/linux/err.h,
or maybe a new arch specific implementation if it's not universally safe.

An alternative, which looks enticing given that proper user variants will
be a bit of an explosion, would be to do:

  static void *____x86_set_memory_region(...)
  {
	<actual function>
  }

  void __user *__x86_set_memory_region(...)
  {
	return (void __user *)____x86_set_memory_region(...);
  }

A second alternative would be to return an "unsigned long", i.e. force the
one function that actually accesses the hva to do the cast.  I think I like
this option the best as it would minimize the churn in
__x86_set_memory_region().  Callers can use IS_ERR_VALUE() to detect failure.

> +/**
> + * __x86_set_memory_region: Setup KVM internal memory slot
> + *
> + * @kvm: the kvm pointer to the VM.
> + * @id: the slot ID to setup.
> + * @gpa: the GPA to install the slot (unused when @size == 0).
> + * @size: the size of the slot. Set to zero to uninstall a slot.
> + *
> + * This function helps to setup a KVM internal memory slot.  Specify
> + * @size > 0 to install a new slot, while @size == 0 to uninstall a
> + * slot.  The return code can be one of the following:
> + *
> + *   HVA:           on success (uninstall will return a bogus HVA)

I think it's important to call out that it returns '0' on uninstall, e.g.
otherwise it's not clear how a caller can detect failure.

> + *   -errno:        on error
> + *
> + * The caller should always use IS_ERR() to check the return value
> + * before use.  Note, the KVM internal memory slots are guaranteed to
> + * remain valid and unchanged until the VM is destroyed, i.e., the
> + * GPA->HVA translation will not change.  However, the HVA is a user
> + * address, i.e. its accessibility is not guaranteed, and must be
> + * accessed via __copy_{to,from}_user().
> + */
> +void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> +				      u32 size)
>  {
>  	int i, r;
>  	unsigned long hva, uninitialized_var(old_npages);
> @@ -9800,12 +9825,12 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
>  
>  	/* Called with kvm->slots_lock held.  */
>  	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
> -		return -EINVAL;
> +		return ERR_PTR_USR(-EINVAL);
>  
>  	slot = id_to_memslot(slots, id);
>  	if (size) {
>  		if (slot && slot->npages)
> -			return -EEXIST;
> +			return ERR_PTR_USR(-EEXIST);
>  
>  		/*
>  		 * MAP_SHARED to prevent internal slot pages from being moved
> @@ -9814,10 +9839,10 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
>  		hva = vm_mmap(NULL, 0, size, PROT_READ | PROT_WRITE,
>  			      MAP_SHARED | MAP_ANONYMOUS, 0);
>  		if (IS_ERR((void *)hva))

IS_ERR_VALUE() can be used to avoid the double cast.

> -			return PTR_ERR((void *)hva);
> +			return (void __user *)hva;

If we still want to go down the route of ERR_PTR_USR, then an ERR_CAST_USR
seems in order.

>  	} else {
>  		if (!slot || !slot->npages)
> -			return 0;
> +			return ERR_PTR_USR(0);

"return ERR_PTR_USR(NULL)" or "return NULL" would be more intuitive.  Moot
point if the return is changed to "unsigned long".

>  
>  		/*
>  		 * Stuff a non-canonical value to catch use-after-delete.  This
> @@ -9838,13 +9863,13 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
>  		m.memory_size = size;
>  		r = __kvm_set_memory_region(kvm, &m);
>  		if (r < 0)
> -			return r;
> +			return ERR_PTR_USR(r);
>  	}
>  
>  	if (!size)
>  		vm_munmap(hva, old_npages * PAGE_SIZE);
>  
> -	return 0;
> +	return (void __user *)hva;
>  }
>  EXPORT_SYMBOL_GPL(__x86_set_memory_region);
>  
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v8 00/14] KVM: Dirty ring interface
  2020-04-23 15:22     ` Peter Xu
@ 2020-04-24  6:01       ` Tian, Kevin
  2020-04-24 14:19         ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-04-24  6:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Michael S . Tsirkin, Jason Wang,
	Christopherson, Sean J, Christophe de Dinechin, Zhao, Yan Y,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 23, 2020 11:23 PM
> 
> On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, April 23, 2020 2:52 AM
> > >
> > > Hi,
> > >
> > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> > > (slot_id,
> > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> kvm_dirty_gfn
> > > with memslots.
> > >
> > > (A slightly longer version starts...)
> > >
> > > The problem is that binding dirty tracking operations to KVM memslots is
> a
> > > restriction that needs synchronization to memslot changes, which further
> > > needs
> > > synchronization across all the vcpus because they're the consumers of
> > > memslots.
> > > E.g., when we remove a memory slot, we need to flush all the dirty bits
> > > correctly before we do the removal of the memslot.  That's actually an
> > > known
> > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > hypervisors...) right now with current dirty logging.  Meanwhile, even if
> we
> > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > >
> > > Here memory removal is really an (still corner-cased but relatively)
> important
> > > scenario to think about for dirty logging comparing to memory additions
> &
> > > movings.  Because memory addition will always have no initial dirty page,
> > > and
> > > we don't really move RAM a lot (or do we ever?!) for a general VM use
> case.
> > >
> > > Then I went a step back to think about why we need these dirty bit
> > > information
> > > after all if the memslot is going to be removed?
> > >
> > > There're two cases:
> > >
> > >   - When the memslot is going to be removed forever, then the dirty
> > > information
> > >     is indeed meaningless and can be dropped, and,
> > >
> > >   - When the memslot is going to be removed but quickly added back with
> > > changed
> > >     size, then we need to keep those dirty bits because it's just a commmon
> > > way
> > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess
> I
> > >     feel like using "slot_id" to identify memslot is really unfriendly syscall
> > >     design for things like "hole punchings" in the RAM address space...
> > >     However such "punch hold" operation is really needed even for a
> common
> > >     guest for either system reboots or device hotplugs, etc.).
> >
> > why would device hotplug punch a hole in an existing RAM region?
> 
> I thought it could happen because I used to trace the KVM ioctls and see the
> memslot changes during driver loading.  But later when I tried to hotplug a

Is there more detail why driver loading may lead to memslot changes?

> device I do see that it won't...  The new MMIO regions are added only into
> 0xfe000000 for a virtio-net:
> 
>   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
>   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
>   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
>   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
>   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
>   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> 
> Does it mean that device plugging is guaranteed to not trigger RAM changes?

I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
a memory hot-unplug first then it's a bad design.

> I
> am really curious about what cases we need to consider in which we need to
> keep
> the dirty bits for a memory removal, and if system reset is the only case, then
> it could be even easier (because we might be able to avoid the sync in
> memory
> removal but do that once in a sys reset hook)...

Possibly memory hot-unplug, as allowed by recent virtio-mem? 

btw VFIO faces a similar problem when unmapping a DMA range (e.g. when
vIOMMU is enabled) in dirty log phase. There could be some dirty bits which are
not retrieved when unmapping happens. VFIO chooses to return the dirty
bits in a buffer passed in the unmapping parameters. Can memslot interface
do similar thing by allowing the userspace to specify a buffer pointer to hold
whatever dirty pages recorded for the slot that is being removed?

> 
> >
> > >
> > > The real scenario we want to cover for dirty tracking is the 2nd one.
> > >
> > > If we can track dirty using raw GPA, the 2nd scenario is solved itself.
> > > Because we know we'll add those memslots back (though it might be with
> a
> > > different slot ID), then the GPA value will still make sense, which means
> we
> > > should be able to avoid any kind of synchronization for things like
> memory
> > > removals, as long as the userspace is aware of that.
> >
> > A curious question. What about the backing storage of the affected GPA
> > is changed after adding back? Is recorded dirty info for previous backing
> > storage still making sense for the newer one?
> 
> It's the case of a permanent removal, plus another addition iiuc.  Then the
> worst case is we get some extra dirty bits set on that new memory region,
> but
> IMHO that's benigh (we'll migrate some extra pages even they could be zero
> pages).

yes, reporting more than necessary dirty bits doesn't hurt. 

> 
> Thanks,
> 
> >
> > Thanks
> > Kevin
> >
> > >
> > > With that, when we fetch the dirty bits, we lookup the memslot
> dynamically,
> > > drop bits if the memslot does not exist on that address (e.g., permanent
> > > removals), and use whatever memslot is there for that guest physical
> > > address.
> > > Though we for sure still need to handle memory move, that the
> userspace
> > > needs
> > > to still take care of dirty bit flushing and sync for a memory move,
> however
> > > that's merely not happening so nothing to take care about either.
> > >
> > > Does this makes sense?  Comments greatly welcomed..
> > >
> > > Thanks,
> > >
> > > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-
> 03/msg08361.html
> > >
> > > --
> > > Peter Xu
> >
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 00/14] KVM: Dirty ring interface
  2020-04-24  6:01       ` Tian, Kevin
@ 2020-04-24 14:19         ` Peter Xu
  2020-04-26 10:29           ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-04-24 14:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-kernel, kvm, Michael S . Tsirkin, Jason Wang,
	Christopherson, Sean J, Christophe de Dinechin, Zhao, Yan Y,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, April 23, 2020 11:23 PM
> > 
> > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > >
> > > > Hi,
> > > >
> > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> > > > (slot_id,
> > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > kvm_dirty_gfn
> > > > with memslots.
> > > >
> > > > (A slightly longer version starts...)
> > > >
> > > > The problem is that binding dirty tracking operations to KVM memslots is
> > a
> > > > restriction that needs synchronization to memslot changes, which further
> > > > needs
> > > > synchronization across all the vcpus because they're the consumers of
> > > > memslots.
> > > > E.g., when we remove a memory slot, we need to flush all the dirty bits
> > > > correctly before we do the removal of the memslot.  That's actually an
> > > > known
> > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > hypervisors...) right now with current dirty logging.  Meanwhile, even if
> > we
> > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > >
> > > > Here memory removal is really an (still corner-cased but relatively)
> > important
> > > > scenario to think about for dirty logging comparing to memory additions
> > &
> > > > movings.  Because memory addition will always have no initial dirty page,
> > > > and
> > > > we don't really move RAM a lot (or do we ever?!) for a general VM use
> > case.
> > > >
> > > > Then I went a step back to think about why we need these dirty bit
> > > > information
> > > > after all if the memslot is going to be removed?
> > > >
> > > > There're two cases:
> > > >
> > > >   - When the memslot is going to be removed forever, then the dirty
> > > > information
> > > >     is indeed meaningless and can be dropped, and,
> > > >
> > > >   - When the memslot is going to be removed but quickly added back with
> > > > changed
> > > >     size, then we need to keep those dirty bits because it's just a commmon
> > > > way
> > > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess
> > I
> > > >     feel like using "slot_id" to identify memslot is really unfriendly syscall
> > > >     design for things like "hole punchings" in the RAM address space...
> > > >     However such "punch hold" operation is really needed even for a
> > common
> > > >     guest for either system reboots or device hotplugs, etc.).
> > >
> > > why would device hotplug punch a hole in an existing RAM region?
> > 
> > I thought it could happen because I used to trace the KVM ioctls and see the
> > memslot changes during driver loading.  But later when I tried to hotplug a
> 
> Is there more detail why driver loading may lead to memslot changes?

E.g., I can observe these after Linux loads and before the prompt, which is a
simplest VM with default devices on:

41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1 gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0

After a careful look, I noticed it's only the VGA device mostly turning slot 3
off & on.  Frankly speaking I don't know why it happens to do so.

> 
> > device I do see that it won't...  The new MMIO regions are added only into
> > 0xfe000000 for a virtio-net:
> > 
> >   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> >   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> >   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> >   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> >   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> >   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > 
> > Does it mean that device plugging is guaranteed to not trigger RAM changes?
> 
> I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> a memory hot-unplug first then it's a bad design.

Right that's what I was confused about.  Then maybe you're right. :)

> 
> > I
> > am really curious about what cases we need to consider in which we need to
> > keep
> > the dirty bits for a memory removal, and if system reset is the only case, then
> > it could be even easier (because we might be able to avoid the sync in
> > memory
> > removal but do that once in a sys reset hook)...
> 
> Possibly memory hot-unplug, as allowed by recent virtio-mem? 

That should belong to the case where dirty bits do not matter at all after the
removal, right?  I would be mostly curious about when we (1) remove a memory
slot, and at the meantime (2) we still care about the dirty bits of that slot.

I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(), which I
think is really nasty.

> 
> btw VFIO faces a similar problem when unmapping a DMA range (e.g. when
> vIOMMU is enabled) in dirty log phase. There could be some dirty bits which are
> not retrieved when unmapping happens. VFIO chooses to return the dirty
> bits in a buffer passed in the unmapping parameters. Can memslot interface
> do similar thing by allowing the userspace to specify a buffer pointer to hold
> whatever dirty pages recorded for the slot that is being removed?

Yes I think we can, but may not be necessary.  Actually IMHO CPU access to
pages are slightly different to device DMAs in that we can do these sequence to
collect the dirty bits of a slot safely:

  - mark slot as READONLY
  - KVM_GET_DIRTY_LOG on the slot
  - finally remove the slot

I guess VFIO cannot do that because there's no way to really "mark the region
as read-only" for a device because DMA could happen and DMA would fail then
when writting to a readonly slot.

On the KVM/CPU side, after we mark the slot as READONLY then the CPU writes
will page fault and fallback to the QEMU userspace, then QEMU will take care of
the writes (so those writes could be slow but still working) then even if we
mark it READONLY it won't fail the writes but just fallback to QEMU.

Btw, since we're discussing the VFIO dirty logging across memory removal... the
unmapping DMA range you're talking about needs to be added back later, right?
Then what if the device does DMA after the removal but before it's added back?
I think this is a more general question not for dirty logging but also for when
dirty logging is not enabled - I never understand how this could be fixed with
existing facilities.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-04-23 20:39   ` Sean Christopherson
@ 2020-04-24 15:21     ` Peter Xu
  2020-04-27 18:10       ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-04-24 15:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Apr 23, 2020 at 01:39:44PM -0700, Sean Christopherson wrote:
> On Tue, Mar 31, 2020 at 02:59:49PM -0400, Peter Xu wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1b6d9ac9533c..faa702c4d37b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9791,7 +9791,32 @@ void kvm_arch_sync_events(struct kvm *kvm)
> >  	kvm_free_pit(kvm);
> >  }
> >  
> > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > +#define  ERR_PTR_USR(e)  ((void __user *)ERR_PTR(e))
> 
> Heh, my first thought when reading the below code was "cool, I didn't know
> there was ERR_PTR_USR!".  This probably should be in include/linux/err.h,
> or maybe a new arch specific implementation if it's not universally safe.

Yeah, I just wanted to avoid introducing things in common headers before I'm
sure it'll be used in the rest of the world..  We can always replace them with
a global definition when it comes.

> 
> An alternative, which looks enticing given that proper user variants will
> be a bit of an explosion, would be to do:
> 
>   static void *____x86_set_memory_region(...)
>   {
> 	<actual function>
>   }
> 
>   void __user *__x86_set_memory_region(...)
>   {
> 	return (void __user *)____x86_set_memory_region(...);
>   }
> 
> A second alternative would be to return an "unsigned long", i.e. force the
> one function that actually accesses the hva to do the cast.  I think I like
> this option the best as it would minimize the churn in
> __x86_set_memory_region().  Callers can use IS_ERR_VALUE() to detect failure.

If you won't mind, I would prefer a 2nd opinion (maybe Paolo?) so we can
consolidate the idea before I change them... (I would for sure still prefer the
current approach for simplicity since after all I don't have strong opionion..)

> 
> > +/**
> > + * __x86_set_memory_region: Setup KVM internal memory slot
> > + *
> > + * @kvm: the kvm pointer to the VM.
> > + * @id: the slot ID to setup.
> > + * @gpa: the GPA to install the slot (unused when @size == 0).
> > + * @size: the size of the slot. Set to zero to uninstall a slot.
> > + *
> > + * This function helps to setup a KVM internal memory slot.  Specify
> > + * @size > 0 to install a new slot, while @size == 0 to uninstall a
> > + * slot.  The return code can be one of the following:
> > + *
> > + *   HVA:           on success (uninstall will return a bogus HVA)
> 
> I think it's important to call out that it returns '0' on uninstall, e.g.
> otherwise it's not clear how a caller can detect failure.

It will "return (0xdeadull << 48)" as you proposed in abbed4fa94f6? :-)

Frankly speaking I always preferred zero but that's just not true any more
after above change.  This also reminded me that maybe we should also return the
same thing at [1] below.

> 
> > + *   -errno:        on error
> > + *
> > + * The caller should always use IS_ERR() to check the return value
> > + * before use.  Note, the KVM internal memory slots are guaranteed to
> > + * remain valid and unchanged until the VM is destroyed, i.e., the
> > + * GPA->HVA translation will not change.  However, the HVA is a user
> > + * address, i.e. its accessibility is not guaranteed, and must be
> > + * accessed via __copy_{to,from}_user().
> > + */
> > +void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > +				      u32 size)
> >  {
> >  	int i, r;
> >  	unsigned long hva, uninitialized_var(old_npages);
> > @@ -9800,12 +9825,12 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> >  
> >  	/* Called with kvm->slots_lock held.  */
> >  	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
> > -		return -EINVAL;
> > +		return ERR_PTR_USR(-EINVAL);
> >  
> >  	slot = id_to_memslot(slots, id);
> >  	if (size) {
> >  		if (slot && slot->npages)
> > -			return -EEXIST;
> > +			return ERR_PTR_USR(-EEXIST);
> >  
> >  		/*
> >  		 * MAP_SHARED to prevent internal slot pages from being moved
> > @@ -9814,10 +9839,10 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> >  		hva = vm_mmap(NULL, 0, size, PROT_READ | PROT_WRITE,
> >  			      MAP_SHARED | MAP_ANONYMOUS, 0);
> >  		if (IS_ERR((void *)hva))
> 
> IS_ERR_VALUE() can be used to avoid the double cast.

Agreed.  But it's a context cleanup, so I normally will keep it as is (or use a
standalone patch).

> 
> > -			return PTR_ERR((void *)hva);
> > +			return (void __user *)hva;
> 
> If we still want to go down the route of ERR_PTR_USR, then an ERR_CAST_USR
> seems in order.

Sure.  But I'll still keep it kvm-only if you won't mind...

> 
> >  	} else {
> >  		if (!slot || !slot->npages)
> > -			return 0;
> > +			return ERR_PTR_USR(0);

[1]

> 
> "return ERR_PTR_USR(NULL)" or "return NULL" would be more intuitive.  Moot
> point if the return is changed to "unsigned long".

ERR_PTR_USR() takes a "long".  I can use ERR_CAST_USR(NULL) if you prefer me to
explicitly use NULL.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v8 00/14] KVM: Dirty ring interface
  2020-04-24 14:19         ` Peter Xu
@ 2020-04-26 10:29           ` Tian, Kevin
  2020-04-27 14:27             ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-04-26 10:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Michael S . Tsirkin, Jason Wang,
	Christopherson, Sean J, Christophe de Dinechin, Zhao, Yan Y,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

> From: Peter Xu
> Sent: Friday, April 24, 2020 10:20 PM
> 
> On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, April 23, 2020 11:23 PM
> > >
> > > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > > >
> > > > > Hi,
> > > > >
> > > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead
> of
> > > > > (slot_id,
> > > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > > kvm_dirty_gfn
> > > > > with memslots.
> > > > >
> > > > > (A slightly longer version starts...)
> > > > >
> > > > > The problem is that binding dirty tracking operations to KVM
> memslots is
> > > a
> > > > > restriction that needs synchronization to memslot changes, which
> further
> > > > > needs
> > > > > synchronization across all the vcpus because they're the consumers of
> > > > > memslots.
> > > > > E.g., when we remove a memory slot, we need to flush all the dirty
> bits
> > > > > correctly before we do the removal of the memslot.  That's actually an
> > > > > known
> > > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > > hypervisors...) right now with current dirty logging.  Meanwhile, even
> if
> > > we
> > > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > > >
> > > > > Here memory removal is really an (still corner-cased but relatively)
> > > important
> > > > > scenario to think about for dirty logging comparing to memory
> additions
> > > &
> > > > > movings.  Because memory addition will always have no initial dirty
> page,
> > > > > and
> > > > > we don't really move RAM a lot (or do we ever?!) for a general VM
> use
> > > case.
> > > > >
> > > > > Then I went a step back to think about why we need these dirty bit
> > > > > information
> > > > > after all if the memslot is going to be removed?
> > > > >
> > > > > There're two cases:
> > > > >
> > > > >   - When the memslot is going to be removed forever, then the dirty
> > > > > information
> > > > >     is indeed meaningless and can be dropped, and,
> > > > >
> > > > >   - When the memslot is going to be removed but quickly added back
> with
> > > > > changed
> > > > >     size, then we need to keep those dirty bits because it's just a
> commmon
> > > > > way
> > > > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd
> confess
> > > I
> > > > >     feel like using "slot_id" to identify memslot is really unfriendly
> syscall
> > > > >     design for things like "hole punchings" in the RAM address space...
> > > > >     However such "punch hold" operation is really needed even for a
> > > common
> > > > >     guest for either system reboots or device hotplugs, etc.).
> > > >
> > > > why would device hotplug punch a hole in an existing RAM region?
> > >
> > > I thought it could happen because I used to trace the KVM ioctls and see
> the
> > > memslot changes during driver loading.  But later when I tried to hotplug
> a
> >
> > Is there more detail why driver loading may lead to memslot changes?
> 
> E.g., I can observe these after Linux loads and before the prompt, which is a
> simplest VM with default devices on:
> 
> 41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1
> gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0
> 
> After a careful look, I noticed it's only the VGA device mostly turning slot 3
> off & on.  Frankly speaking I don't know why it happens to do so.
> 
> >
> > > device I do see that it won't...  The new MMIO regions are added only
> into
> > > 0xfe000000 for a virtio-net:
> > >
> > >   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> > >   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> > >   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> > >   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> > >   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> > >   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > >
> > > Does it mean that device plugging is guaranteed to not trigger RAM
> changes?
> >
> > I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> > a memory hot-unplug first then it's a bad design.
> 
> Right that's what I was confused about.  Then maybe you're right. :)
> 
> >
> > > I
> > > am really curious about what cases we need to consider in which we
> need to
> > > keep
> > > the dirty bits for a memory removal, and if system reset is the only case,
> then
> > > it could be even easier (because we might be able to avoid the sync in
> > > memory
> > > removal but do that once in a sys reset hook)...
> >
> > Possibly memory hot-unplug, as allowed by recent virtio-mem?
> 
> That should belong to the case where dirty bits do not matter at all after the
> removal, right?  I would be mostly curious about when we (1) remove a
> memory
> slot, and at the meantime (2) we still care about the dirty bits of that slot.

I remember one case. PCIe spec defines a resizable BAR capability, allowing
hardware to communicate supported resource sizes and software to set
an optimal size back to hardware. If such a capability is presented to guest
and the BAR is backed by memory, it's possible to observe a removal-and-
add-back scenario. However in such case, the spec requires the software 
to clear memory space enable bit in command register before changing 
the BAR size. I suppose such thing should happen at boot phase where
dirty bits related to old BAR size don't matter. 

> 
> I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(),
> which I
> think is really nasty.
> 
> >
> > btw VFIO faces a similar problem when unmapping a DMA range (e.g.
> when
> > vIOMMU is enabled) in dirty log phase. There could be some dirty bits
> which are
> > not retrieved when unmapping happens. VFIO chooses to return the dirty
> > bits in a buffer passed in the unmapping parameters. Can memslot
> interface
> > do similar thing by allowing the userspace to specify a buffer pointer to
> hold
> > whatever dirty pages recorded for the slot that is being removed?
> 
> Yes I think we can, but may not be necessary.  Actually IMHO CPU access to
> pages are slightly different to device DMAs in that we can do these sequence
> to
> collect the dirty bits of a slot safely:
> 
>   - mark slot as READONLY
>   - KVM_GET_DIRTY_LOG on the slot
>   - finally remove the slot
> 
> I guess VFIO cannot do that because there's no way to really "mark the
> region
> as read-only" for a device because DMA could happen and DMA would fail
> then
> when writting to a readonly slot.
> 
> On the KVM/CPU side, after we mark the slot as READONLY then the CPU
> writes
> will page fault and fallback to the QEMU userspace, then QEMU will take care
> of
> the writes (so those writes could be slow but still working) then even if we
> mark it READONLY it won't fail the writes but just fallback to QEMU.

Yes, you are right.

> 
> Btw, since we're discussing the VFIO dirty logging across memory removal...
> the
> unmapping DMA range you're talking about needs to be added back later,
> right?

pure memory removal doesn't need add-back. Or are you specifically
referring to removal-and-add-back case?

> Then what if the device does DMA after the removal but before it's added
> back?

then just got IOMMU page fault, but I guess that I didn't get your real 
question...

> I think this is a more general question not for dirty logging but also for when
> dirty logging is not enabled - I never understand how this could be fixed with
> existing facilities.
> 

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 00/14] KVM: Dirty ring interface
  2020-04-26 10:29           ` Tian, Kevin
@ 2020-04-27 14:27             ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-04-27 14:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-kernel, kvm, Michael S . Tsirkin, Jason Wang,
	Christopherson, Sean J, Christophe de Dinechin, Zhao, Yan Y,
	Alex Williamson, Paolo Bonzini, Vitaly Kuznetsov,
	Dr . David Alan Gilbert

On Sun, Apr 26, 2020 at 10:29:51AM +0000, Tian, Kevin wrote:
> > From: Peter Xu
> > Sent: Friday, April 24, 2020 10:20 PM
> > 
> > On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Thursday, April 23, 2020 11:23 PM
> > > >
> > > > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > > > From: Peter Xu <peterx@redhat.com>
> > > > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead
> > of
> > > > > > (slot_id,
> > > > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > > > kvm_dirty_gfn
> > > > > > with memslots.
> > > > > >
> > > > > > (A slightly longer version starts...)
> > > > > >
> > > > > > The problem is that binding dirty tracking operations to KVM
> > memslots is
> > > > a
> > > > > > restriction that needs synchronization to memslot changes, which
> > further
> > > > > > needs
> > > > > > synchronization across all the vcpus because they're the consumers of
> > > > > > memslots.
> > > > > > E.g., when we remove a memory slot, we need to flush all the dirty
> > bits
> > > > > > correctly before we do the removal of the memslot.  That's actually an
> > > > > > known
> > > > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > > > hypervisors...) right now with current dirty logging.  Meanwhile, even
> > if
> > > > we
> > > > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > > > >
> > > > > > Here memory removal is really an (still corner-cased but relatively)
> > > > important
> > > > > > scenario to think about for dirty logging comparing to memory
> > additions
> > > > &
> > > > > > movings.  Because memory addition will always have no initial dirty
> > page,
> > > > > > and
> > > > > > we don't really move RAM a lot (or do we ever?!) for a general VM
> > use
> > > > case.
> > > > > >
> > > > > > Then I went a step back to think about why we need these dirty bit
> > > > > > information
> > > > > > after all if the memslot is going to be removed?
> > > > > >
> > > > > > There're two cases:
> > > > > >
> > > > > >   - When the memslot is going to be removed forever, then the dirty
> > > > > > information
> > > > > >     is indeed meaningless and can be dropped, and,
> > > > > >
> > > > > >   - When the memslot is going to be removed but quickly added back
> > with
> > > > > > changed
> > > > > >     size, then we need to keep those dirty bits because it's just a
> > commmon
> > > > > > way
> > > > > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd
> > confess
> > > > I
> > > > > >     feel like using "slot_id" to identify memslot is really unfriendly
> > syscall
> > > > > >     design for things like "hole punchings" in the RAM address space...
> > > > > >     However such "punch hold" operation is really needed even for a
> > > > common
> > > > > >     guest for either system reboots or device hotplugs, etc.).
> > > > >
> > > > > why would device hotplug punch a hole in an existing RAM region?
> > > >
> > > > I thought it could happen because I used to trace the KVM ioctls and see
> > the
> > > > memslot changes during driver loading.  But later when I tried to hotplug
> > a
> > >
> > > Is there more detail why driver loading may lead to memslot changes?
> > 
> > E.g., I can observe these after Linux loads and before the prompt, which is a
> > simplest VM with default devices on:
> > 
> > 41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1
> > gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0
> > 
> > After a careful look, I noticed it's only the VGA device mostly turning slot 3
> > off & on.  Frankly speaking I don't know why it happens to do so.
> > 
> > >
> > > > device I do see that it won't...  The new MMIO regions are added only
> > into
> > > > 0xfe000000 for a virtio-net:
> > > >
> > > >   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> > > >   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> > > >   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> > > >   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> > > >   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> > > >   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > > >
> > > > Does it mean that device plugging is guaranteed to not trigger RAM
> > changes?
> > >
> > > I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> > > a memory hot-unplug first then it's a bad design.
> > 
> > Right that's what I was confused about.  Then maybe you're right. :)
> > 
> > >
> > > > I
> > > > am really curious about what cases we need to consider in which we
> > need to
> > > > keep
> > > > the dirty bits for a memory removal, and if system reset is the only case,
> > then
> > > > it could be even easier (because we might be able to avoid the sync in
> > > > memory
> > > > removal but do that once in a sys reset hook)...
> > >
> > > Possibly memory hot-unplug, as allowed by recent virtio-mem?
> > 
> > That should belong to the case where dirty bits do not matter at all after the
> > removal, right?  I would be mostly curious about when we (1) remove a
> > memory
> > slot, and at the meantime (2) we still care about the dirty bits of that slot.
> 
> I remember one case. PCIe spec defines a resizable BAR capability, allowing
> hardware to communicate supported resource sizes and software to set
> an optimal size back to hardware. If such a capability is presented to guest
> and the BAR is backed by memory, it's possible to observe a removal-and-
> add-back scenario. However in such case, the spec requires the software 
> to clear memory space enable bit in command register before changing 
> the BAR size. I suppose such thing should happen at boot phase where
> dirty bits related to old BAR size don't matter. 

Good to know. Yes it makes sense to have no valid data if the driver is trying
to decide the bar size at boot time.

> 
> > 
> > I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(),
> > which I
> > think is really nasty.
> > 
> > >
> > > btw VFIO faces a similar problem when unmapping a DMA range (e.g.
> > when
> > > vIOMMU is enabled) in dirty log phase. There could be some dirty bits
> > which are
> > > not retrieved when unmapping happens. VFIO chooses to return the dirty
> > > bits in a buffer passed in the unmapping parameters. Can memslot
> > interface
> > > do similar thing by allowing the userspace to specify a buffer pointer to
> > hold
> > > whatever dirty pages recorded for the slot that is being removed?
> > 
> > Yes I think we can, but may not be necessary.  Actually IMHO CPU access to
> > pages are slightly different to device DMAs in that we can do these sequence
> > to
> > collect the dirty bits of a slot safely:
> > 
> >   - mark slot as READONLY
> >   - KVM_GET_DIRTY_LOG on the slot
> >   - finally remove the slot
> > 
> > I guess VFIO cannot do that because there's no way to really "mark the
> > region
> > as read-only" for a device because DMA could happen and DMA would fail
> > then
> > when writting to a readonly slot.
> > 
> > On the KVM/CPU side, after we mark the slot as READONLY then the CPU
> > writes
> > will page fault and fallback to the QEMU userspace, then QEMU will take care
> > of
> > the writes (so those writes could be slow but still working) then even if we
> > mark it READONLY it won't fail the writes but just fallback to QEMU.
> 
> Yes, you are right.
> 
> > 
> > Btw, since we're discussing the VFIO dirty logging across memory removal...
> > the
> > unmapping DMA range you're talking about needs to be added back later,
> > right?
> 
> pure memory removal doesn't need add-back. Or are you specifically
> referring to removal-and-add-back case?

Oh right I get the point now - VFIO unmap is different that it does not mean
the RAM is going away, but the RAM is only not accessable any more from the
device, so it's actually a different scenario here comparing to real memory
removals.  Sorry I obviously missed that... :)

Thanks,

> 
> > Then what if the device does DMA after the removal but before it's added
> > back?
> 
> then just got IOMMU page fault, but I guess that I didn't get your real 
> question...
> 
> > I think this is a more general question not for dirty logging but also for when
> > dirty logging is not enabled - I never understand how this could be fixed with
> > existing facilities.
> > 
> 
> Thanks,
> Kevin

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-04-24 15:21     ` Peter Xu
@ 2020-04-27 18:10       ` Sean Christopherson
  2020-04-28 20:22         ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2020-04-27 18:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Apr 24, 2020 at 11:21:51AM -0400, Peter Xu wrote:
> On Thu, Apr 23, 2020 at 01:39:44PM -0700, Sean Christopherson wrote:
> > On Tue, Mar 31, 2020 at 02:59:49PM -0400, Peter Xu wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 1b6d9ac9533c..faa702c4d37b 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -9791,7 +9791,32 @@ void kvm_arch_sync_events(struct kvm *kvm)
> > >  	kvm_free_pit(kvm);
> > >  }
> > >  
> > > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > > +#define  ERR_PTR_USR(e)  ((void __user *)ERR_PTR(e))
> > 
> > Heh, my first thought when reading the below code was "cool, I didn't know
> > there was ERR_PTR_USR!".  This probably should be in include/linux/err.h,
> > or maybe a new arch specific implementation if it's not universally safe.
> 
> Yeah, I just wanted to avoid introducing things in common headers before I'm
> sure it'll be used in the rest of the world..  We can always replace them with
> a global definition when it comes.

Gotcha.

> > An alternative, which looks enticing given that proper user variants will
> > be a bit of an explosion, would be to do:
> > 
> >   static void *____x86_set_memory_region(...)
> >   {
> > 	<actual function>
> >   }
> > 
> >   void __user *__x86_set_memory_region(...)
> >   {
> > 	return (void __user *)____x86_set_memory_region(...);
> >   }
> > 
> > A second alternative would be to return an "unsigned long", i.e. force the
> > one function that actually accesses the hva to do the cast.  I think I like
> > this option the best as it would minimize the churn in
> > __x86_set_memory_region().  Callers can use IS_ERR_VALUE() to detect failure.
> 
> If you won't mind, I would prefer a 2nd opinion (maybe Paolo?) so we can
> consolidate the idea before I change them... (I would for sure still prefer the
> current approach for simplicity since after all I don't have strong opionion..)

Definitely makes sense for Paolo to weigh in.

> > > +/**
> > > + * __x86_set_memory_region: Setup KVM internal memory slot
> > > + *
> > > + * @kvm: the kvm pointer to the VM.
> > > + * @id: the slot ID to setup.
> > > + * @gpa: the GPA to install the slot (unused when @size == 0).
> > > + * @size: the size of the slot. Set to zero to uninstall a slot.
> > > + *
> > > + * This function helps to setup a KVM internal memory slot.  Specify
> > > + * @size > 0 to install a new slot, while @size == 0 to uninstall a
> > > + * slot.  The return code can be one of the following:
> > > + *
> > > + *   HVA:           on success (uninstall will return a bogus HVA)
> > 
> > I think it's important to call out that it returns '0' on uninstall, e.g.
> > otherwise it's not clear how a caller can detect failure.
> 
> It will "return (0xdeadull << 48)" as you proposed in abbed4fa94f6? :-)
> 
> Frankly speaking I always preferred zero but that's just not true any more
> after above change.  This also reminded me that maybe we should also return the
> same thing at [1] below.

Ah, I was looking at this code:

	if (!slot || !slot->npages)
		return 0;

That means deletion returns different success values for "deletion was a
nop" and "deletion was successful".  The nop path should probably return
(or fill in) "(unsigned long)(0xdeadull << 48)" as well.

> > > + *   -errno:        on error
> > > + *
> > > + * The caller should always use IS_ERR() to check the return value
> > > + * before use.  Note, the KVM internal memory slots are guaranteed to
> > > + * remain valid and unchanged until the VM is destroyed, i.e., the
> > > + * GPA->HVA translation will not change.  However, the HVA is a user
> > > + * address, i.e. its accessibility is not guaranteed, and must be
> > > + * accessed via __copy_{to,from}_user().
> > > + */
> > > +void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > > +				      u32 size)
> > >  {
> > >  	int i, r;
> > >  	unsigned long hva, uninitialized_var(old_npages);
> > > @@ -9800,12 +9825,12 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > >  
> > >  	/* Called with kvm->slots_lock held.  */
> > >  	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
> > > -		return -EINVAL;
> > > +		return ERR_PTR_USR(-EINVAL);
> > >  
> > >  	slot = id_to_memslot(slots, id);
> > >  	if (size) {
> > >  		if (slot && slot->npages)
> > > -			return -EEXIST;
> > > +			return ERR_PTR_USR(-EEXIST);
> > >  
> > >  		/*
> > >  		 * MAP_SHARED to prevent internal slot pages from being moved
> > > @@ -9814,10 +9839,10 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > >  		hva = vm_mmap(NULL, 0, size, PROT_READ | PROT_WRITE,
> > >  			      MAP_SHARED | MAP_ANONYMOUS, 0);
> > >  		if (IS_ERR((void *)hva))
> > 
> > IS_ERR_VALUE() can be used to avoid the double cast.
> 
> Agreed.  But it's a context cleanup, so I normally will keep it as is (or use a
> standalone patch).
> 
> > 
> > > -			return PTR_ERR((void *)hva);
> > > +			return (void __user *)hva;
> > 
> > If we still want to go down the route of ERR_PTR_USR, then an ERR_CAST_USR
> > seems in order.
> 
> Sure.  But I'll still keep it kvm-only if you won't mind...
> 
> > 
> > >  	} else {
> > >  		if (!slot || !slot->npages)
> > > -			return 0;
> > > +			return ERR_PTR_USR(0);
> 
> [1]
> 
> > 
> > "return ERR_PTR_USR(NULL)" or "return NULL" would be more intuitive.  Moot
> > point if the return is changed to "unsigned long".
> 
> ERR_PTR_USR() takes a "long".  I can use ERR_CAST_USR(NULL) if you prefer me to
> explicitly use NULL.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-04-27 18:10       ` Sean Christopherson
@ 2020-04-28 20:22         ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-04-28 20:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Kevin Tian, Michael S . Tsirkin, Jason Wang,
	Christophe de Dinechin, Yan Zhao, Alex Williamson, Paolo Bonzini,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On Mon, Apr 27, 2020 at 11:10:54AM -0700, Sean Christopherson wrote:

[...]

> > It will "return (0xdeadull << 48)" as you proposed in abbed4fa94f6? :-)
> > 
> > Frankly speaking I always preferred zero but that's just not true any more
> > after above change.  This also reminded me that maybe we should also return the
> > same thing at [1] below.
> 
> Ah, I was looking at this code:
> 
> 	if (!slot || !slot->npages)
> 		return 0;
> 
> That means deletion returns different success values for "deletion was a
> nop" and "deletion was successful".  The nop path should probably return
> (or fill in) "(unsigned long)(0xdeadull << 48)" as well.

Yep.  Since I touched the line here after all, I'll directly squash this small
fix into this patch too when I repost.  Thanks,

[...]

> > > 
> > > >  	} else {
> > > >  		if (!slot || !slot->npages)
> > > > -			return 0;
> > > > +			return ERR_PTR_USR(0);
> > 
> > [1]

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-04-28 20:22 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-31 18:59 [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
2020-03-31 18:59 ` [PATCH v8 01/14] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-03-31 18:59 ` [PATCH v8 02/14] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-03-31 18:59 ` [PATCH v8 03/14] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-04-23 20:39   ` Sean Christopherson
2020-04-24 15:21     ` Peter Xu
2020-04-27 18:10       ` Sean Christopherson
2020-04-28 20:22         ` Peter Xu
2020-03-31 18:59 ` [PATCH v8 04/14] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-03-31 18:59 ` [PATCH v8 05/14] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-03-31 18:59 ` [PATCH v8 06/14] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-03-31 18:59 ` [PATCH v8 07/14] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-03-31 18:59 ` [PATCH v8 08/14] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-04-01  7:04   ` Andrew Jones
2020-03-31 18:59 ` [PATCH v8 09/14] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-03-31 18:59 ` [PATCH v8 10/14] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-03-31 18:59 ` [PATCH v8 11/14] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-04-01  7:03   ` Andrew Jones
2020-04-01 23:24     ` Peter Xu
2020-03-31 18:59 ` [PATCH v8 12/14] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-03-31 18:59 ` [PATCH v8 13/14] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-04-01  7:48   ` Andrew Jones
2020-03-31 19:00 ` [PATCH v8 14/14] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-04-22 18:51 ` [PATCH v8 00/14] KVM: Dirty ring interface Peter Xu
2020-04-23  6:28   ` Tian, Kevin
2020-04-23 15:22     ` Peter Xu
2020-04-24  6:01       ` Tian, Kevin
2020-04-24 14:19         ` Peter Xu
2020-04-26 10:29           ` Tian, Kevin
2020-04-27 14:27             ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).